Method, device, medium, and program product for training question-answer system
US-2025232214-A1 · Jul 17, 2025 · US
US12566971B2 · US · B2
| Field | Value |
|---|---|
| Publication number | US-12566971-B2 |
| Application number | US-202419001635-A |
| Country | US |
| Kind code | B2 |
| Filing date | Dec 26, 2024 |
| Priority date | Jan 10, 2024 |
| Publication date | Mar 3, 2026 |
| Grant date | Mar 3, 2026 |
A practical reading order for non-experts. Skip the full description unless you need deep technical detail.
What the patent document calls the invention.
A short plain-language summary of the technical disclosure.
Who owns or filed the patent and who is credited as inventor.
Filing, priority, publication, and grant dates set the timeline.
The legal scope of protection — read this for what is actually claimed.
Technology tags used to group this patent with similar filings.
Prior art links and similar publications in this corpus.
Official abstract text for this publication.
The present disclosure discloses a method for training an industrial question-answering model based on reinforcement learning and knowledge base matching, comprising: S 1 , training a reward model, and for an industrial knowledge question-answering, matching and comparing an output of an industrial question-answering model with a content of an industrial knowledge base, and generating a reward value based on a similarity between the output of the industrial question-answering model and the content of the industrial knowledge base; S 2 , ranking a plurality of reward values corresponding to a plurality of outputs of the industrial question-answering model and training and updating network parameters of the reward model based on a ranking loss function; and S 3 , training the industrial question-answering model, adding the plurality of reward values to a penalty term, and obtaining an optimal policy after performing a plurality of times of reinforcement learning on the industrial question-answering model using a reinforcement learning algorithm.
Opening claim text (preview).
What is claimed is: 1 . A method for training an industrial question-answering model based on reinforcement learning and knowledge base matching, comprising: S 1 , constructing an industrial knowledge base, training a reward model, and for an industrial knowledge question-answering, matching and comparing an output of an industrial question-answering model with a content of the industrial knowledge base, and generating a reward value based on a similarity between the output of the industrial question-answering model and the content of the industrial knowledge base, including: collecting a professional question-answering in an industrial domain, constructing the industrial knowledge base, training the reward model, matching and comparing the output of the industrial question-answering model with the content of the industrial knowledge base, and generating the reward value r i based on a similarity function, wherein the reward value r i is expressed as follows: r i = Sim ( a i , a 0 ) ; wherein a 0 is prior knowledge, and a i indicates different answers of the industrial question-answering model; wherein the reward value r i is determined as follows: extracting a first feature vector of the prior knowledge a 0 ; extracting a second feature vector corresponding to each of the different answers of the industrial question-answering model a i ; calculating, via the similarity function, a vector distance between the first feature vector and each of second feature vectors; and determining the reward value r i based on the vector distance; the similarity function includes a Euclidean distance and a cosine similarity; the first feature vector is a feature vector including a preset count of keywords extracted based on the prior knowledge; the second feature vector is a feature vector including a preset count of keywords extracted based on a second answer; the shorter the vector distance between the first feature vector and the second feature vector, the greater the reward value r i ; S 2 , ranking a plurality of reward values corresponding to a plurality of outputs of the industrial question-answering model and training and updating network parameters of the reward model based on a ranking loss function; wherein the ranking loss function is expressed as follows: loss = E [ log ( σ ( r i - r j ) ) ] ∀ i , j ; wherein r i and r j are reward values corresponding to different texts, σ is a Sigmoid function; S 3 , training the industrial question-answering model, adding the plurality of reward values to a penalty term, and obtaining an optimal policy after performing a plurality of times of reinforcement learning on the industrial question-answering model using a reinforcement learning algorithm, wherein the reinforcement learning includes steps S 31 -S 33 : S 31 , calculating the reward value r i based on the reward model in S 1 , adding the reward value to the penalty term, wherein a final reward value R i is expressed as follows: R i = r i - β log π i π 0 ; wherein β is a coefficient of the penalty term, π i is a policy output of a last layer of the industrial question-answering model in a current iteration cascading a first linearly fully connected layer, and π 0 is a policy output of a last layer of an initial industrial question-answering model cascading a second linearly fully connected layer; S 32 , training a plurality of times using the reinforcement learning until a reward function converges, including: performing, based on an Actor-Critic network, policy optimization on the reward model using a reinforcement learning proximal policy optimization (PPO) algorithm, the reinforcement learning PPO algorithm limiting an updated magnitude of a policy by a proximal ratio clipping loss, a reinforcement learning process of the industrial question-answering model based on the PPO algorithm including: inputting questions of known industrial question-answering into the initial industrial question-answering model or an iterative industrial question-answering model, obtaining log-probs of different first answers corresponding to a question that are output by the initial industrial question-answering model or the iterative industrial question-answering model; obtaining penalty terms corresponding to processed answers based on the log-probs; obtaining a final reward based on the penalty terms and a scalar reward; inputting the final reward into the industrial question-answering model based on the PPO algorithm, obtaining an output of the industrial question-answering model; and based on the output of the industrial question-answering model and a loss function, updating parameters of the industrial question-answering model until the industrial question-answering model reaches a preset performance or a count of iterations to obtain a trained industrial question-answering model, which as follows: Lt ( θ ) = E ^ t [ min (
Inference or reasoning models · CPC title
based on simulated virtual individual or collective life forms, e.g. social simulations or particle swarm optimisation [PSO] · CPC title
Learning methods · CPC title
Combinations of networks · CPC title
Reinforcement learning · CPC title
Related publications grouped by family.
Answers are generated from the same data shown on this page.