Methods for training an industrial question-answering model based on reinforcement learning and knowledge base matching

US12566971B2 · US · B2

Patent metadata
FieldValue
Publication numberUS-12566971-B2
Application numberUS-202419001635-A
CountryUS
Kind codeB2
Filing dateDec 26, 2024
Priority dateJan 10, 2024
Publication dateMar 3, 2026
Grant dateMar 3, 2026

How to read this patent

A practical reading order for non-experts. Skip the full description unless you need deep technical detail.

  1. Title

    What the patent document calls the invention.

  2. Abstract

    A short plain-language summary of the technical disclosure.

  3. Assignees and inventors

    Who owns or filed the patent and who is credited as inventor.

  4. Key dates

    Filing, priority, publication, and grant dates set the timeline.

  5. First independent claim

    The legal scope of protection — read this for what is actually claimed.

  6. CPC / IPC classifications

    Technology tags used to group this patent with similar filings.

  7. Citations and related patents

    Prior art links and similar publications in this corpus.

Abstract

Official abstract text for this publication.

The present disclosure discloses a method for training an industrial question-answering model based on reinforcement learning and knowledge base matching, comprising: S 1 , training a reward model, and for an industrial knowledge question-answering, matching and comparing an output of an industrial question-answering model with a content of an industrial knowledge base, and generating a reward value based on a similarity between the output of the industrial question-answering model and the content of the industrial knowledge base; S 2 , ranking a plurality of reward values corresponding to a plurality of outputs of the industrial question-answering model and training and updating network parameters of the reward model based on a ranking loss function; and S 3 , training the industrial question-answering model, adding the plurality of reward values to a penalty term, and obtaining an optimal policy after performing a plurality of times of reinforcement learning on the industrial question-answering model using a reinforcement learning algorithm.

First claim

Opening claim text (preview).

What is claimed is: 1 . A method for training an industrial question-answering model based on reinforcement learning and knowledge base matching, comprising: S 1 , constructing an industrial knowledge base, training a reward model, and for an industrial knowledge question-answering, matching and comparing an output of an industrial question-answering model with a content of the industrial knowledge base, and generating a reward value based on a similarity between the output of the industrial question-answering model and the content of the industrial knowledge base, including: collecting a professional question-answering in an industrial domain, constructing the industrial knowledge base, training the reward model, matching and comparing the output of the industrial question-answering model with the content of the industrial knowledge base, and generating the reward value r i based on a similarity function, wherein the reward value r i is expressed as follows: r i = Sim ⁡ ( a i , a 0 ) ;  wherein a 0 is prior knowledge, and a i indicates different answers of the industrial question-answering model; wherein the reward value r i is determined as follows: extracting a first feature vector of the prior knowledge a 0 ; extracting a second feature vector corresponding to each of the different answers of the industrial question-answering model a i ; calculating, via the similarity function, a vector distance between the first feature vector and each of second feature vectors; and determining the reward value r i based on the vector distance; the similarity function includes a Euclidean distance and a cosine similarity; the first feature vector is a feature vector including a preset count of keywords extracted based on the prior knowledge; the second feature vector is a feature vector including a preset count of keywords extracted based on a second answer; the shorter the vector distance between the first feature vector and the second feature vector, the greater the reward value r i ; S 2 , ranking a plurality of reward values corresponding to a plurality of outputs of the industrial question-answering model and training and updating network parameters of the reward model based on a ranking loss function; wherein the ranking loss function is expressed as follows: loss = E [ log ⁡ ( σ ⁡ ( r i - r j ) ) ] ⁢ ∀ i , j ;  wherein r i and r j are reward values corresponding to different texts, σ is a Sigmoid function; S 3 , training the industrial question-answering model, adding the plurality of reward values to a penalty term, and obtaining an optimal policy after performing a plurality of times of reinforcement learning on the industrial question-answering model using a reinforcement learning algorithm, wherein the reinforcement learning includes steps S 31 -S 33 : S 31 , calculating the reward value r i based on the reward model in S 1 , adding the reward value to the penalty term, wherein a final reward value R i is expressed as follows: R i = r i - β ⁢ log ⁢ π i π 0 ; wherein β is a coefficient of the penalty term, π i is a policy output of a last layer of the industrial question-answering model in a current iteration cascading a first linearly fully connected layer, and π 0 is a policy output of a last layer of an initial industrial question-answering model cascading a second linearly fully connected layer; S 32 , training a plurality of times using the reinforcement learning until a reward function converges, including: performing, based on an Actor-Critic network, policy optimization on the reward model using a reinforcement learning proximal policy optimization (PPO) algorithm, the reinforcement learning PPO algorithm limiting an updated magnitude of a policy by a proximal ratio clipping loss, a reinforcement learning process of the industrial question-answering model based on the PPO algorithm including: inputting questions of known industrial question-answering into the initial industrial question-answering model or an iterative industrial question-answering model, obtaining log-probs of different first answers corresponding to a question that are output by the initial industrial question-answering model or the iterative industrial question-answering model; obtaining penalty terms corresponding to processed answers based on the log-probs; obtaining a final reward based on the penalty terms and a scalar reward; inputting the final reward into the industrial question-answering model based on the PPO algorithm, obtaining an output of the industrial question-answering model; and based on the output of the industrial question-answering model and a loss function, updating parameters of the industrial question-answering model until the industrial question-answering model reaches a preset performance or a count of iterations to obtain a trained industrial question-answering model, which as follows: Lt ⁡ ( θ ) = E ^ t [ min (

Assignees

Inventors

Classifications

  • G06N5/04Primary

    Inference or reasoning models · CPC title

  • based on simulated virtual individual or collective life forms, e.g. social simulations or particle swarm optimisation [PSO] · CPC title

  • Learning methods · CPC title

  • Combinations of networks · CPC title

  • G06N3/092Primary

    Reinforcement learning · CPC title

Patent family

Related publications grouped by family.

External sources

Frequently asked questions

Answers are generated from the same data shown on this page.

What does patent US12566971B2 cover?
The present disclosure discloses a method for training an industrial question-answering model based on reinforcement learning and knowledge base matching, comprising: S 1 , training a reward model, and for an industrial knowledge question-answering, matching and comparing an output of an industrial question-answering model with a content of an industrial knowledge base, and generating a reward …
Who is the assignee on this patent?
Univ Nanjing Sci & Tech
What technology area does this patent fall under?
Primary CPC classification G06N5/04. Mapped technology areas include Physics.
When was this patent published?
Publication date Tue Mar 03 2026 00:00:00 GMT+0000 (Coordinated Universal Time) (B2). Legal status and post-grant events are not shown on this page.
What related patents are in patentsdb?
We list 12 related publications on this page (citations in our corpus or others sharing the same primary CPC).