Machine-learning natural language processing classifier for content classification

US11907672B2 · US · B2

Patent metadata
FieldValue
Publication numberUS-11907672-B2
Application numberUS-202016893146-A
CountryUS
Kind codeB2
Filing dateJun 4, 2020
Priority dateJun 5, 2019
Publication dateFeb 20, 2024
Grant dateFeb 20, 2024

How to read this patent

A practical reading order for non-experts. Skip the full description unless you need deep technical detail.

  1. Title

    What the patent document calls the invention.

  2. Abstract

    A short plain-language summary of the technical disclosure.

  3. Assignees and inventors

    Who owns or filed the patent and who is credited as inventor.

  4. Key dates

    Filing, priority, publication, and grant dates set the timeline.

  5. First independent claim

    The legal scope of protection — read this for what is actually claimed.

  6. CPC / IPC classifications

    Technology tags used to group this patent with similar filings.

  7. Citations and related patents

    Prior art links and similar publications in this corpus.

Abstract

Official abstract text for this publication.

Computer-readable media, systems and methods may improve classification of content based on a machine-learning natural language processing (ML-NLP) classifier. The system may train a general language model based on a general corpus, further train the general language model based on a domain-specific corpus to generate a domain-specific language model, and conduct supervised machine-learning based on the domain-specific language using topic-specific corpus labeled as relating to topics of interest to generate the ML-NLP classifier. Accordingly, the ML-NLP classifier may be trained on a general corpus, further trained on a domain-specific corpus, and fine-tuned on a topic-specific corpus. In this manner, domain-specific content may be classified into topics of interest. The ML-NLP classifier may classify content into the topics of interest.

First claim

Opening claim text (preview).

What is claimed is: 1. A computer system to classify content based on a machine-learning natural language processing (ML-NLP) classifier, the computer system comprising: a processor programmed to: receive content comprising natural language text; access an identification of a plurality of topics of interest, which are received from a user, that are associated with a domain; provide an input based on the content to the ML-NLP classifier, the ML-NLP classifier being based on a base model having a plurality of layers that is pre-trained on a general corpus of data, and then further pre-trained a domain-specific corpus of data specific to the domain, wherein at least one output layer for performing a classification task that predicts whether the content relates to one of the plurality of topics of interest is appended to the base model, wherein the base model with the at least one appended output layer is further trained on a topic-specific corpus of data labeled according to the plurality of topics of interest, wherein the general corpus of data, the domain-specific corpus of data, and the topic-specific corpus of data are independent of one another; generate, as an output of the ML-NLP classifier using the at least one output layer for performing the classification task, a prediction that the content relates to a corresponding topic from among the plurality of topics of interest; and generate a report based on the one or more classifications. 2. The system of claim 1 , wherein the processor is further programmed to: determine that the content is related to at least a first topic of interest and a second topic of interest. 3. The system of claim 1 , wherein the processor is further programmed to: reclassify the content into one or more of a second plurality of topics of interest. 4. The system of claim 1 , wherein the ML-NLP classifier comprises a bidirectional encoder representations from transformers model, the domain-specific corpus of data relates to a financial domain, and the topic-specific corpus relates to environmental, social, or governance. 5. The system of claim 1 , wherein the domain-specific corpus of data is in a first language, and where the processor is further programmed to: translate, based on a machine translation, the domain-specific corpus of data from the first language to a second language; translate, via the machine translation, the domain-specific corpus of data from the second language back to the first language to generate a second version of the domain-specific corpus of data, wherein the second version of the domain-specific corpus of data is different than the domain-specific corpus of data; and expand the domain-specific corpus of data based on the second version of the domain-specific corpus of data. 6. The system of claim 1 , wherein the processor is further programmed to: receive an identification of one or more new topics of interest; update the plurality of topics of interest based on the one or more new topics of interest; and retrain the ML-NLP classifier based on the updated plurality of topics. 7. The system of claim 1 , wherein the processor is further programmed to: access a relevance score that indicates a level of relevance of the content to an entity of interest; and weight the respective probabilities based on the relevance score. 8. The system of claim 1 , wherein the processor is further programmed to: receive, via a graphical user interface, an identification of at least a first topic; identify, one or more content that was classified into the first topic; and provide data based on the identified one or more content. 9. The system of claim 1 , wherein the processor is further programmed to: receive, from a manual curator, feedback comprising an indication of whether or not the content was correctly identified as being related to a particular topic of interest from among the plurality of topics of interest; and add the content to the topic-specific corpus as labeled training data based on the indication. 10. The system of claim 1 , wherein the processor is further programmed to: for each of the one or more classifications of the content: compare the respective probability that the content relates to the corresponding topic with a confidence threshold, and determine whether the content relates to the corresponding topic based on the comparison; and populate the report based on the determinations. 11. The system of claim 10 , wherein to determine whether the content relates to the corresponding topic based on the comparison, the processor is further programmed to: determine that the respective probability meets or exceeds a first confidence threshold; and automatically classify the content as relating to the corresponding topic based on the determination that the content meets or exceeds the first confidence threshold. 12. The system of claim 10 , wherein to determine whether the content relates to the corresponding topic based on the comparison, the processor is further programmed to: determine that the respective probability does not meet or exceed a first confidence threshold but meets or exceeds a second confidence threshold lower than the first confidence threshold; and flag the content to be verified as relating to the corresponding topic based on the determination that the respective probability does not meet or exceed the first confidence threshold but meets or exceeds the second confidence threshold. 13. The system of claim 10 , wherein the content relates to an entity, and wherein the processor is further programmed to: receive a request for contents relating to the entity and the plurality of topics of interest; determine that the content relates to the entity, wherein the report is generated responsive to the request; and transmit the report to a user. 14. The system of claim 10 , wherein to determine whether the content relates to the corresponding topic based on the comparison, the processor is further programmed to: determine that the respective probability does not meet or exceed a first confidence threshold and does not meet or exceed a second confidence threshold lower than the first confidence threshold; and determine that the content does not relate to the corresponding topic based on the determination that the respective probability does not meet or exceed the first confidence threshold and does not meet exceed the second confidence threshold. 15. The system of claim 14 , wherein the processor is further programmed to: determine that none of the respective probabilities meet or exceed the second confidence threshold; and remove the content from being considered for a pool of content determined to be relevant to the plurality of topics of interest. 16. A non-transitory computer-readable medium that stores instructions to classify content based on a machine-learning natural language processing (ML-NLP) classifier, the instructions, when executed by a processor, program the processor to: access a general corpus of data comprising text; train the ML-NLP classifier based on a base model having a plurality of layers that is pre-trained on the general corpus of data; access a domain-specific corpus of data comprising text relating to a specific domain; further train the ML-NLP classifier based on the domain-specific corpus of data; access a topic-specific corpus of data labeled according to a plurality of topics of interest, wherein at least one output layer for performing a classification task that predicts whether the content relates to one of the plurality of topics of interes

Assignees

Inventors

Classifications

  • Supervised learning · CPC title

  • Transfer learning · CPC title

  • Feedforward networks · CPC title

  • G06F40/40Primary

    Processing or translation of natural language (natural language analysis G06F40/20; semantic analysis G06F40/30) · CPC title

  • Combinations of networks · CPC title

Patent family

Related publications grouped by family.

External sources

Frequently asked questions

Answers are generated from the same data shown on this page.

What does patent US11907672B2 cover?
Computer-readable media, systems and methods may improve classification of content based on a machine-learning natural language processing (ML-NLP) classifier. The system may train a general language model based on a general corpus, further train the general language model based on a domain-specific corpus to generate a domain-specific language model, and conduct supervised machine-learning bas…
Who is the assignee on this patent?
Refinitiv Us Organization Llc
What technology area does this patent fall under?
Primary CPC classification G06F40/40. Mapped technology areas include Physics.
When was this patent published?
Publication date Tue Feb 20 2024 00:00:00 GMT+0000 (Coordinated Universal Time) (B2). Legal status and post-grant events are not shown on this page.
What related patents are in patentsdb?
We list 11 related publications on this page (citations in our corpus or others sharing the same primary CPC).