Active learning for concept disambiguation

US11636376B2 · US · B2

Patent metadata
FieldValue
Publication numberUS-11636376-B2
Application numberUS-201815996491-A
CountryUS
Kind codeB2
Filing dateJun 3, 2018
Priority dateJun 3, 2018
Publication dateApr 25, 2023
Grant dateApr 25, 2023

How to read this patent

A practical reading order for non-experts. Skip the full description unless you need deep technical detail.

  1. Title

    What the patent document calls the invention.

  2. Abstract

    A short plain-language summary of the technical disclosure.

  3. Assignees and inventors

    Who owns or filed the patent and who is credited as inventor.

  4. Key dates

    Filing, priority, publication, and grant dates set the timeline.

  5. First independent claim

    The legal scope of protection — read this for what is actually claimed.

  6. CPC / IPC classifications

    Technology tags used to group this patent with similar filings.

  7. Citations and related patents

    Prior art links and similar publications in this corpus.

Abstract

Official abstract text for this publication.

A method, computer system, and a computer program product for active machine learning is provided. The present invention may include annotating a plurality of data entries. The present invention may also include building a first dataset based on the annotated plurality of data entries. The present invention may then include receiving user feedback based on the built first dataset. The present invention may further include assigning a plurality of weights to a plurality of data entry subsets. The present invention may also include generating a second weighted dataset based on the received user feedback.

First claim

Opening claim text (preview).

What is claimed is: 1. A method for generating ground truth using active machine learning, the method comprising: annotating a plurality of data entries using rule-based natural language processing; parsing the plurality of data entries into a first dataset that includes entities, features, and classifications; building, using a bootstrap aggregation, the first dataset based on the annotated plurality of data entries using coreference resolution and entity analysis, wherein each row in the first dataset represents an annotation from the annotated plurality of data entries; receiving user feedback based on the built first dataset in response to detecting an ambiguity associated with a data entry in the built first dataset, wherein the ambiguity indicates the data entry comprises more than one meaning; assigning a plurality of weights to a plurality of data entry subsets; generating a second weighted dataset that is weighted higher than the first dataset because the second weighted dataset is based on the received user feedback; and transmitting the second weighted dataset to create a trained model. 2. The method of claim 1 , wherein the plurality of data entries are derived from a source, and wherein the source is selected from a group consisting of a database, a corpus, a knowledgebase or an individual. 3. The method of claim 1 , wherein the second weighted dataset includes the plurality of data entry subsets that create a machine learning model. 4. The method of claim 1 , wherein the first dataset is data obtained by a domain specific logic and natural language processing of the plurality of data entries. 5. The method of claim 1 , wherein the user feedback is created by a subject matter expert (SME) based on rule-based logic applied to the first dataset. 6. The method of claim 1 , wherein the second weighted dataset is ground truth data that include features selected from a group consisting of an ambiguous entity, an entity characteristic, an NLP trigger, an NLP trigger distance, an NLP parse tree characteristic and a plurality of a parts of speech tag. 7. The method of claim 1 , wherein the higher weight is a more accurate plurality of data. 8. A computer system for active machine learning, comprising: one or more processors, one or more computer-readable memories, one or more computer-readable tangible storage media, and program instructions stored on at least one of the one or more computer-readable tangible storage media for execution by at least one of the one or more processors via at least one of the one or more computer-readable memories, wherein the computer system is capable of performing a method comprising: annotating a plurality of data entries using rule-based natural language processing; parsing the plurality of data entries into a first dataset that includes entities, features, and classifications; building, using a bootstrap aggregation, the first dataset based on the annotated plurality of data entries using coreference resolution and entity analysis, wherein each row in the first dataset represents an annotation from the annotated plurality of data entries; receiving user feedback based on the built first dataset in response to detecting an ambiguity associated with a data entry in the built first dataset, wherein the ambiguity indicates the data entry comprises more than one meaning; assigning a plurality of weights to a plurality of data entry subsets; generating a second weighted dataset that is weighted higher than the first dataset because the second weighted dataset is based on the received user feedback; and transmitting the second weighted dataset to create a trained model. 9. The computer system of claim 8 , wherein the plurality of data entries are derived from a source, and wherein the source is selected from a group consisting of a database, a corpus, a knowledgebase or an individual. 10. The computer system of claim 8 , wherein the second weighted dataset includes the plurality of data entry subsets that create a machine learning model. 11. The computer system of claim 8 , wherein the first dataset is data obtained by a domain specific logic and natural language processing of the plurality of data entries. 12. The computer system of claim 8 , wherein the user feedback is created by a subject matter expert (SME) based on rule-based logic applied to the first dataset. 13. The computer system of claim 8 , wherein the second weighted dataset is ground truth data that include features selected from a group consisting of an ambiguous entity, an entity characteristic, an NLP trigger, an NLP trigger distance, an NLP parse tree characteristic and a plurality of a parts of speech tag. 14. The computer system of claim 8 , wherein the higher weight is a more accurate plurality of data. 15. A computer program product for active machine learning, comprising: one or more computer-readable tangible storage media and program instructions stored on at least one of the one or more computer-readable tangible storage media, the program instructions executable by a processor to cause the processor to perform a method comprising: annotating a plurality of data entries using rule-based natural language processing; parsing the plurality of data entries into a first dataset that includes entities, features, and classifications; building, using a bootstrap aggregation, the first dataset based on the annotated plurality of data entries using coreference resolution and entity analysis, wherein each row in the first dataset represents an annotation from the annotated plurality of data entries; receiving user feedback based on the built first dataset in response to detecting an ambiguity associated with a data entry in the built first dataset, wherein the ambiguity indicates the data entry comprises more than one meaning; assigning a plurality of weights to a plurality of data entry subsets; generating a second weighted dataset that is weighted higher than the first dataset because the second weighted dataset is based on the received user feedback; and transmitting the second weighted dataset to create a trained model. 16. The computer program product of claim 15 , wherein the plurality of data entries are derived from a source, and wherein the source is selected from a group consisting of a database, a corpus, a knowledgebase or an individual. 17. The computer program product of claim 15 , wherein the second weighted dataset includes the plurality of data entry subsets that create a machine learning model. 18. The computer program product of claim 15 , wherein the first dataset is data obtained by a domain specific logic and natural language processing of the plurality of data entries. 19. The computer program product of claim 15 , wherein the user feedback is created by a subject matter expert (SME) based on rule-based logic applied to the first dataset. 20. The computer program product of claim 15 , wherein the second weighted dataset is ground truth data that include features selected from a group consisting of an ambiguous entity, an entity characteristic, an NLP trigger, an NLP trigger distance, an NLP parse tree characteristic and a plurality of a parts of speech tag.

Assignees

Inventors

Classifications

  • Processing or translation of natural language (natural language analysis G06F40/20; semantic analysis G06F40/30) · CPC title

  • G06N20/00Primary

    Machine learning · CPC title

  • Extracting rules from data · CPC title

  • Natural language analysis (semantic analysis of natural language G06F40/30) · CPC title

  • G06F40/169Primary

    Annotation, e.g. comment data or footnotes · CPC title

Patent family

Related publications grouped by family.

External sources

Frequently asked questions

Answers are generated from the same data shown on this page.

What does patent US11636376B2 cover?
A method, computer system, and a computer program product for active machine learning is provided. The present invention may include annotating a plurality of data entries. The present invention may also include building a first dataset based on the annotated plurality of data entries. The present invention may then include receiving user feedback based on the built first dataset. The present i…
Who is the assignee on this patent?
IBM
What technology area does this patent fall under?
Primary CPC classification G06N20/00. Mapped technology areas include Physics.
When was this patent published?
Publication date Tue Apr 25 2023 00:00:00 GMT+0000 (Coordinated Universal Time) (B2). Legal status and post-grant events are not shown on this page.
What related patents are in patentsdb?
We list 8 related publications on this page (citations in our corpus or others sharing the same primary CPC).