Method and apparatus for informative training repository building in sentiment analysis model learning and customization

US10824812B2 · US · B2

Patent metadata
FieldValue
Publication numberUS-10824812-B2
Application numberUS-201615175808-A
CountryUS
Kind codeB2
Filing dateJun 7, 2016
Priority dateJun 7, 2016
Publication dateNov 3, 2020
Grant dateNov 3, 2020

How to read this patent

A practical reading order for non-experts. Skip the full description unless you need deep technical detail.

  1. Title

    What the patent document calls the invention.

  2. Abstract

    A short plain-language summary of the technical disclosure.

  3. Assignees and inventors

    Who owns or filed the patent and who is credited as inventor.

  4. Key dates

    Filing, priority, publication, and grant dates set the timeline.

  5. First independent claim

    The legal scope of protection — read this for what is actually claimed.

  6. CPC / IPC classifications

    Technology tags used to group this patent with similar filings.

  7. Citations and related patents

    Prior art links and similar publications in this corpus.

Abstract

Official abstract text for this publication.

The methods, systems, and computer program products described herein provide ways to generate an informative training corpus of samples for use in machine training a high-quality sentiment analysis computer model. In some aspects, a method is disclosed including receiving a plurality of training samples, extracting semantic and sentiment elements of one or more of the training samples, generalizing the semantic and sentiment elements of the one or more of the training samples, generating an informative ranking score for one or more of the training samples based on the generalized semantic and sentiment elements, selecting informative training samples from the plurality of training samples based at least in part on the generated informative ranking scores, and adding the selected informative training samples to an informative training corpus.

First claim

Opening claim text (preview).

What is claimed is: 1. A method, comprising: receiving, by at least one processor, a plurality of training samples; extracting, by at least one processor, semantic and sentiment elements of one or more of the training samples; generalizing, by at least one processor, the extracted semantic and sentiment elements of the one or more of the training samples; generating, by at least one processor, an informative ranking score for one or more of the training samples based on the generalized semantic and sentiment elements, the informative ranking score generated as a function of a weighted element density associated with said one or more of the training samples and a weighted semantic diversity associated with said one or more of the training samples, wherein the weighted semantic diversity is determined as a sum of, a weighted ratio between aspect categories of a given training sample and all training samples being considered, and a weighted ratio between extracted elements of the given training sample and said all training samples being considered; selecting, by at least one processor, informative training samples from the plurality of training samples based at least in part on the generated informative ranking scores; adding, by at least one processor, the selected informative training samples to an informative training corpus; and machine training a semantic analysis computer model using the informative training corpus. 2. The method of claim 1 , wherein extracting semantic and sentiment elements from one or more of the training samples includes, for each of the one or more of the training samples: analyzing the training sample to identify an entity that is the subject of the training sample; analyzing the training sample to identify one or more feature terms of the training sample; and analyzing the training sample to identify one or more opinion words of the training sample. 3. The method of claim 2 , wherein generalizing the extracted semantic and sentiment elements of the one or more of the training samples includes, for each of the one or more training samples: categorizing the identified entity and feature terms into one or more aspect categories; grouping the identified opinion words into one or more opinion word groups; and extracting a latent semantic association context structure for one or more of the aspect categories. 4. The method of claim 3 , wherein generating the informative ranking score for the one or more of the training samples includes, for each of the one or more training samples: constructing a sentiment topic feature vector for the training sample based on at least one of the identified entity, the identified feature terms, the identified opinion words, the one or more aspect categories, the one or more opinion word groups, and the latent semantic association context for the one or more aspect categories; and generating an informative ranking score for the training sample based at least in part on the generated sentiment topic feature vector. 5. The method of claim 3 , wherein the latent semantic association context structure comprises one or more context pairs, each pair comprising: at least one of the identified entity and one of the feature terms; and an opinion word. 6. The method of claim 1 , wherein the extracting, generalizing, generating, selecting, and adding are performed autonomously. 7. A system, comprising: at least one hardware processor programmed to: receive a plurality of training samples; extract semantic and sentiment elements of one or more of the training samples; generalize the extracted semantic and sentiment elements of the one or more of the training samples; generate an informative ranking score for one or more of the training samples based on the generalized semantic and sentiment elements, the informative ranking score generated as a function of a weighted element density associated with said one or more of the training samples and a weighted semantic diversity associated with said one or more of the training samples, wherein the weighted semantic diversity is determined as a sum of, a weighted ratio between aspect categories of a given training sample and all training samples being considered, and a weighted ratio between extracted elements of the given training sample and said all training samples being considered; select informative training samples from the plurality of training samples based at least in part on the generated informative ranking scores; add the selected informative training samples to an informative training corpus; and a storage device coupled to the at least one hardware processor, wherein the at least one hardware processor is further programmed for machine training a semantic analysis computer model using the informative training corpus. 8. The system of claim 7 , wherein to extract the semantic and sentiment elements from one or more of the training samples, for each of the one or more of the training samples, the at least one hardware processor: analyzes the training sample to identify an entity that is the subject of the training sample; analyzes the training sample to identify one or more feature terms of the training sample; and analyzes the training sample to identify one or more opinion words of the training sample. 9. The system of claim 8 , wherein to generalize the extracted semantic and sentiment elements of the one or more of the training samples, for each of the one or more training samples, the at least one hardware processor: categorizes the identified entity and feature terms into one or more aspect categories; groups the identified opinion words into one or more opinion word groups; and extracts a latent semantic association context structure for one or more of the aspect categories. 10. The system of claim 9 , wherein to generate the informative ranking score for the one or more of the training samples, for each of the one or more training samples, the at least one hardware processor: constructs a sentiment topic feature vector for the training sample based on at least one of the identified entity, the identified feature terms, the identified opinion words, the one or more aspect categories, the one or more opinion word groups, and the latent semantic association context for the one or more aspect categories; and generates an informative ranking score for the training sample based at least in part on the generated sentiment topic feature vector. 11. The system of claim 9 , wherein the latent semantic association context structure comprises one or more context pairs, each context pair comprising: at least one of the identified entity and one of the feature terms; and an opinion word. 12. The system of claim 7 , wherein the at least one hardware processor performs extracting, generalizing, generating, selecting, and adding autonomously. 13. A computer program product comprising a computer readable storage medium having program instructions embodied therewith, wherein the computer readable storage medium is not a transitory signal per se, the program instructions readable by a processor to cause the processor to perform a method comprising: receiving a plurality of training samples; extracting semantic and sentiment elements of one or more of the training samples; generalizing the extracted semantic and sentiment elements of the one or more of the training samples; generating an informative ranking score for one or more of the training samples based on the generalized semantic and sentiment elements, the informative ranking score generated as a function of a weighted element density associated with said one or

Assignees

Inventors

Classifications

  • G06Q30/02Primary

    Marketing; Price estimation or determination; Fundraising · CPC title

  • G06F40/30Primary

    Semantic analysis · CPC title

Patent family

Related publications grouped by family.

External sources

Frequently asked questions

Answers are generated from the same data shown on this page.

What does patent US10824812B2 cover?
The methods, systems, and computer program products described herein provide ways to generate an informative training corpus of samples for use in machine training a high-quality sentiment analysis computer model. In some aspects, a method is disclosed including receiving a plurality of training samples, extracting semantic and sentiment elements of one or more of the training samples, generali…
Who is the assignee on this patent?
IBM
What technology area does this patent fall under?
Primary CPC classification G06Q30/02. Mapped technology areas include Physics.
When was this patent published?
Publication date Tue Nov 03 2020 00:00:00 GMT+0000 (Coordinated Universal Time) (B2). Legal status and post-grant events are not shown on this page.
What related patents are in patentsdb?
We list 6 related publications on this page (citations in our corpus or others sharing the same primary CPC).