Evaluating text classifier parameters based on semantic features

US10078688B2 · US · B2

Patent metadata
FieldValue
Publication numberUS-10078688-B2
Application numberUS-201615157722-A
CountryUS
Kind codeB2
Filing dateMay 18, 2016
Priority dateApr 12, 2016
Publication dateSep 18, 2018
Grant dateSep 18, 2018

How to read this patent

A practical reading order for non-experts. Skip the full description unless you need deep technical detail.

  1. Title

    What the patent document calls the invention.

  2. Abstract

    A short plain-language summary of the technical disclosure.

  3. Assignees and inventors

    Who owns or filed the patent and who is credited as inventor.

  4. Key dates

    Filing, priority, publication, and grant dates set the timeline.

  5. First independent claim

    The legal scope of protection — read this for what is actually claimed.

  6. CPC / IPC classifications

    Technology tags used to group this patent with similar filings.

  7. Citations and related patents

    Prior art links and similar publications in this corpus.

Abstract

Official abstract text for this publication.

Systems and methods for evaluating text classifier parameters based on semantic features. An example method comprises: performing a semantico-syntactic analysis of a natural language text of a corpus of natural language texts to produce a semantic structure representing a set of semantic classes; identifying a natural language text feature to be extracted using a set of values of a plurality of feature extraction parameters; partitioning the corpus of natural language texts into a training data set comprising a first plurality of natural language texts and a validation data set comprising a second plurality of natural language texts; determining, in view of the category of the training data set, the set of values of the feature extraction parameters; validating the set of values of the feature extraction parameters using the validation data set.

First claim

Opening claim text (preview).

What is claimed is: 1. A method, comprising: identifying a plurality of feature extraction parameters of a text classifier model, wherein the plurality of feature extraction parameters comprises a first attribute of a first semantic class and a second attribute of a second semantic class, wherein a value of the second attribute is produced by applying a pre-defined transformation to a value of the first attribute; partitioning a corpus of natural language texts into a training data set comprising a first plurality of natural language texts and a validation data set comprising a second plurality of natural language texts; determining, in view of the training data set, a set of values of the feature extraction parameters, which maximizes a number of natural language texts of the validation data set that are classified correctly by the text classifier model using the set of values of the feature extraction parameters; performing, by a processing device, a semantico-syntactic analysis of an input natural language text to produce a semantic structure representing a set of semantic classes; producing a plurality of values by applying, to the semantic structure representing the input natural language text, the text classifier model using the set of values of the feature extraction parameters, wherein each value of the plurality of values reflects a degree of association of the input natural language text with a particular category of natural language texts; associating the input natural language text with a category corresponding to an optimal value among the plurality of values; and utilizing the category to perform a natural language processing task. 2. The method of claim 1 , wherein partitioning the corpus of natural language texts comprises cross-validating the first plurality of natural language texts and the second plurality of natural language texts. 3. The method of claim 1 , further comprising: validating the set of values of the feature extraction parameters to produce a quasi-optimal set of values relative to a third plurality of natural language texts. 4. The method of claim 1 , wherein the plurality of feature extraction parameters comprises a number of levels of the semantic structure to be analyzed by the text classifier model. 5. The method of claim 1 , wherein applying the pre-defined transformation comprises multiplying the value of the first attribute by a pre-defined multiplier. 6. A method, comprising: identifying a plurality of hyper-parameters of a text classifier model, wherein the plurality of hyper-parameters include a number of nearest neighbors to be analyzed by the text classifier model; partitioning a corpus of natural language texts into a training data set comprising a first plurality of natural language texts and a validation data set comprising a second plurality of natural language texts; determining, in view of the training data set, a set of values of the hyper-parameters of the text classifier model, which maximizes a number of natural language texts of the validation data set that are classified correctly by the text classifier model using the set of values of the hyper-parameters; performing, by a processing device, a semantico-syntactic analysis of an input natural language text to produce a semantic structure representing a set of semantic classes; producing a plurality of values by applying, to the semantic structure representing the input natural language text, the text classifier model using the set of values of the hyper-parameters, wherein each value of the plurality of values reflects a degree of association of the input natural language text with a particular category of natural language texts; associating the input natural language text with a category corresponding to an optimal value among the plurality of values; and utilizing the category to perform a natural language processing task. 7. The method of claim 6 , wherein partitioning the corpus of natural language texts comprises cross-validating the first plurality of natural language texts and the second plurality of natural language texts. 8. The method of claim 6 , further comprising: validating the set of values of the feature extraction parameters to produce a quasi-optimal set of values relative to a third plurality of natural language texts. 9. The method of claim 6 , wherein the plurality of hyper-parameters further comprises a regularization parameter of the text classifier model. 10. The method of claim 6 , wherein determining the set of values of the hyper-parameters of the text classifier model further comprises: modifying, using a pre-defined transformation, values of one or more hyper-parameters of the text classifier model to produce a modified set of values of the hyper-parameters; evaluating the number of natural language texts of the validation data set that are classified correctly by the text classifier model using the modified set of values of the hyper-parameters; responsive to determining that the number of natural language texts falls below a threshold number, repeating the modifying operation. 11. A system, comprising: a memory; a processor, coupled to the memory, the processor configured to: identify a plurality of feature extraction parameters of a text classifier model, wherein the plurality of feature extraction parameters comprises a first attribute of a first semantic class and a second attribute of a second semantic class, wherein a value of the second attribute is produced by applying a pre-defined transformation to a value of the first attribute; partition a corpus of natural language texts into a training data set comprising a first plurality of natural language texts and a validation data set comprising a second plurality of natural language texts; determine, in view of the training data set, a set of values of the feature extraction parameters, which maximizes a number of natural language texts of the validation data set that are classified correctly by the text classifier model using the set of values of the feature extraction parameters; perform a semantico-syntactic analysis of an input natural language text to produce a semantic structure representing a set of semantic classes; produce a plurality of values by applying, to the semantic structure representing the input natural language text, the text classifier model using the set of values of the feature extraction parameters, wherein each value of the plurality of values reflects a degree of association of the input natural language text with a particular category of natural language text; associate the input natural language text with a category corresponding to an optimal value among the plurality of values; and utilize the category to perform a natural language processing task. 12. The system of claim 11 , wherein partitioning the corpus of natural language texts comprises cross-validating the first plurality of natural language texts and the second plurality of natural language texts. 13. The system of claim 11 , wherein the plurality of feature extraction parameters comprises a number of levels of the semantic structure to be analyzed by the text classifier model. 14. A computer-readable non-transitory storage medium comprising executable instructions that, when executed by a computer system, cause the computer system to: identify a plurality of hyper-parameters of a text classifier model, wherein the plurality of hyper-parameters include a number of nearest neighbors to be analyzed by the text classifier model; partition a corpus of natural language texts into a training data set comprising a first plurality of

Assignees

Inventors

Classifications

  • Processing or translation of natural language (natural language analysis G06F40/20; semantic analysis G06F40/30) · CPC title

  • Morphological analysis · CPC title

  • G06F16/36Primary

    Creation of semantic tools, e.g. ontology or thesauri · CPC title

  • Syntactic parsing, e.g. based on context-free grammar [CFG] or unification grammars · CPC title

  • Semantic analysis · CPC title

Patent family

Related publications grouped by family.

External sources

Frequently asked questions

Answers are generated from the same data shown on this page.

What does patent US10078688B2 cover?
Systems and methods for evaluating text classifier parameters based on semantic features. An example method comprises: performing a semantico-syntactic analysis of a natural language text of a corpus of natural language texts to produce a semantic structure representing a set of semantic classes; identifying a natural language text feature to be extracted using a set of values of a plurality of…
Who is the assignee on this patent?
Abbyy Infopoisk Llc, Abbyy Production Llc
What technology area does this patent fall under?
Primary CPC classification G06F16/36. Mapped technology areas include Physics.
When was this patent published?
Publication date Tue Sep 18 2018 00:00:00 GMT+0000 (Coordinated Universal Time) (B2). Legal status and post-grant events are not shown on this page.
What related patents are in patentsdb?
We list 2 related publications on this page (citations in our corpus or others sharing the same primary CPC).