Cross-language text classification

US9588958B2 · US · B2

Patent metadata
FieldValue
Publication numberUS-9588958-B2
Application numberUS-201213535638-A
CountryUS
Kind codeB2
Filing dateJun 28, 2012
Priority dateOct 10, 2006
Publication dateMar 7, 2017
Grant dateMar 7, 2017

How to read this patent

A practical reading order for non-experts. Skip the full description unless you need deep technical detail.

  1. Title

    What the patent document calls the invention.

  2. Abstract

    A short plain-language summary of the technical disclosure.

  3. Assignees and inventors

    Who owns or filed the patent and who is credited as inventor.

  4. Key dates

    Filing, priority, publication, and grant dates set the timeline.

  5. First independent claim

    The legal scope of protection — read this for what is actually claimed.

  6. CPC / IPC classifications

    Technology tags used to group this patent with similar filings.

  7. Citations and related patents

    Prior art links and similar publications in this corpus.

Abstract

Official abstract text for this publication.

Methods are described for performing classification (categorization) of text documents written in various languages. Language-independent semantic structures are constructed before classifying documents. These structures reflect lexical, morphological, syntactic, and semantic properties of documents. The methods suggested are able to perform cross-language text classification which is based on document properties reflecting their meaning. The methods are applicable to genre classification, topic detection, news analysis, authorship analysis, etc.

First claim

Opening claim text (preview).

We claim: 1. A method of performing text classification based on language-independent text features, the method comprising: performing, by a processor, a first syntactic and semantic analysis of a training natural language text to produce a first plurality of language-independent semantic structures representing a plurality of sentences of the training natural language text; producing, based on the first plurality of language-independent semantic structures, a text classifier model; performing a second syntactic and semantic analysis of an input natural language text to produce a second plurality of language-independent semantic structures representing a plurality of sentences of the input natural language text; extracting, using the second plurality of language-independent semantic structures, a set of features, wherein at least one feature references a semantic class of a language-independent semantic hierarchy comprising a plurality of semantic classes, in which the semantic class exhibits one or more properties inherited from its parent semantic class; applying the text classifier model to the set of features to produce a classification spectrum comprising a plurality of weight values, wherein each weight value reflects a degree of association of the input natural language text with a particular category of natural language texts; and associating the input natural language text with one or more categories using the classification spectrum. 2. The method of claim 1 , wherein the second syntactic and semantic analysis further includes determining a grammatical feature of the input natural language text. 3. The method of claim 1 , wherein the second syntactic and semantic analysis further includes determining a lexical feature of the input natural language text. 4. The method of claim 1 , wherein the second syntactic and semantic analysis further includes determining a syntactic feature of the input natural language text. 5. The method of claim 1 , wherein the second syntactic and semantic analysis further includes determining a semantic feature of the input natural language text. 6. The method of claim 1 , wherein the second syntactic and semantic analysis further includes generating a syntactic structure of a sentence of the input natural language text. 7. The method of claim 1 , wherein the categories are represented by language independent categories. 8. A non-transitory computer readable storage medium comprising executable instructions for causing a computing system to perform operations comprising: performing a first syntactic and semantic analysis of a training natural language text to produce a first plurality of language-independent semantic structures representing a plurality of sentences of the training natural language text; producing, based on the first plurality of language-independent semantic structures, a text classifier model; performing a second syntactic and semantic analysis of an input natural language text to produce a second plurality of language-independent semantic structures representing a plurality of sentences of the input natural language text; extracting, using the second plurality of language-independent semantic structures, a set of features, wherein at least one feature references a semantic class of a language-independent semantic hierarchy comprising a plurality of semantic classes, in which the semantic class exhibits one or more properties inherited from its parent semantic class; applying the text classifier model to the set of features to produce a classification spectrum comprising a plurality of weight values, wherein each weight value references a degree of association of the input natural language text with a particular category of natural language texts; and associating the input natural language text with one or more categories using the classification spectrum. 9. The non-transitory computer readable storage medium of claim 8 , wherein the second syntactic and semantic analysis further includes determining a grammatical feature of the input natural language text. 10. The non-transitory computer readable medium of claim 8 , wherein the second syntactic and semantic analysis further includes determining a lexical feature of the input natural language text. 11. The non-transitory computer readable medium of claim 8 , wherein the second syntactic and semantic analysis further includes determining a syntactic feature of the input natural language text. 12. The non-transitory computer readable medium of claim 8 , wherein the second syntactic and semantic analysis further includes determining a semantic feature of the input natural language text. 13. The non-transitory computer readable medium of claim 8 , wherein the second syntactic and semantic analysis further includes generating a syntactic structure of a sentence of the input natural language text. 14. The non-transitory computer readable medium of claim 8 , wherein the categories are represented by language independent categories. 15. A computer system adapted to perform text classification based on language-independent text features, the computer system comprising: a feature extractor adapted to perform operations comprising: performing a first syntactic and semantic analysis of a training natural language text to produce a first plurality of language-independent semantic structures representing a plurality of sentences of the training natural language text; producing, based on the first plurality of language-independent semantic structures, a text classifier model; performing a second syntactic and semantic analysis of an input natural language text to produce a second plurality of language-independent semantic structures representing a plurality of sentences of the input natural language text; extracting, using the second plurality of language-independent semantic structures, a set of features, wherein at least one feature references a semantic class of a language-independent semantic hierarchy comprising a plurality of semantic classes, in which the semantic class exhibits one or more properties inherited from its parent semantic class; and a text classifier adapted to perform operations comprising: applying the text classifier model to the set of features to generate a classification spectrum comprising a plurality of weight values, wherein each weight value references a degree of association of the input natural language text with a particular category of natural language texts; and associating the input natural language text with one or more categories using the classification spectrum. 16. The computer system of claim 15 , wherein the feature extractor is further adapted to perform operations comprising: determining a grammatical feature of the input natural language text. 17. The computer system of claim 15 , wherein the feature extractor is further adapted to perform operations comprising: determining a lexical feature of the input natural language text. 18. The computer system of claim 15 , wherein the feature extractor is further adapted to perform operations comprising: determining a syntactic feature of the input natural language text. 19. The computer system of claim 15 , wherein the feature extractor is further adapted to perform operations comprising: determining a semantic feature of the input natural language text. 20. The computer system of claim 15 , wherein the feature extractor is further adapted to perform operations comprising: generating a syntactic structure of a se

Assignees

Inventors

Classifications

  • Rule-based translation · CPC title

  • Syntactic parsing, e.g. based on context-free grammar [CFG] or unification grammars · CPC title

  • G06F40/268Primary

    Morphological analysis · CPC title

  • Lexical analysis, e.g. tokenisation or collocates · CPC title

  • Semantic analysis · CPC title

Patent family

Related publications grouped by family.

External sources

Frequently asked questions

Answers are generated from the same data shown on this page.

What does patent US9588958B2 cover?
Methods are described for performing classification (categorization) of text documents written in various languages. Language-independent semantic structures are constructed before classifying documents. These structures reflect lexical, morphological, syntactic, and semantic properties of documents. The methods suggested are able to perform cross-language text classification which is based on …
Who is the assignee on this patent?
Danielyan Tatiana, Zuev Konstantin, Anisimovich Konstantin, and 2 more
What technology area does this patent fall under?
Primary CPC classification G06F40/268. Mapped technology areas include Physics.
When was this patent published?
Publication date Tue Mar 07 2017 00:00:00 GMT+0000 (Coordinated Universal Time) (B2). Legal status and post-grant events are not shown on this page.
What related patents are in patentsdb?
We list 8 related publications on this page (citations in our corpus or others sharing the same primary CPC).