Intent classification using non-correlated features

US11966699B2 · US · B2

Patent metadata
FieldValue
Publication numberUS-11966699-B2
Application numberUS-202117350116-A
CountryUS
Kind codeB2
Filing dateJun 17, 2021
Priority dateJun 17, 2021
Publication dateApr 23, 2024
Grant dateApr 23, 2024

How to read this patent

A practical reading order for non-experts. Skip the full description unless you need deep technical detail.

  1. Title

    What the patent document calls the invention.

  2. Abstract

    A short plain-language summary of the technical disclosure.

  3. Assignees and inventors

    Who owns or filed the patent and who is credited as inventor.

  4. Key dates

    Filing, priority, publication, and grant dates set the timeline.

  5. First independent claim

    The legal scope of protection — read this for what is actually claimed.

  6. CPC / IPC classifications

    Technology tags used to group this patent with similar filings.

  7. Citations and related patents

    Prior art links and similar publications in this corpus.

Abstract

Official abstract text for this publication.

A system for classifying a language sample intent by receiving a language sample including a set of features, identifying language sample features, determining a tokenization score for the language sample according to the language sample features, eliminating duplicate features according to the tokenization score, determining a term frequency (tf) according to the identified features and the tokenization score, determining an inverse document frequency (idf) according to the identified features and the tokenization score, and generating a term frequency-inverse document frequency (tf-idf) matrix for the identified features.

First claim

Opening claim text (preview).

What is claimed is: 1. A computer system comprising computer processors to implement a method for classifying a language sample, the computer system comprising: one or more computer processors; one or more computer readable storage devices; stored program instructions on one or more computer readable storage devices for execution by the one or more computer processors, the stored program instructions comprising: program instructions to train a term frequency-inverse document frequency (tf-idf) matrix using a training data set including labeled language samples for multiple languages by: extracting language characteristics for each language sample, determining a probability of using white-space tokenizer (PWST) score for each language, training weights of a tf-idf matrix for each language according to the PWST and language characteristics for each language, program instructions to receive a first language sample comprising a set of features; program instructions to identify language sample features of the first language sample; program instructions to determine a tokenization score for the first language sample according to the language sample features; program instructions to determine a term frequency (tf) according to the identified language sample features and the tokenization score; program instructions to determine an inverse document frequency (idf) according to the identified language sample features and the tokenization score; program instructions to generate a revised term frequency-inverse document frequency (tf-idf) matrix for the identified language sample features using the trained tf-idf matrix, the tf, and the idf; program instructions to receive input text; program instructions to identify tokens in the input text; and program instructions to translate the tokens into entries in the revised term frequency—inverse document frequency (tf-idf) matrix to classify an intent of the input text. 2. The computer system according to claim 1 , the stored program instructions further comprising program instructions to identify the language according to language characteristics, white-space proportions, and average token length; and program instructions to determine the tokenization score according to the language and the language features. 3. The computer system according to claim 1 , wherein determining a tokenization score for the language sample according to the language comprises determining a tokenization score according to information including a white-space proportion estimate and an average token length. 4. The computer system according to claim 1 , the stored program instructions further comprising program instructions to eliminate duplicate features according to the tokenization score. 5. The computer system according to claim 4 , wherein eliminating duplicate features comprises keeping white-space features for a high tokenization score. 6. The computer system according to claim 4 , wherein eliminating duplicate features comprises keeping sentencepiece features for a low tokenization score. 7. The computer system according to claim 1 , the stored program instructions further comprising program instructions to train a language sample classifier according to training sample features and tokenization scores. 8. A method for classifying a language sample, the method comprising: training, by one or more computer processors, a term frequency-inverse document frequency (tf-idf) matrix using a training data set including labeled language samples for multiple languages by: extracting language characteristics for each language sample, determining a probability of using white-space tokenizer (PWST) score for each language, training weights of a tf-idf matrix for each language according to the PWST and language characteristics for each language, receiving, by the one or more computer processors, a first language sample comprising a set of features; identifying, by the one or more computer processors, language sample features of the first language sample; determining, by the one or more computer processors, a tokenization score for the first language sample according to the language sample features; determining, by the one or more computer processors, a term frequency (tf) according to the identified language sample features and the tokenization score; determining, by the one or more computer processors, an inverse document frequency (idf) according to the identified language sample features and the tokenization score; generating, by the one or more computer processors, a revised term frequency—inverse document frequency (tf-idf) matrix for the identified language sample features using the trained tf-idf matrix, the tf and the idf; receiving, by the one or more computer processors, input text; identifying, by the one or more computer processors, tokens in the input text; and translating, by the one or more computer processors, the tokens into entries in the revised term frequency—inverse document frequency (tf-idf) matrix to classify an intent of the input text. 9. The method according to claim 8 , further comprising identifying, by the one or more computer processors, the language according to language characteristics, white-space proportions, and average token length; and determining, by the one or more computer processors, the tokenization score according to the language and the language features. 10. The method according to claim 8 , wherein determining a tokenization score for the language sample according to the language comprises determining a tokenization score according to information including a white-space proportion estimate and an average token length. 11. The method according to claim 8 , further comprising eliminating, by the one or more computer processors, duplicate features according to the tokenization score. 12. The method according to claim 11 , wherein eliminating duplicate features comprises keeping white-space features for a high tokenization score. 13. The method according to claim 11 , wherein eliminating duplicate features comprises keeping sentencepiece features for a low tokenization score. 14. The method according to claim 8 , further comprising training, by the one or more computer processors, a language sample classifier according to training sample features and tokenization scores. 15. A computer program product for classifying a language sample, the computer program product comprising one or more computer readable storage devices and collectively stored program instructions on the one or more computer readable storage devices, the stored program instructions comprising: program instructions to train a term frequency-inverse document frequency (tf-idf) matrix using a training data set including labeled language samples for multiple languages by: extracting language characteristics for each language sample, determining a probability of using white-space tokenizer (PWST) score for each language, training weights of a tf-idf matrix for each language according to the PWST and language characteristics for each language, program instructions to receive a first language sample comprising a set of features; program instructions to identify language sample features of the first language sample; program instructions to determine a tokenization score for the first language sample according to the language sample features; program instructions to determine a term frequency (tf) according to the identified language sample features and the tokenization score; program instructions to determine an inverse document frequency (idf) according to the identi

Assignees

Inventors

Classifications

  • G06F40/284Primary

    Lexical analysis, e.g. tokenisation or collocates · CPC title

  • using natural language analysis · CPC title

  • Creation or modification of classes or clusters · CPC title

  • Inference or reasoning models · CPC title

  • Machine learning · CPC title

Patent family

Related publications grouped by family.

External sources

Frequently asked questions

Answers are generated from the same data shown on this page.

What does patent US11966699B2 cover?
A system for classifying a language sample intent by receiving a language sample including a set of features, identifying language sample features, determining a tokenization score for the language sample according to the language sample features, eliminating duplicate features according to the tokenization score, determining a term frequency (tf) according to the identified features and the to…
Who is the assignee on this patent?
IBM
What technology area does this patent fall under?
Primary CPC classification G06F40/284. Mapped technology areas include Physics.
When was this patent published?
Publication date Tue Apr 23 2024 00:00:00 GMT+0000 (Coordinated Universal Time) (B2). Legal status and post-grant events are not shown on this page.
What related patents are in patentsdb?
We list 12 related publications on this page (citations in our corpus or others sharing the same primary CPC).