What technology area does this patent fall under?

Primary CPC classification G06F40/216. Mapped technology areas include Physics.

When was this patent published?

Publication date Tue Jul 17 2018 00:00:00 GMT+0000 (Coordinated Universal Time) (B2). Legal status and post-grant events are not shown on this page.

What related patents are in patentsdb?

We list 8 related publications on this page (citations in our corpus or others sharing the same primary CPC).

System and method for natural language processing using synthetic text

US10025773B2 · US · B2

Patent metadata
Field	Value
Publication number	US-10025773-B2
Application number	US-201514809001-A
Country	US
Kind code	B2
Filing date	Jul 24, 2015
Priority date	Jul 24, 2015
Publication date	Jul 17, 2018
Grant date	Jul 17, 2018

How to read this patent

A practical reading order for non-experts. Skip the full description unless you need deep technical detail.

Title
What the patent document calls the invention.
Abstract
A short plain-language summary of the technical disclosure.
Assignees and inventors
Who owns or filed the patent and who is credited as inventor.
Key dates
Filing, priority, publication, and grant dates set the timeline.
First independent claim
The legal scope of protection — read this for what is actually claimed.
CPC / IPC classifications
Technology tags used to group this patent with similar filings.
Citations and related patents
Prior art links and similar publications in this corpus.

Abstract

Official abstract text for this publication.

A method for performing natural language processing includes receiving a primary text file. The received primary text file is scanned to determine a set of statistics related to a frequency at which various words of the primary text file follow other words of the primary text file. A probabilistic word generator is created based on the determined set of statistics. The probabilistic word generator generates synthetic text exhibiting the determined set of statistics. Synthetic text exhibiting the determined set of statistics is generated using the created probabilistic word generator. Word vectorization is performed on the synthetic text. Results of the performed vectorization are used to perform machine learning tasks.

First claim

Opening claim text (preview).

What is claimed is: 1. A method for performing natural language processing, comprising: receiving a primary text file; scanning the received primary text file to determine a set of statistics related to a frequency at which various words of the primary text file follow other words of the primary text file; creating a probabilistic word generator, based on the determined set of statistics, that generates synthetic text exhibiting the determined set of statistics; generating synthetic text directly into a main memory of a computer system, the generated synthetic text including a plurality of sentences, each of which including a predetermined number of probabilistically selected words exhibiting the determined set of statistics, related to a frequency at which various words of the primary text file follow other words of the primary text file, using the created probabilistic word generator; performing word vectorization on the synthetic text, within the main memory of the computer system; using results of the performed vectorization to perform machine learning tasks; and using the machine learning tasks to perform natural language processing to interpret a subsequent text, wherein, in generating the synthetic text that is used to perform machine learning directly into the main memory, no text data is loaded into the main memory from a memory storage device. 2. The method of claim 1 , wherein the machine learning tasks include natural language processing. 3. The method of claim 1 , wherein word vectorization is performed on the synthetic text as each word or each group of words of the synthetic text is generated. 4. The method of claim 1 , wherein receiving the primary text file includes storing the primary text file in a secondary storage that includes either a hard disk drive or a solid state drive. 5. The method of claim 1 , wherein the set of statistics includes an indication of a frequency with which a second given word in the primary text file immediately follows a first given word in the primary text file. 6. The method of claim 1 , wherein the set of statistics includes n-tuple statistics. 7. The method of claim 1 , wherein the set of statistics includes an indication of a frequency with which a given word in the primary text file immediately follows a given sequence of words in the primary text file. 8. The method of claim 7 , wherein the given sequence of words includes a sequence of two words. 9. The method of claim 7 , wherein the given sequence of words includes a sequence of two, three, four, or five words. 10. The method of claim 1 , wherein the probabilistic word generator generates synthetic text directly to a main memory. 11. The method of claim 1 , wherein there are multiple instances of generating synthetic text and performing word vectorization on the synthetic text, running in parallel. 12. The method of claim 1 , wherein vectorization is additionally performed in parallel on the primary text file and on generated text. 13. The method of claim 12 , wherein generating synthetic text and performing word vectorization on the synthetic text are performed during periods in which text from the primary text file is loaded from storage memory into main memory for vectorization thereupon. 14. The method of claim 1 , wherein after the primary text file is received, each word of the primary text file is replaced by a corresponding integer, the synthetic text is generated as a sequence of corresponding integers, rather than words, and word vectorization is performed on the sequence of corresponding integers. 15. The method of claim 1 , wherein each sentence of the plurality of sentences of the synthetic text is generated, by the created probabilistic word generator, by probabilistically selecting a starting word according to a frequency of word use, which is a first statistic of the determined set of statistics, and then generating a sequence of successive words thereafter by probabilistically selecting each word thereof, according to a frequency by which one word follows another, which is a second statistic of the determined set of statistics. 16. A method for performing natural language processing, comprising: receiving a primary text file; scanning the received primary text file to determine a set of statistics related to a frequency at which various words of the primary text file follow other words of the primary text file; creating a probabilistic word generator, based on the determined set of statistics, that generates synthetic text exhibiting the determined set of statistics; generating synthetic text directly into a main memory of a computer system, the generated synthetic text exhibiting the determined set of statistics using the created probabilistic word generator; performing word vectorization on the synthetic text, within the main memory of the computer system; using results of the performed vectorization to perform machine learning tasks; and using the machine learning tasks to perform natural language processing to interpret a subsequent text, wherein the set of statistics includes n-tuple statistics, wherein the generating of the synthetic text comprises generating a next word by taking into account statistics for the word generated n-words prior, within the generated synthetic text, or the word generated another fixed distance prior, within the generated synthetic text, by using m-bit vectors and hashing, and wherein the steps of generating synthetic text, performing word vectorization thereon, and using the results of the vectorization to perform machine learning tasks are implemented as multi-threaded processes so that a plurality of instances of generating synthetic text and vectorizing/performing machine learning are executed in parallel. 17. A method for performing natural language processing, comprising: receiving a primary text file; scanning the received primary text file to determine a set of statistics related to a frequency at which various words of the primary text file follow other words of the primary text file; creating a probabilistic word generator, based on the determined set of statistics, that generates synthetic text exhibiting the determined set of statistics; generating synthetic text directly into a main memory of a computer system, the generated synthetic text exhibiting the determined set of statistics using the created probabilistic word generator; performing word vectorization on the synthetic text, within the main memory of the computer system; using results of the performed vectorization to perform machine learning tasks; and using the machine learning tasks to perform natural language processing to interpret a subsequent text, wherein vectorization is additionally performed in parallel on the primary text file and on generated text, and wherein different weights are afforded to the vectorization of the primary text file and the generated text in providing adjustments to vector parameters. 18. The method of claim 17 , wherein the different weights are adjusted as the vectorization progresses. 19. A method for performing natural language processing, comprising: receiving a primary text file; scanning the received primary text file to determine a set of statistics related to a frequency at which various words of the primary text file follow other words of the primary text file; creating a probabilistic word generator, based on the determined set of statistics, that generates synthetic text exhibiting the determined set of statistics; generating synthetic

Assignees

Inventors

Classifications

G06F40/216Primary
using statistical methods · CPC title
G06F40/56
Natural language generation · CPC title
G06F40/284
Lexical analysis, e.g. tokenisation or collocates · CPC title
G06F40/274Primary
Converting codes to words; Guess-ahead of partial word inputs · CPC title
G06F17/2715
Physics · mapped topic

Patent family

Related publications grouped by family.

View patent family 57837130

External sources

Frequently asked questions

Answers are generated from the same data shown on this page.

What does patent US10025773B2 cover?: A method for performing natural language processing includes receiving a primary text file. The received primary text file is scanned to determine a set of statistics related to a frequency at which various words of the primary text file follow other words of the primary text file. A probabilistic word generator is created based on the determined set of statistics. The probabilistic word genera…
Who is the assignee on this patent?: IBM
What technology area does this patent fall under?: Primary CPC classification G06F40/216. Mapped technology areas include Physics.
When was this patent published?: Publication date Tue Jul 17 2018 00:00:00 GMT+0000 (Coordinated Universal Time) (B2). Legal status and post-grant events are not shown on this page.
What related patents are in patentsdb?: We list 8 related publications on this page (citations in our corpus or others sharing the same primary CPC).