System and method for natural language processing using synthetic text

US10025773B2 · US · B2

Patent metadata
FieldValue
Publication numberUS-10025773-B2
Application numberUS-201514809001-A
CountryUS
Kind codeB2
Filing dateJul 24, 2015
Priority dateJul 24, 2015
Publication dateJul 17, 2018
Grant dateJul 17, 2018

How to read this patent

A practical reading order for non-experts. Skip the full description unless you need deep technical detail.

  1. Title

    What the patent document calls the invention.

  2. Abstract

    A short plain-language summary of the technical disclosure.

  3. Assignees and inventors

    Who owns or filed the patent and who is credited as inventor.

  4. Key dates

    Filing, priority, publication, and grant dates set the timeline.

  5. First independent claim

    The legal scope of protection — read this for what is actually claimed.

  6. CPC / IPC classifications

    Technology tags used to group this patent with similar filings.

  7. Citations and related patents

    Prior art links and similar publications in this corpus.

Abstract

Official abstract text for this publication.

A method for performing natural language processing includes receiving a primary text file. The received primary text file is scanned to determine a set of statistics related to a frequency at which various words of the primary text file follow other words of the primary text file. A probabilistic word generator is created based on the determined set of statistics. The probabilistic word generator generates synthetic text exhibiting the determined set of statistics. Synthetic text exhibiting the determined set of statistics is generated using the created probabilistic word generator. Word vectorization is performed on the synthetic text. Results of the performed vectorization are used to perform machine learning tasks.

First claim

Opening claim text (preview).

What is claimed is: 1. A method for performing natural language processing, comprising: receiving a primary text file; scanning the received primary text file to determine a set of statistics related to a frequency at which various words of the primary text file follow other words of the primary text file; creating a probabilistic word generator, based on the determined set of statistics, that generates synthetic text exhibiting the determined set of statistics; generating synthetic text directly into a main memory of a computer system, the generated synthetic text including a plurality of sentences, each of which including a predetermined number of probabilistically selected words exhibiting the determined set of statistics, related to a frequency at which various words of the primary text file follow other words of the primary text file, using the created probabilistic word generator; performing word vectorization on the synthetic text, within the main memory of the computer system; using results of the performed vectorization to perform machine learning tasks; and using the machine learning tasks to perform natural language processing to interpret a subsequent text, wherein, in generating the synthetic text that is used to perform machine learning directly into the main memory, no text data is loaded into the main memory from a memory storage device. 2. The method of claim 1 , wherein the machine learning tasks include natural language processing. 3. The method of claim 1 , wherein word vectorization is performed on the synthetic text as each word or each group of words of the synthetic text is generated. 4. The method of claim 1 , wherein receiving the primary text file includes storing the primary text file in a secondary storage that includes either a hard disk drive or a solid state drive. 5. The method of claim 1 , wherein the set of statistics includes an indication of a frequency with which a second given word in the primary text file immediately follows a first given word in the primary text file. 6. The method of claim 1 , wherein the set of statistics includes n-tuple statistics. 7. The method of claim 1 , wherein the set of statistics includes an indication of a frequency with which a given word in the primary text file immediately follows a given sequence of words in the primary text file. 8. The method of claim 7 , wherein the given sequence of words includes a sequence of two words. 9. The method of claim 7 , wherein the given sequence of words includes a sequence of two, three, four, or five words. 10. The method of claim 1 , wherein the probabilistic word generator generates synthetic text directly to a main memory. 11. The method of claim 1 , wherein there are multiple instances of generating synthetic text and performing word vectorization on the synthetic text, running in parallel. 12. The method of claim 1 , wherein vectorization is additionally performed in parallel on the primary text file and on generated text. 13. The method of claim 12 , wherein generating synthetic text and performing word vectorization on the synthetic text are performed during periods in which text from the primary text file is loaded from storage memory into main memory for vectorization thereupon. 14. The method of claim 1 , wherein after the primary text file is received, each word of the primary text file is replaced by a corresponding integer, the synthetic text is generated as a sequence of corresponding integers, rather than words, and word vectorization is performed on the sequence of corresponding integers. 15. The method of claim 1 , wherein each sentence of the plurality of sentences of the synthetic text is generated, by the created probabilistic word generator, by probabilistically selecting a starting word according to a frequency of word use, which is a first statistic of the determined set of statistics, and then generating a sequence of successive words thereafter by probabilistically selecting each word thereof, according to a frequency by which one word follows another, which is a second statistic of the determined set of statistics. 16. A method for performing natural language processing, comprising: receiving a primary text file; scanning the received primary text file to determine a set of statistics related to a frequency at which various words of the primary text file follow other words of the primary text file; creating a probabilistic word generator, based on the determined set of statistics, that generates synthetic text exhibiting the determined set of statistics; generating synthetic text directly into a main memory of a computer system, the generated synthetic text exhibiting the determined set of statistics using the created probabilistic word generator; performing word vectorization on the synthetic text, within the main memory of the computer system; using results of the performed vectorization to perform machine learning tasks; and using the machine learning tasks to perform natural language processing to interpret a subsequent text, wherein the set of statistics includes n-tuple statistics, wherein the generating of the synthetic text comprises generating a next word by taking into account statistics for the word generated n-words prior, within the generated synthetic text, or the word generated another fixed distance prior, within the generated synthetic text, by using m-bit vectors and hashing, and wherein the steps of generating synthetic text, performing word vectorization thereon, and using the results of the vectorization to perform machine learning tasks are implemented as multi-threaded processes so that a plurality of instances of generating synthetic text and vectorizing/performing machine learning are executed in parallel. 17. A method for performing natural language processing, comprising: receiving a primary text file; scanning the received primary text file to determine a set of statistics related to a frequency at which various words of the primary text file follow other words of the primary text file; creating a probabilistic word generator, based on the determined set of statistics, that generates synthetic text exhibiting the determined set of statistics; generating synthetic text directly into a main memory of a computer system, the generated synthetic text exhibiting the determined set of statistics using the created probabilistic word generator; performing word vectorization on the synthetic text, within the main memory of the computer system; using results of the performed vectorization to perform machine learning tasks; and using the machine learning tasks to perform natural language processing to interpret a subsequent text, wherein vectorization is additionally performed in parallel on the primary text file and on generated text, and wherein different weights are afforded to the vectorization of the primary text file and the generated text in providing adjustments to vector parameters. 18. The method of claim 17 , wherein the different weights are adjusted as the vectorization progresses. 19. A method for performing natural language processing, comprising: receiving a primary text file; scanning the received primary text file to determine a set of statistics related to a frequency at which various words of the primary text file follow other words of the primary text file; creating a probabilistic word generator, based on the determined set of statistics, that generates synthetic text exhibiting the determined set of statistics; generating synthetic

Assignees

Inventors

Classifications

  • G06F40/216Primary

    using statistical methods · CPC title

  • Natural language generation · CPC title

  • Lexical analysis, e.g. tokenisation or collocates · CPC title

  • G06F40/274Primary

    Converting codes to words; Guess-ahead of partial word inputs · CPC title

  • Physics · mapped topic

Patent family

Related publications grouped by family.

External sources

Frequently asked questions

Answers are generated from the same data shown on this page.

What does patent US10025773B2 cover?
A method for performing natural language processing includes receiving a primary text file. The received primary text file is scanned to determine a set of statistics related to a frequency at which various words of the primary text file follow other words of the primary text file. A probabilistic word generator is created based on the determined set of statistics. The probabilistic word genera…
Who is the assignee on this patent?
IBM
What technology area does this patent fall under?
Primary CPC classification G06F40/216. Mapped technology areas include Physics.
When was this patent published?
Publication date Tue Jul 17 2018 00:00:00 GMT+0000 (Coordinated Universal Time) (B2). Legal status and post-grant events are not shown on this page.
What related patents are in patentsdb?
We list 8 related publications on this page (citations in our corpus or others sharing the same primary CPC).