Structure aware transformers for natural language processing
US-2024370714-A1 · Nov 7, 2024 · US
US10025773B2 · US · B2
| Field | Value |
|---|---|
| Publication number | US-10025773-B2 |
| Application number | US-201514809001-A |
| Country | US |
| Kind code | B2 |
| Filing date | Jul 24, 2015 |
| Priority date | Jul 24, 2015 |
| Publication date | Jul 17, 2018 |
| Grant date | Jul 17, 2018 |
A practical reading order for non-experts. Skip the full description unless you need deep technical detail.
What the patent document calls the invention.
A short plain-language summary of the technical disclosure.
Who owns or filed the patent and who is credited as inventor.
Filing, priority, publication, and grant dates set the timeline.
The legal scope of protection — read this for what is actually claimed.
Technology tags used to group this patent with similar filings.
Prior art links and similar publications in this corpus.
Official abstract text for this publication.
A method for performing natural language processing includes receiving a primary text file. The received primary text file is scanned to determine a set of statistics related to a frequency at which various words of the primary text file follow other words of the primary text file. A probabilistic word generator is created based on the determined set of statistics. The probabilistic word generator generates synthetic text exhibiting the determined set of statistics. Synthetic text exhibiting the determined set of statistics is generated using the created probabilistic word generator. Word vectorization is performed on the synthetic text. Results of the performed vectorization are used to perform machine learning tasks.
Opening claim text (preview).
What is claimed is: 1. A method for performing natural language processing, comprising: receiving a primary text file; scanning the received primary text file to determine a set of statistics related to a frequency at which various words of the primary text file follow other words of the primary text file; creating a probabilistic word generator, based on the determined set of statistics, that generates synthetic text exhibiting the determined set of statistics; generating synthetic text directly into a main memory of a computer system, the generated synthetic text including a plurality of sentences, each of which including a predetermined number of probabilistically selected words exhibiting the determined set of statistics, related to a frequency at which various words of the primary text file follow other words of the primary text file, using the created probabilistic word generator; performing word vectorization on the synthetic text, within the main memory of the computer system; using results of the performed vectorization to perform machine learning tasks; and using the machine learning tasks to perform natural language processing to interpret a subsequent text, wherein, in generating the synthetic text that is used to perform machine learning directly into the main memory, no text data is loaded into the main memory from a memory storage device. 2. The method of claim 1 , wherein the machine learning tasks include natural language processing. 3. The method of claim 1 , wherein word vectorization is performed on the synthetic text as each word or each group of words of the synthetic text is generated. 4. The method of claim 1 , wherein receiving the primary text file includes storing the primary text file in a secondary storage that includes either a hard disk drive or a solid state drive. 5. The method of claim 1 , wherein the set of statistics includes an indication of a frequency with which a second given word in the primary text file immediately follows a first given word in the primary text file. 6. The method of claim 1 , wherein the set of statistics includes n-tuple statistics. 7. The method of claim 1 , wherein the set of statistics includes an indication of a frequency with which a given word in the primary text file immediately follows a given sequence of words in the primary text file. 8. The method of claim 7 , wherein the given sequence of words includes a sequence of two words. 9. The method of claim 7 , wherein the given sequence of words includes a sequence of two, three, four, or five words. 10. The method of claim 1 , wherein the probabilistic word generator generates synthetic text directly to a main memory. 11. The method of claim 1 , wherein there are multiple instances of generating synthetic text and performing word vectorization on the synthetic text, running in parallel. 12. The method of claim 1 , wherein vectorization is additionally performed in parallel on the primary text file and on generated text. 13. The method of claim 12 , wherein generating synthetic text and performing word vectorization on the synthetic text are performed during periods in which text from the primary text file is loaded from storage memory into main memory for vectorization thereupon. 14. The method of claim 1 , wherein after the primary text file is received, each word of the primary text file is replaced by a corresponding integer, the synthetic text is generated as a sequence of corresponding integers, rather than words, and word vectorization is performed on the sequence of corresponding integers. 15. The method of claim 1 , wherein each sentence of the plurality of sentences of the synthetic text is generated, by the created probabilistic word generator, by probabilistically selecting a starting word according to a frequency of word use, which is a first statistic of the determined set of statistics, and then generating a sequence of successive words thereafter by probabilistically selecting each word thereof, according to a frequency by which one word follows another, which is a second statistic of the determined set of statistics. 16. A method for performing natural language processing, comprising: receiving a primary text file; scanning the received primary text file to determine a set of statistics related to a frequency at which various words of the primary text file follow other words of the primary text file; creating a probabilistic word generator, based on the determined set of statistics, that generates synthetic text exhibiting the determined set of statistics; generating synthetic text directly into a main memory of a computer system, the generated synthetic text exhibiting the determined set of statistics using the created probabilistic word generator; performing word vectorization on the synthetic text, within the main memory of the computer system; using results of the performed vectorization to perform machine learning tasks; and using the machine learning tasks to perform natural language processing to interpret a subsequent text, wherein the set of statistics includes n-tuple statistics, wherein the generating of the synthetic text comprises generating a next word by taking into account statistics for the word generated n-words prior, within the generated synthetic text, or the word generated another fixed distance prior, within the generated synthetic text, by using m-bit vectors and hashing, and wherein the steps of generating synthetic text, performing word vectorization thereon, and using the results of the vectorization to perform machine learning tasks are implemented as multi-threaded processes so that a plurality of instances of generating synthetic text and vectorizing/performing machine learning are executed in parallel. 17. A method for performing natural language processing, comprising: receiving a primary text file; scanning the received primary text file to determine a set of statistics related to a frequency at which various words of the primary text file follow other words of the primary text file; creating a probabilistic word generator, based on the determined set of statistics, that generates synthetic text exhibiting the determined set of statistics; generating synthetic text directly into a main memory of a computer system, the generated synthetic text exhibiting the determined set of statistics using the created probabilistic word generator; performing word vectorization on the synthetic text, within the main memory of the computer system; using results of the performed vectorization to perform machine learning tasks; and using the machine learning tasks to perform natural language processing to interpret a subsequent text, wherein vectorization is additionally performed in parallel on the primary text file and on generated text, and wherein different weights are afforded to the vectorization of the primary text file and the generated text in providing adjustments to vector parameters. 18. The method of claim 17 , wherein the different weights are adjusted as the vectorization progresses. 19. A method for performing natural language processing, comprising: receiving a primary text file; scanning the received primary text file to determine a set of statistics related to a frequency at which various words of the primary text file follow other words of the primary text file; creating a probabilistic word generator, based on the determined set of statistics, that generates synthetic text exhibiting the determined set of statistics; generating synthetic
using statistical methods · CPC title
Natural language generation · CPC title
Lexical analysis, e.g. tokenisation or collocates · CPC title
Converting codes to words; Guess-ahead of partial word inputs · CPC title
Physics · mapped topic
Related publications grouped by family.
Answers are generated from the same data shown on this page.