Treebank synthesis for training production parsers

US11769007B2 · US · B2

Patent metadata
FieldValue
Publication numberUS-11769007-B2
Application numberUS-202117303349-A
CountryUS
Kind codeB2
Filing dateMay 27, 2021
Priority dateMay 27, 2021
Publication dateSep 26, 2023
Grant dateSep 26, 2023

How to read this patent

A practical reading order for non-experts. Skip the full description unless you need deep technical detail.

  1. Title

    What the patent document calls the invention.

  2. Abstract

    A short plain-language summary of the technical disclosure.

  3. Assignees and inventors

    Who owns or filed the patent and who is credited as inventor.

  4. Key dates

    Filing, priority, publication, and grant dates set the timeline.

  5. First independent claim

    The legal scope of protection — read this for what is actually claimed.

  6. CPC / IPC classifications

    Technology tags used to group this patent with similar filings.

  7. Citations and related patents

    Prior art links and similar publications in this corpus.

Abstract

Official abstract text for this publication.

An approach for generating synthetic treebanks to be used in training a parser in a production system is provided. A processor receives a request to generate one or more synthetic treebanks from a production system, wherein the request indicates a language for the one or more synthetic treebanks. A processor retrieves at least one corpus of text in which the requested language is present. A processor provides the at least one corpus to a transformer enhanced parser neural network model. A processor generates at least one synthetic treebank associated with a string of text from the at least one corpus of text in which the requested language is present. A processor sends the at least one synthetic treebank to the production system, wherein the production system trains a parser utilized by the production system with the at least one synthetic treebank.

First claim

Opening claim text (preview).

What is claimed is: 1. A computer-implemented method for generating synthetic treebanks to be used in training a parser in a production system, the method comprising: receiving, by one or more processors, a request to generate one or more synthetic treebanks from a production system, wherein the request indicates a language for the one or more synthetic treebanks; retrieving, by the one or more processors, at least one corpus of text in which the requested language is present; providing, by the one or more processors, the at least one corpus to a transformer enhanced parser neural network model; generating, by the one or more processors, at least one synthetic treebank associated with a string of text from the at least one corpus of text in which the requested language is present, wherein the at least one synthetic treebank is generated with unsupervised training of the transformer enhanced parser neural network model; and sending, by the one or more processors, the at least one synthetic treebank to the production system, wherein the production system trains a parser utilized by the production system with the at least one synthetic treebank. 2. The computer-implemented method of claim 1 , wherein the at least one corpus of text includes a corpus directed towards a limited language or domain. 3. The computer-implemented method of claim 2 , wherein the transformer enhanced parser neural network model includes one of the following pretrained transformer models: a bidirectional encoder representations for transformers (BERT) model or a cross-lingual language model (XLM). 4. The computer-implemented method of claim 1 , the transformer enhanced parser neural network model includes a neural-network parser. 5. The computer-implemented method of claim 4 , wherein the parser utilized by the production system is of lower quality than the neural-network parser. 6. The computer-implemented method of claim 1 , wherein the transformer enhanced parser neural network model separates one or more words of the at least one corpus of text into subwords. 7. A computer program product for generating synthetic treebanks to be used in training of a parser in a production system, the computer program product comprising: one or more computer-readable storage media and program instructions stored on the one or more computer-readable storage media, the program instructions comprising: program instructions to receive a request to generate one or more synthetic treebanks from a production system, wherein the request indicates a language for the one or more synthetic treebanks; program instructions to retrieve at least one corpus of text in which the requested language is present; program instructions to provide the at least one corpus to a transformer enhanced parser neural network model; program instructions to generate at least one synthetic treebank associated with a string of text from the at least one corpus of text in which the requested language is present, wherein the at least one synthetic treebank is generated with unsupervised training of the transformer enhanced parser neural network model; and program instructions to send the at least one synthetic treebank to the production system, wherein the production system trains a parser utilized by the production system with the at least one synthetic treebank. 8. The computer program product of claim 7 , wherein the at least one corpus of text includes a corpus directed towards a limited language or domain. 9. The computer program product of claim 8 , wherein the transformer enhanced parser neural network model includes one of the following pretrained transformer models: a bidirectional encoder representations for transformers (BERT) model or a cross-lingual language model (XLM). 10. The computer program product of claim 7 , the transformer enhanced parser neural network model includes a neural-network parser. 11. The computer program product of claim 10 , wherein the parser utilized by the production system is of lower quality than the neural-network parser. 12. The computer program product of claim 7 , wherein the transformer enhanced parser neural network model separates one or more words of the at least one corpus of text into subwords. 13. A computer system for generating synthetic treebanks to be used in training of a parser in a production system, the computer system comprising: one or more computer processors; one or more computer readable storage media; and program instructions stored on the computer readable storage media for execution by at least one of the one or more processors, the program instructions comprising: program instructions to receive a request to generate one or more synthetic treebanks from a production system, wherein the request indicates a language for the one or more synthetic treebanks; program instructions to retrieve at least one corpus of text in which the requested language is present; program instructions to provide the at least one corpus to a transformer enhanced parser neural network model; program instructions to generate at least one synthetic treebank associated with a string of text from the at least one corpus of text in which the requested language is present, wherein the at least one synthetic treebank is generated with unsupervised training of the transformer enhanced parser neural network model; and program instructions to send the at least one synthetic treebank to the production system, wherein the production system trains a parser utilized by the production system with the at least one synthetic treebank. 14. The computer system of claim 13 , wherein the at least one corpus of text includes a corpus directed towards a limited language or domain. 15. The computer system of claim 14 , wherein the transformer enhanced parser neural network model includes one of the following pretrained transformer models: a bidirectional encoder representations for transformers (BERT) model or a cross-lingual language model (XLM). 16. The computer system of claim 13 , the transformer enhanced parser neural network model includes a neural-network parser. 17. The computer system of claim 16 , wherein the parser utilized by the production system is of lower quality than the neural-network parser. 18. The computer system of claim 13 , wherein the transformer enhanced parser neural network model separates one or more words of the at least one corpus of text into subwords.

Assignees

Inventors

Classifications

  • Supervised learning · CPC title

  • Weakly supervised learning, e.g. semi-supervised or self-supervised learning · CPC title

  • Auto-encoder networks; Encoder-decoder networks · CPC title

  • G06F40/205Primary

    Parsing · CPC title

  • Machine-assisted translation, e.g. using translation memory · CPC title

Patent family

Related publications grouped by family.

External sources

Frequently asked questions

Answers are generated from the same data shown on this page.

What does patent US11769007B2 cover?
An approach for generating synthetic treebanks to be used in training a parser in a production system is provided. A processor receives a request to generate one or more synthetic treebanks from a production system, wherein the request indicates a language for the one or more synthetic treebanks. A processor retrieves at least one corpus of text in which the requested language is present. A pro…
Who is the assignee on this patent?
IBM
What technology area does this patent fall under?
Primary CPC classification G06F40/205. Mapped technology areas include Physics.
When was this patent published?
Publication date Tue Sep 26 2023 00:00:00 GMT+0000 (Coordinated Universal Time) (B2). Legal status and post-grant events are not shown on this page.
What related patents are in patentsdb?
We list 3 related publications on this page (citations in our corpus or others sharing the same primary CPC).