System and method for extracting information from unstructured text

US10002129B1 · US · B1

Patent metadata
FieldValue
Publication numberUS-10002129-B1
Application numberUS-201715474194-A
CountryUS
Kind codeB1
Filing dateMar 30, 2017
Priority dateFeb 15, 2017
Publication dateJun 19, 2018
Grant dateJun 19, 2018

How to read this patent

A practical reading order for non-experts. Skip the full description unless you need deep technical detail.

  1. Title

    What the patent document calls the invention.

  2. Abstract

    A short plain-language summary of the technical disclosure.

  3. Assignees and inventors

    Who owns or filed the patent and who is credited as inventor.

  4. Key dates

    Filing, priority, publication, and grant dates set the timeline.

  5. First independent claim

    The legal scope of protection — read this for what is actually claimed.

  6. CPC / IPC classifications

    Technology tags used to group this patent with similar filings.

  7. Citations and related patents

    Prior art links and similar publications in this corpus.

Abstract

Official abstract text for this publication.

This disclosure relates generally to natural language processing, and more particularly to a system and method for extracting subject-verb-object (SVO) chunked text from an unstructured text. In one embodiment, a method is provided for extracting SVO chunked text from an unstructured text. The method comprises identifying a plurality of part of speech (PoS) tokens in the unstructured text, and determining a plurality of SVO chunked text directly from the plurality of PoS tokens using a machine learning chunker model. The machine learning chunker model is trained on a subject-verb-object (SVO) annotated training data.

First claim

Opening claim text (preview).

What is claimed is: 1. A method for extracting subject-verb-object (SVO) chunked text from unstructured text, the method comprising: identifying, by a SVO chunked text computing device, a plurality of part of speech (PoS) tokens in an unstructured text; and determining, by the SVO chunked text computing device, a SVO chunked text directly from the plurality of PoS tokens using a machine learning chunker model, wherein the machine learning chunker model is trained on an SVO annotated training data, wherein the SVO annotated training data comprises a plurality of tokens, a plurality of corresponding PoS tags, and a plurality of corresponding SVO tags, the plurality of corresponding SVO tags comprises one or more of a subject tag, a verb tag, an object tag, or an object-subject tag, and the plurality of corresponding SVO tags is in beginninginside-other (BIO) format, and wherein the SVO annotated training data is generated based on a plurality of corresponding span information for the plurality of tokens by for each of a plurality of PoS tokens in each of a plurality of sets of syntactically related PoS tokens in a sentence, detecting a span information for a PoS token and tagging the PoS token as a subject, a verb, an object, or an object-subject based on the span information and a pervious tagging of the PoS token. 2. The method of claim 1 , wherein identifying the plurality of PoS tokens comprises: extracting a plurality of tokens from the input text, wherein each of the plurality of tokens comprises a word or a phrase; and determining a PoS tag for each of the plurality of tokens. 3. The method of claim 1 , wherein each of the plurality of SVO chunked text is a set of semantically related PoS tokens and comprises a verb phrase and at least two of a subject phrase, an object phrase, or an object-subject phrase and the object-subject phrase corresponds to an overlapping contiguous chunks that is an object phrase in an initial part of a sentence and a subject phrase in the subsequent part of the sentence. 4. The method of claim 1 , wherein the machine learning chunker model is trained on one or more of: a non-overlapping SVO annotated training data comprising one set of subject, verb, and object in each of the sentences; or an overlapping SVO annotated training data comprising one or more sets of subject, verb, object, and object-subject in each of the sentences. 5. The method of claim 1 , wherein the machine learning chunker model determines the plurality of SVO chunked text directly from the plurality of PoS tokens without a set of heuristics or a set of rules. 6. A subject-verb-object (SVO) chunked computing device, comprising; at least one processor; and memory storing instructions that, when executed by the at least one processor, cause the at least one processor to perform operations comprising: identify a plurality of part of speech (PoS) tokens in an unstructured text; and determine a SVO chunked text directly from the plurality of PoS tokens using a machine learning chunker model, wherein the machine learning chunker model is trained on an SVO annotated training data, wherein the SVO annotated training data comprises a plurality of tokens, a plurality of corresponding PoS tags, and a plurality of corresponding SVO tags, the plurality of corresponding SVO tags comprises one or more of a subject tag, a verb tag, an object tag, or an object-subject tag, and the plurality of corresponding SVO tags is in beginninginside-other (BIO) format, and wherein the SVO annotated training data is generated based on a plurality of corresponding span information for the plurality of tokens by for each of a plurality of PoS tokens in each of a plurality of sets of syntactically related PoS tokens in a sentence, detecting a span information for a PoS token and tagging the PoS token as a subject, a verb, an object, or an object-subject based on the span information and a pervious tagging of the PoS token. 7. The SVO chunked computing device of claim 6 , wherein the instructions, when executed by the at least one processor, further cause the at least one processor to: extract a plurality of tokens from the input text, wherein each of the plurality of tokens comprises a word or a phrase; and determine a PoS tag for each of the plurality of tokens. 8. The SVO chunked computing device of claim 6 , wherein each of the plurality of SVO chunked text is a set of semantically related PoS tokens and comprises a verb phrase and at least two of a subject phrase, an object phrase, or an object-subject phrase and the objectsubject phrase corresponds to an overlapping contiguous chunks that is an object phrase in an initial part of a sentence and a subject phrase in the subsequent part of the sentence. 9. The SVO chunked computing device of claim 6 , wherein the machine learning chunker model is trained on one or more of: a non-overlapping SVO annotated training data comprising one set of subject, verb, and object in each of the sentences; or an overlapping SVO annotated training data comprising one or more sets of subject, verb, object, and object-subject in each of the sentences. 10. The SVO chunked computing device of claim 6 , wherein the machine learning chunker model determines the plurality of SVO chunked text directly from the plurality of PoS tokens without a set of heuristics or a set of rules. 11. A non-transitory computer-readable medium having stored thereon instructions for extracting subject-verb-object (SVO) chunked text from unstructured text comprising executable code which, when executed by one or more processors, causes the one or more processors to: identify a plurality of part of speech (PoS) tokens in the unstructured text; and determine a plurality of SVO chunked text directly from the plurality of PoS tokens using a machine learning chunker model, wherein the machine learning chunker model is trained on a subject-verb-object (SVO) annotated training data, wherein the SVO annotated training data comprises a plurality of tokens, a plurality of corresponding PoS tags, and a plurality of corresponding SVO tags, the plurality of corresponding SVO tags comprises one or more of a subject tag, a verb tag, an object tag, or an object-subject tag, and the plurality of corresponding SVO tags is in beginninginside-other (BIO) format, and wherein the SVO annotated training data is generated based on a plurality of corresponding span information for the plurality of tokens by for each of a plurality of PoS tokens in each of a plurality of sets of syntactically related PoS tokens in a sentence, detecting a span information for a PoS token and tagging the PoS token as a subject, a verb, an object, or an object-subject based on the span information and a pervious tagging of the PoS token. 12. The non-transitory computer-readable medium of claim 11 , wherein the executable code, when executed by the one or more processor, further causes the one or more processor to: extract a plurality of tokens from the input text, wherein each of the plurality of tokens comprises a word or a phrase; and determine a PoS tag for each of the plurality of tokens. 13. The non-transitory computer-readable medium of claim 11 , wherein each of the plurality of SVO chunked text is a set of semantically related PoS tokens and comprises a verb phrase and at least two of a subject phrase, an object phrase, or an object-subject phrase and the object-subject phrase corresponds to an overlapping contiguous chunks that is an object phrase in an initial part of a sentence and a subject phrase in the subsequent part of the sentence. 14. The non-transitory computer-readable medium

Assignees

Inventors

Classifications

  • G06F40/289Primary

    Phrasal analysis, e.g. finite state techniques or chunking · CPC title

  • Syntactic parsing, e.g. based on context-free grammar [CFG] or unification grammars · CPC title

  • Physics · mapped topic

  • Physics · mapped topic

  • G06F17/278Primary

    Physics · mapped topic

Patent family

Related publications grouped by family.

External sources

Frequently asked questions

Answers are generated from the same data shown on this page.

What does patent US10002129B1 cover?
This disclosure relates generally to natural language processing, and more particularly to a system and method for extracting subject-verb-object (SVO) chunked text from an unstructured text. In one embodiment, a method is provided for extracting SVO chunked text from an unstructured text. The method comprises identifying a plurality of part of speech (PoS) tokens in the unstructured text, and …
Who is the assignee on this patent?
Wipro Ltd
What technology area does this patent fall under?
Primary CPC classification G06F40/289. Mapped technology areas include Physics.
When was this patent published?
Publication date Tue Jun 19 2018 00:00:00 GMT+0000 (Coordinated Universal Time) (B1). Legal status and post-grant events are not shown on this page.
What related patents are in patentsdb?
We list 1 related publication on this page (citations in our corpus or others sharing the same primary CPC).