Method and apparatus for structuring data, related computer device and medium

US11615242B2 · US · B2

Patent metadata
FieldValue
Publication numberUS-11615242-B2
Application numberUS-202016940703-A
CountryUS
Kind codeB2
Filing dateJul 28, 2020
Priority dateDec 20, 2019
Publication dateMar 28, 2023
Grant dateMar 28, 2023

How to read this patent

A practical reading order for non-experts. Skip the full description unless you need deep technical detail.

  1. Title

    What the patent document calls the invention.

  2. Abstract

    A short plain-language summary of the technical disclosure.

  3. Assignees and inventors

    Who owns or filed the patent and who is credited as inventor.

  4. Key dates

    Filing, priority, publication, and grant dates set the timeline.

  5. First independent claim

    The legal scope of protection — read this for what is actually claimed.

  6. CPC / IPC classifications

    Technology tags used to group this patent with similar filings.

  7. Citations and related patents

    Prior art links and similar publications in this corpus.

Abstract

Official abstract text for this publication.

A method and an apparatus for structuring data are related to information processing technologies in the field of natural language processing. By acquiring an unstructured text and inputting the unstructured text into an encoder-decoder model, an output sequence is obtained. The encoder-decoder model is trained using a training text marked with the attribute value of each attribute. A structured representation is generated based on the attributes corresponding to the attribute elements included in the output sequence and the attribute values comprised in the attribute elements.

First claim

Opening claim text (preview).

What is claimed is: 1. A method for structuring data, comprising: acquiring an unstructured text; inputting the unstructured text into an encoder-decoder model to obtain an output sequence, wherein the output sequence comprises a plurality of attribute elements, each attribute element corresponds to a respective attribute, and each attribute element comprises an attribute value of the respective attribute, wherein the encoder-decoder model is trained using a training text marked with the attribute value of each attribute; and generating a structured representation based on the attributes corresponding to the attribute elements comprised in the output sequence and the attribute values comprised in the attribute elements, wherein the encoder-decoder model comprises an encoder and a decoder, and inputting the unstructured text into the encoder-decoder model to obtain the output sequence comprises: performing a word segmentation on the unstructured text to obtain a plurality of word elements; sorting the plurality of word elements in order, to obtain an input sequence; inputting the word elements of the input sequence into the encoder to semantically encode the word elements to obtain a hidden state vector of each word element, wherein the hidden state vector indicates semantics of the respective word element and a context thereof; and decoding each hidden state vector by the decoder to obtain the attribute values of the output sequence, wherein the decoder has learned an attention weight of each hidden state vector with respect to each attribute value and a mapping relation between the hidden state vector that is weighted by the attention weight and the attribute value. 2. The method of claim 1 , wherein the output sequence is in a data exchange format, the output sequence in the data exchange format comprises at least one object, and each object comprises a plurality of attribute elements, wherein before inputting the unstructured text into the encoder-decoder model to obtain the output sequence, the method further comprises: acquiring a plurality of training texts, wherein each training text has marked information in the data exchange format, the marked information comprises at least one object corresponding to an entity described by the training text, and each object comprises the attribute value of the attribute for describing the entity, wherein an order of the attribute values of the attributes in the object is the same as an order of the attribute elements of the attributes in the output sequence; and training the encoder-decoder model by adopting the plurality of training texts to minimize an error between the output sequence of the encoder-decoder model and the marked information. 3. The method of claim 2 , wherein generating the structured representation based on the attributes corresponding to the attribute elements comprised in the output sequence and the attribute values comprised in the attribute elements comprises: for each object, extracting attribute elements belonging to the object from the output sequence in the data exchange format; generating the structured representation of the object based on the attribute value of each attribute comprised in the attribute elements extracted; and generating the structured representation of the unstructured text based on the structured representation of each object. 4. The method of claim 2 , wherein the attribute value of each attribute is one of a text position and an actual text, the attribute value is determined based on a value range of the attribute, and in cases that the value range is limited, the attribute value is the actual text, and in cases that the value range is unlimited, the attribute value is the text position, wherein before generating the structured representation, the method further comprises: for each attribute element, in cases that the attribute value is the text position, updating the attribute value to the word element at the text position in the unstructured text. 5. The method of claim 1 , wherein sorting the plurality of word elements in order, to obtain the input sequence comprises: inputting the plurality of word elements into an entity recognition model, to obtain an entity label of each word element; and splicing each word element with a respective entity label as a word element of the input sequence. 6. A computer device, comprising: at least one processor; and a memory, communicatively coupled to the at least one processor, wherein the memory has instructions executable by the at least one processor stored therein, when the instructions are executed by the at least one processor, wherein the at least one processor is configured to: acquire an unstructured text; input the unstructured text into an encoder-decoder model to obtain an output sequence, wherein the output sequence comprises a plurality of attribute elements, each attribute element corresponds to a respective attribute, and each attribute element comprises an attribute value of the respective attribute, wherein the encoder-decoder model is trained using a training text marked with the attribute value of each attribute; and generate a structured representation based on the attributes corresponding to the attribute elements comprised in the output sequence and the attribute values comprised in the attribute elements, wherein the encoder-decoder model comprises an encoder and a decoder, and the at least one processor is further configured to: perform a word segmentation on the unstructured text to obtain a plurality of word elements; sort the plurality of word elements in order, to obtain an input sequence; input the word elements of the input sequence into the encoder to semantically encode the word elements to obtain a hidden state vector of each word element, wherein the hidden state vector indicates semantics of the respective word element and a context thereof; and decode each hidden state vector by adopting the decoder to obtain the attribute values of the output sequence, wherein the decoder has learned an attention weight of each hidden state vector with respect to each attribute value and a mapping relation between the hidden state vector that is weighted by the attention weight and the attribute value. 7. The computer device of claim 6 , wherein the output sequence is in a data exchange format, the output sequence in the data exchange format comprises at least one object, and each object comprises a plurality of attribute elements, wherein the at least one processor is further configured to: acquire a plurality of training texts, wherein each training text has marked information in the data exchange format, the marked information comprises at least one object corresponding to an entity described by the training text, and each object comprises the attribute value of the attribute for describing the entity, wherein an order of the attribute values of the attributes of the object is the same as an order of the attribute elements of the attributes in the output sequence; and train the encoder-decoder model by adopting the plurality of training texts to minimize an error between the output sequence of the encoder-decoder model and the marked information. 8. The computer device of claim 7 , wherein the at least one processor is further configured to: for each object, extract attribute elements belonging to the object from the output sequence in the data exchange format; generate the structured representation of the object based on the attribute value of each attribute comprised in the attribute elements extracted; and generate the structured representation of the unstructured text based on the structured representation of each object. 9. The comput

Assignees

Inventors

Classifications

  • Convolutional networks [CNN, ConvNet] · CPC title

  • characterised by memory or gating, e.g. long short-term memory [LSTM] or gated recurrent units [GRU] · CPC title

  • Supervised learning · CPC title

  • Auto-encoder networks; Encoder-decoder networks · CPC title

  • G06F40/284Primary

    Lexical analysis, e.g. tokenisation or collocates · CPC title

Patent family

Related publications grouped by family.

External sources

Frequently asked questions

Answers are generated from the same data shown on this page.

What does patent US11615242B2 cover?
A method and an apparatus for structuring data are related to information processing technologies in the field of natural language processing. By acquiring an unstructured text and inputting the unstructured text into an encoder-decoder model, an output sequence is obtained. The encoder-decoder model is trained using a training text marked with the attribute value of each attribute. A structure…
Who is the assignee on this patent?
Beijing Baidu Netcom Sci & Tech Co Ltd
What technology area does this patent fall under?
Primary CPC classification G06F40/284. Mapped technology areas include Physics.
When was this patent published?
Publication date Tue Mar 28 2023 00:00:00 GMT+0000 (Coordinated Universal Time) (B2). Legal status and post-grant events are not shown on this page.
What related patents are in patentsdb?
We list 3 related publications on this page (citations in our corpus or others sharing the same primary CPC).