Character-based attribute value extraction system

US2016321358A1 · US · A1

Patent metadata
FieldValue
Publication numberUS-2016321358-A1
Application numberUS-201514700683-A
CountryUS
Kind codeA1
Filing dateApr 30, 2015
Priority dateApr 30, 2015
Publication dateNov 3, 2016
Grant date

How to read this patent

A practical reading order for non-experts. Skip the full description unless you need deep technical detail.

  1. Title

    What the patent document calls the invention.

  2. Abstract

    A short plain-language summary of the technical disclosure.

  3. Assignees and inventors

    Who owns or filed the patent and who is credited as inventor.

  4. Key dates

    Filing, priority, publication, and grant dates set the timeline.

  5. First independent claim

    The legal scope of protection — read this for what is actually claimed.

  6. CPC / IPC classifications

    Technology tags used to group this patent with similar filings.

  7. Citations and related patents

    Prior art links and similar publications in this corpus.

Abstract

Official abstract text for this publication.

A system is provided that extracts attribute values. The system receives data including unstructured text from a data store. The system further tokenizes the unstructured text into tokens, where a token is a character of the unstructured text. The system further annotates the tokens with attribute labels, where an attribute label for a token is determined, in least in part, based on a word that the token originates from within the unstructured text. The system further groups the tokens into text segments based on the attribute labels, where a set of tokens that are annotated with an identical attribute label are grouped into a text segment, and where the text segments define attribute values. The system further stores the attribute labels and the attribute values within the data store.

First claim

Opening claim text (preview).

We claim: 1 . A computer-readable medium having instructions stored thereon that, when executed by a processor, cause the processor to extract attribute values, the extracting comprising: receiving data comprising unstructured text from a data store; tokenizing the unstructured text into one or more tokens, wherein a token is a character of the unstructured text; annotating the one or more tokens with one or more attribute labels, wherein an attribute label for a token is determined, at least in part, based on a word that the token originates from within the unstructured text; grouping the one or more tokens into one or more text segments based on the one or more attribute labels, wherein a set of one or more tokens that are annotated with an identical attribute label are grouped into a text segment, and wherein the one or more text segments define one or more attribute values; and storing the one or more attribute labels and the one or more attribute values within the data store. 2 . The computer-readable medium of claim 1 , the extracting further comprising normalizing at least one attribute value of the one or more attribute values. 3 . The computer-readable medium of claim 2 , the normalizing the at least one attribute value further comprising: pairing an attribute value of the one or more attribute values with one or more target attribute values; selecting a target attribute value that has a highest probability of matching the attribute value; and replacing the attribute value with the selected target attribute value. 4 . The computer-readable medium of claim 3 , the extracting further comprising replacing at least one attribute label that is annotated for at least one token with at least one new attribute label in response to a user interaction. 5 . The computer-readable medium of claim 1 , wherein the unstructured text comprises a product description. 6 . The computer-readable medium of claim 1 , wherein the data store comprises a database. 7 . The computer-readable medium of claim 1 wherein the attribute label for the token is further determined, at least in part, based on a character-based conditional random field. 8 . The computer-readable medium of claim 1 , wherein the one or more tokens are character-based tokens. 9 . The computer-readable medium of claim 1 , wherein the attribute label for the token is further determined, at least in part, based on at least one of: whether the token is a lowercase character; a shape of the token; a punctuation of the token, one or more surrounding tokens; a size of the word that the token originates from within the unstructured text; a position of the token relative to the word that the token originates from within the unstructured text; or a position of the token relative to the unstructured text. 10 . The computer-readable medium of claim 1 , the extracting further comprising: receiving one or more pre-defined attribute values; annotating one or more characters of the unstructured text with one or more attribute labels by matching the one or more pre-defined attribute values with one or more text segments of the unstructured text; and replacing at least one attribute label that is annotated for at least one character with at least one new attribute label in response to a user interaction. 11 . A computer-implemented method for extracting attribute values, the computer-implemented method comprising: receiving data comprising unstructured text from a data store; tokenizing the unstructured text into one or more tokens, wherein a token is a character of the unstructured text; annotating the one or more tokens with one or more attribute labels, wherein an attribute label for a token is determined, at least in part, based on a word that the token originates from within the unstructured text; grouping the one or more tokens into one or more text segments based on the one or more attribute labels, wherein a set of one or more tokens that are annotated with an identical attribute label are grouped into a text segment, and wherein the one or more text segments define one or more attribute values; and storing the one or more attribute labels and the one or more attribute values within the data store. 12 . The computer-implemented method of claim 11 , further comprising normalizing at least one attribute value of the one or more attribute values, 13 . The computer-implemented method of claim 12 , the normalizing the at least one attribute value further comprising: pairing an attribute value of the one or more attribute values with one or more target attribute values; selecting a target attribute value that has a highest probability of matching the attribute value; and replacing the attribute value with the selected target attribute value. 14 . The computer-implemented method of claim 13 , further comprising replacing at least one attribute label that is annotated for at least one token with at least one new attribute label in response to a user interaction. 15 . The computer-implemented method of claim 11 , wherein the attribute label for the token is further determined, at least in part, based on at least one of: whether the token is a lowercase character; a shape of the token; a punctuation of the token, one or more surrounding tokens; a size of the word that the token originates from within the unstructured text; a position of the token relative to the word that the token originates from within the unstructured text; or a position of the token relative to the unstructured text. 16 . A system for extracting attribute values, the system comprising: a data reception module configured to receive data comprising unstructured text from a data store; a tokenization module configured to tokenize the unstructured text into one or more tokens, wherein a token is a character of the unstructured text; an annotation module configured to annotate the one or more tokens with one or more attribute labels, wherein an attribute label for a token is determined, at least in part, based on a word that the token originates from within the unstructured text; a token grouping module configured to group the one or more tokens into one or more text segments based on the one or more attribute labels, wherein a set of one or more tokens that are annotated with an identical attribute label are grouped into a text segment, and wherein the one or more text segments define one or more attribute values; and an attribute storage module configured to store the one or more attribute labels and the one or more attribute values within the data store. 17 . The system of claim 16 , further comprising a normalization module configured to normalize at least one attribute value of the one or more attribute values, 18 . The system of claim 17 , wherein the normalization module is further configured to pair an attribute value of the one or more attribute values with one or more target attribute values; wherein the normalization module is further configured to select a target attribute value that has a highest probability of matching the attribute value; and wherein the normalization module is further configured to replace the attribute value with the selected target attribute value. 19 . The system of claim 18 , further comprising a manual annotation module configured to replace at least one attribute label that is annotated for at least one token with at least one new attribute label in response to a user interaction. 20 . The system of claim 16 ,

Assignees

Inventors

Classifications

  • G06Q30/00Primary

    Commerce · CPC title

  • Mark-up to mark-up conversion (conversion for visualization in web browsing G06F16/9577) · CPC title

  • Lexical analysis, e.g. tokenisation or collocates · CPC title

  • using context analysis, e.g. recognition aided by known co-occurring patterns · CPC title

  • Physics · mapped topic

Patent family

Related publications grouped by family.

External sources

Frequently asked questions

Answers are generated from the same data shown on this page.

What does patent US2016321358A1 cover?
A system is provided that extracts attribute values. The system receives data including unstructured text from a data store. The system further tokenizes the unstructured text into tokens, where a token is a character of the unstructured text. The system further annotates the tokens with attribute labels, where an attribute label for a token is determined, in least in part, based on a word that…
Who is the assignee on this patent?
Oracle Int Corp
What technology area does this patent fall under?
Primary CPC classification G06Q30/00. Mapped technology areas include Physics.
When was this patent published?
Publication date Thu Nov 03 2016 00:00:00 GMT+0000 (Coordinated Universal Time) (A1). Legal status and post-grant events are not shown on this page.
What related patents are in patentsdb?
We list 8 related publications on this page (citations in our corpus or others sharing the same primary CPC).