Weakly supervised extraction of attributes from unstructured data to generate training data for machine learning models

US12210591B2 · US · B2

Patent metadata
FieldValue
Publication numberUS-12210591-B2
Application numberUS-202117407158-A
CountryUS
Kind codeB2
Filing dateAug 19, 2021
Priority dateAug 19, 2021
Publication dateJan 28, 2025
Grant dateJan 28, 2025

How to read this patent

A practical reading order for non-experts. Skip the full description unless you need deep technical detail.

  1. Title

    What the patent document calls the invention.

  2. Abstract

    A short plain-language summary of the technical disclosure.

  3. Assignees and inventors

    Who owns or filed the patent and who is credited as inventor.

  4. Key dates

    Filing, priority, publication, and grant dates set the timeline.

  5. First independent claim

    The legal scope of protection — read this for what is actually claimed.

  6. CPC / IPC classifications

    Technology tags used to group this patent with similar filings.

  7. Citations and related patents

    Prior art links and similar publications in this corpus.

Abstract

Official abstract text for this publication.

An online concierge system receives unstructured data describing items offered for purchase by various warehouses. To generate attributes for products from the unstructured data, the online concierge system extracts candidate values for attributes from the unstructured data through natural language processing. One or more users associate a subset candidate values with corresponding attributes, and the online concierge system clusters the remaining candidate values with the candidate values of the subset associated with attributes. One or more users provide input on the accuracy of the generated clusters. The candidate values are applied as labels to items by the online concierge system, which uses the labeled items as training data for an attribute extraction model to predict values for one or more attributes from unstructured data about an item.

First claim

Opening claim text (preview).

What is claimed is: 1. A method comprising: obtaining unstructured data describing items for display by an online concierge system, the unstructured data including a name of each item; identifying a set of items each having a common characteristic; extracting candidate values for attributes as segments from names of each item of the set, each candidate value associated with a frequency with which the segment occurs in the set of items; identifying a subset of candidate values based on frequency of occurrence in the set of items; generating a seed set of candidate values for one or more attributes from inputs received from one or more users associating candidate values of the subset with one or more attributes; generating clusters of candidate values from distances between candidate values not included in the subset and candidate values of the subset associated with one or more attributes, each cluster of candidate value corresponding to an attribute and including candidate values that are potential values for the attribute; receiving input from one or more users manually reviewing the generated clusters for accuracy; applying one or more labels to each item of the set, a label applied to an item of the set indicating a candidate value corresponding to an attribute matching a segment extracted from a name of the item of the set; generating training data including a plurality of examples, each example including an identifier of the item of the set and the one or more labels applied to the item of the set; and training an attribute extraction model to predict values for one or more attributes of an item from unstructured data describing the item by applying the attribute extraction model to the plurality of examples of the training data. 2. The method of claim 1 , wherein identifying the subset of candidate values based on frequency of occurrence in the set of items comprises: selecting candidate values having at least a threshold frequency of occurrence in the set of items. 3. The method of claim 1 , wherein identifying the subset of candidate values based on frequency of occurrence in the set of items comprises: ranking candidate values based on frequency of occurrence in the set of items; and selecting candidate values having at least a threshold position in the ranking. 4. The method of claim 1 , wherein generating clusters of candidate values from distances between candidate values not included in the subset and candidate values of the subset associated with one or more attributes comprises: identifying seed clusters as candidate values of the subset associated each associated with a common attribute; identifying a candidate value not included in the subset; determining a distance between the candidate value not included in the subset and each seed cluster; and generating a cluster including the candidate value not included in the subset and a seed cluster having less than a threshold distance to the candidate value not included in the subset. 5. The method of claim 4 , wherein the distance between the candidate value not included in the subset a seed cluster is based on semantic distances between the not included in the subset and candidate values included in the seed cluster. 6. The method of claim 4 , wherein the distance between the candidate value not included in the subset a seed cluster is based on syntactic distances between the not included in the subset and candidate values included in the seed cluster. 7. The method of claim 1 , wherein generating clusters of candidate values from distances between candidate values not included in the subset and candidate values of the subset associated with one or more attributes comprises: identifying seed clusters as candidate values of the subset associated each associated with a common attribute; identifying a candidate value not included in the subset; determining a distance between the candidate value not included in the subset and each seed cluster; and generating a cluster including the candidate value not included in the subset and a seed cluster having a minimum distance to the candidate value not included in the subset. 8. The method of claim 7 , wherein the distance between the candidate value not included in the subset a seed cluster is based on semantic distances between the not included in the subset and candidate values included in the seed cluster. 9. The method of claim 7 , wherein the distance between the candidate value not included in the subset a seed cluster is based on syntactic distances between the not included in the subset and candidate values included in the seed cluster. 10. The method of claim 1 , further comprising: applying the trained attribute extraction model to unstructured data describing additional items; displaying one or more of the additional items to one or more users using predicted values for one or more of the additional items; receiving feedback from a user to whom the one or more additional items were displayed; and updating the training data based on the received feedback. 11. The method of claim 1 , wherein receiving input from one or more users manually reviewing the generated clusters for accuracy comprises: receiving an input from a user identifying a generated cluster corresponding to an attribute and an indication whether candidate values included in the generated cluster are accurate. 12. The method of claim 11 , wherein the input identifies one or more candidate values to remove from the generated cluster. 13. A non-transitory computer readable medium having instructions encoded thereon that, when executed by a processor, cause the processor to: obtain unstructured data describing items for display by an online concierge system, the unstructured data including a name of each item; identify a set of items each having a common characteristic; extract candidate values for attributes as segments from names of each item of the set, each candidate value associated with a frequency with which the segment occurs in the set of items; identify a subset of candidate values based on frequency of occurrence in the set of items; generate a seed set of candidate values for one or more attributes from inputs received from one or more users associating candidate values of the subset with one or more attributes; generate clusters of candidate values from distances between candidate values not included in the subset and candidate values of the subset associated with one or more attributes, each cluster of candidate value corresponding to an attribute and including candidate values that are potential values for the attribute; receive input from one or more users manually reviewing the generated clusters for accuracy; apply one or more labels to each item of the set, a label applied to an item of the set indicating a candidate value corresponding to an attribute matching a segment extracted from a name of the item of the set; generate training data including a plurality of examples, each example including an identifier of the item of the set and the one or more labels applied to the item of the set; and train an attribute extraction model to predict values for one or more attributes of an item from unstructured data describing the item by applying the attribute extraction model to the plurality of examples of the training data. 14. The non-transitory computer readable medium of claim 13 , wherein identify the subset of candidate values based on frequency of occurrence in the set of items comprises: select candidate values having at least a threshold frequency of occurrence in the set of items.

Assignees

Inventors

Classifications

  • based on feedback of a supervisor · CPC title

  • Generating training patterns; Bootstrap methods, e.g. bagging or boosting · CPC title

  • Matching criteria, e.g. proximity measures · CPC title

  • utilising user interfaces specially adapted for shopping · CPC title

  • G06F18/23Primary

    Clustering techniques · CPC title

Patent family

Related publications grouped by family.

External sources

Frequently asked questions

Answers are generated from the same data shown on this page.

What does patent US12210591B2 cover?
An online concierge system receives unstructured data describing items offered for purchase by various warehouses. To generate attributes for products from the unstructured data, the online concierge system extracts candidate values for attributes from the unstructured data through natural language processing. One or more users associate a subset candidate values with corresponding attributes, …
Who is the assignee on this patent?
Maplebear Inc
What technology area does this patent fall under?
Primary CPC classification G06Q30/0641. Mapped technology areas include Physics.
When was this patent published?
Publication date Tue Jan 28 2025 00:00:00 GMT+0000 (Coordinated Universal Time) (B2). Legal status and post-grant events are not shown on this page.
What related patents are in patentsdb?
We list 2 related publications on this page (citations in our corpus or others sharing the same primary CPC).