Method for systematic mass normalization of titles

US9342592B2 · US · B2

Patent metadata
FieldValue
Publication numberUS-9342592-B2
Application numberUS-201313953444-A
CountryUS
Kind codeB2
Filing dateJul 29, 2013
Priority dateJul 29, 2013
Publication dateMay 17, 2016
Grant dateMay 17, 2016

How to read this patent

A practical reading order for non-experts. Skip the full description unless you need deep technical detail.

  1. Title

    What the patent document calls the invention.

  2. Abstract

    A short plain-language summary of the technical disclosure.

  3. Assignees and inventors

    Who owns or filed the patent and who is credited as inventor.

  4. Key dates

    Filing, priority, publication, and grant dates set the timeline.

  5. First independent claim

    The legal scope of protection — read this for what is actually claimed.

  6. CPC / IPC classifications

    Technology tags used to group this patent with similar filings.

  7. Citations and related patents

    Prior art links and similar publications in this corpus.

Abstract

Official abstract text for this publication.

A method for normalizing raw titles to canonical titles is described. The method includes designating a set of canonical titles, generating a set of n-grams for each canonical title, assigning a set of attributes to each n-gram, assigning a set of labels to each of the attributes, and storing the labeled canonical title and labeled n-grams in a database. In some examples, a new title may be mapped to an existing canonical title in the database by generating a set of n-grams for the new title, looking up the n-grams in the database of canonical titles, retrieving the set of labels assigned to n-grams in the database that match n-grams from the new title, and assigning those labels to the corresponding attributes of the new title. The new title may then be mapped to a canonical title on the basis of similarly labeled attributes.

First claim

Opening claim text (preview).

What is claimed is: 1. A method for generating a database of labeled canonical titles, the method comprising: designating a set of canonical titles; generating a set of canonical n-grams for each canonical title, wherein each canonical n-gram includes one or more contiguous words in the canonical title; assigning a set of canonical attributes to each canonical n-gram in the set of canonical n-grams; assigning a set of canonical labels to one or more of the canonical attributes for each canonical n-gram; storing at least one of each canonical title, the set of canonical n-grams generated for each canonical title, the set of canonical attributes assigned to each of the canonical n-grams, or the set of canonical labels assigned to each of the canonical attributes in the database of labeled canonical titles; receiving a set of raw titles; generating a set of raw n-grams for each raw title, wherein each raw n-gram includes one or more contiguous words in the raw title; assigning a set of raw labels to one or more attributes in a set of attributes assigned to each raw n-gram, wherein the set of attributes assigned to the raw n-grams and the set of canonical attributes assigned to the canonical n-grams are the same set; grouping the raw titles with identical raw labels into representative groups; selecting a raw title from each representative group to be a representative title; mapping each representative title to one of the canonical titles based on a comparison of the raw labels associated with each representative title to the canonical labels associated with each canonical title; and verifying that the representative titles are correctly mapped to the canonical titles, and that the raw titles are correctly represented by the representative titles. 2. The method of claim 1 , wherein receiving a set of raw titles comprises receiving the set of raw titles from a title search. 3. The method of claim 1 , wherein receiving a set of raw titles comprises receiving the set of raw titles entered by a user. 4. The method of claim 1 , wherein generating a set of raw n-grams for a raw title comprises: generating a set of raw unigrams, wherein each raw unigram is one word in the raw title; generating a set of raw bi-grams, wherein each raw bi-gram is two contiguous words in the raw title; and generating a set of raw tri-grams, wherein each raw tri-gram is three contiguous words in the raw title. 5. The method of claim 1 , wherein assigning a set of raw labels comprises: searching the database of labeled canonical titles for the canonical n-grams that match the raw n-grams of one or more of the raw titles; and for each canonical n-gram that matches a raw n-gram of the one or more raw titles: retrieving the canonical labels assigned to each canonical attribute of the matched canonical n-gram; and assigning the retrieved canonical labels to the raw attribute of the raw n-gram of the one or more raw titles, the raw attribute being the same as the canonical attribute associated with the retrieved canonical labels. 6. The method of claim 1 , wherein selecting a raw title comprises selecting the most frequently occurring raw title in the representative group as the representative title. 7. The method of claim 1 , wherein mapping each representative title to one of the canonical titles comprises: searching the database of labeled canonical titles for a canonical title having labels that are identical to the set of raw labels assigned to the representative title; selecting the canonical title having the identical labels as a best match title; and if no best match title is found: assigning weighting factors to the raw attributes; ranking the canonical titles in the database of labeled canonical titles based on the weighting factors; and selecting the highest ranked canonical title as the best match title. 8. A method for generating a database of labeled canonical titles, the method comprising: designating a set of canonical titles; generating a set of canonical n-grams for each canonical title, wherein each canonical n-gram includes one or more contiguous words in the canonical title; assigning a set of canonical attributes to each canonical n-gram in the set of canonical n-grams; assigning a set of canonical labels to one or more of the canonical attributes for each canonical n-gram; storing at least one of each canonical title, the set of canonical n-grams generated for each canonical title, the set of canonical attributes assigned to each of the canonical n-grams, or the set of canonical labels assigned to each of the canonical attributes in the database of labeled canonical titles; receiving a set of raw titles; generating a set of raw n-grams for each raw title, wherein each raw n-gram includes one or more contiguous words in the raw title; assigning a set of raw labels to one or more attributes in a set of attributes assigned to each raw n-gram, wherein the set of attributes assigned to the raw n-grams and the set of attributes assigned to the canonical n-grams are the same set; mapping the set of raw titles to a first set of the canonical titles; mapping the set of raw titles to a second set of the canonical titles; and comparing the first set of the canonical titles to the second set of the canonical titles to determine differences therebetween. 9. The method of claim 8 , wherein mapping the set of raw titles to a first set of the canonical titles comprises selecting canonical titles from the database of labeled canonical titles to represent each raw title in the set of raw titles. 10. The method of claim 8 , wherein mapping the set of raw titles to a second set of the canonical titles comprises comparing the raw labels assigned to each attribute of each raw n-gram of each raw title to the canonical labels assigned to each attribute of each canonical n-gram of each canonical title to find a best match. 11. A method for generating a database of labeled canonical titles, the method comprising: designating a set of canonical titles; generating a set of canonical n-grams for each canonical title, wherein each canonical n-gram includes one or more contiguous words in the canonical title; assigning a set of canonical attributes to each canonical n-gram in the set of canonical n-grams; assigning a set of canonical labels to one or more of the canonical attributes for each canonical n-gram; storing at least one of each canonical title, the set of canonical n-grams generated for each canonical title, the set of canonical attributes assigned to each of the canonical n-grams, or the set of canonical labels assigned to each of the canonical attributes in the database of labeled canonical titles; receiving a subset of the canonical titles, including the set of canonical attributes assigned to the canonical titles in the subset and the set of canonical labels assigned to the set of canonical attributes; assigning a weight to each of the canonical attributes in the subset; assigning a weight to each of the canonical labels in the subset; ranking the subset of canonical titles by the canonical attribute weight and the canonical label weight; and displaying the subset of canonical titles arranged in order of ranking. 12. The method of claim 11 , wherein displaying comprises: displaying the canonical titles in the subset having the same ranking on the same level; and displaying the canonical titles in the subset having higher rankings on a higher level than the canonical titles in the subset having lower rankings. 13. A method of mapping a raw title to a canonical title in a database of labeled canonical titles, the method c

Assignees

Inventors

Classifications

  • G06Q10/40Primary

    Business processes related to social networking or social networking services · CPC title

  • using statistical methods · CPC title

  • G06F16/215Primary

    Improving data quality; Data cleansing, e.g. de-duplication, removing invalid entries or correcting typographical errors · CPC title

  • Creation or modification of classes or clusters · CPC title

  • File access structures, e.g. distributed indices (arrangements of input from, or output to, record carriers G06F3/06) · CPC title

Patent family

Related publications grouped by family.

External sources

Frequently asked questions

Answers are generated from the same data shown on this page.

What does patent US9342592B2 cover?
A method for normalizing raw titles to canonical titles is described. The method includes designating a set of canonical titles, generating a set of n-grams for each canonical title, assigning a set of attributes to each n-gram, assigning a set of labels to each of the attributes, and storing the labeled canonical title and labeled n-grams in a database. In some examples, a new title may be map…
Who is the assignee on this patent?
Workday Inc
What technology area does this patent fall under?
Primary CPC classification G06Q10/40. Mapped technology areas include Physics.
When was this patent published?
Publication date Tue May 17 2016 00:00:00 GMT+0000 (Coordinated Universal Time) (B2). Legal status and post-grant events are not shown on this page.
What related patents are in patentsdb?
We list 8 related publications on this page (citations in our corpus or others sharing the same primary CPC).