Detecting overlapping clusters
US-8949237-B2 · Feb 3, 2015 · US
US9348902B2 · US · B2
| Field | Value |
|---|---|
| Publication number | US-9348902-B2 |
| Application number | US-201313754802-A |
| Country | US |
| Kind code | B2 |
| Filing date | Jan 30, 2013 |
| Priority date | Jan 30, 2013 |
| Publication date | May 24, 2016 |
| Grant date | May 24, 2016 |
A practical reading order for non-experts. Skip the full description unless you need deep technical detail.
What the patent document calls the invention.
A short plain-language summary of the technical disclosure.
Who owns or filed the patent and who is credited as inventor.
Filing, priority, publication, and grant dates set the timeline.
The legal scope of protection — read this for what is actually claimed.
Technology tags used to group this patent with similar filings.
Prior art links and similar publications in this corpus.
Official abstract text for this publication.
Systems and methods are disclosed herein for performing classification of documents or performing other tasks based on rules. The rules may include context rules that define a mapping that relates a value and context in a document to an attribute to which the value corresponds. Products are selected for labeling with attributes by identifying patterns, e.g. values and contexts that are not covered by a current rule set. Those products having a highest score are selected for labeling in a crowd sourcing forum, where the score is based on the number of non-covered patterns and a frequency of occurrence of the non-covered patterns in a document corpus. Proposed rules are generated for frequently occurring patterns and submitted to analysts for one or both of completion and validation. Proposed rules may include a proposed attribute for a frequently occurring value and corresponding context.
Opening claim text (preview).
The invention claimed is: 1. A method for labeling, the method comprising: identifying, by a computer system comprising one or more processors, a plurality of patterns in a plurality of documents, the plurality of patterns not covered by a set of validated contextual rules; (a) selecting, by the computer system, a batch of documents of the plurality of documents according to a frequency of usage of patterns included in the batch of documents of the plurality of documents, where the frequency of usage is for the plurality of documents, the batch of documents of the plurality of documents is selected by, for each document of the plurality of documents: identifying a pattern set for the each document; removing from the pattern set those patterns for which a rule of the set of validated contextual rules applies; and computing a score for the each document according to a sum of usage frequencies for the patterns of the pattern set for the plurality of documents; (b) submitting, by the computer system, the batch of documents of the plurality of documents to a crowdsourcing community for labeling; (c) receiving, by the computer system, labels for the batch of documents of the plurality of documents; (d) generating, by the computer system, proposed contextual rules based on labeled documents of the plurality of documents; (e) submitting, by the computer system, the proposed contextual rules to an analyst community for validation; (f) receiving, by the computer system, validation of a portion of the proposed contextual rules; and (g) adding the portion validated of the proposed contextual rules to the set of validated contextual rules. 2. The method of claim 1 , further comprising, repeating steps (a) through (g) repeatedly. 3. The method of claim 1 , wherein generating, by the computer system, the proposed contextual rules based on the labeled documents of the plurality of documents comprises: identifying one or more patterns in the labeled documents and a usage frequency for the one or more patterns identified; selecting a top N patterns of the one or more patterns having highest usage frequencies; if a label of the labeled documents has a high correspondence to a pattern of the top N patterns of the one or more patterns, generating a proposed contextual rule relating the pattern of the top N patterns of the one or more patterns to the label; and otherwise, generating a proposed contextual rule that includes the pattern of the top N patterns of the one or more patterns but no related label. 4. The method of claim 1 , further comprising applying the set of validated contextual rules to a document of the plurality of documents in order to determine labels thereof. 5. The method of claim 1 , further comprising applying the set of validated contextual rules to the plurality of documents by, for a current document of the plurality of documents: identifying one or more tokens in the current document; searching for the one or more tokens identified in an index relating words to rules; counting a number of hits for each applicable rule of applicable rules having a corresponding token in the one or more tokens identified; comparing the number of hits for the each applicable rule of the applicable rules; and labeling the current document according to those applicable rules having a pattern word count greater than or equal to the number of hits for the each applicable rule of the applicable rules. 6. The method of claim 1 , wherein the set of validated contextual rules (a) take as inputs a value and a value context and (b) output an attribute label for the value. 7. The method of claim 6 , wherein the value context includes words adjacent the value. 8. The method of claim 7 , wherein the value context includes a part of speech for the words adjacent the value. 9. The method of claim 1 , wherein the plurality of documents are product descriptions in a product taxonomy. 10. A system for labeling, the system comprising one or more processors and one or more memory devices operably coupled to the one or more processors, the one or more memory devices storing executable and operational data effective to cause the one or more processors to: identify a plurality of patterns in a plurality of documents, the plurality of patterns not covered by a set of validated contextual rules; (a) select a batch of documents of the plurality of documents according to a frequency of usage of patterns included in the batch of documents of the plurality of documents, where the frequency of usage is for the plurality of documents, the batch of documents of the plurality of documents is selected by, for each document of the plurality of documents: identifying a pattern set for the each document; removing from the pattern set those patterns for which a rule of the set of validated contextual rules applies; and computing a score for the each document according to a sum of usage frequencies for the patterns of the pattern set for the plurality of documents; (b) submit the batch of documents of the plurality of documents to a crowdsourcing community for labeling; (c) receive labels for the batch of documents of the plurality of documents; (d) generate proposed contextual rules based on labeled documents of the plurality of documents; (e) submit the proposed contextual rules to an analyst community for validation; (f) receive validation of a portion of the proposed contextual rules; and (g) add the portion validated of the proposed contextual rules to the set of validated contextual rules. 11. The system of claim 10 , wherein the executable and operational data are further effective to cause the one or more processors to repeat steps (a) through (g) repeatedly. 12. The system of claim 10 , wherein the executable and operational data are further effective to cause the one or more processors to generate the proposed contextual rules based on the labeled documents of the plurality of documents by: identifying one or more patterns in the labeled documents and a usage frequency for the one or more patterns identified; selecting a top N patterns of the one or more patterns having highest usage frequencies; if a label of the labeled documents has a high correspondence to a pattern of the top N patterns of the one or more patterns, generating a proposed contextual rule relating the pattern of the top N patterns of the one or more patterns to the label; and otherwise, generating a proposed contextual rule that includes the pattern of the top N patterns of the one or more patterns but no related label. 13. The system of claim 10 , wherein the executable and operational data are further effective to cause the one or more processors to apply the set of validated contextual rules to a document of the plurality of documents in order to determine labels thereof. 14. The system of claim 10 , wherein the executable and operational data are further effective to cause the one or more processors to apply the set of validated contextual rules to the plurality of documents by, for a current document of the plurality of documents: identifying one or more tokens in the current document; searching for the one or more tokens identified in an index relating words to rules; counting a number of hits for each applicable rule of applicable rules having a corresponding token in the one or more tokens identified; comparing the number of hits for the each applicable rule of the applicable rules; and labeling the current document according to those applicable rules having a pattern word count greater than or equal to the number of hits for the each applicable rule o
Physics · mapped topic
Office automation; Time management · CPC title
Retrieval characterised by using metadata, e.g. metadata not derived from the content or metadata generated manually · CPC title
using metadata automatically derived from the content · CPC title
Related publications grouped by family.
Answers are generated from the same data shown on this page.