System and Method for Parsing Regulatory and Other Documents for Machine Scoring Background
US-2024296188-A1 · Sep 5, 2024 · US
US9582490B2 · US · B2
| Field | Value |
|---|---|
| Publication number | US-9582490-B2 |
| Application number | US-201314075690-A |
| Country | US |
| Kind code | B2 |
| Filing date | Nov 8, 2013 |
| Priority date | Jul 12, 2013 |
| Publication date | Feb 28, 2017 |
| Grant date | Feb 28, 2017 |
A practical reading order for non-experts. Skip the full description unless you need deep technical detail.
What the patent document calls the invention.
A short plain-language summary of the technical disclosure.
Who owns or filed the patent and who is credited as inventor.
Filing, priority, publication, and grant dates set the timeline.
The legal scope of protection — read this for what is actually claimed.
Technology tags used to group this patent with similar filings.
Prior art links and similar publications in this corpus.
Official abstract text for this publication.
A collection of data that is extremely large can be difficult to search and/or analyze. Relevance may be dramatically improved by automatically classifying queries and web pages in useful categories, and using these classification scores as relevance features. A thorough approach may require building a large number of classifiers, corresponding to the various types of information, activities, and products. Creation of classifiers and schematizers is provided on large data sets. Exercising the classifiers and schematizers on hundreds of millions of items may expose value that is inherent to the data by adding usable meta-data. Some aspects include active labeling exploration, automatic regularization and cold start, scaling with the number of items and the number of classifiers, active featuring, and segmentation and schematization.
Opening claim text (preview).
The invention claimed is: 1. One or more hardware memory having embodied thereon computer-usable instructions that, when executed, facilitate a method of interactively labeling training data for machine learning, the method comprising: generating a first set of data items; alternating between repetitions of labeling sampled data items and repetitions of labeling searched data items; wherein labeling sampled data items includes A) providing a first set of data items, wherein each data item is scored with a probability of being a positive example of a particular class of data items, B) presenting a second set of one or more data items on a user interface, wherein the second set of data items is generated by sampling, from the first set of data items, one or more data items having scores lying within a range of scores selected to a) optimize precision of the classifier, wherein the scores selected to optimize precision of the classifier are greater than 0.5 on a scale of zero to one, or b) optimize recall of the classifier, C) receiving, via the user interface, one or more user-provided labels that identify one or more data items in the second set as positive or negative examples of the particular class of data items, and D) training the classifier based on the user-provided labels in the second set of data items; and wherein labeling searched data items includes A) presenting, on the user interface, a third set of one or more data items selected from the first set of data items based on a user-provided search query, B) receiving, via the user interface, one or more user-provided labels that identify one or more data items in the third set as positive or negative examples of the particular class of data items; and C) training the classifier based on the user-provided labels in the third set of data items. 2. The one or more hardware memory of claim 1 , wherein training the classifier occurs automatically and is transparent to a user. 3. The one or more hardware memory of claim 1 , wherein subsequent to training the classifier each data item in the first set of data items is rescored with a probability of being a positive example of the particular class of data items. 4. The one or more hardware memory of claim 1 , wherein the scores selected to optimize precision of the classifier lie within a distribution around a probability of 0.75 on a scale of zero to one. 5. The one or more hardware memory of claim 1 , wherein the scores selected to optimize recall of the classifier are less than 0.5 on a scale of zero to one. 6. The one or more hardware memory of claim 5 , wherein the scores selected to optimize recall of the classifier lie within a distribution around a probability of 0.25 on a scale of zero to one. 7. The one or more hardware memory of claim 1 , wherein the first set of data items includes data items randomly sampled from a larger set of data items. 8. The one or more hardware memory of claim 7 , wherein prior to scoring the first set of data items with the classifier, the classifier is pre-trained based on data items that are results of a user query. 9. The one or more hardware memory of claim 7 , wherein the data items randomly sampled from the larger set of data items are initially unlabeled. 10. One or more hardware memory having embodied thereon computer-usable instructions that, when executed, facilitate a method of interactively labeling training data for machine learning, the method comprising: providing a working set of data items; training a classifier using a predefined set of positive or negative examples of a particular class of data items, wherein the classifier is trained to score a data item with a probability of being an example of the particular class of data item; scoring the working set of data items via the classifier; generating a sampled set of data items by sampling the working set of data items within a selected range of data item scores, wherein the selected range of data item scores is selected to optimize either precision of the classifier or recall of the classifier; presenting the sampled set of data items within a user interface; receiving, via the user interface, one or more user-provided labels for one or more of the presented data items, wherein the one or more user-provided labels identify the one or more of the presented data items as positive or negative examples of the class of data items; retraining the classifier based on the one or more user-provided labels; and repeating steps of presenting data items, receiving user-provided labels, and retraining the classifier, wherein the presenting data items alternates between presenting sampled sets of data items and presenting data items based on results of a user-provided search query, and wherein the selected range of data item scores is alternated between the range of data item scores selected to optimize the precision of the classifier and the range of data item scores selected to optimize the recall of the classifier. 11. The one or more hardware memory of claim 10 , wherein subsequent to retraining the classifier, the working set of data items is scored with the retrained classifier. 12. The one or more hardware memory of claim 10 , wherein the predefined set of positive or negative examples is based on search results of a search engine. 13. The one or more hardware memory of claim 10 , wherein the range of data item scores selected to optimize the precision of the classifier is within a range of scores having a probability of greater than 0.5 of being positive examples on a scale of zero to one. 14. The one or more hardware memory of claim 13 , wherein the range of data item scores selected to optimize the recall of the classifier is within a range of scores having a probability of less than 0.5 of being positive examples on a scale of zero to one. 15. One or more hardware memory having embodied thereon computer-usable instructions that, when executed, facilitate a method of interactively labeling training data for machine learning, the method comprising: generating a first set of data items; identifying data items, from within the first set of data items, as belonging to a particular class of data items, based on a user search query; training a classifier based at least on the data items that were identified as belonging to the particular class of data items; scoring the first set of data items with the classifier, wherein the classifier scores each data item with a probability of being a positive example of the particular class of data items; selecting, from the first set of data items, a second set of one or more data items based on the scoring, wherein the one or more data items in the second set are selected A) to improve precision of the classifier, wherein the selected one or more data items have scores that lie within a distribution around a probability of 0.75 on a scale of zero to one of being a positive example of the particular class of data items, or B) to improve recall of the classifier, wherein the selected one or more data items have scores that lie within a distribution around a probability of 0.25 on a scale of zero to one of being a positive example of the particular class of data items; presenting the second set of one or more data items on a user interface; receiving, via the user interface, one or more user-provided labels that identify one or more of the data items in the second set as positive or negative examples of the particular class of data items, wherein the one or more user-provided labels are received by means of one or more of a text entry field, a checkbo
Probabilistic graphical models, e.g. probabilistic networks · CPC title
Formats for control data (H04L1/16 takes precedence; training sequences H04L25/00 and H04L27/00) · CPC title
Dictionaries · CPC title
Indexing; Web crawling techniques · CPC title
Semantic analysis · CPC title
Related publications grouped by family.
Answers are generated from the same data shown on this page.