Indexing Electronic Documents
US-2016292296-A1 · Oct 6, 2016 · US
US12072935B2 · US · B2
| Field | Value |
|---|---|
| Publication number | US-12072935-B2 |
| Application number | US-202117469751-A |
| Country | US |
| Kind code | B2 |
| Filing date | Sep 8, 2021 |
| Priority date | Sep 8, 2021 |
| Publication date | Aug 27, 2024 |
| Grant date | Aug 27, 2024 |
A practical reading order for non-experts. Skip the full description unless you need deep technical detail.
What the patent document calls the invention.
A short plain-language summary of the technical disclosure.
Who owns or filed the patent and who is credited as inventor.
Filing, priority, publication, and grant dates set the timeline.
The legal scope of protection — read this for what is actually claimed.
Technology tags used to group this patent with similar filings.
Prior art links and similar publications in this corpus.
Official abstract text for this publication.
Machine learning to predict a layout type that each of a plurality of portions of a document appears in. This is done even though the computer-readable representation of the document does not contain information at the granularity of the prediction to be made that identifies which layout type that each of the plurality of document portions belongs in. For each of a plurality of the portions, the machine-learning system predicts the layout type that the respective portion appears in, and indexes the document using the predictions so as to result in a computer-readable index. The index represents a predicted layout type associated with each of the plurality of portions of the document. Thus, the index can be used to search based on position of a searched term within the document.
Opening claim text (preview).
What is claimed is: 1. A computing system comprising: one or more processors; and one or more computer-readable media having stored thereon computer-executable instructions that are structured such that, if executed by the one or more processors, would cause the computing system to perform a machine-learned prediction of a layout type that each of a plurality of portions of a document appears in by performing the following: accessing a computer-readable representation of a document that contains a plurality of portions and, for at least one portion of the document, a plurality of sub-portions, for which there is no layout information at a granularity of the layout prediction to be made identified within the computer-readable representation of the document; for each of the plurality of sub-portions, predicting a layout type that the particular sub-portion appears in; and for each of the plurality of the portions, predicting a layout type that each of the plurality of portions appears in, wherein if the particular portion of the plurality of portions contains the particular sub-portions, the predicting the layout type for the particular portion is done by using the layout predictions for the particular sub-portion; and indexing the document using the layout predictions for each of the plurality of portions and each of the plurality of sub-portions so as to result in a computer-readable index that is structured so as to be interpretable by a computing system to represent a predicted layout type associated with each of the plurality of portions and each of the plurality of sub-portions of the document. 2. The computing system in accordance with claim 1 , the computer-executable instructions being further structured such that, if executed by the one or more processors, the predictions of the layout type of the sub-portions is performed using a neural network. 3. The computing system in accordance with claim 2 , the computer-executable instructions being further structured such that, if executed by the one or more processors, the predictions of the layout type of the particular portions is performed using a rules-based prediction component. 4. The computing system in accordance with claim 1 , the particular portion being a sentence. 5. The computing system in accordance with claim 4 , the multiple sub-portions of the particular portion being words within the sentence. 6. The computing system in accordance with claim 4 , the multiple sub-portions of the particular portion being characters within the sentence. 7. The computing system in accordance with claim 1 , the computer-executable instructions being further structured such that, if executed by the one or more processors, the indexing of the document is performed by constructing multiple collections each associated with a respective layout type, and each containing one or more portions of the document that are predicted to appear in the respective layout type. 8. The computing system in accordance with claim 7 , the computer-executable instructions being further structured such that, if executed by the one or more processors, the computing system is caused to instantiate a search component that is configured to interpret search requests that expressly contain one or more search terms and an identification of a layout type, and in response: selecting one or more of the multiple collections associated with the identified layout type; and performing a search based on the one or more search terms, the search being performed on only the selected one or more collections. 9. The computing system in accordance with claim 8 , the search requests including searches for documents based at least in part upon similarity in layout with an identified document. 10. The computing system in accordance with claim 7 , the computer-executable instructions being further structured such that, if executed by the one or more processors, the computing system is caused to instantiate a search component that is configured to interpret search results that contain one or more search terms, but not an identification of a layout type, and in response: determining a layout type associated with the search request; selecting one or more of the multiple collections associated with the determined layout type; and performing a search based on the one or more search terms, the search being performed on only the selected one or more collections. 11. The computing system in accordance with claim 7 , the computer-executable instructions being further structured such that, if executed by the one or more processors, the computing system is caused to instantiate a search component that is configured to interpret search results that contain one or more search terms, but not an identification of a layout type, and in response: determining a layout type that is not to be associated with the search request; selecting one or more of the multiple collections not associated with the determined layout type; and performing a search based on the one or more search terms, the search being performed on only the selected one or more collections. 12. The computing system in accordance with claim 1 , the computer-executable instructions being further structured such that, if executed by the one or more processors, the indexing of the document is performed by: constructing a first collection of portions of the document that are predicted to appear in a first layout type, and labelling the first collection with an identification of the first layout type; and constructing a second collection of portions of the document that are predicted to appear in a second layout type, and labelling the second collection with an identification of the second layout type. 13. A method performed by a computing system, the method for machine-learned prediction of a layout type that each of a plurality of portions of a document appears in, the method comprising: accessing a computer-readable representation of a document that contains a plurality of portions and, for at least one portion of the document, a plurality of sub-portions, for which there is no layout information at a granularity of the layout prediction to be made identified within the computer-readable representation of the document; for each of the plurality of sub-portions, predicting a layout type that the particular sub-portion appears in; and for each of the plurality of the portions, predicting a layout type that each of the plurality of portions appears in, wherein if the particular portion of the plurality of portions contains the particular sub-portions, the predicting the layout type for the particular portion is done by using the layout predictions for the particular sub-portion; and indexing the document using the layout predictions for each of the plurality of portions and each of the plurality of sub-portions so as to result in a computer-readable index that is structured so as to be interpretable by a computing system to represent a predicted layout type associated with each of the plurality of portions and each of the plurality of sub-portions of the document. 14. The method in accordance with claim 13 , the predictions of the layout type of the sub-portions being performed using a neural network. 15. The method in accordance with claim 14 , the predictions of the layout type of the particular portions is performed using a rules-based prediction component. 16. The method in accordance with claim 13 , the particular portion being a sentence, the multiple sub-portions of the particular portion being words within the sentence.
Extracting rules from data · CPC title
Neural networks · CPC title
Extracting the geometrical structure, e.g. layout tree; Block segmentation, e.g. bounding boxes for graphics or text · CPC title
by using string matching techniques · CPC title
Machine learning · CPC title
Related publications grouped by family.
Answers are generated from the same data shown on this page.