Multiple channels of rasterized content for page decomposition using machine learning
US-11386685-B2 · Jul 12, 2022 · US
US11880435B2 · US · B2
| Field | Value |
|---|---|
| Publication number | US-11880435-B2 |
| Application number | US-202117167316-A |
| Country | US |
| Kind code | B2 |
| Filing date | Feb 4, 2021 |
| Priority date | Feb 12, 2020 |
| Publication date | Jan 23, 2024 |
| Grant date | Jan 23, 2024 |
A practical reading order for non-experts. Skip the full description unless you need deep technical detail.
What the patent document calls the invention.
A short plain-language summary of the technical disclosure.
Who owns or filed the patent and who is credited as inventor.
Filing, priority, publication, and grant dates set the timeline.
The legal scope of protection — read this for what is actually claimed.
Technology tags used to group this patent with similar filings.
Prior art links and similar publications in this corpus.
Official abstract text for this publication.
A document is received. The document is analyzed to discover text and structures of content included in the document. A result of the analysis is used to determine intermediate text representations of segments of the content included in the document, wherein at least one of the intermediate text representations includes an added text encoding the discovered structure of the corresponding content segment within a structural layout of the document. The intermediate text representations are used as an input to a machine learning model to extract information of interest in the document. One or more structured records of the extracted information of interest are created.
Opening claim text (preview).
What is claimed is: 1. A method, comprising: receiving a document; analyzing the document to discover text and structures of content included in the document; using a result of the analysis to determine intermediate text representations of segments of the content included in the document, wherein at least one of the intermediate text representations includes an added text encoding the discovered structure of the corresponding content segment within a structural layout of the document; using the intermediate text representations as an input to a machine learning model to extract information of interest in the document; and creating one or more structured records of the extracted information of interest. 2. The method of claim 1 , wherein the document is a legal document. 3. The method of claim 2 , wherein the legal document is a contract for transfer of software rights. 4. The method of claim 1 , wherein analyzing the document includes determining a document type and performing processing specific to the document type associated with preparing the document for discovery of text and table structures. 5. The method of claim 4 , wherein the processing associated with preparing the document for discovery of text and table structures includes converting text in images to a format that is readable and searchable by a computer. 6. The method of claim 1 , wherein analyzing the document includes utilizing an additional machine learning model to determine table boundary coordinates within the document. 7. The method of claim 6 , wherein the additional machine learning model is a fast region-based convolutional neural network (Fast R-CNN). 8. The method of claim 1 , wherein analyzing the document includes determining whether a discovered table includes relevant content. 9. The method of claim 8 , wherein determining whether the discovered table includes relevant content includes detecting text associated with a specified list of words pertaining to software licensing. 10. The method of claim 1 , wherein the intermediate text representations are converted from prior text representations that include content values that are separated by separator characters. 11. The method of claim 1 , wherein the intermediate text representations are generated at least in part by combining label components and non-label components, extracted from prior text representations, into one or more natural language sentences. 12. The method of claim 1 , wherein the added text encoding the discovered structure of the corresponding content segment within the structural layout of the document comprises a table column label. 13. The method of claim 1 , wherein the machine learning model is a named-entity recognition (NER) model. 14. The method of claim 1 , wherein the machine learning model utilizes feature vectors comprising natural language words derived from the intermediate text representations. 15. The method of claim 1 , wherein the machine learning model is trained on datasets comprising a constrained set of objects associated with one or more prescribed entity types to which the extracted information of interest belongs. 16. The method of claim 1 , wherein the extracted information of interest comprises a software product name. 17. The method of claim 1 , wherein the document is in a file format that has captured elements of a printed document as an electronic image that a user can view, navigate, print, and send to another user. 18. The method of claim 1 , wherein the one or more structured records are stored in a software asset management (SAM) database. 19. A system, comprising: one or more processors configured to: receive a document; analyze the document to discover text and structures of content included in the document; use a result of the analysis to determine intermediate text representations of segments of the content included in the document, wherein at least one of the intermediate text representations includes an added text encoding the discovered structure of the corresponding content segment within a structural layout of the document; use the intermediate text representations as an input to a machine learning model to extract information of interest in the document; and create one or more structured records of the extracted information of interest; and a memory coupled to at least one of the one or more processors and configured to provide at least one of the one or more processors with instructions. 20. A computer program product embodied in a non-transitory computer readable medium and comprising computer instructions for: receiving a document; analyzing the document to discover text and structures of content included in the document; using a result of the analysis to determine intermediate text representations of segments of the content included in the document, wherein at least one of the intermediate text representations includes an added text encoding the discovered structure of the corresponding content segment within a structural layout of the document; using the intermediate text representations as an input to a machine learning model to extract information of interest in the document; and creating one or more structured records of the extracted information of interest.
Supervised learning · CPC title
Convolutional networks [CNN, ConvNet] · CPC title
Arrangements for software license management or administration, e.g. for managing licenses at corporate level · CPC title
Data format conversion from or to a database · CPC title
Named entity recognition · CPC title
Related publications grouped by family.
Answers are generated from the same data shown on this page.