Determination of intermediate representations of discovered document structures

US11880435B2 · US · B2

Patent metadata
FieldValue
Publication numberUS-11880435-B2
Application numberUS-202117167316-A
CountryUS
Kind codeB2
Filing dateFeb 4, 2021
Priority dateFeb 12, 2020
Publication dateJan 23, 2024
Grant dateJan 23, 2024

How to read this patent

A practical reading order for non-experts. Skip the full description unless you need deep technical detail.

  1. Title

    What the patent document calls the invention.

  2. Abstract

    A short plain-language summary of the technical disclosure.

  3. Assignees and inventors

    Who owns or filed the patent and who is credited as inventor.

  4. Key dates

    Filing, priority, publication, and grant dates set the timeline.

  5. First independent claim

    The legal scope of protection — read this for what is actually claimed.

  6. CPC / IPC classifications

    Technology tags used to group this patent with similar filings.

  7. Citations and related patents

    Prior art links and similar publications in this corpus.

Abstract

Official abstract text for this publication.

A document is received. The document is analyzed to discover text and structures of content included in the document. A result of the analysis is used to determine intermediate text representations of segments of the content included in the document, wherein at least one of the intermediate text representations includes an added text encoding the discovered structure of the corresponding content segment within a structural layout of the document. The intermediate text representations are used as an input to a machine learning model to extract information of interest in the document. One or more structured records of the extracted information of interest are created.

First claim

Opening claim text (preview).

What is claimed is: 1. A method, comprising: receiving a document; analyzing the document to discover text and structures of content included in the document; using a result of the analysis to determine intermediate text representations of segments of the content included in the document, wherein at least one of the intermediate text representations includes an added text encoding the discovered structure of the corresponding content segment within a structural layout of the document; using the intermediate text representations as an input to a machine learning model to extract information of interest in the document; and creating one or more structured records of the extracted information of interest. 2. The method of claim 1 , wherein the document is a legal document. 3. The method of claim 2 , wherein the legal document is a contract for transfer of software rights. 4. The method of claim 1 , wherein analyzing the document includes determining a document type and performing processing specific to the document type associated with preparing the document for discovery of text and table structures. 5. The method of claim 4 , wherein the processing associated with preparing the document for discovery of text and table structures includes converting text in images to a format that is readable and searchable by a computer. 6. The method of claim 1 , wherein analyzing the document includes utilizing an additional machine learning model to determine table boundary coordinates within the document. 7. The method of claim 6 , wherein the additional machine learning model is a fast region-based convolutional neural network (Fast R-CNN). 8. The method of claim 1 , wherein analyzing the document includes determining whether a discovered table includes relevant content. 9. The method of claim 8 , wherein determining whether the discovered table includes relevant content includes detecting text associated with a specified list of words pertaining to software licensing. 10. The method of claim 1 , wherein the intermediate text representations are converted from prior text representations that include content values that are separated by separator characters. 11. The method of claim 1 , wherein the intermediate text representations are generated at least in part by combining label components and non-label components, extracted from prior text representations, into one or more natural language sentences. 12. The method of claim 1 , wherein the added text encoding the discovered structure of the corresponding content segment within the structural layout of the document comprises a table column label. 13. The method of claim 1 , wherein the machine learning model is a named-entity recognition (NER) model. 14. The method of claim 1 , wherein the machine learning model utilizes feature vectors comprising natural language words derived from the intermediate text representations. 15. The method of claim 1 , wherein the machine learning model is trained on datasets comprising a constrained set of objects associated with one or more prescribed entity types to which the extracted information of interest belongs. 16. The method of claim 1 , wherein the extracted information of interest comprises a software product name. 17. The method of claim 1 , wherein the document is in a file format that has captured elements of a printed document as an electronic image that a user can view, navigate, print, and send to another user. 18. The method of claim 1 , wherein the one or more structured records are stored in a software asset management (SAM) database. 19. A system, comprising: one or more processors configured to: receive a document; analyze the document to discover text and structures of content included in the document; use a result of the analysis to determine intermediate text representations of segments of the content included in the document, wherein at least one of the intermediate text representations includes an added text encoding the discovered structure of the corresponding content segment within a structural layout of the document; use the intermediate text representations as an input to a machine learning model to extract information of interest in the document; and create one or more structured records of the extracted information of interest; and a memory coupled to at least one of the one or more processors and configured to provide at least one of the one or more processors with instructions. 20. A computer program product embodied in a non-transitory computer readable medium and comprising computer instructions for: receiving a document; analyzing the document to discover text and structures of content included in the document; using a result of the analysis to determine intermediate text representations of segments of the content included in the document, wherein at least one of the intermediate text representations includes an added text encoding the discovered structure of the corresponding content segment within a structural layout of the document; using the intermediate text representations as an input to a machine learning model to extract information of interest in the document; and creating one or more structured records of the extracted information of interest.

Assignees

Inventors

Classifications

  • Supervised learning · CPC title

  • Convolutional networks [CNN, ConvNet] · CPC title

  • G06F21/105Primary

    Arrangements for software license management or administration, e.g. for managing licenses at corporate level · CPC title

  • Data format conversion from or to a database · CPC title

  • Named entity recognition · CPC title

Patent family

Related publications grouped by family.

External sources

Frequently asked questions

Answers are generated from the same data shown on this page.

What does patent US11880435B2 cover?
A document is received. The document is analyzed to discover text and structures of content included in the document. A result of the analysis is used to determine intermediate text representations of segments of the content included in the document, wherein at least one of the intermediate text representations includes an added text encoding the discovered structure of the corresponding conten…
Who is the assignee on this patent?
Servicenow Inc
What technology area does this patent fall under?
Primary CPC classification G06F21/105. Mapped technology areas include Physics.
When was this patent published?
Publication date Tue Jan 23 2024 00:00:00 GMT+0000 (Coordinated Universal Time) (B2). Legal status and post-grant events are not shown on this page.
What related patents are in patentsdb?
We list 4 related publications on this page (citations in our corpus or others sharing the same primary CPC).