Systems and methods for data extraction from electronic documents using data patterns

US11625419B2 · US · B2

Patent metadata
FieldValue
Publication numberUS-11625419-B2
Application numberUS-202017064150-A
CountryUS
Kind codeB2
Filing dateOct 6, 2020
Priority dateOct 6, 2020
Publication dateApr 11, 2023
Grant dateApr 11, 2023

How to read this patent

A practical reading order for non-experts. Skip the full description unless you need deep technical detail.

  1. Title

    What the patent document calls the invention.

  2. Abstract

    A short plain-language summary of the technical disclosure.

  3. Assignees and inventors

    Who owns or filed the patent and who is credited as inventor.

  4. Key dates

    Filing, priority, publication, and grant dates set the timeline.

  5. First independent claim

    The legal scope of protection — read this for what is actually claimed.

  6. CPC / IPC classifications

    Technology tags used to group this patent with similar filings.

  7. Citations and related patents

    Prior art links and similar publications in this corpus.

Abstract

Official abstract text for this publication.

Systems and methods for extracting data from electronic documents based on data patterns. The method includes receiving electronic template documents. Each template document corresponds to a type of electronic document. The method further includes, for each template document, processing the template document using a text extraction and data processing application. The method also includes, for each template document, determining a data extraction formula corresponding to the type of electronic document. The method further includes, storing the data extraction formula in a first database. The method also includes, receiving an electronic document including user data and a Unicode corresponding to the type of document. The method also includes, processing and classifying the electronic document into the type of document corresponding to the Unicode. The method also includes identifying data elements in the electronic document based on the data extraction formula and extracting data values for each of the identified data elements.

First claim

Opening claim text (preview).

What is claimed: 1. A computerized method for extracting data from electronic documents based on a plurality of data patterns, the method comprising: receiving, by a server computing device, a plurality of electronic template documents, wherein each electronic template document corresponds to a type of electronic document; for each of the plurality of electronic template documents, processing, by the server computing device, the electronic template document using a text extraction and data processing application; for each of the plurality of electronic template documents, determining, by the server computing device, a data extraction formula corresponding to the type of electronic document; storing, by the server computing device, the data extraction formula for each of the plurality of electronic template documents in a first database; receiving, by the server computing device, an electronic document comprising user data and a Unicode corresponding to the type of electronic document; processing, by the server computing device, the electronic document using the text extraction and data processing application; classifying, by the server computing device, the electronic document into the type of electronic document corresponding to the Unicode; identifying, by the server computing device, data elements in the electronic document based on the data extraction formula corresponding to the type of electronic document; extracting, by the server computing device, data values for each of the identified data elements in the electronic document; and generating, by the server computing device, a second database comprising the data values for each of the identified data elements in the electronic document and locations of the identified data elements. 2. The computerized method of claim 1 , wherein processing the electronic template document comprises: identifying, by the server computing device, a header and a footer based on a similarity score; and removing, by the server computing device, the header and footer from the electronic template document. 3. The computerized method of claim 1 , wherein the data extraction formula corresponds to locations of data elements in the electronic template document. 4. The computerized method of claim 1 , wherein processing the electronic document comprises: identifying, by the server computing device, a header and a footer based on a similarity score; and removing, by the server computing device, the header and footer from the electronic document. 5. The computerized method of claim 1 , wherein classifying the electronic document into the type of electronic document is further based on an organization corresponding to the type of electronic document. 6. The computerized method of claim 1 , wherein identifying the data elements in the electronic document further comprises: calculating, by the server computing device, a cosine similarity score based on the electronic document and the electronic template document corresponding to the document type; and benchmarking, by the server computing device, the cosine similarity scores. 7. The computerized method of claim 1 , wherein the locations of the identified data elements correspond to a page number of the electronic document. 8. The computerized method of claim 1 , wherein the server computing device is further configured to receive the plurality of electronic template documents from a plurality of data sources. 9. A system for extracting data from electronic documents based on a plurality of data patterns, the system comprising: a server computing device communicatively coupled to a first database and a second database over a network, the server computing device configured to: receive a plurality of electronic template documents, wherein each electronic template document corresponds to a type of electronic document; for each of the plurality of electronic template documents, process the electronic template document using a text extraction and data processing application; for each of the plurality of electronic template documents, determine a data extraction formula corresponding to the type of electronic document; store the data extraction formula for each of the plurality of electronic template documents in the first database; receive an electronic document comprising user data and a Unicode corresponding to the type of electronic document; process the electronic document using the text extraction and data processing application; classify the electronic document into the type of electronic document corresponding to the Unicode; identify data elements in the electronic document based on the data extraction formula corresponding to the type of electronic document; extract data values for each of the identified data elements in the electronic document; and generate the second database comprising the data values for each of the identified data elements in the electronic document and locations of the identified data elements. 10. The system of claim 9 , wherein the server computing device is further configured to process the electronic template document by: identifying a header and a footer based on a similarity score; and removing the header and footer from the electronic template document. 11. The system of claim 9 , wherein the data extraction formula corresponds to locations of data elements in the electronic template document. 12. The system of claim 9 , wherein the server computing device is further configured to process the electronic document by: identifying a header and a footer based on a similarity score; and removing the header and footer from the electronic document. 13. The system of claim 9 , wherein classifying the electronic document into the type of electronic document is further based on an organization corresponding to the type of electronic document. 14. The system of claim 9 , wherein the server computing device is further configured to identify the data elements in the electronic document by: calculating a cosine similarity score based on the electronic document and the electronic template document corresponding to the document type; and benchmarking the cosine similarity scores. 15. The system of claim 9 , wherein the locations of the identified data elements correspond to a page number of the electronic document. 16. The system of claim 9 , wherein the server computing device is further configured to receive the plurality of electronic template documents from a plurality of data sources. 17. A computerized method for extracting data from electronic documents based on a plurality of data patterns, the method comprising: receiving, by the server computing device, an electronic document comprising user data and a Unicode corresponding to a type of electronic document; identifying, by the server computing device, a header and a footer of the electronic document based on a similarity score; removing, by the server computing device, the identified header and footer from the electronic document based on the similarity score; classifying, by the server computing device, the electronic document into the type of electronic document corresponding to the Unicode; identifying, by the server computing device, data elements in the electronic document based on a data extraction formula corresponding to the type of electronic document; extracting, by the server computing device, data values for each of the identified data elements in the electronic document; and generate, by the server computing device, a database comprising the data values for

Assignees

Inventors

Classifications

  • G06F40/186Primary

    Templates · CPC title

  • Document management systems · CPC title

  • Creation or modification of classes or clusters · CPC title

  • Extracting the logical structure, e.g. chapters, sections or page numbers; Identifying elements of the document, e.g. authors · CPC title

  • G06F16/285Primary

    Clustering or classification · CPC title

Patent family

Related publications grouped by family.

External sources

Frequently asked questions

Answers are generated from the same data shown on this page.

What does patent US11625419B2 cover?
Systems and methods for extracting data from electronic documents based on data patterns. The method includes receiving electronic template documents. Each template document corresponds to a type of electronic document. The method further includes, for each template document, processing the template document using a text extraction and data processing application. The method also includes, for …
Who is the assignee on this patent?
Fmr Llc
What technology area does this patent fall under?
Primary CPC classification G06F40/186. Mapped technology areas include Physics.
When was this patent published?
Publication date Tue Apr 11 2023 00:00:00 GMT+0000 (Coordinated Universal Time) (B2). Legal status and post-grant events are not shown on this page.
What related patents are in patentsdb?
We list 1 related publication on this page (citations in our corpus or others sharing the same primary CPC).