Systems and methods for determining document section types

US11494418B2 · US · B2

Patent metadata
FieldValue
Publication numberUS-11494418-B2
Application numberUS-202117160712-A
CountryUS
Kind codeB2
Filing dateJan 28, 2021
Priority dateJan 28, 2021
Publication dateNov 8, 2022
Grant dateNov 8, 2022

How to read this patent

A practical reading order for non-experts. Skip the full description unless you need deep technical detail.

  1. Title

    What the patent document calls the invention.

  2. Abstract

    A short plain-language summary of the technical disclosure.

  3. Assignees and inventors

    Who owns or filed the patent and who is credited as inventor.

  4. Key dates

    Filing, priority, publication, and grant dates set the timeline.

  5. First independent claim

    The legal scope of protection — read this for what is actually claimed.

  6. CPC / IPC classifications

    Technology tags used to group this patent with similar filings.

  7. Citations and related patents

    Prior art links and similar publications in this corpus.

Abstract

Official abstract text for this publication.

Systems and methods for discovering and/or determining section types for a given document class in a data-driven manner are provided. A modified Bayesian model merging algorithm can be used, along with extending an Analogical Story Merging (ASM) algorithm. The systems and methods can learn the section structure of documents without a pre-existing ontology of sections or time-intensive annotation efforts.

First claim

Opening claim text (preview).

What is claimed is: 1. A system for determining section types of a given document class, the system comprising: a processor; a memory in operable communication with the processor; and a machine-readable medium in operable communication with the processor and the memory, the machine-readable medium having instructions stored thereon that, when executed by the processor, perform the following steps: receiving a corpus of documents of the given document class; using a modified Bayesian model merging algorithm on the corpus to determine the section types of the given document class; and storing the determined section types on the memory to be used for labeling a document of the given document class, the using of the modified Bayesian model merging algorithm on the corpus comprising: creating an initial Hidden Markov Model (HMM)-like model, where each document of the corpus is represented as a linear chain of states, with each state of the linear chain of states corresponding to a section of unknown type in a same order as found in the respective document of the corpus; performing a merge operation on the initial HMM-like model to merge states and generate an updated model; defining a prior probability distribution over the updated model; computing a posterior probability distribution based on the prior probability distribution; and searching a merge space of the updated model based on the posterior probability distribution to determine the section types of the given document class. 2. The system according to claim 1 , the using of the modified Bayesian model merging algorithm on the corpus comprising extending an analogical story merging (ASM) approach with a Bayesian model merging algorithm. 3. The system according to claim 1 , the searching of the merge space of the updated model comprising maximizing the posterior probability distribution to give a generalizable model that fits the corpus. 4. The system according to claim 1 , the computing of the posterior probability distribution comprising computing P(M)P(D|M), which is proportional to P(M|D), where P(M) is the prior probability distribution, P(M|D) is the posterior probability distribution, M represents the updated model, and D represents a document of the corpus. 5. The system according to claim 1 , the defining of the prior probability distribution comprising using the following equations P ⁡ ( M ) = N ⁡ ( μ , σ 2 ) ⁢ ∏ i G ⁡ ( S i ) ⁢ G ⁡ ( S i ) = { 1 0 ⁢ ∀ s j , s k ∈ S i , Sim ⁢ ( s j , s k ) > T otherwise , where P(M) is the prior probability distribution, M represents the updated model, N(μ, σ 2 ) is a normal distribution of the updated model, S i is the i th state in the updated model, s j and s k are section contents that have been merged into state S i , Sim is a similarity function that takes content of s j and s k and computes a cosine similarity of vector representations of s j and s k , and T is a similarity threshold. 6. The system according to claim 5 , T being set as 1.5 standard deviations from a mean similarity of the similarity function. 7. The system according to claim 5 , where, if headers of all sections in the updated model are exactly the same, G(Si) is set to 1. 8. The system according to claim 1 , the corpus of documents comprising at least 100 documents. 9. The system according to claim 1 , the given document class being a psychiatric evaluation, a discharge summary, a radiology report, or a United States patent document. 10. A method for determining section types of a given document class, the method comprising: receiving, by a processor, a corpus of documents of the given document class; using, by the processor, a modified Bayesian model merging algorithm on the corpus to determine the section types of the given document class; and storing, by the processor, the determined section types on a memory in operable communication with the processor to be used for labeling a document of the given document class, the using of the modified Bayesian model merg

Assignees

Inventors

Classifications

  • Probabilistic graphical models, e.g. probabilistic networks · CPC title

  • Thesauruses; Synonyms · CPC title

  • using statistical methods · CPC title

  • Fragmentation of text files, e.g. creating reusable text-blocks; Linking to fragments, e.g. using XInclude; Namespaces · CPC title

  • Semantic analysis · CPC title

Patent family

Related publications grouped by family.

External sources

Frequently asked questions

Answers are generated from the same data shown on this page.

What does patent US11494418B2 cover?
Systems and methods for discovering and/or determining section types for a given document class in a data-driven manner are provided. A modified Bayesian model merging algorithm can be used, along with extending an Analogical Story Merging (ASM) algorithm. The systems and methods can learn the section structure of documents without a pre-existing ontology of sections or time-intensive annotatio…
Who is the assignee on this patent?
Banisakher Deya, Rishe Naphtali, Finlayson Mark, and 1 more
What technology area does this patent fall under?
Primary CPC classification G06F7/32. Mapped technology areas include Physics.
When was this patent published?
Publication date Tue Nov 08 2022 00:00:00 GMT+0000 (Coordinated Universal Time) (B2). Legal status and post-grant events are not shown on this page.
What related patents are in patentsdb?
We list 12 related publications on this page (citations in our corpus or others sharing the same primary CPC).