Generating and applying event data extraction templates
US-9652530-B1 · May 16, 2017 · US
US9785705B1 · US · B1
| Field | Value |
|---|---|
| Publication number | US-9785705-B1 |
| Application number | US-201414516122-A |
| Country | US |
| Kind code | B1 |
| Filing date | Oct 16, 2014 |
| Priority date | Oct 16, 2014 |
| Publication date | Oct 10, 2017 |
| Grant date | Oct 10, 2017 |
A practical reading order for non-experts. Skip the full description unless you need deep technical detail.
What the patent document calls the invention.
A short plain-language summary of the technical disclosure.
Who owns or filed the patent and who is credited as inventor.
Filing, priority, publication, and grant dates set the timeline.
The legal scope of protection — read this for what is actually claimed.
Technology tags used to group this patent with similar filings.
Prior art links and similar publications in this corpus.
Official abstract text for this publication.
Methods, apparatus, systems, and computer-readable media are provided for generating and applying data extraction templates. In various implementations, a corpus of plain text communications such as emails may be grouped into clusters based on one or more similarities between the plain text communications. One or more segments of communications of a particular cluster may be classified as transient based on textual pattern matching. One or more other segments of the communications of the particular cluster may be classified as transient based on various criteria. One or more transient segments may be assigned a generic and/or specific semantic data type and/or a confidentiality designation based on various signals. A data extraction template may be generated to extract, from subsequent plain text communications, content associated with transient (and in some cases, non-confidential) segments.
Opening claim text (preview).
What is claimed is: 1. A computer-implemented method for generating and applying data extraction templates to extract transient content from plain text communications created automatically using templates, comprising: grouping a corpus of plain text communications into a plurality of clusters based on one or more shared attributes; classifying one or more plain text segments of each plain text communication of a particular cluster as fixed in response to a determination that a count of occurrences of the one or more plain text segments across the particular cluster satisfies a criterion; classifying one or more remaining plain text segments of each plain text communication of the particular cluster as transient; generating a tree to represent sequences of classified plain text segments associated with each plain text communication of the particular cluster, wherein the tree includes at least a first branch to represent a first sequence of classified plain text segments corresponding to a first plain text communication of the particular cluster and a second branch to represent at least part of a second sequence of classified plain text segments corresponding to a second plain text communication of the particular cluster, wherein the second sequence of classified plain text segments is different than the first sequence of classified plain text segments; generating, based on the tree, a data extraction template to extract, from one or more subsequent plain text communications, content associated with transient segments; extracting content associated with at least one transient segment from a given subsequent plain text communication addressed to a user by applying the data extraction template to the given subsequent plain text communication; and rating the extracting performed on the given subsequent plain text communication based on how closely a sequence of classified plain text segments generated for the given subsequent plain text communication traverses a branch of the tree. 2. The computer-implemented method of claim 1 , further comprising identifying, in each plain text communication of the particular cluster based on one or more textual patterns, one or more transient plain text segments. 3. The computer-implemented method of claim 2 , further comprising assigning generic semantic data types to one or more identified transient plain text segments in each plain text communication of the particular cluster based on the one or more textual patterns. 4. The computer-implemented method of claim 3 , further comprising assigning specific semantic data types to one or more transient plain text segments in each plain text communication of the particular cluster based on a context of the plain text communications of the particular cluster or one or more heuristics. 5. The computer-implemented method of claim 1 , wherein the one or more shared attributes comprise a subject or data indicative of a sending entity. 6. The computer-implemented method of claim 5 , wherein the grouping comprises associating a plurality of different sender identifiers with a single sending entity based on one or more textual patterns shared among the plurality of different sender identifiers. 7. The computer-implemented method of claim 1 , further comprising configuring the data extraction template so that content associated with plain text segments classified as fixed are ignored in one or more plain text communications. 8. The computer-implemented method of claim 1 , further comprising classifying a particular plain text segment of each plain text communication of the particular cluster as fixed in response to a determination that the particular segment contains one or more patterns of plain text characters that are used to provide visual structure to each plain text communication. 9. The computer-implemented method of claim 1 , wherein generating the data extraction template comprises generating the data extraction template to ignore, in one or more plain text communications, content associated with a particular transient plain text segment in response to a determination, based on one or more signals related to plain text communications of the particular cluster, that a semantic data type of the particular transient plain text segment is confidential. 10. A system including memory and one or more processors operable to execute instructions stored in the memory, comprising instructions to: group a corpus of plain text communications into a plurality of clusters based on one or more shared attributes; classify one or more plain text segments of each plain text communication of a particular cluster as fixed in response to a determination that a count of occurrences of the one or more plain text segments across the particular cluster satisfies a criterion; classify one or more remaining plain text segments of each plain text communication of the particular cluster as transient; generate a tree to represent sequences of classified plain text segments associated with each plain text communication of the particular cluster, wherein the tree includes at least a first branch to represent a first sequence of classified plain text segments corresponding to a first plain text communication of the particular cluster and a second branch to represent at least part of a second sequence of classified plain text segments corresponding to a second plain text communication of the particular cluster, wherein the second sequence of classified plain text segments is different than the first sequence of classified plain text segments; generate, based on the tree, a data extraction template to extract, from one or more subsequent plain text communications, content associated with transient segments; extract content associated with at least one transient segment from a given subsequent plain text communication addressed to a user by applying the data extraction template to the given subsequent plain text communication; and rate the extraction performed on the given subsequent plain text communication based on how closely a sequence of classified plain text segments generated for the given subsequent plain text communication traverses a branch of the tree. 11. The system of claim 10 , further comprising instructions to identify, in each plain text communication of the particular cluster based on one or more textual patterns, one or more transient plain text segments. 12. The system of claim 11 , further comprising instructions to assign generic semantic data types to one or more identified transient plain text segments in each plain text communication of the particular cluster based on the one or more textual patterns. 13. The system of claim 12 , further comprising instructions to assign specific semantic data types to one or more transient plain text segments in each plain text communication of the particular cluster based on a context of the plain text communications of the particular cluster or one or more signals. 14. The system of claim 10 , wherein the one or more shared attributes comprise a subject or data indicative of a sending entity. 15. The system of claim 14 , further comprising instructions to associate a plurality of different sender identifiers with a single sending entity based on one or more textual patterns shared among the plurality of different sender identifiers. 16. At least one non-transitory computer-readable medium comprising instructions that, when execution by a computing system, cause the computing system to perform the following operations: grouping a corpus of plain text communications into a plurality of clusters bas
Clustering; Classification · CPC title
Physics · mapped topic
Related publications grouped by family.
Answers are generated from the same data shown on this page.