Generating and applying data extraction templates

US10216838B1 · US · B1

Patent metadata
FieldValue
Publication numberUS-10216838-B1
Application numberUS-201615394610-A
CountryUS
Kind codeB1
Filing dateDec 29, 2016
Priority dateAug 27, 2014
Publication dateFeb 26, 2019
Grant dateFeb 26, 2019

How to read this patent

A practical reading order for non-experts. Skip the full description unless you need deep technical detail.

  1. Title

    What the patent document calls the invention.

  2. Abstract

    A short plain-language summary of the technical disclosure.

  3. Assignees and inventors

    Who owns or filed the patent and who is credited as inventor.

  4. Key dates

    Filing, priority, publication, and grant dates set the timeline.

  5. First independent claim

    The legal scope of protection — read this for what is actually claimed.

  6. CPC / IPC classifications

    Technology tags used to group this patent with similar filings.

  7. Citations and related patents

    Prior art links and similar publications in this corpus.

Abstract

Official abstract text for this publication.

Methods, apparatus, and computer-readable media are provided for generating and applying data extraction templates. In various implementations, a corpus of structured communications such as emails may be grouped into clusters based on one or more similarities between the structured communications. A set of structural paths may be identified from structured communications of a particular cluster. One or more structural paths of the set may be classified as transient wherein a count of occurrences of one or more associated segments of text across the particular cluster satisfies a criterion. One or more transient paths may be assigned a semantic data type and/or a confidentiality designation based on various signals. A data extraction template may be generated to extract, from subsequent structured communications, segments of text associated with transient (and in some cases, non-confidential) structural paths.

First claim

Opening claim text (preview).

What is claimed is: 1. A computer-implemented method for generating and applying data extraction templates to extract transient content from structured communications created automatically using templates, comprising: grouping a corpus of structured communications into a plurality of clusters based on one or more patterns shared among one or more structured communications within the corpus; identifying, from structured communications of a particular cluster, a set of structural paths; classifying a first structural path of the set of structural paths, associated with a first segment of text, as a first transient structural path in response to a determination that a count of occurrences of the first segment of text across the particular cluster satisfies a criterion; classifying the first transient structural path as a first semantic data type based on one or more signals related to the structured communications of the particular cluster; classifying a second structural path of the set of structural paths, associated with a second segment of text, as a second transient structural path in response to a determination that a count of occurrences of the second segment of text across the particular cluster satisfies the same criterion or a different criterion; classifying the second transient structural path as a second semantic data type based at least in part on the first semantic data type; generating a data extraction template to extract, from one or more subsequent structured communications, one or more segments of text associated with the first transient structural path; associating a subsequent structured communication with the particular cluster based on one or more patterns shared between the subsequent structured communication and one or more structured communications of the corpus; and applying the data extraction template associated with the particular cluster to the subsequent structured communication to extract one or more segments of text associated with the first transient structural path. 2. The computer-implemented method of claim 1 , wherein the generating further comprises generating the data extraction template to ignore, in one or more subsequent structured communications, one or more segments of text associated with the second transient structural path. 3. The computer-implemented method of claim 2 , wherein the second semantic data type is confidential. 4. The computer-implemented method of claim 3 , wherein the first semantic data type comprises user-identifying information. 5. The computer-implemented method of claim 3 , wherein the first semantic data type comprises a departure date. 6. The computer-implemented method of claim 3 , wherein the first semantic data type comprises a user's address or telephone number. 7. The computer-implemented method of claim 3 , wherein the first semantic data type comprises a position coordinate. 8. A computer-implemented method for generating and applying data extraction templates to extract transient content from structured communications created automatically using templates, comprising: identifying, from a corpus of structured communications, a set of structural paths; classifying a first structural path of the set of structural paths, associated with a first segment of text, as a transient structural path in response to a determination that a count of occurrences of the first segment of text across the corpus satisfies a criterion; determining a semantic data type of the transient structural path based on one or more patterns detected in the first segment of text; generating a data extraction template to: extract, from one or more structured communications, one or more segments of text associated with the transient structural path where the semantic data type of the transient structural path is non-confidential, or ignore, in one or more subsequent structured communications, one or more segments of text associated with the transient structural path where the semantic data type of the transient structural path is confidential; associating a subsequent structured communication with the data extraction template based on content of the subsequent structured communication; and applying the data extraction template to the subsequent structured communication to extract one or more segments of text associated with the transient structural path. 9. The computer-implemented method of claim 8 , wherein the one or more patterns comprise a numeric pattern. 10. The computer-implemented method of claim 9 , wherein the numeric pattern corresponds to a date. 11. The computer-implemented method of claim 9 , wherein the numeric pattern corresponds to a credit card number. 12. The computer-implemented method of claim 9 , wherein the numeric pattern corresponds to a telephone number. 13. The computer-implemented method of claim 9 , wherein the numeric pattern corresponds to a social security number. 14. The computer-implemented method of claim 9 , wherein the numeric pattern corresponds to an expiration date. 15. The computer-implemented method of claim 8 , wherein the semantic data type of the transient structural path is further determined based at least in part on another semantic data type of another transient structural path in the set of structural paths. 16. The computer-implemented method of claim 15 , wherein the another semantic data type comprises user-identifying information. 17. A computer-implemented method for generating and applying data extraction templates to extract transient content from structured communications created automatically using templates, comprising: identifying, from a corpus of structured communications, a set of structural paths; classifying a first structural path of the set of structural paths, associated with a first segment of text, as a first transient structural path in response to a determination that a count of occurrences of the first segment of text across the corpus of structured communications satisfies a criterion; classifying the first transient structural path as a first semantic data type based on one or more signals related to the corpus of structured communications; classifying a second structural path of the set of structural paths, associated with a second segment of text, as a second transient structural path in response to a determination that a count of occurrences of the second segment of text across the corpus of structured communications satisfies the same criterion or a different criterion; classifying the second transient structural path as a second semantic data type based at least in part on the first semantic data type; generating a data extraction template to extract, from one or more subsequent structured communications, one or more segments of text associated with the first transient structural path; associating a subsequent structured communication with the data extraction template based on content of the subsequent structured communication; and applying the data extraction template to the subsequent structured communication to extract one or more segments of text associated with the first transient structural path. 18. The computer-implemented method of claim 17 , wherein the generating further comprises generating the data extraction template to ignore, in one or more subsequent structured communications, one or more segments of text associated with the second transient structural path. 19. The computer-implemented method of claim 18 , wherein the second semantic data type is confidential.

Assignees

Inventors

Classifications

  • Semantic analysis · CPC title

  • Query execution (filtering based on additional data G06F16/335) · CPC title

  • G06F16/35Primary

    Clustering; Classification · CPC title

  • Protecting personal data, e.g. for financial or medical purposes · CPC title

  • Physics · mapped topic

Patent family

Related publications grouped by family.

External sources

Frequently asked questions

Answers are generated from the same data shown on this page.

What does patent US10216838B1 cover?
Methods, apparatus, and computer-readable media are provided for generating and applying data extraction templates. In various implementations, a corpus of structured communications such as emails may be grouped into clusters based on one or more similarities between the structured communications. A set of structural paths may be identified from structured communications of a particular cluster…
Who is the assignee on this patent?
Google Inc, Google Llc
What technology area does this patent fall under?
Primary CPC classification G06F16/35. Mapped technology areas include Physics.
When was this patent published?
Publication date Tue Feb 26 2019 00:00:00 GMT+0000 (Coordinated Universal Time) (B1). Legal status and post-grant events are not shown on this page.
What related patents are in patentsdb?
We list 5 related publications on this page (citations in our corpus or others sharing the same primary CPC).