Maintaining a custodian directory by analyzing documents

US10007894B2 · US · B2

Patent metadata
FieldValue
Publication numberUS-10007894-B2
Application numberUS-201514805631-A
CountryUS
Kind codeB2
Filing dateJul 22, 2015
Priority dateJul 22, 2015
Publication dateJun 26, 2018
Grant dateJun 26, 2018

How to read this patent

A practical reading order for non-experts. Skip the full description unless you need deep technical detail.

  1. Title

    What the patent document calls the invention.

  2. Abstract

    A short plain-language summary of the technical disclosure.

  3. Assignees and inventors

    Who owns or filed the patent and who is credited as inventor.

  4. Key dates

    Filing, priority, publication, and grant dates set the timeline.

  5. First independent claim

    The legal scope of protection — read this for what is actually claimed.

  6. CPC / IPC classifications

    Technology tags used to group this patent with similar filings.

  7. Citations and related patents

    Prior art links and similar publications in this corpus.

Abstract

Official abstract text for this publication.

A computer processor may extract identity information from a document. The identity information may include at least one custodian identity attribute. After extracting the identity information, the computer processor may determine that the identity information is associated with a specific custodian. The computer processor may then search for the custodian identity attribute in a custodian directory to determine whether the custodian directory contains an entry for the custodian. If the custodian is not in the custodian directory, the computer processor may create a new entry in the custodian directory for the custodian and store the extracted identity information in the new entry.

First claim

Opening claim text (preview).

What is claimed is: 1. A system for maintaining a custodian directory, the system comprising: a memory; and a processor communicatively coupled to the memory, where the processor is configured to perform a method comprising: extracting identity information from a document, the identity information including a custodian identity attribute; and determining that the identity information is associated with a first custodian; searching for the custodian identity attribute in the custodian directory; creating, in response to determining that the first custodian is not in the custodian directory, a new entry for the first custodian in the custodian directory, the new entry including the identity information; updating, in response to determining that the first custodian is in the custodian directory, an entry for the first custodian in the custodian directory using the extracted identity information; and carrying out a cleanup of the custodian directory by: identifying two or more entries in the custodian directory that have at least one matching custodian identity attribute; determining a weighting factor for each field in the custodian directory, wherein the weighting factor for each respective field is based on a likelihood that the custodian identity attribute for the respective field is unique to a single custodian; generating a relationship score for the two or more entries by comparing the identity information in the two or more entries and using the weighting factors, the relationship score being a numeric score that indicates a level of similarity between the two or more entries; determining that the relationship score exceeds a confidence threshold; determining, based on the relationship score exceeding the confidence threshold, that all of the two or more entries in the custodian directory relate to a particular custodian; and merging, in response to determining that all of the two or more entries relate to the particular custodian, the two or more entries in the custodian directory. 2. The system of claim 1 , wherein the identity information is a name, and wherein the identifying two or more entries that relate to a particular custodian comprises: identifying a first name in a first entry in the custodian directory; identifying a second name in a second entry in the custodian directory; determining that the first name is an alternative name for the second name. 3. The system of claim 1 , wherein the method performed by the processor further comprises: identifying a first entry in the custodian directory; determining, using information in the custodian directory, that the first entry corresponds to a customer; and transmitting, in response to determining that the first entry corresponds to the customer, the first entry to a customer relationship management (CRM) system. 4. The system of claim 1 , wherein extracting the identity information includes extracting information from a body of the document using natural language processing and extracting information from metadata of the document, wherein the identity information includes a second custodian identity attribute extracted from the metadata and a third custodian identity attribute extracted from the body of the document, and wherein the method performed by the processor further comprises: determining that the second custodian identity attribute is associated with a second custodian; determining, based on a field of the metadata where the second custodian identity attribute was extracted from and a location in the body of the document that the third custodian identity attribute was extracted from, that the second custodian identity attribute and the third custodian identity attribute are associated with the same custodian; searching for the second custodian identity attribute in the custodian directory; determining, based on the searching for the second custodian identity attribute, that an existing entry exists for the second custodian in the custodian directory; determining a type of custodian identity attribute for the third custodian identity attribute; comparing the third custodian identity attribute to a corresponding field in the existing entry using the type of custodian identity attribute; determining, based on comparing the third custodian identity attribute to the corresponding field, that the third custodian identity attribute does not match a value stored in the corresponding field; and updating, in response to determining that the third custodian identity attribute does not match the value stored in the corresponding field, the existing entry for the second custodian by storing the third custodian identity attribute in the custodian directory. 5. The system of claim 1 , wherein the identity information includes a plurality of custodian identity attributes, and wherein determining whether the first custodian is in the custodian directory comprises: comparing each custodian identity attribute of the plurality of custodian identity attributes to fields in the custodian directory; determining that at least one custodian identity attribute of the plurality of custodian identity attributes matches a first value in a first entry in the custodian directory; comparing each custodian identity attribute of the plurality of custodian identity attributes to corresponding fields in the first entry; generating a comparison score for the first entry using fuzzy logic matching; and comparing the comparison score to a threshold, the threshold being a minimum score that a potential match has to obtain to be considered a match, the threshold being automatically determined by the processor based on historical data relating to custodian directory matches. 6. The system of claim 1 , wherein extracting the identity information from the document includes: extracting a plurality of custodian identity attributes, wherein a first custodian identity attribute is extracted from metadata of the document and a second custodian identity attribute is extracted from a body of the document; grouping the plurality of custodian identity attributes according to a location in the document from which each custodian identity attribute was extracted, wherein grouping the plurality of custodian identity attributes includes grouping the first and second custodian identity attributes together. 7. A computer program product for maintaining a custodian directory, the computer program product comprising a computer readable storage medium having program instructions embodied therewith, wherein the computer readable storage medium is not a transitory signal per se, the program instruction executable by a processor to cause the processor to perform a method comprising: extracting identity information from a document, the identity information including a custodian identity attribute; determining that the identity information is associated with a first custodian; searching for the custodian identity attribute in the custodian directory; creating, in response to determining that the first custodian is not in the custodian directory, a new entry for the first custodian in the custodian directory, the new entry including the identity information: updating, in response to determining that the first custodian is in the custodian directory, an entry for the first custodian in the custodian directory using the extracted identity information; and carrying out a cleanup of the custodian directory by: identifying two or more entries in the custodian directory that have at least one matching custodian identity attribute; determining a weighting factor for each field in the custodian directory, wherein the weighting factor for each respective field is based on a likelihood that the custodian identity attribute for the

Assignees

Inventors

Classifications

  • Computer-aided management of electronic mailing [e-mailing] · CPC title

  • File access structures, e.g. distributed indices (arrangements of input from, or output to, record carriers G06F3/06) · CPC title

  • Parsing · CPC title

  • Document management systems · CPC title

  • G06Q10/105Primary

    Human resources · CPC title

Patent family

Related publications grouped by family.

External sources

Frequently asked questions

Answers are generated from the same data shown on this page.

What does patent US10007894B2 cover?
A computer processor may extract identity information from a document. The identity information may include at least one custodian identity attribute. After extracting the identity information, the computer processor may determine that the identity information is associated with a specific custodian. The computer processor may then search for the custodian identity attribute in a custodian dire…
Who is the assignee on this patent?
IBM
What technology area does this patent fall under?
Primary CPC classification G06Q10/105. Mapped technology areas include Physics.
When was this patent published?
Publication date Tue Jun 26 2018 00:00:00 GMT+0000 (Coordinated Universal Time) (B2). Legal status and post-grant events are not shown on this page.
What related patents are in patentsdb?
We list 1 related publication on this page (citations in our corpus or others sharing the same primary CPC).