Systems and methods for identifying anomalous data in large structured data sets and querying the data sets

US9965524B2 · US · B2

Patent metadata
FieldValue
Publication numberUS-9965524-B2
Application numberUS-201414244146-A
CountryUS
Kind codeB2
Filing dateApr 3, 2014
Priority dateApr 3, 2013
Publication dateMay 8, 2018
Grant dateMay 8, 2018

How to read this patent

A practical reading order for non-experts. Skip the full description unless you need deep technical detail.

  1. Title

    What the patent document calls the invention.

  2. Abstract

    A short plain-language summary of the technical disclosure.

  3. Assignees and inventors

    Who owns or filed the patent and who is credited as inventor.

  4. Key dates

    Filing, priority, publication, and grant dates set the timeline.

  5. First independent claim

    The legal scope of protection — read this for what is actually claimed.

  6. CPC / IPC classifications

    Technology tags used to group this patent with similar filings.

  7. Citations and related patents

    Prior art links and similar publications in this corpus.

Abstract

Official abstract text for this publication.

The technology disclosed relates to automatic generation of tuples from a record set for outlier analysis. Applying this new technology, user need not specify which 1-tuples to combine into n-tuples. The tuples are generated from structured records organized into features (that also could be fields, objects or attributes.) Tuples are generated from combinations of feature values in the records. Thresholding is applied to manage the number of tuples generated. The technology disclosed further relates to indexing and searching high dimensional tuple spaces in a computer-implemented system.

First claim

Opening claim text (preview).

The invention claimed is: 1. A system that identifies anomalous data in a record set by comparing frequencies of unique elements obtained from the record set and frequencies of the unique elements in a reference data set, the system including: a computer including memory; and computer instructions causing the computer to implement: creating an expanded tuple set by automatically expanding an existing first tuple set of a first feature from the record set to include a second tuple set of a second feature from the record set, the existing first tuple set being expanded by (i) adding the second tuple set to the existing first tuple set and (ii) creating unique elements with elements from the first feature from the record set and the second feature from the record set, wherein the unique elements in the expanded tuple set enumerate permutations of unique values of the second feature from the record set that are combined with values of the first feature from the record set to form the expanded tuple set; identifying a count of how often each feature value combination of the unique elements is found in the expanded tuple set; limiting the unique elements in the expanded tuple set to inhabited feature value combinations by (i) applying a threshold count criterion of 2 or more to the identified counts of how often the feature value combinations of the unique elements are found in the expanded tuple set and (ii) not retaining unique elements in the expanded tuple set that do not satisfy the threshold count criterion; after expanding the existing first tuple set into the expanded tuple set and applying the threshold count criterion, comparing frequencies of the unique elements in the expanded tuple set to frequencies of the unique elements in the reference data set to identify anomalous frequencies of the unique elements in the expanded tuple set with respect to the frequencies of the unique elements in the reference data set; and spotting outliers from the expanded tuple set with respect to the reference data set based on the identified anomalous frequencies. 2. The system of claim 1 , wherein the threshold count criterion is in a range of 2 to 20. 3. The system of claim 1 , wherein a number of features in the expanded tuple set is in a range of 4 to 40. 4. The system of claim 1 , wherein a number of features in the expanded tuple set is in a range of 5 to 20. 5. The system of claim 1 , further including, before combining a unique value of the second feature from the record set with an element of or applying the threshold count criterion to a resulting expanded tuple set element, qualifying the unique value of the second feature as satisfying the threshold count criterion. 6. The system of claim 1 , wherein: the record set includes elements of a first type that are being tested for frequency of anomalies; and the reference data set includes between 10 and one billion times as many elements of the first type as the record set. 7. The system of claim 6 , applied repeatedly to distinct groups of elements the first type, wherein there are between 10 and one million of the distinct groups of the first type. 8. The system of claim 1 , wherein the computer instructions further cause the computer to implement reporting the outliers for analysis. 9. The system of claim 1 , applied to identifying valued sources of contacts, wherein: the record set and the reference data set both include sales of contact objects; the record set includes contact objects from identified sources that are being tested for frequency of contact resale; the record set and the reference data set both include or can be counted to produce a frequencies contact object sales; the comparing of the frequencies includes comparing the frequencies of the contact object sales for the expanded tuple set generated from the record set to tuples generated from the reference data set; and the outliers are the identified sources whose contact objects have been sold with an anomalous frequency. 10. The system of claim 9 , wherein categories of the identified valued sources of contacts further comprise company name, contact title, and contact location. 11. The system of claim 1 , applied to screening insurance claims, wherein: the record set and the reference data set both include insurance claims submitted from service providers; the record set includes objects from at least one identified service provider whose claims are being tested; and the comparing of the frequencies includes comparing frequencies of insurance claim feature tuples generated from the record set for an identified service provider to insurance claim feature tuples generated from the reference data set. 12. The system of claim 11 , further including: submissions of insurance claims having object features that match the insurance claim feature tuples generated from the record set to the insurance claim feature tuples generated from the reference data set; and the outliers are identified sources whose insurance claims have been submitted with an anomalous frequency. 13. The system of claim 1 , applied to customer service call center routing wherein: the record set and the reference data set both include completed call summaries submitted from incoming customer calls; the record set includes objects from at least one identified call center whose incoming customer calls are being evaluated; and the comparing of the frequencies includes comparing frequencies of customer complaint feature tuples generated from the record set for an identified call center to customer complaint feature tuples from the reference data set. 14. The system of claim 13 , further including: completed call summaries having object features that match the expanded tuple set generated from the record set to the tuples generated from the reference data set; and the outliers are customer service agents whose completed call summaries have been resolved with an anomalous frequency. 15. The system of claim 14 , wherein resolved customer complaints with anomalous frequency correlate to customer service agents who handled incoming service calls with high rates of success. 16. A non-transitory computer readable media, including instructions that, when executed on a processor, cause the processor to execute a method for identifying anomalous data in a record set by comparing frequencies of unique elements obtained from the record set and frequencies of the unique elements in a reference data set, the method comprising: creating an expanded tuple set by automatically expanding an existing first tuple set of a first feature from the record set to include a second tuple set of a second feature from the record set, the existing first tuple set being expanded by (i) adding the second tuple set to the existing first tuple set, and (ii) creating unique elements with elements from the first feature from the record set and the second feature from the record set, wherein the unique elements in the expanded tuple set enumerate permutations of unique values of the second feature from the record set that are combined with values of the first feature from the record set to form the expanded tuple set; identifying a count of how often each feature value combination of the unique elements is found in the expanded tuple set; limiting the unique elements in the expanded tuple set to inhabited feature value combinations by (i) applying a threshold count criterion of 2 or more to the identified counts of how often the feature value combinations of the unique elements are found in the expanded tuple set and (ii) not retain

Assignees

Inventors

Classifications

  • Physics · mapped topic

  • Monitoring; Testing (of line transmission systems H04B3/46; arrangements for monitoring or testing transmission systems employing electromagnetic waves other than radio waves H04B10/07) · CPC title

  • Query processing support for facilitating data mining operations in structured databases · CPC title

Patent family

Related publications grouped by family.

External sources

Frequently asked questions

Answers are generated from the same data shown on this page.

What does patent US9965524B2 cover?
The technology disclosed relates to automatic generation of tuples from a record set for outlier analysis. Applying this new technology, user need not specify which 1-tuples to combine into n-tuples. The tuples are generated from structured records organized into features (that also could be fields, objects or attributes.) Tuples are generated from combinations of feature values in the records.…
Who is the assignee on this patent?
Salesforce Com Inc
What technology area does this patent fall under?
Primary CPC classification G06F17/30539. Mapped technology areas include Physics.
When was this patent published?
Publication date Tue May 08 2018 00:00:00 GMT+0000 (Coordinated Universal Time) (B2). Legal status and post-grant events are not shown on this page.
What related patents are in patentsdb?
We list 8 related publications on this page (citations in our corpus or others sharing the same primary CPC).