Outlier detection for streaming data

US2017199902A1 · US · A1

Patent metadata
FieldValue
Publication numberUS-2017199902-A1
Application numberUS-201614990175-A
CountryUS
Kind codeA1
Filing dateJan 7, 2016
Priority dateJan 7, 2016
Publication dateJul 13, 2017
Grant date

How to read this patent

A practical reading order for non-experts. Skip the full description unless you need deep technical detail.

  1. Title

    What the patent document calls the invention.

  2. Abstract

    A short plain-language summary of the technical disclosure.

  3. Assignees and inventors

    Who owns or filed the patent and who is credited as inventor.

  4. Key dates

    Filing, priority, publication, and grant dates set the timeline.

  5. First independent claim

    The legal scope of protection — read this for what is actually claimed.

  6. CPC / IPC classifications

    Technology tags used to group this patent with similar filings.

  7. Citations and related patents

    Prior art links and similar publications in this corpus.

Abstract

Official abstract text for this publication.

Random cut trees are generated with respective to respective samples of a baseline set of data records of a data set for which outlier detection is to be performed. To construct a particular random cut tree, an iterative splitting technique is used, in which the attribute along which a given set of data records is split is selected based on its value range. With respect to a newly-received data record of the stream, an outlier score is determined based at least partly on a potential insertion location of a node representing the data record in a particular random cut tree, without necessarily modifying the random cut tree.

First claim

Opening claim text (preview).

What is claimed is: 1 . A system, comprising: one or more computing devices of a network-accessible analytics service implemented at a provider network; wherein the one or more computing devices are configured to: receive a request, via a programmatic interface from a client, to perform outlier detection on observation records of a streaming data source, wherein individual ones of the observation records comprise values of a plurality of attributes; collect a baseline set of observation records of the streaming data source; generate a plurality of random cut trees corresponding to respective samples of the baseline set of observation records, wherein generation of a particular random cut tree corresponding to a particular sample comprises one or more iterations of: selecting a particular attribute of the plurality of attributes based at least in part on a value range of the particular attribute; and splitting at least a portion of the particular sample on a split value selected for the particular attribute; in response to obtaining a particular observation record which is not included in the baseline set, compute an outlier score for the particular observation record based at least in part on a potential insertion location identified for the particular observation record with respect to the particular random cut tree; and in response to a determination, made according to a probabilistic stream sample update algorithm, that the sample corresponding to the particular random cut tree is to include the particular observation record, insert a node representing the particular observation record into the particular random cut tree; and in response to determining that the outlier score meets a result reporting criterion, provide an indication of the outlier score to the client. 2 . The system as recited in claim 1 , wherein to insert the node representing the particular observation record into the particular random cut tree, the one or more computing devices are configured to: in response to (a) reaching a particular node of the particular random cut tree during a traversal starting from the root node of the particular random cut tree, wherein the particular node represents a bounding box of attribute values corresponding to an earlier random split computation performed with respect to a first attribute of the plurality of attributes and (b) determining that the particular observation record falls outside the bounding box, perform a random split computation with respect to a selected attribute of the plurality of attributes represented in the bounding box; and based at least in part on the result of the random split computation, determine whether (a) a node representing the particular observation record is to be added as a new child node of the particular node or (b) traversal of the particular random cut tree is to be continued along a path which includes a different child node of the particular node. 3 . The system as recited in claim 1 , wherein the one or more computing devices are configured to: subsequent to the determination that the sample corresponding to the particular random cut tree is to include the particular observation record, delete a leaf node representing a different observation record from the particular random cut tree; replace the parent node of the leaf node with a sibling node of the leaf node; and adjust respective bounding boxes of one or more nodes on the path between the new location of the sibling node and the root node of the particular random cut tree. 4 . The system as recited in claim 1 , wherein the one or more computing devices are configured to: compute the outlier score based at least in part on one or more of: (a) a sparsity metric corresponding to the potential insertion location, (b) respective distance metrics computed with respect to the potential insertion location and one or more other nodes of the particular random cut tree, or (c) a displacement metric corresponding to the potential insertion location. 5 . The system as recited in claim 1 , wherein the request to perform outlier detection is formulated in a selected query language, and wherein the request indicates a parameter indicating a time window with respect to which the outlier detection is to be performed. 6 . A method, comprising: performing, by one or more computing devices: determining that outlier detection is to be performed on data records of a stream, wherein individual ones of the data records comprise values of a plurality of attributes; generating one or more random cut trees corresponding to respective samples of a baseline set of data records of the stream, wherein generation of a particular random cut tree corresponding to a particular sample comprises one or more iterations of: selecting a particular attribute of the plurality of attributes based at least in part on a value range of the particular attribute; and splitting at least a portion of the particular sample on a split value selected for the particular attribute; in response to obtaining a particular data record which is not included in the baseline set, storing an outlier score computed for the particular data record, wherein the outlier score is based at least in part on a potential insertion location identified for the particular data record with respect to the particular random cut tree, and wherein the outlier score is computed without modifying the particular random cut tree. 7 . The method as recited in claim 6 , further comprising performing, by the one or more computing devices: computing the outlier score based at least in part on one or more of: (a) a sparsity metric corresponding to the potential insertion location, (b) respective distance metrics computed with respect to the potential insertion location and one or more nodes of the particular random cut tree, or (c) a displacement metric corresponding to the potential insertion location. 8 . The method as recited in claim 6 , further comprising performing, by the one or more computing devices: in response to obtaining a second data record which is not included in the baseline set, computing a second outlier score corresponding to the second data record based at least in part on a potential insertion location identified for the second data record with respect to the particular random cut tree; and in accordance with a sample update algorithm, inserting the second data record into the particular random cut tree, and deleting a selected data record from the particular random cut tree. 9 . The method as recited in claim 8 , wherein the sample update algorithm comprises one or more of: (a) a random sample update algorithm, or (b) a weighted sample update algorithm. 10 . The method as recited in claim 6 , wherein the one or more computing devices are configured to: assigning, to a second data record, a second outlier score based at least in part on a determination that the potential insertion location for the second data record within the particular random cut tree is at a depth greater than a threshold, without traversing the particular random cut tree to a depth greater than the threshold. 11 . The method as recited in claim 6 , wherein the one or more computing devices are configured to: determining that outlier detection is to be performed on data records of a second stream, wherein individual ones of the data records of the second stream comprise values of a second plurality of attributes; pre-processing, prior to generating one or more random cut trees corresponding to a second baseline set of data records of the second stream, at least a subset of the second baseline set, wherein said pre-proc

Assignees

Inventors

Classifications

  • Dynamic search techniques; Heuristics; Dynamic trees; Branch-and-bound · CPC title

  • G06F16/215Primary

    Improving data quality; Data cleansing, e.g. de-duplication, removing invalid entries or correcting typographical errors · CPC title

  • Machine learning · CPC title

  • Physics · mapped topic

  • Query processing support for facilitating data mining operations in structured databases · CPC title

Patent family

Related publications grouped by family.

External sources

Frequently asked questions

Answers are generated from the same data shown on this page.

What does patent US2017199902A1 cover?
Random cut trees are generated with respective to respective samples of a baseline set of data records of a data set for which outlier detection is to be performed. To construct a particular random cut tree, an iterative splitting technique is used, in which the attribute along which a given set of data records is split is selected based on its value range. With respect to a newly-received data…
Who is the assignee on this patent?
Amazon Tech Inc
What technology area does this patent fall under?
Primary CPC classification G06F16/215. Mapped technology areas include Physics.
When was this patent published?
Publication date Thu Jul 13 2017 00:00:00 GMT+0000 (Coordinated Universal Time) (A1). Legal status and post-grant events are not shown on this page.
What related patents are in patentsdb?
We list 2 related publications on this page (citations in our corpus or others sharing the same primary CPC).