What technology area does this patent fall under?

Primary CPC classification G06F16/215. Mapped technology areas include Physics.

When was this patent published?

Publication date Thu Jul 13 2017 00:00:00 GMT+0000 (Coordinated Universal Time) (A1). Legal status and post-grant events are not shown on this page.

What related patents are in patentsdb?

We list 2 related publications on this page (citations in our corpus or others sharing the same primary CPC).

Outlier detection for streaming data

US2017199902A1 · US · A1

Patent metadata
Field	Value
Publication number	US-2017199902-A1
Application number	US-201614990175-A
Country	US
Kind code	A1
Filing date	Jan 7, 2016
Priority date	Jan 7, 2016
Publication date	Jul 13, 2017
Grant date	—

How to read this patent

A practical reading order for non-experts. Skip the full description unless you need deep technical detail.

Title
What the patent document calls the invention.
Abstract
A short plain-language summary of the technical disclosure.
Assignees and inventors
Who owns or filed the patent and who is credited as inventor.
Key dates
Filing, priority, publication, and grant dates set the timeline.
First independent claim
The legal scope of protection — read this for what is actually claimed.
CPC / IPC classifications
Technology tags used to group this patent with similar filings.
Citations and related patents
Prior art links and similar publications in this corpus.

Abstract

Official abstract text for this publication.

Random cut trees are generated with respective to respective samples of a baseline set of data records of a data set for which outlier detection is to be performed. To construct a particular random cut tree, an iterative splitting technique is used, in which the attribute along which a given set of data records is split is selected based on its value range. With respect to a newly-received data record of the stream, an outlier score is determined based at least partly on a potential insertion location of a node representing the data record in a particular random cut tree, without necessarily modifying the random cut tree.

First claim

Opening claim text (preview).

What is claimed is: 1 . A system, comprising: one or more computing devices of a network-accessible analytics service implemented at a provider network; wherein the one or more computing devices are configured to: receive a request, via a programmatic interface from a client, to perform outlier detection on observation records of a streaming data source, wherein individual ones of the observation records comprise values of a plurality of attributes; collect a baseline set of observation records of the streaming data source; generate a plurality of random cut trees corresponding to respective samples of the baseline set of observation records, wherein generation of a particular random cut tree corresponding to a particular sample comprises one or more iterations of: selecting a particular attribute of the plurality of attributes based at least in part on a value range of the particular attribute; and splitting at least a portion of the particular sample on a split value selected for the particular attribute; in response to obtaining a particular observation record which is not included in the baseline set, compute an outlier score for the particular observation record based at least in part on a potential insertion location identified for the particular observation record with respect to the particular random cut tree; and in response to a determination, made according to a probabilistic stream sample update algorithm, that the sample corresponding to the particular random cut tree is to include the particular observation record, insert a node representing the particular observation record into the particular random cut tree; and in response to determining that the outlier score meets a result reporting criterion, provide an indication of the outlier score to the client. 2 . The system as recited in claim 1 , wherein to insert the node representing the particular observation record into the particular random cut tree, the one or more computing devices are configured to: in response to (a) reaching a particular node of the particular random cut tree during a traversal starting from the root node of the particular random cut tree, wherein the particular node represents a bounding box of attribute values corresponding to an earlier random split computation performed with respect to a first attribute of the plurality of attributes and (b) determining that the particular observation record falls outside the bounding box, perform a random split computation with respect to a selected attribute of the plurality of attributes represented in the bounding box; and based at least in part on the result of the random split computation, determine whether (a) a node representing the particular observation record is to be added as a new child node of the particular node or (b) traversal of the particular random cut tree is to be continued along a path which includes a different child node of the particular node. 3 . The system as recited in claim 1 , wherein the one or more computing devices are configured to: subsequent to the determination that the sample corresponding to the particular random cut tree is to include the particular observation record, delete a leaf node representing a different observation record from the particular random cut tree; replace the parent node of the leaf node with a sibling node of the leaf node; and adjust respective bounding boxes of one or more nodes on the path between the new location of the sibling node and the root node of the particular random cut tree. 4 . The system as recited in claim 1 , wherein the one or more computing devices are configured to: compute the outlier score based at least in part on one or more of: (a) a sparsity metric corresponding to the potential insertion location, (b) respective distance metrics computed with respect to the potential insertion location and one or more other nodes of the particular random cut tree, or (c) a displacement metric corresponding to the potential insertion location. 5 . The system as recited in claim 1 , wherein the request to perform outlier detection is formulated in a selected query language, and wherein the request indicates a parameter indicating a time window with respect to which the outlier detection is to be performed. 6 . A method, comprising: performing, by one or more computing devices: determining that outlier detection is to be performed on data records of a stream, wherein individual ones of the data records comprise values of a plurality of attributes; generating one or more random cut trees corresponding to respective samples of a baseline set of data records of the stream, wherein generation of a particular random cut tree corresponding to a particular sample comprises one or more iterations of: selecting a particular attribute of the plurality of attributes based at least in part on a value range of the particular attribute; and splitting at least a portion of the particular sample on a split value selected for the particular attribute; in response to obtaining a particular data record which is not included in the baseline set, storing an outlier score computed for the particular data record, wherein the outlier score is based at least in part on a potential insertion location identified for the particular data record with respect to the particular random cut tree, and wherein the outlier score is computed without modifying the particular random cut tree. 7 . The method as recited in claim 6 , further comprising performing, by the one or more computing devices: computing the outlier score based at least in part on one or more of: (a) a sparsity metric corresponding to the potential insertion location, (b) respective distance metrics computed with respect to the potential insertion location and one or more nodes of the particular random cut tree, or (c) a displacement metric corresponding to the potential insertion location. 8 . The method as recited in claim 6 , further comprising performing, by the one or more computing devices: in response to obtaining a second data record which is not included in the baseline set, computing a second outlier score corresponding to the second data record based at least in part on a potential insertion location identified for the second data record with respect to the particular random cut tree; and in accordance with a sample update algorithm, inserting the second data record into the particular random cut tree, and deleting a selected data record from the particular random cut tree. 9 . The method as recited in claim 8 , wherein the sample update algorithm comprises one or more of: (a) a random sample update algorithm, or (b) a weighted sample update algorithm. 10 . The method as recited in claim 6 , wherein the one or more computing devices are configured to: assigning, to a second data record, a second outlier score based at least in part on a determination that the potential insertion location for the second data record within the particular random cut tree is at a depth greater than a threshold, without traversing the particular random cut tree to a depth greater than the threshold. 11 . The method as recited in claim 6 , wherein the one or more computing devices are configured to: determining that outlier detection is to be performed on data records of a second stream, wherein individual ones of the data records of the second stream comprise values of a second plurality of attributes; pre-processing, prior to generating one or more random cut trees corresponding to a second baseline set of data records of the second stream, at least a subset of the second baseline set, wherein said pre-proc

Assignees

Amazon Tech Inc

Inventors

Classifications

G06N5/01
Dynamic search techniques; Heuristics; Dynamic trees; Branch-and-bound · CPC title
G06F16/215Primary
Improving data quality; Data cleansing, e.g. de-duplication, removing invalid entries or correcting typographical errors · CPC title
G06N20/00
Machine learning · CPC title
G06F17/30303Primary
Physics · mapped topic
G06F16/2465Primary
Query processing support for facilitating data mining operations in structured databases · CPC title

Patent family

Related publications grouped by family.

View patent family 57890928

External sources

Frequently asked questions

Answers are generated from the same data shown on this page.

What does patent US2017199902A1 cover?: Random cut trees are generated with respective to respective samples of a baseline set of data records of a data set for which outlier detection is to be performed. To construct a particular random cut tree, an iterative splitting technique is used, in which the attribute along which a given set of data records is split is selected based on its value range. With respect to a newly-received data…
Who is the assignee on this patent?: Amazon Tech Inc
What technology area does this patent fall under?: Primary CPC classification G06F16/215. Mapped technology areas include Physics.
When was this patent published?: Publication date Thu Jul 13 2017 00:00:00 GMT+0000 (Coordinated Universal Time) (A1). Legal status and post-grant events are not shown on this page.
What related patents are in patentsdb?: We list 2 related publications on this page (citations in our corpus or others sharing the same primary CPC).