Centroid for improving machine learning classification and info retrieval

US10417530B2 · US · B2

Patent metadata
FieldValue
Publication numberUS-10417530-B2
Application numberUS-201715720372-A
CountryUS
Kind codeB2
Filing dateSep 29, 2017
Priority dateSep 30, 2016
Publication dateSep 17, 2019
Grant dateSep 17, 2019

How to read this patent

A practical reading order for non-experts. Skip the full description unless you need deep technical detail.

  1. Title

    What the patent document calls the invention.

  2. Abstract

    A short plain-language summary of the technical disclosure.

  3. Assignees and inventors

    Who owns or filed the patent and who is credited as inventor.

  4. Key dates

    Filing, priority, publication, and grant dates set the timeline.

  5. First independent claim

    The legal scope of protection — read this for what is actually claimed.

  6. CPC / IPC classifications

    Technology tags used to group this patent with similar filings.

  7. Citations and related patents

    Prior art links and similar publications in this corpus.

Abstract

Official abstract text for this publication.

Centroids are used for improving machine learning classification and information retrieval. A plurality of files are classified as malicious or not malicious based on a function dividing a coordinate space into at least a first portion and a second portion such that the first portion includes a first subset of the plurality of files classified as malicious. One or more first geometric regions are defined in the first portion that classify files from the first subset as not malicious. A file is determined to be malicious based on whether the file is located within the one or more first geometric regions.

First claim

Opening claim text (preview).

The invention claimed is: 1. A system, comprising: at least one processor; and at least one memory including program code which when executed by the at least one memory provides operations comprising: classifying at least a portion of a plurality of files as malicious based on a function dividing a coordinate space into at least a first portion and a second portion, wherein the first portion includes a first subset of the plurality of files classified as malicious; defining one or more first geometric regions in the first portion that classify files from the first subset as not malicious; identifying a plurality of clusters from the plurality of files; determining whether any of the plurality of clusters do not include known malicious files; defining individual geometric regions around at least one of the plurality of clusters which do not include known malicious files, wherein the one or more first geometric regions include the individual geometric regions; determining whether a file is malicious based on whether the file is located within the one or more first geometric regions; and preventing files determined to be malicious from such files from executing, opening, continuing to execute, writing, or being downloaded. 2. A system as in claim 1 , wherein the second portion includes a second subset of the plurality of files classified as not malicious, and wherein the operations further comprise: defining one or more second geometric regions in the second portion that classify files from the second subset as malicious, wherein determining whether the file is malicious further comprises determining whether the file is located within a region of the second portion that does not include the one or more second geometric regions. 3. A system as in claim 1 , wherein the operations further comprise: determining a plurality of attributes of the plurality of files; and mapping the plurality of files in a positive portion of the coordinate space defined by an intersection of at least two of the plurality of attributes. 4. A system as in claim 1 , wherein the operations further comprise: determining whether any of the individual geometric regions include a radius greater than a threshold value; reducing the radius of the individual geometric regions which are greater than the threshold value such that the radius is less than or equal to the threshold value; and re-defining, after the reducing, the individual geometric regions which no longer include all files from a respective cluster of the plurality of clusters, wherein the re-defining includes defining multiple smaller geometric regions in place of the individual geometric regions. 5. A system as in claim 1 , wherein the one or more first geometric regions include a circular geometry having a center point and a radius, and wherein the file is determined to be located within the one or more first geometric regions when a distance between the center point and a location of the file is less than or equal to the radius. 6. A system as in claim 5 , wherein the center point is determined based on averaging locations for each of the plurality of files located within the one or more first geometric regions. 7. A system as in claim 5 , wherein the center point is determined based on shared attributes for each of the plurality of files located within the one or more first geometric regions. 8. A system as in claim 5 , wherein the radius is determined based on a maximum Euclidian distance between each of the plurality of files located within the one or more first geometric regions. 9. A system as in claim 1 , wherein the classifying employs at least one machine learning model. 10. A system as in claim 1 , wherein the classifying employs at least one of: a neural networks, a support vector machine, a logistic regression model, a Bayesian algorithm, or a decision tree. 11. A computer-implemented method, comprising: classifying at least a portion of a plurality of files as malicious based on a function dividing a coordinate space into at least a first portion and a second portion, wherein the first portion includes a first subset of the plurality of files classified as malicious; defining one or more first geometric regions in the first portion that classify files from the first subset as not malicious; identifying a plurality of clusters from the plurality of files; determining whether any of the plurality of clusters do not include known malicious files; defining individual geometric regions around at least one of the plurality of clusters which do not include known malicious files, wherein the one or more first geometric regions include the individual geometric regions; determining whether a file is malicious based on whether the file is located within the one or more first geometric regions; and preventing files determined to be malicious from such files from executing, opening, continuing to execute, writing, or being downloaded. 12. A computer-implemented method as in claim 11 , wherein the second portion includes a second subset of the plurality of files classified as not malicious, wherein the method further comprises: defining one or more second geometric regions in the second portion that classify files from the second subset as malicious, and wherein determining whether the file is malicious further comprises determining whether the file is located within a region of the second portion that does not include the one or more second geometric regions. 13. A computer-implemented method as in claim 11 , further comprising: determining a plurality of attributes of the plurality of files; and mapping the plurality of files in a positive portion of the coordinate space defined by an intersection of at least two of the plurality of attributes. 14. A computer-implemented method as in claim 11 , further comprising: determining whether any of the individual geometric regions include a radius greater than a threshold value; reducing the radius of the individual geometric regions which are greater than the threshold value such that the radius is less than or equal to the threshold value; and re-defining, after the reducing, the individual geometric regions which no longer include all files from a respective cluster of the plurality of clusters, wherein the re-defining includes defining multiple smaller geometric regions in place of the individual geometric regions. 15. A computer-implemented method as in claim 11 , wherein the one or more first geometric regions include a circular geometry having a center point and a radius, and wherein the file is determined to be located within the one or more first geometric regions when a distance between the center point and a location of the file is less than or equal to the radius. 16. A computer-implemented method as in claim 15 , wherein the center point is determined based on averaging locations for each of the plurality of files located within the one or more first geometric regions. 17. A computer-implemented method as in claim 15 , wherein the center point is determined based on shared attributes for each of the plurality of files located within the one or more first geometric regions. 18. A computer-implemented method as in claim 15 , wherein the radius is determined based on a maximum Euclidian distance between each of the plurality of files located within the one or more first geometric regions. 19. A method as in claim 11 , wherein the classifying employs at least one machine learning model. 20. A system

Assignees

Inventors

Classifications

Patent family

Related publications grouped by family.

External sources

Frequently asked questions

Answers are generated from the same data shown on this page.

What does patent US10417530B2 cover?
Centroids are used for improving machine learning classification and information retrieval. A plurality of files are classified as malicious or not malicious based on a function dividing a coordinate space into at least a first portion and a second portion such that the first portion includes a first subset of the plurality of files classified as malicious. One or more first geometric regions a…
Who is the assignee on this patent?
Cylance Inc
What technology area does this patent fall under?
Primary CPC classification G06N3/08. Mapped technology areas include Physics.
When was this patent published?
Publication date Tue Sep 17 2019 00:00:00 GMT+0000 (Coordinated Universal Time) (B2). Legal status and post-grant events are not shown on this page.
What related patents are in patentsdb?
We list 9 related publications on this page (citations in our corpus or others sharing the same primary CPC).