Centroid for improving machine learning classification and info retrieval

US11568185B2 · US · B2

Patent metadata
FieldValue
Publication numberUS-11568185-B2
Application numberUS-202017024439-A
CountryUS
Kind codeB2
Filing dateSep 17, 2020
Priority dateSep 30, 2016
Publication dateJan 31, 2023
Grant dateJan 31, 2023

How to read this patent

A practical reading order for non-experts. Skip the full description unless you need deep technical detail.

  1. Title

    What the patent document calls the invention.

  2. Abstract

    A short plain-language summary of the technical disclosure.

  3. Assignees and inventors

    Who owns or filed the patent and who is credited as inventor.

  4. Key dates

    Filing, priority, publication, and grant dates set the timeline.

  5. First independent claim

    The legal scope of protection — read this for what is actually claimed.

  6. CPC / IPC classifications

    Technology tags used to group this patent with similar filings.

  7. Citations and related patents

    Prior art links and similar publications in this corpus.

Abstract

Official abstract text for this publication.

Centroids are used for improving machine learning classification and information retrieval. A plurality of files are classified as malicious or not malicious based on a function dividing a coordinate space into at least a first portion and a second portion such that the first portion includes a first subset of the plurality of files classified as malicious. One or more first centroids are defined in the first portion that classify files from the first subset as not malicious. A file is determined to be malicious based on whether the file is located within the one or more first centroids.

First claim

Opening claim text (preview).

The invention claimed is: 1. A system, comprising: at least one processor; and at least one memory including program code which when executed by the at least one memory provides operations comprising: receiving a file; determining whether the file is malicious based on whether the file is located within one or more first centroids; preventing the file determined to be malicious from executing, opening, continuing to execute, writing, or being downloaded; wherein the one or more first centroids are generated by: identifying a plurality of clusters from a plurality of files; determining whether any of the plurality of clusters do not include known malicious files; and defining individual centroids around each of the plurality of clusters which do not include known malicious files, wherein the one or more first centroids include the individual centroids. 2. A system as in claim 1 , wherein the operations further comprise: determining a plurality of attributes of the plurality of files; and mapping the plurality of files in a positive portion of a coordinate space defined by an intersection of at least two of the plurality of attributes. 3. A system as in claim 1 , wherein the operations further comprise: classifying at least a portion of the plurality of files as malicious based on a function dividing a coordinate space into at least a first portion and a second portion, wherein the first portion includes a first subset of the plurality of files classified as malicious; and defining one or more first centroids in the first portion that classify files from the first subset as not malicious. 4. A system as in claim 3 , wherein the operations further comprise: determining whether any of the individual centroids include a radius greater than a threshold value; reducing the radius of the individual centroids which are greater than the threshold value such that the radius is less than or equal to the threshold value; and re-defining, after the reducing, the individual centroids which no longer include all files from a respective cluster of the plurality of clusters, wherein the re-defining includes defining multiple smaller centroids in place of the individual centroids. 5. A system as in claim 1 , wherein the one or more first centroids include a circular geometry having a center point and a radius, and wherein the file is determined to be located within the one or more first centroids when a distance between the center point and a location of the file is less than or equal to the radius. 6. A system as in claim 5 , wherein the center point is determined based on averaging locations for each of the plurality of files located within the one or more first centroids. 7. A system as in claim 5 , wherein the center point is determined based on shared attributes for each of the plurality of files located within the one or more first centroids. 8. A system as in claim 5 , wherein the radius is determined based on a maximum Euclidian distance between each of the plurality of files located within the one or more first centroids. 9. A computer-implemented method, comprising: classifying at least a portion of a plurality of files as malicious based on a function dividing a coordinate space into at least a first portion and a second portion, wherein the first portion includes a first subset of the plurality of files classified as malicious; defining one or more first centroids in the first portion that classify files from the first subset as not malicious; and determining whether a file is malicious based on whether the file is located within the one or more first centroids, wherein the one or more first centroids include a circular geometry having a center point and a radius, and wherein the file is determined to be located within the one or more first centroids when a distance between the center point and a location of the file is less than or equal to the radius. 10. A computer-implemented method as in claim 9 , wherein the second portion includes a second subset of the plurality of files classified as not malicious, wherein the method further comprises: defining one or more second centroids in the second portion that classify files from the second subset as malicious, and wherein determining whether the file is malicious further comprises determining whether the file is located within a region of the second portion that does not include the one or more second centroids. 11. A computer-implemented method as in claim 9 , further comprising: determining a plurality of attributes of the plurality of files; and mapping the plurality of files in a positive portion of the coordinate space defined by an intersection of at least two of the plurality of attributes. 12. A computer-implemented method as in claim 9 , further comprising: identifying a plurality of clusters from the plurality of files; determining whether any of the plurality of clusters do not include known malicious files; and defining individual centroids around each of the plurality of clusters which do not include known malicious files, wherein the one or more first centroids includes the individual centroids. 13. A computer-implemented method as in claim 12 , further comprising: determining whether any of the individual centroids include a radius greater than a threshold value; reducing the radius of the individual centroids which are greater than the threshold value such that the radius is less than or equal to the threshold value; and re-defining, after the reducing, the individual centroids which no longer include all files from a respective cluster of the plurality of clusters, wherein the re-defining includes defining multiple smaller centroids in place of the individual centroids. 14. A computer-implemented method as in claim 9 , wherein the center point is determined based on averaging locations for each of the plurality of files located within the one or more first centroids. 15. A computer-implemented method as in claim 9 , wherein the center point is determined based on shared attributes for each of the plurality of files located within the one or more first centroids. 16. A computer-implemented method as in claim 9 , wherein the radius is determined based on a maximum Euclidian distance between each of the plurality of files located within the one or more first centroids. 17. A computer-implemented method comprising: receiving files for classification from each of a plurality of endpoint computer systems; classifying at least one of the files as belonging to a specific classification type indicating that the file is malicious when the file is located within one or more first centroids; and preventing the classified files from executing, opening, continuing to execute, writing, or being downloaded in response to the classification; wherein the one or more centroids are generated by: searching for one or more clusters among a plurality of training files in a coordinate space; defining one or more centroids around the one or more clusters, the one or more centroids classifying a set of training files within the one or more centroids as belonging to a specific classification type; and defining individual centroids around each of the plurality of clusters which have a classification type corresponding to such centroid not including known malicious files, wherein the defined one or more centroids include the individual centroids. 18. A method as in claim 17 , wherein the specific classification type includes one or more of safe, suspect, benign, unsafe, ma

Assignees

Inventors

Classifications

  • G06N3/08Primary

    Learning methods · CPC title

  • Computer malware detection or handling, e.g. anti-virus arrangements · CPC title

  • G06K9/6272Primary

    Physics · mapped topic

  • Physics · mapped topic

  • File meta data generation · CPC title

Patent family

Related publications grouped by family.

External sources

Frequently asked questions

Answers are generated from the same data shown on this page.

What does patent US11568185B2 cover?
Centroids are used for improving machine learning classification and information retrieval. A plurality of files are classified as malicious or not malicious based on a function dividing a coordinate space into at least a first portion and a second portion such that the first portion includes a first subset of the plurality of files classified as malicious. One or more first centroids are defin…
Who is the assignee on this patent?
Cylance Inc
What technology area does this patent fall under?
Primary CPC classification G06N3/08. Mapped technology areas include Physics.
When was this patent published?
Publication date Tue Jan 31 2023 00:00:00 GMT+0000 (Coordinated Universal Time) (B2). Legal status and post-grant events are not shown on this page.
What related patents are in patentsdb?
We list 12 related publications on this page (citations in our corpus or others sharing the same primary CPC).