Systems and methods for online clustering of content items

US11003692B2 · US · B2

Patent metadata
FieldValue
Publication numberUS-11003692-B2
Application numberUS-201514980572-A
CountryUS
Kind codeB2
Filing dateDec 28, 2015
Priority dateDec 28, 2015
Publication dateMay 11, 2021
Grant dateMay 11, 2021

How to read this patent

A practical reading order for non-experts. Skip the full description unless you need deep technical detail.

  1. Title

    What the patent document calls the invention.

  2. Abstract

    A short plain-language summary of the technical disclosure.

  3. Assignees and inventors

    Who owns or filed the patent and who is credited as inventor.

  4. Key dates

    Filing, priority, publication, and grant dates set the timeline.

  5. First independent claim

    The legal scope of protection — read this for what is actually claimed.

  6. CPC / IPC classifications

    Technology tags used to group this patent with similar filings.

  7. Citations and related patents

    Prior art links and similar publications in this corpus.

Abstract

Official abstract text for this publication.

Systems, methods, and non-transitory computer-readable media can obtain a first batch of content items to be clustered. A set of clusters can be generated by clustering respective binary hash codes for each content item in the first batch, wherein content items included in a cluster are visually similar to one another. A next batch of content items to be clustered can be obtained. One or more respective binary hash codes for the content items in the next batch can be assigned to a cluster in the set of clusters.

First claim

Opening claim text (preview).

What is claimed is: 1. A computer-implemented method comprising: obtaining, by a computing system, a first batch of content items to be clustered; selecting, by the computing system, a set of the first batch of content items to be a set of cluster centers; inserting, by the computing system, cluster center binary hash codes of the set of cluster centers into a multi-index hash table; generating, by the computing system, a set of clusters of visually similar content items in the first batch of content items based at least in part on first binary hash codes associated with the first batch of content items, wherein the generating is based on an assignment of a content item of the first batch of content items to a nearest cluster center based on a look up of a nearest cluster center binary hash code in the multi-index hash table; updating, by the computing system, the cluster center binary hash codes based on means of the first binary hash codes; obtaining, by the computing system, a second batch of content items to be clustered; assigning, by the computing system, at least a portion of second binary hash codes associated with the second batch of content items to clusters in the set of clusters based on distances between the second binary hash codes and cluster centers in the set of clusters; and generating, by the computing system, at least one new cluster based on unassigned second binary hash codes that are within a threshold distance to each other. 2. The computer-implemented method of claim 1 , wherein obtaining, by the computing system, the first batch of content items to be clustered further comprises at least one of: determining that a threshold number of content items have been received, or determining that a threshold period of time has elapsed. 3. The computer-implemented method of claim 1 , wherein generating, by the computing system, the set of clusters further comprises: generating the first binary hash codes associated with the first batch of content items; and determining, for each content item in the first batch of content items, a corresponding cluster from the set of clusters based on the look up of the nearest cluster center binary hash code in the multi-index hash table. 4. The computer-implemented method of claim 3 , wherein determining, for each content item in the first batch of content items, the corresponding cluster from the set of clusters further comprises: determining, for each content item in the first batch of content items, a respective distance between the first binary hash code for the content item and the nearest cluster center for the content item; and assigning each content item in the first batch of content items to the nearest cluster center for the content item based at least in part on the respective distance satisfying a threshold. 5. The computer-implemented method of claim 1 , wherein the set of cluster centers are randomly selected. 6. The computer-implemented method of claim 1 , wherein the set of clusters of visually similar content items is generated further based at least in part on a binary k-means clustering approach. 7. The computer-implemented method of claim 1 , the method further comprising: removing, from each cluster in the set of clusters, content items in the cluster that are at least a threshold distance from a cluster center corresponding to the cluster. 8. The computer-implemented method of claim 1 , the method further comprising: removing, from the set of clusters, clusters that are smaller than a threshold size. 9. The computer-implemented method of claim 1 , wherein assigning, by the computing system, the at least the portion of the second binary hash codes associated with the second batch of content items further comprises: determining a nearest cluster center for each content item in the second batch of content items; and determining, for each content item in the second batch of content items, that a respective distance between the second binary hash code for the content item and a binary hash code corresponding to the nearest cluster center for the content item satisfies a threshold. 10. The computer-implemented method of claim 1 , the method further comprising: determining that a content item is categorized as spam; determining a cluster in which the content item was clustered, the cluster including a plurality of other content items; and determining that the plurality of other content items in the cluster are spam. 11. A system comprising: at least one processor; and a memory storing instructions that, when executed by the at least one processor, cause the system to perform: obtaining a first batch of content items to be clustered; selecting a set of the first batch of content items to be a set of cluster centers; inserting cluster center binary hash codes of the set of cluster centers into a multi-index hash table; generating a set of clusters of visually similar content items in the first batch of content items based at least in part on first binary hash codes associated with the first batch of content items, wherein the generating is based on an assignment of a content item of the first batch of content items to a nearest cluster center based on a look up of a nearest cluster center binary hash code in the multi-index hash table; updating the cluster center binary hash codes based on means of the first binary hash codes; obtaining a second batch of content items to be clustered; assigning at least a portion of secondary binary hash codes associated with the second batch of content items of clusters in the set of clusters based on distances between the second binary hash codes and cluster centers in the set of clusters; and generating at least one new cluster based on unassigned second binary hash codes that are within a threshold distance to each other. 12. The system of claim 11 , wherein obtaining the first batch of content items to be clustered further comprises at least one of: determining that a threshold number of content items have been received, or determining that a threshold period of time has elapsed. 13. The system of claim 11 , wherein generating the set of clusters further comprises: generating the first binary hash codes associated with the first batch of content items; and determining, for each content item in the first batch of content items, a corresponding cluster from the set of clusters based on the look up of the nearest cluster center binary hash code in the multi-index hash table. 14. The system of claim 13 , wherein determining, for each content item in the first batch of content items, the corresponding cluster from the set of clusters further comprises: determining, for each content item in the first batch of content items, a respective distance between the first binary hash code for the content item and the nearest cluster center for the content item; and assigning each content item in the first batch of content items to the nearest cluster center for the content item based at least in part on the respective distance satisfying a threshold. 15. The system of claim 11 , wherein the set of cluster centers are randomly selected. 16. A non-transitory computer-readable storage medium including instructions that, when executed by at least one processor of a computing system, cause the computing system to perform a method comprising: obtaining a first batch of content items to be clustered; selecting a set of the first batch of content items to be a set of cluster centers; inserting cluster center binary hash codes of the set of cluster centers

Assignees

Inventors

Classifications

  • G06N20/00Primary

    Machine learning · CPC title

  • for distributed storage of data in networks, e.g. transport arrangements for network file system [NFS], storage area networks [SAN] or network attached storage [NAS] · CPC title

  • Graphs; Linked lists (G06F16/9027 takes precedence) · CPC title

  • G06F16/285Primary

    Clustering or classification · CPC title

  • in which an application is distributed across nodes in the network (software deployment G06F8/60; multiprogramming arrangements G06F9/46) · CPC title

Patent family

Related publications grouped by family.

External sources

Frequently asked questions

Answers are generated from the same data shown on this page.

What does patent US11003692B2 cover?
Systems, methods, and non-transitory computer-readable media can obtain a first batch of content items to be clustered. A set of clusters can be generated by clustering respective binary hash codes for each content item in the first batch, wherein content items included in a cluster are visually similar to one another. A next batch of content items to be clustered can be obtained. One or more r…
Who is the assignee on this patent?
Facebook Inc
What technology area does this patent fall under?
Primary CPC classification G06N20/00. Mapped technology areas include Physics.
When was this patent published?
Publication date Tue May 11 2021 00:00:00 GMT+0000 (Coordinated Universal Time) (B2). Legal status and post-grant events are not shown on this page.
What related patents are in patentsdb?
We list 3 related publications on this page (citations in our corpus or others sharing the same primary CPC).