Automatic techniques for constructing an evolving interest taxonomy from user-generated content

US12437010B2 · US · B2

Patent metadata
FieldValue
Publication numberUS-12437010-B2
Application numberUS-202418436945-A
CountryUS
Kind codeB2
Filing dateFeb 8, 2024
Priority dateFeb 8, 2024
Publication dateOct 7, 2025
Grant dateOct 7, 2025

How to read this patent

A practical reading order for non-experts. Skip the full description unless you need deep technical detail.

  1. Title

    What the patent document calls the invention.

  2. Abstract

    A short plain-language summary of the technical disclosure.

  3. Assignees and inventors

    Who owns or filed the patent and who is credited as inventor.

  4. Key dates

    Filing, priority, publication, and grant dates set the timeline.

  5. First independent claim

    The legal scope of protection — read this for what is actually claimed.

  6. CPC / IPC classifications

    Technology tags used to group this patent with similar filings.

  7. Citations and related patents

    Prior art links and similar publications in this corpus.

Abstract

Official abstract text for this publication.

Techniques for creating an interest graph include obtaining content items from multiple content sources and applying tailored (e.g., source-specific) preprocessing to the content items based on their respective content source. Text is extracted and salient keywords and key phrases are identified using unsupervised machine learning models. The keywords and key phrases become nodes in an interest graph, each node comprising an embedding of a keyword or key phrase in a common embedding space, with edges representing semantic similarity based on embeddings or co-engagement patterns. The graph provides an expansive, granular, and dynamic taxonomy easily adaptable to emerging interests. The interest graph overcomes limitations of conventional taxonomies that lack depth, fail to capture niche interests, and cannot adapt to reflect evolving user preferences. The described techniques construct a rich interest graph from diverse content for improved content understanding.

First claim

Opening claim text (preview).

What is claimed is: 1. A computer-implemented method comprising: obtaining a plurality of content items from a plurality of content delivery sources of an online platform, each content delivery source having respective content items in a different type of content format; applying distinct preprocessing steps to the respective content items from each content delivery source to extract text from the plurality of content items, wherein the distinct preprocessing steps are specific to the different content format of each content delivery source and include separate preprocessing pipelines for processing different types of content formats, wherein the preprocessing steps comprise source-specific preprocessing tailored to each type of content format and generalized source-agnostic preprocessing; identifying, from the text extracted from the plurality of content items, a plurality of keywords and key phrases; constructing an interest graph from the plurality of keywords and key phrases, the interest graph having as nodes, wherein each node is a vector representation of a keyword or key phrase, and having edges connecting the nodes, wherein each of the edges represents a measure of similarity between two connected nodes, wherein the constructing of the interest graph comprises providing the plurality of keywords and key phrases as input to an embedding model that outputs vector representations in a common embedding space based on co-engagement patterns determined from user behavior, wherein the co-engagement patterns are identified by detecting when a user interacts with two of the plurality of content items including one of the keywords or key phrases, such that keywords and key phrases that are similar to one another based on the co-engagement patterns have vector representations that are closer in distance to one another; preparing a training dataset for a pairwise machine learning model by pairing each image and video content item with one or more relevant keywords or key phrases from the constructed interest graph; and training a pairwise machine learning model on the prepared training dataset, wherein the training involves minimizing a contrastive loss function that brings closer together embeddings of corresponding image and text pairs from the training dataset, while pushing apart embeddings of non-corresponding image and text pairs. 2. The computer-implemented method of claim 1 , wherein the distinct preprocessing steps applied to content items from a first content delivery source comprise: identifying one or more hashtags in the content items from the first content delivery source; separating each hashtag into individual words; and adding the separated words from the first content delivery source to the text extracted from the plurality of content items. 3. The computer-implemented method of claim 1 , wherein the distinct preprocessing steps applied to content items from a second content delivery source comprise: identifying a heading or title associated with a content item from the second content delivery source; extracting text from the identified heading or title; and adding the extracted text from the second content delivery source to the text extracted from the plurality of content items. 4. The computer-implemented method of claim 1 , further comprising: for each content item from the plurality of content delivery sources that includes a video clip, processing the video clip using speech-to-text transcription to derive a text-based transcript; providing the derived text-based transcript as input to a pretrained machine learning model, wherein the pretrained machine learning model processes the input transcript and outputs one or more keywords and key phrases; and adding the one or more outputted keywords and key phrases to the text extracted from the plurality of content items. 5. The computer-implemented method of claim 1 , wherein the embedding model further determines similarity between the keywords and key phrases based on at least one of: co-occurrence of the keywords and key phrases together within individual content items; or contextual usage of the keywords and key phrases together within a textual context. 6. The computer-implemented method of claim 1 , further comprising: generating the edges between the nodes in the interest graph based on co-occurrence of keywords and key phrases within the plurality of content items, wherein a higher frequency of co-occurrence results in a higher edge weight between respective nodes. 7. The computer-implemented method of claim 1 , wherein the obtaining of the plurality of content items comprises retrieving previously submitted content items from a database, wherein the previously submitted content items had been previously posted to the online platform by users of the online platform. 8. The computer-implemented method of claim 1 , wherein: the pairwise machine learning model learns a joint embedding space that brings closer together embeddings of image and text content that correspond to each other. 9. A system comprising: one or more hardware processors; a memory storage device storing instructions thereon, which, when executed by the one or more hardware processors, cause the system to perform operations comprising: obtaining a plurality of content items from a plurality of content delivery sources of an online platform, each content delivery source having respective content items in a different type of content format; applying distinct preprocessing steps to the respective content items from each content delivery source to extract text from the plurality of content items, wherein the distinct preprocessing steps are specific to the different content format of each content delivery source and include separate preprocessing pipelines for processing different types of content formats, wherein the preprocessing steps comprise source-specific preprocessing tailored to each type of content format and generalized source-agnostic preprocessing; identifying, from the text extracted from the plurality of content items, a plurality of keywords and key phrases; constructing an interest graph from the plurality of keywords and key phrases, the interest graph having as nodes, wherein each node is a vector representation of a keyword or key phrase, and having edges connecting the nodes, wherein each of the edges represents a measure of similarity between two connected nodes, wherein the constructing of the interest graph comprises providing the plurality of keywords and key phrases as input to an embedding model that outputs vector representations in a common embedding space based on co-engagement patterns determined from user behavior, wherein the co-engagement patterns are identified by detecting when a user interacts with two of the plurality of content items including one of the keywords or key phrases, such that keywords and key phrases that are similar to one another based on the co-engagement patterns have vector representations that are closer in distance to one another; preparing a training dataset for a pairwise machine learning model by pairing each image and video content item with one or more relevant keywords or key phrases from the constructed interest graph; and training a pairwise machine learning model on the prepared training dataset, wherein the training involves minimizing a contrastive loss function that brings closer together embeddings of corresponding image and text pairs from the training dataset, while pushing apart embeddings of non-corresponding image and text pairs. 10. The system of claim 9 , wherein the distinct preprocessing steps applied to content items from a first content delivery source comprise: identifyin

Assignees

Inventors

Classifications

  • Search customisation based on user profiles and personalisation · CPC title

  • Query formulation · CPC title

Patent family

Related publications grouped by family.

External sources

Frequently asked questions

Answers are generated from the same data shown on this page.

What does patent US12437010B2 cover?
Techniques for creating an interest graph include obtaining content items from multiple content sources and applying tailored (e.g., source-specific) preprocessing to the content items based on their respective content source. Text is extracted and salient keywords and key phrases are identified using unsupervised machine learning models. The keywords and key phrases become nodes in an interest…
Who is the assignee on this patent?
Snap Inc
What technology area does this patent fall under?
Primary CPC classification G06F16/9535. Mapped technology areas include Physics.
When was this patent published?
Publication date Tue Oct 07 2025 00:00:00 GMT+0000 (Coordinated Universal Time) (B2). Legal status and post-grant events are not shown on this page.
What related patents are in patentsdb?
We list 8 related publications on this page (citations in our corpus or others sharing the same primary CPC).