Systems and methods for classifying token sequence embeddings

US2025265414A1 · US · A1

Patent metadata
FieldValue
Publication numberUS-2025265414-A1
Application numberUS-202418581556-A
CountryUS
Kind codeA1
Filing dateFeb 20, 2024
Priority dateFeb 20, 2024
Publication dateAug 21, 2025
Grant date

How to read this patent

A practical reading order for non-experts. Skip the full description unless you need deep technical detail.

  1. Title

    What the patent document calls the invention.

  2. Abstract

    A short plain-language summary of the technical disclosure.

  3. Assignees and inventors

    Who owns or filed the patent and who is credited as inventor.

  4. Key dates

    Filing, priority, publication, and grant dates set the timeline.

  5. First independent claim

    The legal scope of protection — read this for what is actually claimed.

  6. CPC / IPC classifications

    Technology tags used to group this patent with similar filings.

  7. Citations and related patents

    Prior art links and similar publications in this corpus.

Abstract

Official abstract text for this publication.

Systems and methods for classifying token sequences using clustering. In some aspects, the system receives a first set of input sequences and generates a corresponding first plurality of embeddings using a first machine learning model. The system generate a plurality of clusters, each containing embeddings, using a clustering model. For each cluster in the plurality of clusters, the system generates a set of common components in the cluster. The system generates a set of universal components based on the plurality of clusters. For each cluster in the plurality of clusters, the system generates a set of unique components by removing the set of universal components from its associated set of common components. The system classifies a second plurality of embeddings using the sets of unique components.

First claim

Opening claim text (preview).

What is claimed is: 1 . A system for generating explainability for text embeddings when processing user communications in real-time to generate real-time responses, comprising: one or more processors; and one or more non-transitory, computer-readable media comprising instructions that, when executed by the one or more processors, cause operations comprising: receiving a first user communication, wherein the first user communication comprises a request with a temporal element; detecting first data and second data in the first user communication, wherein the first data comprises background information relating to the first user communication, and wherein the second data is a query requiring a first response within the temporal element; using a first machine learning model, generating a first plurality of embeddings corresponding to a first set of input sequences in the first data, wherein the first machine learning model translates sequences of text tokens into embeddings, wherein embeddings are vectors of real values; using a clustering model, generating a plurality of clusters, each cluster comprising one or more embeddings in the first plurality of embeddings; for each cluster in the plurality of clusters, generating a set of common components in the cluster, wherein each common component in the set of common components comprises one or more real value segments that constitute embeddings in the cluster; generating a set of universal components based on the plurality of clusters, wherein each universal component in the set of universal components comprises one or more real value segments and wherein each universal component in the set of universal components occurs in a majority of the plurality of clusters; generating, for each cluster in the plurality of clusters, a set of unique components by modifying its associated set of common components, comprising removing the set of universal components from its associated set of common components; using the sets of unique components, classifying a second plurality of embeddings in the second data; and generating for display, in a user interface, the first response to the first user communication based on classifying the second plurality of embeddings, wherein the first response corresponds to the temporal element. 2 . A method for generating explainability for text embeddings, comprising: receiving a first set of input sequences; using a first machine learning model, generating a first plurality of embeddings corresponding to the first set of input sequences, wherein the first machine learning model translates sequences of text tokens into embeddings, wherein embeddings are vectors of real values; using a clustering model, generating a plurality of clusters, each cluster comprising one or more embeddings in the first plurality of embeddings; for each cluster in the plurality of clusters, generating a set of common components in the cluster, wherein each common component in the set of common components comprises one or more real value segments that constitute embeddings in the cluster; generating a set of universal components based on the plurality of clusters, wherein each universal component in the set of universal components comprises one or more real value segments and wherein each universal component in the set of universal components occurs in a majority of the plurality of clusters; generating, for each cluster in the plurality of clusters, a set of unique components by modifying its associated set of common components, comprising removing the set of universal components from its associated set of common components; and using the sets of unique components, classifying a second plurality of embeddings. 3 . The method of claim 2 , wherein first machine learning model is a bidirectional encoder transformer representations model trained to produce textual predictions. 4 . The method of claim 2 , wherein generating the set of common components for a cluster in the plurality of clusters comprises: identifying a set of potential components based on the cluster, wherein the set of potential components comprises real value vectors of varying lengths, and wherein each potential component in the set of potential components is found in one or more embeddings in the cluster; ranking the set of potential components based on frequency of occurrence to generate a component ranking; and selecting the set of common components from the component ranking to be a fixed number of components. 5 . The method of claim 2 , wherein generating a set of universal components based on the plurality of clusters comprises: generating a full set of components, wherein the full set of components comprises all real- valued vectors of any length from the plurality of clusters; ranking the full set of components based on frequency of occurrence to generate a full component ranking; and selecting the set of universal components from the full component ranking to be a portion of components. 6 . The method of claim 2 , wherein generating the plurality of clusters comprises: training the clustering model to sort vectors of real values into clusters based on distances between any two vectors of real values; and using the clustering model, generating the plurality of clusters based on the first plurality of embeddings. 7 . The method of claim 6 , wherein the clustering model: selects a number of initial cluster centroids; generates a set of initial clusters by assigning each embedding to an initial cluster centroid that minimizes a distance between the embedding and the initial cluster centroid; calculates a number of final cluster centroids by, for each cluster in the set of initial clusters, selecting an embedding with minimal average distance to other embeddings in the cluster; and generates a set of final clusters by assigning each embedding to a final cluster centroid that minimizes the distance between the embedding and the final cluster centroid. 8 . The method of claim 2 , wherein generating a set of unique components for a cluster in the plurality of clusters comprises: for each embedding in the cluster, generating a similarity score, wherein the similarity score indicates a degree of similarity between the embedding and a closest embedding in the set of universal components; and removing all embeddings with similarity scores above a threshold from the cluster. 9 . The method of claim 2 , wherein each set of unique components comprises an archetype embedding, wherein the archetype embedding is a vector of real values most representative of the embeddings in the cluster corresponding to the set of unique components. 10 . The method of claim 9 , wherein classifying the second plurality of embeddings using the sets of unique components comprises: for each embedding in the second plurality of embeddings, generating a distance metric from the embedding to each archetype embedding; and based on the archetype embedding with a shortest distance metric, assigning classifications to the second plurality of embeddings. 11 . The method of claim 2 , further comprising: generating, for each cluster in the plurality of clusters, a component collection for the cluster, wherein the component collection comprises all subsections of real values found in embeddings in the cluster; and for each component collection, removing the set of universal components from the component collection to generate the set of unique components corresponding to the cluster. 12 . The method of claim 2 , wherein generating the first plurality of embeddings corresponding to the first set of input sequences comprises: extracting a

Assignees

Inventors

Classifications

  • Auto-encoder networks; Encoder-decoder networks · CPC title

  • Combinations of networks · CPC title

  • G06F40/284Primary

    Lexical analysis, e.g. tokenisation or collocates · CPC title

  • G06N20/00Primary

    Machine learning · CPC title

Patent family

Related publications grouped by family.

External sources

Frequently asked questions

Answers are generated from the same data shown on this page.

What does patent US2025265414A1 cover?
Systems and methods for classifying token sequences using clustering. In some aspects, the system receives a first set of input sequences and generates a corresponding first plurality of embeddings using a first machine learning model. The system generate a plurality of clusters, each containing embeddings, using a clustering model. For each cluster in the plurality of clusters, the system gene…
Who is the assignee on this patent?
Capital One Services Llc
What technology area does this patent fall under?
Primary CPC classification G06F40/284. Mapped technology areas include Physics.
When was this patent published?
Publication date Thu Aug 21 2025 00:00:00 GMT+0000 (Coordinated Universal Time) (A1). Legal status and post-grant events are not shown on this page.
What related patents are in patentsdb?
We list 5 related publications on this page (citations in our corpus or others sharing the same primary CPC).