Who is the assignee on this patent?

Microsoft Technology Licensing Llc

What technology area does this patent fall under?

Primary CPC classification G06F16/24578. Mapped technology areas include Physics.

When was this patent published?

Publication date Tue Nov 28 2023 00:00:00 GMT+0000 (Coordinated Universal Time) (B2). Legal status and post-grant events are not shown on this page.

What related patents are in patentsdb?

We list 5 related publications on this page (citations in our corpus or others sharing the same primary CPC).

Document body vectorization and noise-contrastive training

US11829374B2 · US · B2

Patent metadata
Field	Value
Publication number	US-11829374-B2
Application number	US-202117207103-A
Country	US
Kind code	B2
Filing date	Mar 19, 2021
Priority date	Dec 4, 2020
Publication date	Nov 28, 2023
Grant date	Nov 28, 2023

How to read this patent

A practical reading order for non-experts. Skip the full description unless you need deep technical detail.

Title
What the patent document calls the invention.
Abstract
A short plain-language summary of the technical disclosure.
Assignees and inventors
Who owns or filed the patent and who is credited as inventor.
Key dates
Filing, priority, publication, and grant dates set the timeline.
First independent claim
The legal scope of protection — read this for what is actually claimed.
CPC / IPC classifications
Technology tags used to group this patent with similar filings.
Citations and related patents
Prior art links and similar publications in this corpus.

Abstract

Official abstract text for this publication.

Document embedding vectors for each document of a corpus may be generated by combining embedding vectors for document subparts, thereby yielding a final embedding vector for the document. A machine learning model is trained using a query corpus and the document corpus, where the model generates a ranking score for a given (query, document) pair. During training, rankings scores are generated using the model, such that the training dataset is further refined using the generated ranking scores. For example, top documents and a negative document may be determined for a given query and subsequently used as training data. Multiple negative documents may therefore be determined for a given query. A negative document for a given query may be determined from the negative documents using noise-contrastive estimation. Such determined negative documents may be evaluated using a loss function during model training, thereby yielding a more robust model for search processing.

First claim

Opening claim text (preview).

What is claimed is: 1. A system comprising: at least one processor; and memory storing instructions that, when executed by the at least one processor, causes the system to perform a set of operations, the set of operations comprising: training a machine learning model based on a training dataset comprising a search query corpus and the document corpus, wherein training the machine learning model comprises: generating, for each document of the document corpus, a document embedding vector based on a plurality of embedding vectors that are each associated with a document subpart of a given document; generating, using the machine learning model, a set of ranking scores for documents of the document corpus based on a first search query of the search query corpus and a document embedding vector for each of the documents; refining the training dataset based on the generated set of ranking scores; determining a first negative document from a set of negative documents for the first search query; evaluating a loss function using the first negative document to train the machine learning model, thereby yielding an updated machine learning model; and in a subsequent training iteration: further refining the training dataset based on the updated machine learning model to generate a second training dataset; and further training the updated machine learning model based on the second data set; obtaining a request comprising a second search query; generating, using the trained machine learning model, a set of documents from the document corpus that is responsive to the second search query; and providing, in response to the request, the set of documents that is responsive to the second search query. 2. The system of claim 1 , wherein refining the training dataset comprises: retaining, for the first search query, a subset of documents of the document corpus in the training dataset based on the set of ranking scores; and determining a second negative document for the first search query from the document corpus, wherein the second negative document is part of the set of negative documents for the first search query. 3. The system of claim 2 , wherein the second negative document is randomly determined. 4. The system of claim 1 , wherein the first negative document is determined from the set of negative documents for the first search query using noise-contrastive estimation. 5. The system of claim 1 , wherein the loss function evaluates a first cosine similarity between a query embedding vector for the first search query and a first document embedding vector for the first negative document. 6. The system of claim 5 , wherein the loss function further evaluates a second cosine similarity between the query embedding vector and a second document embedding vector for a positive document associated with the first search query. 7. The system of claim 1 , wherein generating the set of documents that is responsive to the second search query comprises: performing an approximate nearest neighbor search using a query embedding vector for the second search query and document embedding vectors for documents of the document corpus to generate the set of documents; and ranking the set of documents according to associated ranking scores. 8. A method for generating a set of documents responsive to a search query, comprising: obtaining a request comprising a search query; generating a query embedding vector for the search query; generating, based on the query embedding vector and document embedding vectors for documents of a document corpus, a set of documents responsive to the search query, wherein a document embedding vector of the document embedding vectors is generated based on a plurality of embedding vectors that are each associated with a document subpart of a given document; evaluating, using a loss function, a first cosine similarity between the query embedding vector for a first search query and a first document embedding vector for a first negative document and a second cosine similarity between the query embedding vector for a second search query and a second document embedding vector for a positive document associated with the first search query; generating, the set of documents responsive to the second search query; and ranking the set of documents according to associated ranking scores; and providing, in response to the request, the ranked set of documents that is responsive to the search query. 9. The method of claim 8 , wherein generating the set of documents responsive to the search query comprises processing the query embedding vector and the document embedding vectors using an approximate nearest neighbor search. 10. The method of claim 8 , wherein a document embedding vector for a document of the document corpus is a pre-generated document embedding vector based on a plurality of embedding vectors, wherein each embedding vector of the plurality of embedding vectors is associated with a subpart of the document. 11. The method of claim 8 , wherein a document embedding vector for a document of the document corpus is associated with a body of the document. 12. The method of claim 8 , wherein providing the ranked set of documents comprises providing a subpart of a document in the ranked set of documents. 13. A method for machine learning model-based search processing, comprising: training a machine learning model based on a training dataset comprising a search query corpus and the document corpus, wherein training the machine learning model comprises: generating, using the machine learning model, a set of ranking scores for documents of the document corpus based on a first search query of the search query corpus; refining the training dataset based on the generated set of ranking scores; determining a first negative document from a set of negative documents for the first search query; evaluating a loss function using the first negative document to train the machine learning model, thereby yielding an updated machine learning model; and in a subsequent training iteration: further refining the training dataset based on the updated machine learning model to generate a second training data set; and further training the updated machine learning model based on the second data set; obtaining a request comprising a second search query; generating, using the trained machine learning model, a set of documents from the document corpus that is responsive to the second search query; and providing, in response to the request, the set of documents that is responsive to the second search query. 14. The method of claim 13 , further comprising: generating, for each document of the document corpus, a document embedding vector based on an embedding vector of a plurality of subparts of the given document. 15. The method of claim 13 , wherein refining the training dataset comprises: retaining, for the first search query, a subset of documents of the document corpus in the training dataset based on the set of ranking scores; and determining a second negative document for the first search query from the document corpus, wherein the second negative document is part of the set of negative documents for the first search query. 16. The method of claim 13 , wherein the first negative document is determined from the set of negative documents for the first search query using noise-contrastive estimation. 17. The method of claim 13 , wherein the loss function evaluates a first cosine similarity between a query embedding vector for the first search query and a first document emb

Assignees

Microsoft Technology Licensing Llc

Inventors

Classifications

G06F16/24578Primary
using ranking · CPC title
G06N5/04
Inference or reasoning models · CPC title
G06N20/00
Machine learning · CPC title
G06F16/3347Primary
using vector based model · CPC title
G06F16/3326
using relevance feedback from the user, e.g. relevance feedback on documents, documents sets, document terms or passages · CPC title

Patent family

Related publications grouped by family.

View patent family 81848129

External sources

Frequently asked questions

Answers are generated from the same data shown on this page.

What does patent US11829374B2 cover?: Document embedding vectors for each document of a corpus may be generated by combining embedding vectors for document subparts, thereby yielding a final embedding vector for the document. A machine learning model is trained using a query corpus and the document corpus, where the model generates a ranking score for a given (query, document) pair. During training, rankings scores are generated us…
Who is the assignee on this patent?: Microsoft Technology Licensing Llc
What technology area does this patent fall under?: Primary CPC classification G06F16/24578. Mapped technology areas include Physics.
When was this patent published?: Publication date Tue Nov 28 2023 00:00:00 GMT+0000 (Coordinated Universal Time) (B2). Legal status and post-grant events are not shown on this page.
What related patents are in patentsdb?: We list 5 related publications on this page (citations in our corpus or others sharing the same primary CPC).

How to read this patent

Abstract

First claim

Assignees

Inventors

Classifications

Patent family

External sources

Related patents

Method for resource sorting, method for training sorting model and corresponding apparatuses

Techniques for identifying color profiles for textual queries

Machine learning retraining

Visually Guided Machine-learning Language Model

Method of and system for generating a training set for a machine learning algorithm

Frequently asked questions