Document body vectorization and noise-contrastive training

US11829374B2 · US · B2

Patent metadata
FieldValue
Publication numberUS-11829374-B2
Application numberUS-202117207103-A
CountryUS
Kind codeB2
Filing dateMar 19, 2021
Priority dateDec 4, 2020
Publication dateNov 28, 2023
Grant dateNov 28, 2023

How to read this patent

A practical reading order for non-experts. Skip the full description unless you need deep technical detail.

  1. Title

    What the patent document calls the invention.

  2. Abstract

    A short plain-language summary of the technical disclosure.

  3. Assignees and inventors

    Who owns or filed the patent and who is credited as inventor.

  4. Key dates

    Filing, priority, publication, and grant dates set the timeline.

  5. First independent claim

    The legal scope of protection — read this for what is actually claimed.

  6. CPC / IPC classifications

    Technology tags used to group this patent with similar filings.

  7. Citations and related patents

    Prior art links and similar publications in this corpus.

Abstract

Official abstract text for this publication.

Document embedding vectors for each document of a corpus may be generated by combining embedding vectors for document subparts, thereby yielding a final embedding vector for the document. A machine learning model is trained using a query corpus and the document corpus, where the model generates a ranking score for a given (query, document) pair. During training, rankings scores are generated using the model, such that the training dataset is further refined using the generated ranking scores. For example, top documents and a negative document may be determined for a given query and subsequently used as training data. Multiple negative documents may therefore be determined for a given query. A negative document for a given query may be determined from the negative documents using noise-contrastive estimation. Such determined negative documents may be evaluated using a loss function during model training, thereby yielding a more robust model for search processing.

First claim

Opening claim text (preview).

What is claimed is: 1. A system comprising: at least one processor; and memory storing instructions that, when executed by the at least one processor, causes the system to perform a set of operations, the set of operations comprising: training a machine learning model based on a training dataset comprising a search query corpus and the document corpus, wherein training the machine learning model comprises: generating, for each document of the document corpus, a document embedding vector based on a plurality of embedding vectors that are each associated with a document subpart of a given document; generating, using the machine learning model, a set of ranking scores for documents of the document corpus based on a first search query of the search query corpus and a document embedding vector for each of the documents; refining the training dataset based on the generated set of ranking scores; determining a first negative document from a set of negative documents for the first search query; evaluating a loss function using the first negative document to train the machine learning model, thereby yielding an updated machine learning model; and in a subsequent training iteration: further refining the training dataset based on the updated machine learning model to generate a second training dataset; and further training the updated machine learning model based on the second data set; obtaining a request comprising a second search query; generating, using the trained machine learning model, a set of documents from the document corpus that is responsive to the second search query; and providing, in response to the request, the set of documents that is responsive to the second search query. 2. The system of claim 1 , wherein refining the training dataset comprises: retaining, for the first search query, a subset of documents of the document corpus in the training dataset based on the set of ranking scores; and determining a second negative document for the first search query from the document corpus, wherein the second negative document is part of the set of negative documents for the first search query. 3. The system of claim 2 , wherein the second negative document is randomly determined. 4. The system of claim 1 , wherein the first negative document is determined from the set of negative documents for the first search query using noise-contrastive estimation. 5. The system of claim 1 , wherein the loss function evaluates a first cosine similarity between a query embedding vector for the first search query and a first document embedding vector for the first negative document. 6. The system of claim 5 , wherein the loss function further evaluates a second cosine similarity between the query embedding vector and a second document embedding vector for a positive document associated with the first search query. 7. The system of claim 1 , wherein generating the set of documents that is responsive to the second search query comprises: performing an approximate nearest neighbor search using a query embedding vector for the second search query and document embedding vectors for documents of the document corpus to generate the set of documents; and ranking the set of documents according to associated ranking scores. 8. A method for generating a set of documents responsive to a search query, comprising: obtaining a request comprising a search query; generating a query embedding vector for the search query; generating, based on the query embedding vector and document embedding vectors for documents of a document corpus, a set of documents responsive to the search query, wherein a document embedding vector of the document embedding vectors is generated based on a plurality of embedding vectors that are each associated with a document subpart of a given document; evaluating, using a loss function, a first cosine similarity between the query embedding vector for a first search query and a first document embedding vector for a first negative document and a second cosine similarity between the query embedding vector for a second search query and a second document embedding vector for a positive document associated with the first search query; generating, the set of documents responsive to the second search query; and ranking the set of documents according to associated ranking scores; and providing, in response to the request, the ranked set of documents that is responsive to the search query. 9. The method of claim 8 , wherein generating the set of documents responsive to the search query comprises processing the query embedding vector and the document embedding vectors using an approximate nearest neighbor search. 10. The method of claim 8 , wherein a document embedding vector for a document of the document corpus is a pre-generated document embedding vector based on a plurality of embedding vectors, wherein each embedding vector of the plurality of embedding vectors is associated with a subpart of the document. 11. The method of claim 8 , wherein a document embedding vector for a document of the document corpus is associated with a body of the document. 12. The method of claim 8 , wherein providing the ranked set of documents comprises providing a subpart of a document in the ranked set of documents. 13. A method for machine learning model-based search processing, comprising: training a machine learning model based on a training dataset comprising a search query corpus and the document corpus, wherein training the machine learning model comprises: generating, using the machine learning model, a set of ranking scores for documents of the document corpus based on a first search query of the search query corpus; refining the training dataset based on the generated set of ranking scores; determining a first negative document from a set of negative documents for the first search query; evaluating a loss function using the first negative document to train the machine learning model, thereby yielding an updated machine learning model; and in a subsequent training iteration: further refining the training dataset based on the updated machine learning model to generate a second training data set; and further training the updated machine learning model based on the second data set; obtaining a request comprising a second search query; generating, using the trained machine learning model, a set of documents from the document corpus that is responsive to the second search query; and providing, in response to the request, the set of documents that is responsive to the second search query. 14. The method of claim 13 , further comprising: generating, for each document of the document corpus, a document embedding vector based on an embedding vector of a plurality of subparts of the given document. 15. The method of claim 13 , wherein refining the training dataset comprises: retaining, for the first search query, a subset of documents of the document corpus in the training dataset based on the set of ranking scores; and determining a second negative document for the first search query from the document corpus, wherein the second negative document is part of the set of negative documents for the first search query. 16. The method of claim 13 , wherein the first negative document is determined from the set of negative documents for the first search query using noise-contrastive estimation. 17. The method of claim 13 , wherein the loss function evaluates a first cosine similarity between a query embedding vector for the first search query and a first document emb

Assignees

Inventors

Classifications

  • using ranking · CPC title

  • Inference or reasoning models · CPC title

  • Machine learning · CPC title

  • using vector based model · CPC title

  • using relevance feedback from the user, e.g. relevance feedback on documents, documents sets, document terms or passages · CPC title

Patent family

Related publications grouped by family.

External sources

Frequently asked questions

Answers are generated from the same data shown on this page.

What does patent US11829374B2 cover?
Document embedding vectors for each document of a corpus may be generated by combining embedding vectors for document subparts, thereby yielding a final embedding vector for the document. A machine learning model is trained using a query corpus and the document corpus, where the model generates a ranking score for a given (query, document) pair. During training, rankings scores are generated us…
Who is the assignee on this patent?
Microsoft Technology Licensing Llc
What technology area does this patent fall under?
Primary CPC classification G06F16/24578. Mapped technology areas include Physics.
When was this patent published?
Publication date Tue Nov 28 2023 00:00:00 GMT+0000 (Coordinated Universal Time) (B2). Legal status and post-grant events are not shown on this page.
What related patents are in patentsdb?
We list 5 related publications on this page (citations in our corpus or others sharing the same primary CPC).