Handling queries in document systems using segment differential based document text-index modelling

US11157477B2 · US · B2

Patent metadata
FieldValue
Publication numberUS-11157477-B2
Application numberUS-201816202215-A
CountryUS
Kind codeB2
Filing dateNov 28, 2018
Priority dateNov 28, 2018
Publication dateOct 26, 2021
Grant dateOct 26, 2021

How to read this patent

A practical reading order for non-experts. Skip the full description unless you need deep technical detail.

  1. Title

    What the patent document calls the invention.

  2. Abstract

    A short plain-language summary of the technical disclosure.

  3. Assignees and inventors

    Who owns or filed the patent and who is credited as inventor.

  4. Key dates

    Filing, priority, publication, and grant dates set the timeline.

  5. First independent claim

    The legal scope of protection — read this for what is actually claimed.

  6. CPC / IPC classifications

    Technology tags used to group this patent with similar filings.

  7. Citations and related patents

    Prior art links and similar publications in this corpus.

Abstract

Official abstract text for this publication.

A method, computer system, and computer program product for segment differential-based document text-index modeling are provided. The embodiment may include receiving, by a processor, a document with a valid document ID and version ID tuple. The embodiment may also include determining the received document is a new version of a previously stored document and consequently multiplexing versions of the document into a single indexed document. The embodiment may further include segmenting the received document and building a token vector. The embodiment may also include calculating a difference between the received new version of the document and the previously stored document using information obtained from the segmentation. The embodiment may further include in response to the calculated difference being below a pre-configured threshold value, discarding the received new version.

First claim

Opening claim text (preview).

What is claimed is: 1. A processor-implemented method for storing and processing a query in a document corpus utilizing a segment level differential document text-index model, the method comprising: receiving, by a processor, a document with a valid document ID and version ID tuple; determining the received document is a new version of a previously stored document and, consequently, multiplexing multiple versions of the document into a single indexed document; segmenting the received document and building a token vector; calculating a difference between the received new version of the document and the previously stored document using information obtained from the segmentation; in response to the calculated difference being below a pre-configured threshold value, discarding the received new version; in response to the calculated difference exceeding the pre-configured threshold value, updating the token vector and an index with a token stream of the received new version of the document; executing a query with terms to receive hits across all versions of the document, wherein posting lists for a particular term and payloads of each position from the all versions of the documents are loaded, wherein token positions that are shared among the all versions of the document are analyzed to search for the particular term; and displaying a result set to a user indicating that the particular term was found among the all versions of the document. 2. The method of claim 1 , wherein the difference is calculated based on cumulative differentials between segments of the current version of the document and the previous version of the document. 3. The method of claim 1 , wherein the difference is calculated based on semantic differences between segments of the current version and the previous version of the document. 4. The method of claim 1 , wherein the difference is calculated based on counting a total number of words in each version of the document. 5. The method of claim 1 , wherein the difference is measured based on a comparison of each token vector of each document. 6. The method of claim 1 , further comprising: generating a version-specific payload for each token which indicates when a new term is detected in the new version. 7. The method of claim 1 , further comprising: modifying a term string when adding the token stream to an index by placing special characters or marks at the end of a token to indicate the term string is valid after a preconfigured number of document versions. 8. The method of claim 1 , further comprising: storing token vectors as separate data when a difference between two token streams exceeds a pre-configured threshold value. 9. The method of claim 1 , further comprising: loading a posting list of a term and a term with special marks or characters when a user searches for a term in specified versions of a document; loading a payload of each term indicating each position and version; and using version information to filter a requisite query result. 10. A computer system for avoiding a high object version explosion in processing a query in a document utilizing a segment differential-based document text-index modeling, the computer system comprising: one or more processors, one or more computer-readable memories, one or more computer-readable tangible storage media, and program instructions stored on at least one of the one or more tangible storage media for execution by at least one of the one or more processors via at least one of the one or more memories, wherein the computer system is capable of performing a method comprising: receiving, by a processor, a document with a valid document ID and version ID tuple; determining the received document is a new version of a previously stored document and, consequently, multiplexing multiple versions of the document into a single indexed document; segmenting the received document and building a token vector; calculating a difference between the received new version of the document and the previously stored document using information obtained from the segmentation; in response to the calculated difference being below a pre-configured threshold value, discarding the received new version; in response to the calculated difference exceeding the pre-configured threshold value, updating the token vector and an index with a token stream of the received new version of the document; executing a query with terms to receive hits across all versions of the document, wherein posting lists for a particular term and payloads of each position from the all versions of the documents are loaded, wherein token positions that are shared among the all versions of the document are analyzed to search for the particular term; and displaying a result set to a user indicating that the particular term was found among the all versions of the document. 11. The computer system of claim 10 , wherein the difference is calculated based on cumulative differentials between segments of the current version of the document and the previous version of the document. 12. The computer system of claim 10 , wherein the difference is calculated based on semantic differences between segments of the current version and the previous version of the document. 13. The computer system of claim 10 , wherein the difference is calculated based on counting a total number of words in each version of the document. 14. The computer system of claim 10 , wherein the difference is measured based on a comparison of each token vector of each document. 15. The computer system of claim 10 , further comprising: generating a version-specific payload for each token which indicates when a new term is detected in the new version. 16. The computer system of claim 10 , further comprising: modifying a term string when adding the token stream to an index by placing special characters or marks at the end of a token to indicate the term string is valid after a preconfigured number of document versions. 17. The computer system of claim 10 , further comprising: storing token vectors as separate data when a difference between two token streams exceeds a pre-configured threshold value. 18. The computer system of claim 10 , further comprising: loading a posting list of a term and a term with special marks or characters when a user searches for a term in specified versions of a document; loading a payload of each term indicating each position and version; and using version information to filter a requisite query result. 19. A computer program product for avoiding a high object version explosion in processing a query in a document utilizing a segment differential-based document text-index modeling, the computer program product comprising: one or more computer-readable tangible storage media and program instructions stored on at least one of the one or more tangible storage media, the program instructions executable by a processor of a computer to perform a method, the method comprising: receiving, by a processor, a document with a valid document ID and version ID tuple; determining the received document is a new version of a previously stored document and, consequently, multiplexing multiple versions of the document into a single indexed document; segmenting the received document and building a token vector; calculating a difference between the received new version of the document and the previously stored document using information obtained from the segmentation; in response to the calculated difference being below

Assignees

Inventors

Classifications

  • G06F16/313Primary

    Selection or weighting of terms for indexing · CPC title

  • Managing data history or versioning (querying versioned data G06F16/2474; querying temporal data G06F16/2477) · CPC title

  • Management thereof · CPC title

Patent family

Related publications grouped by family.

External sources

Frequently asked questions

Answers are generated from the same data shown on this page.

What does patent US11157477B2 cover?
A method, computer system, and computer program product for segment differential-based document text-index modeling are provided. The embodiment may include receiving, by a processor, a document with a valid document ID and version ID tuple. The embodiment may also include determining the received document is a new version of a previously stored document and consequently multiplexing versions o…
Who is the assignee on this patent?
IBM
What technology area does this patent fall under?
Primary CPC classification G06F16/313. Mapped technology areas include Physics.
When was this patent published?
Publication date Tue Oct 26 2021 00:00:00 GMT+0000 (Coordinated Universal Time) (B2). Legal status and post-grant events are not shown on this page.
What related patents are in patentsdb?
We list 3 related publications on this page (citations in our corpus or others sharing the same primary CPC).