Term selection from a document to find similar content

US2016140231A1 · US · A1

Patent metadata
FieldValue
Publication numberUS-2016140231-A1
Application numberUS-201414546340-A
CountryUS
Kind codeA1
Filing dateNov 18, 2014
Priority dateNov 18, 2014
Publication dateMay 19, 2016
Grant date

How to read this patent

A practical reading order for non-experts. Skip the full description unless you need deep technical detail.

  1. Title

    What the patent document calls the invention.

  2. Abstract

    A short plain-language summary of the technical disclosure.

  3. Assignees and inventors

    Who owns or filed the patent and who is credited as inventor.

  4. Key dates

    Filing, priority, publication, and grant dates set the timeline.

  5. First independent claim

    The legal scope of protection — read this for what is actually claimed.

  6. CPC / IPC classifications

    Technology tags used to group this patent with similar filings.

  7. Citations and related patents

    Prior art links and similar publications in this corpus.

Abstract

Official abstract text for this publication.

Methods, devices, and systems are described for creating and implementing search query vectors for knowledge base articles or other formal articles, the query vectors automatically created from informal correspondence such as a service request email to an information technology (IT) department. Term frequency-inverse document frequency (TF-IDF) scores are calculated for rarewords in the correspondence with respect to a corpus of other service requests. High scoring terms with the same neighbors as those in the corpus of formal articles are added to the search query vector, while high scoring terms that do not share the same neighbors are thrown out. The query vector is then used to run a search of the knowledge base for relevant articles.

First claim

Opening claim text (preview).

What is claimed is: 1 . A method for searching using term selection from a document to find similar content, the method comprising: providing formally written articles; selecting one or more tokens in each article by: identifying candidate root words; calculating, using a processor operatively coupled with a memory, a term frequency-inverse document frequency (TF-IDF) score for each of the candidate root words; and selecting the candidate root words as tokens based on the TF-IDF scores; cataloging neighboring tokens for each selected token into a data structure for each article, where neighboring tokens include tokens that are within a threshold number of words to the selected token in an article; merging the data structures for the articles into a merged data structure; providing a written correspondence; selecting one or more tokens in the correspondence by: identifying candidate root words from the correspondence; computing a TF-IDF score for each of the candidate root words in the correspondence with respect to a corpus of other correspondence; and selecting the candidate root words as tokens based on the TF-IDF scores; ascertaining neighboring tokens for each selected token in the correspondence; finding a match between a token in the correspondence and in the merged data structure; for the matched token, counting how many neighboring tokens in the merged data structure are also neighboring tokens in the correspondence; and adding the matched token to a query vector based on the counting; and performing a search of the formally written articles using the query vector. 2 . The method of claim 1 wherein the matched token is added to the query vector based on having a minimum threshold number of neighboring tokens in the merged data structure also being neighboring tokens in the correspondence, thereby excluding from the query vector high scoring terms in the correspondence that are specific to correspondence but not correlated among substantive, technical terms in formal written articles. 3 . The method of claim 1 further comprising: inserting a neighboring token from the merged data structure that is not a token in the correspondence, thereby expanding terms in the query vector beyond those that are in the correspondence. 4 . The method of claim 1 further comprising: returning search results based on the search. 5 . The method of claim 1 further comprising: building a data structure for the neighboring tokens in the correspondence, wherein the data structure for the neighboring tokens in the correspondence is of a same data type as the merged data structure. 6 . The method of claim 1 further comprising: tracking a minimum number of words between two tokens as a weight; and merging the data structures using the minimum number of words. 7 . The method of claim 1 further comprising: retaining a minimum number of words between two tokens when merging as a weight. 8 . The method of claim 1 wherein the selecting of tokens in each article, cataloging, and merging are performed before the written correspondence is provided. 9 . The method of claim 1 further comprising: calculating a logarithm of how many neighboring tokens in the data structure are also neighboring tokens in the correspondence; and adding the matched token to the query vector only if the logarithm is above a threshold value. 10 . The method of claim 1 wherein the neighboring tokens include tokens that are within 50 to 100 words of the selected token in an article. 11 . The method of claim 1 wherein the candidate root words are selected as tokens if they are above a transition point. 12 . The method of claim 1 wherein the candidate root words are selected as tokens if they are in a fourth quartile of scores. 13 . The method of claim 1 wherein the data structure includes an inverted index. 14 . The method of claim 1 wherein the correspondence includes an informal email. 15 . The method of claim 14 wherein the correspondence includes a service request for technical assistance. 16 . The method of claim 1 wherein the formally written articles include a knowledge base article. 17 . A machine-readable non-transitory medium embodying information indicative of instructions for causing one or more machines to perform operations for searching using term selection from a document to find similar content, the operations comprising: providing formally written articles; selecting one or more tokens in each article by: identifying candidate root words; calculating a term frequency-inverse document frequency (TF-IDF) score for each of the candidate root words; and selecting the candidate root words as tokens based on the TF-IDF scores; cataloging neighboring tokens for each selected token into a data structure for each article, where neighboring tokens include tokens that are within a threshold number of words to the selected token in an article; merging the data structures for the articles into a merged data structure; providing a written correspondence; selecting one or more tokens in the correspondence by: identifying candidate root words from the correspondence; computing a TF-IDF score for each of the candidate root words in the correspondence with respect to a corpus of other correspondence; and selecting the candidate root words as tokens based on the TF-IDF scores; ascertaining neighboring tokens for each selected token in the correspondence; finding a match between a token in the correspondence and in the merged data structure; for the matched token, counting how many neighboring tokens in the merged data structure are also neighboring tokens in the correspondence; and adding the matched token to a query vector based on the counting; and performing a search of the formally written articles using the query vector. 18 . The medium of claim 17 wherein the matched token is added to the query vector based on having a minimum threshold number of neighboring tokens in the merged data structure also being neighboring tokens in the correspondence, thereby excluding from the query vector high scoring terms in the correspondence that are specific to correspondence but not correlated among substantive, technical terms in formal written articles. 19 . A computer system executing instructions in a computer program for searching using term selection from a document to find similar content, the system comprising: a processor; and a memory operatively coupled with the processor, the processor executing instructions stored in the memory including: program code for providing formally written articles; program code for selecting one or more tokens in each article by: program code for identifying candidate root words; program code for calculating a term frequency-inverse document frequency (TF-IDF) score for each of the candidate root words; and program code for selecting the candidate root words as tokens based on the TF-IDF scores; program code for cataloging neighboring tokens for each selected token into a data structure for each article, where neighboring tokens include tokens that are within a threshold number of words to the selected token in an article; program code for merging the data structures for the articles into a common data structure; program code for providing a written correspondence; program code for selecting one or more tokens in the correspondence by: program code for identifying candidate root words from the correspondence; prog

Assignees

Inventors

Classifications

  • Physics · mapped topic

  • Physics · mapped topic

  • G06F16/334Primary

    Query execution (filtering based on additional data G06F16/335) · CPC title

Patent family

Related publications grouped by family.

External sources

Frequently asked questions

Answers are generated from the same data shown on this page.

What does patent US2016140231A1 cover?
Methods, devices, and systems are described for creating and implementing search query vectors for knowledge base articles or other formal articles, the query vectors automatically created from informal correspondence such as a service request email to an information technology (IT) department. Term frequency-inverse document frequency (TF-IDF) scores are calculated for rarewords in the corresp…
Who is the assignee on this patent?
Oracle Int Corp
What technology area does this patent fall under?
Primary CPC classification G06F17/30864. Mapped technology areas include Physics.
When was this patent published?
Publication date Thu May 19 2016 00:00:00 GMT+0000 (Coordinated Universal Time) (A1). Legal status and post-grant events are not shown on this page.
What related patents are in patentsdb?
We list 8 related publications on this page (citations in our corpus or others sharing the same primary CPC).