Disambiguation in mention detection

US10176165B2 · US · B2

Patent metadata
FieldValue
Publication numberUS-10176165-B2
Application numberUS-201514926260-A
CountryUS
Kind codeB2
Filing dateOct 29, 2015
Priority dateOct 31, 2014
Publication dateJan 8, 2019
Grant dateJan 8, 2019

How to read this patent

A practical reading order for non-experts. Skip the full description unless you need deep technical detail.

  1. Title

    What the patent document calls the invention.

  2. Abstract

    A short plain-language summary of the technical disclosure.

  3. Assignees and inventors

    Who owns or filed the patent and who is credited as inventor.

  4. Key dates

    Filing, priority, publication, and grant dates set the timeline.

  5. First independent claim

    The legal scope of protection — read this for what is actually claimed.

  6. CPC / IPC classifications

    Technology tags used to group this patent with similar filings.

  7. Citations and related patents

    Prior art links and similar publications in this corpus.

Abstract

Official abstract text for this publication.

Disambiguation in mention detection. The method includes: determining at least one location in a text at which a target surface form in the text appears; obtaining an overall word-bag context of the target surface form in the text, the word-bag context at each of the at least one location including words within a predetermined neighborhood of the location; obtaining an overall resource context of the target surface form in the text, the resource context at each of the at least one location including resources corresponding to a further surface form within a predetermined neighborhood of the location; and determining a similarity between the target surface form and a candidate resource for the target surface form based on the overall word-bag context and the overall resource context. A system for disambiguation in mention detection is also provided.

First claim

Opening claim text (preview).

We claim: 1. A method for disambiguation in mention detection, comprising: determining at least one location in a text at which a target surface form in the text appears, wherein the text is located in a first web resource; training a word-bag context and a resource context, wherein the training comprises: obtaining a training corpus, wherein the training corpus includes a plurality of articles, and wherein each of the plurality of articles include at least one of the target surface form; and generating the word-bag context and the resource context based on the training corpus; obtaining the word-bag context, in the text, for the target surface form at each of the at least one location, the word-bag context for the target surface form at each of the at least one location including words within a first predetermined neighborhood of the at least one location; obtaining an overall word-bag context of the target surface form, wherein the overall word-bag context includes words that are present a first number of times exceeding a first pre-defined threshold in the word-bag context of the target surface form at each of the at least one location; obtaining the resource context, in the text, for the target surface form at each of the at least one location, the resource context for the target surface form at each of the at least one location including resources corresponding to an ancillary surface form within a second predetermined neighborhood of the at least one location; obtaining an overall resource context of the target surface form, wherein the overall resource context includes words that are present a second number of times exceeding a second pre-defined threshold in the resource context of the target surface form at each of the at least one location; and determining a similarity between the target surface form and a candidate resource for the target surface form based on the overall word-bag context and the overall resource context, wherein determining the similarity between the target surface form and a candidate resource for the target surface form based on the overall word-bag context and the overall resource context comprises: constructing a surface form context vector of the target surface form based on the overall word-bag context and the overall resource context; and obtaining a candidate resource context vector of the candidate resource, the candidate resource context vector including the overall word-bag context and the overall resource context of the candidate resource; determining the similarity between the target surface form and the candidate resource based on the surface form context vector and the candidate resource context vector; mapping the target surface form to the candidate resource based at least in part on the similarity between the target surface form and the candidate resource being above a threshold similarity, wherein the candidate resource comprises a second web resource; and outputting, via a user interface, the target surface form and candidate resource. 2. The method according to claim 1 , wherein obtaining an overall word-bag contexts of the target surface form in the text comprises merging the word-bag contexts of the target surface form at each of the at least one location; and obtaining an overall resource context of the target surface form in the text comprises merging the resource contexts of the target surface form at each of least one location. 3. The method according to claim 1 , wherein determining the similarity between the target surface form and the candidate resource based on the surface form context vector and the candidate resource context vector comprises: obtaining a first set of weights of elements in the surface form context vector, the first set of weights indicating importance of the elements in the surface form context vector; obtaining a second set of weights of elements in the candidate resource context vector, the second set of weights indicating importance of the elements in the candidate resource context vector; and calculating an inner product of the surface form context vector and the candidate resource context vector based on the first set of weights and the second set of weights, to determine the similarity between the target surface form and the candidate resource. 4. The method according to claim 3 , wherein obtaining the first set of weights of elements in the surface form context vector comprises: calculating the weights based on at least one of a term of frequency (TF) and an inverse document frequency (IDF) of the elements in the surface form context vector. 5. The method according to claim 1 , wherein obtaining the candidate resource context vector of the candidate resource comprises: obtaining the candidate resource context vector from an index associated with the target surface form. 6. The method according to claim 1 , wherein determining at least one location in the text at which the target surface form in the text appears comprises: obtaining an overall word-bag context of each surface form in a plurality of surface forms in the text; determining, based on the overall word-bag context of each surface form in the plurality of surface forms, a coarse similarity between each surface form in the plurality of surface forms and a respective candidate resource; and selecting the target surface form from among the plurality of surface forms, such that the coarse similarity of the target surface form is lower than a first threshold, and the coarse similarity of a further surface form within a predetermined neighborhood of the target surface form is higher than a second threshold. 7. A system for disambiguation in mention detection, comprising: a processor communicatively coupled to a memory, the processor configured to: determine at least one location in a text at which a target surface form in the text appears, wherein the text is located in a first web resource; train a word-bag context and a resource context, wherein the training comprises: obtain a training corpus, wherein the training corpus includes a plurality of articles, and wherein each of the plurality of articles include at least one of the target surface form; and generate the word-bag context and the resource context based on the training corpus; obtain the word-bag context, in the text, for the target surface form at each of the at least one location, the word-bag context for the target surface form at each of the at least one location including words within a first predetermined neighborhood of the at least one location; obtain an overall word-bag context of the target surface form, wherein the overall word-bag context includes words that are present a first number of times exceeding a first pre-defined threshold in the word-gab context of the target surface form at each of the at least one location; obtain the resource context of the target surface form, in the text, for the target surface form at each of the at least one location, the resource context for the target surface form at each of the at least one location including resources corresponding to an ancillary surface form within a second predetermined neighborhood of the at least one location; obtain an overall resource context of the target surface form, wherein the overall resource context includes words that are present a second number of times exceeding a second pre-defined threshold in the resource context of the target surface form at each of the at least one location; determine, based on the overall word-bag context and the overall resource context, a similarity between the target surface form and a candidate resource for the target surface form, wherein determining the similarity between the target surface form and a candidate resource for th

Assignees

Inventors

Classifications

Patent family

Related publications grouped by family.

External sources

Frequently asked questions

Answers are generated from the same data shown on this page.

What does patent US10176165B2 cover?
Disambiguation in mention detection. The method includes: determining at least one location in a text at which a target surface form in the text appears; obtaining an overall word-bag context of the target surface form in the text, the word-bag context at each of the at least one location including words within a predetermined neighborhood of the location; obtaining an overall resource context …
Who is the assignee on this patent?
IBM
What technology area does this patent fall under?
Primary CPC classification G06F40/295. Mapped technology areas include Physics.
When was this patent published?
Publication date Tue Jan 08 2019 00:00:00 GMT+0000 (Coordinated Universal Time) (B2). Legal status and post-grant events are not shown on this page.
What related patents are in patentsdb?
We list 3 related publications on this page (citations in our corpus or others sharing the same primary CPC).