Identification of textual similarity

US2018137090A1 · US · A1

Patent metadata
FieldValue
Publication numberUS-2018137090-A1
Application numberUS-201615350355-A
CountryUS
Kind codeA1
Filing dateNov 14, 2016
Priority dateNov 14, 2016
Publication dateMay 17, 2018
Grant date

How to read this patent

A practical reading order for non-experts. Skip the full description unless you need deep technical detail.

  1. Title

    What the patent document calls the invention.

  2. Abstract

    A short plain-language summary of the technical disclosure.

  3. Assignees and inventors

    Who owns or filed the patent and who is credited as inventor.

  4. Key dates

    Filing, priority, publication, and grant dates set the timeline.

  5. First independent claim

    The legal scope of protection — read this for what is actually claimed.

  6. CPC / IPC classifications

    Technology tags used to group this patent with similar filings.

  7. Citations and related patents

    Prior art links and similar publications in this corpus.

Abstract

Official abstract text for this publication.

Techniques for determining a similarity between text segments within a document comprising textual references are described. According to an example, a system comprises a memory that stores computer executable components; and a processor that executes the computer executable components stored in the memory. The computer executable components can comprise: an identification component that identifies a reference associated with a set of text and an extraction component that extracts the reference from the set of text. The computer executable components can also comprise an embedding component that replaces the reference with a corresponding vector.

First claim

Opening claim text (preview).

What is claimed is: 1 . A system comprising: a memory that stores computer executable components; a processor that executes the computer executable components stored in the memory, wherein the computer executable components comprise: an identification component that identifies a reference associated with a set of text; an extraction component that extracts the reference from the set of text; and an embedding component that replaces the reference with a corresponding vector. 2 . The system of claim 1 , further comprising a first determination component that determines a similarity between an identified first language associated with an embedded reference and an identified second language associated with another embedded reference, wherein the similarity is based on a group of operations consisting of a cosine similarity operation and a machine learning operation, and wherein the embedded reference and the another embedded reference comprises a vector and another vector respectively that are capable of being analyzed by the group of operations. 3 . The system of claim 1 , wherein the identification component comprises components from the group consisting of a hyperlink identification component that identifies whether the reference is linked by a hyperlink to the set of text and a contextualization component that identifies an organizational framework of the reference within the set of text. 4 . The system of claim 3 , wherein the extraction component comprises a template extraction component that extracts a reference template from the organizational framework, wherein the reference template facilitates access to a set of data corresponding to the reference. 5 . They system of claim 4 , further comprising a template matching component that matches the reference to a location within the reference template. 6 . The system of claim 1 , further comprising a rule matching component that organizes one or more clause within the set of text according to one or more clause rules. 7 . The system of claim 1 , further comprising an annotation component that annotates a version of one or more clauses of the set of text based on a structural rule representing grammatical requirements for a set of clauses. 8 . The system of claim 1 , wherein the extraction component employs a defined term extraction component that extracts a defined term from the set of text, wherein an extraction of the defined term is based on performance of a semantic parsing operation on the set of text. 9 . The system of claim 8 , further comprising an ontological matching component that ontologically matches the defined term to a reference term within the set of text. 10 . The system of claim 9 , further comprising a second determination component that determines a similarity score based on a comparison of the defined term and the reference term, wherein the similarity score represents a degree of similarity between the defined term and the reference term. 11 . The system of claim 10 , wherein the embedding component embeds a first version of the set of text based on the similarity score being greater than a threshold score, wherein the first version of the set of text comprises the defined term and a reference vector. 12 . The system of claim 11 , wherein the embedding component comprises a construction component that embeds the defined term with the reference vector based on a neural sentence embedding model, wherein the defined term is represented by a common language term based on a neural sentence embedding model. 13 . A computer-implemented method, comprising: identifying, by a system operatively coupled to a processor, a reference associated with a set of text; extracting, by the system, the reference from the set of text; and embedding, by the system, a vector corresponding to the reference as a replacement for the reference. 14 . The computer-implemented method of claim 13 , further comprising determining, by the system, a similarity between a first language associated with an embedded reference and a second language associated with another embedded reference, wherein the similarity is based on a group of operations consisting of a cosine similarity operation and a machine learning algorithm, and wherein the embedded reference and the another embedded reference comprises a vector and another vector respectively that are capable of being analyzed by the group of operations. 15 . The computer-implemented method of claim 13 , further comprising extracting, by the system, a reference template from an organizational framework, wherein the reference template facilitates access to a set of data corresponding to the reference. 16 . The computer-implemented method of claim 15 , further comprising annotating, by the system, a version of one or more clauses of the set of text based on a structural rule representing grammatical requirements for a set of clauses. 17 . The computer-implemented method of claim 13 , further comprising extracting, by the system, a defined term from the set of text, wherein an extraction of the defined term is based on performance of a semantic parsing operation on the set of text. 18 . A computer program product for efficiently determining textual similarities, the computer program product comprising a computer readable storage medium having program instructions embodied therewith, the program instructions executable by a processor to cause the processor to: identify a reference associated with a set of text; extract the reference from the set of text; and embed a vector corresponding to the reference as a replacement for the reference. 19 . The computer program product of claim 18 , wherein the program instructions are further executable by the processor to cause the processor to: determine a similarity between a first language associated with an embedded reference and a second language associated with another embedded reference, wherein the similarity is based on a group of operations consisting of a cosine similarity operation and a machine learning algorithm, and wherein the embedded reference and the another embedded reference comprises a vector and another vector respectively that are capable of being analyzed by the group of operations. 20 . The computer program product of claim 18 , wherein the program instructions are further executable by the processor to cause the processor to: extract a reference template from an organizational framework, wherein the reference template facilitates access to a set of data corresponding to the reference.

Assignees

Inventors

Classifications

Patent family

Related publications grouped by family.

External sources

Frequently asked questions

Answers are generated from the same data shown on this page.

What does patent US2018137090A1 cover?
Techniques for determining a similarity between text segments within a document comprising textual references are described. According to an example, a system comprises a memory that stores computer executable components; and a processor that executes the computer executable components stored in the memory. The computer executable components can comprise: an identification component that identi…
Who is the assignee on this patent?
IBM
What technology area does this patent fall under?
Primary CPC classification G06F40/30. Mapped technology areas include Physics.
When was this patent published?
Publication date Thu May 17 2018 00:00:00 GMT+0000 (Coordinated Universal Time) (A1). Legal status and post-grant events are not shown on this page.
What related patents are in patentsdb?
We list 9 related publications on this page (citations in our corpus or others sharing the same primary CPC).