Domain-specific stopword removal from unstructured computer text using a neural network

US2018225568A1 · US · A1

Patent metadata
FieldValue
Publication numberUS-2018225568-A1
Application numberUS-201715426958-A
CountryUS
Kind codeA1
Filing dateFeb 7, 2017
Priority dateFeb 7, 2017
Publication dateAug 9, 2018
Grant date

How to read this patent

A practical reading order for non-experts. Skip the full description unless you need deep technical detail.

  1. Title

    What the patent document calls the invention.

  2. Abstract

    A short plain-language summary of the technical disclosure.

  3. Assignees and inventors

    Who owns or filed the patent and who is credited as inventor.

  4. Key dates

    Filing, priority, publication, and grant dates set the timeline.

  5. First independent claim

    The legal scope of protection — read this for what is actually claimed.

  6. CPC / IPC classifications

    Technology tags used to group this patent with similar filings.

  7. Citations and related patents

    Prior art links and similar publications in this corpus.

Abstract

Official abstract text for this publication.

Methods and apparatuses are described for analyzing unstructured computer text for domain-specific stopword identification and removal. A computer data store stores unstructured text. A server computing device splits the unstructured text into phrases and generates tokens from the phrases. The server computing device generates a set of bootstrap keywords using the tokens. An artificial intelligence neural network executing on the server computing device generates a stopword training model. The server computing device generates a first set of candidate stopwords using the bootstrap keywords and the stopword training model. The server computing device generates regular expressions using the bootstrap keywords, and generates a second set of candidate stopwords using the regular expressions. The server computing device stores the candidate stopwords in the data store, and removes stopwords from the unstructured text using the data store.

First claim

Opening claim text (preview).

What is claimed is: 1 . A system used in a computing environment in which unstructured computer text is analyzed for domain-specific stopword identification and removal, the system comprising: a computer data store including domain-specific unstructured text, the unstructured text being input via a web page, input directly into the computer data store via a first computer file, or any combination thereof, and a server computing device in communication with the computer data store and programmed to: split the unstructured text into one or more phrases; generate tokens from the phrases, wherein each token comprises a word; generate a set of bootstrap keywords by: a) selecting a subset of the tokens that (i) appear frequently in the unstructured text or (ii) are determined to be uninformative, and b) inserting each token in the subset of the tokens into the computer data store as a bootstrap keyword; generate, using an artificial intelligence neural network executing on the server computing device, a stopword training model by creating a word vector for each token and inserting the word vector in a high-dimensional space that comprises the training model, wherein a position of each word vector in the high-dimensional space is based upon a semantic relationship between the corresponding word and surrounding words in the unstructured text, generate a first set of candidate stopwords by executing the stopword training model against the set of bootstrap keywords to determine one or more words in the unstructured text that are similar to the bootstrap keywords based upon the word vector for the bootstrap keywords; generate regular expressions based upon the set of bootstrap keywords by: a) identifying a sequence of characters in one or more of the bootstrap keywords, b) identifying a syntactical pattern in the sequence of characters, and c) generating a regular expression using the syntactical pattern; generate a second set of candidate stopwords using the regular expressions and insert the first set of candidate stopwords and the second set of candidate stopwords into the computer data store as a set of final stopwords; and remove stopwords from the unstructured text that match the set of final stopwords. 2 . The system of claim 1 , wherein the artificial intelligence neural network positions each word vector in the high-dimensional space to be in proximity to similar word vectors. 3 . The system of claim 1 , wherein the server computing device splits the unstructured text into one or more phrases by locating a terminator in the unstructured text and separating the unstructured text on either side of the terminator into a phrase. 4 . The system of claim 1 , wherein the server computing device is further programmed to determine one or more themes in the unstructured text after the stopwords are removed. 5 . The system of claim 1 , wherein the high-dimensional space comprises hundreds of dimensions. 6 . The system of claim 1 , wherein the second set of candidate stopwords comprises a plurality of stopwords with a syntax that matches at least one regular expression. 7 . The system of claim 6 , wherein the second set of candidate stopwords includes one or more stopwords that are not present in the unstructured text. 8 . A computerized method in which unstructured computer text is analyzed for domain-specific stopword identification and removal, the method comprising: storing, in a computer data store, domain-specific unstructured text, the unstructured text being input via a web page, input directly into the computer data store via a first computer file, or any combination thereof; splitting, by a server computing device in communication with the computer data store, the unstructured text into one or more phrases; generating, by the server computing device, tokens from the phrases, wherein each token comprises a word; generating, by the server computing device, a set of bootstrap keywords by: a) selecting a subset of the tokens that (i) appear frequently in the unstructured text or (ii) are determined to be uninformative, and b) inserting each token in the subset of the tokens into the computer data store as a bootstrap keyword; generating, by an artificial intelligence neural network executing on the server computing device, a stopword training model by creating a word vector for each token and inserting the word vector in a high-dimensional space that comprises the training model, wherein a position of each word vector in the high-dimensional space is based upon a semantic relationship between the corresponding word and surrounding words in the unstructured text; generating, by the server computing device, a first set of candidate stopwords by executing the stopword training model against the set of bootstrap keywords to determine one or more words in the unstructured text that are similar to the bootstrap keywords based upon the word vector for the bootstrap keywords; generating, by the server computing device regular expressions based upon the set of bootstrap keywords by: a) identifying a sequence of characters in one or more of the bootstrap keywords, b) identifying a syntactical pattern in the sequence of characters, and c) generating a regular expression using the syntactical pattern; generating, by the server computing device, a second set of candidate stopwords using the regular expressions and insert the first set of candidate stopwords and the second set of candidate stopwords into the computer data store as a set of final stopwords; and removing, by the server computing device, stopwords from the unstructured text that match the set of final stopwords. 9 . The method of claim 8 , wherein the artificial intelligence neural network positions each word vector in the high-dimensional space to be in proximity to similar word vectors. 10 . The method of claim 8 , wherein the server computing device splits the unstructured text into one or more phrases by locating a terminator in the unstructured text and separating the unstructured text on either side of the terminator into a phrase. 11 . The method of claim 8 , wherein the server computing device is further programmed to determine one or more themes in the unstructured text after the stopwords are removed. 12 . The method of claim 8 , wherein the high-dimensional space comprises hundreds of dimensions. 13 . The method of claim 8 , wherein the second set of candidate stopwords comprises a plurality of stopwords with a syntax that matches at least one regular expression. 14 . The method of claim 13 , wherein the second set of candidate stopwords includes one or more stopwords that are not present in the unstructured text. 15 . A computer readable storage medium comprising programmatic instructions for operation of a computing environment in which unstructured computer text is analyzed for domain-specific stopword identification and removal, the instructions operable to cause a computer data store to store domain-specific unstructured text, the unstructured text being input via a web page, input directly into the computer data store via a first computer file, or any combination thereof; and a server computing device in communication with the computer data store, and including an executable artificial intelligence neural network executing on the server computing device, to: split the unstructured text into one or more phrases; generate tokens from the phrases, wherein each token comprises a word; generate a set of bootstrap keywords by a) selecting a subset of the tokens that (i) appear frequently

Assignees

Inventors

Classifications

  • Semantic analysis · CPC title

  • Selection or weighting of terms for indexing · CPC title

  • using natural language analysis · CPC title

  • using statistical methods · CPC title

  • Lexical analysis, e.g. tokenisation or collocates · CPC title

Patent family

Related publications grouped by family.

External sources

Frequently asked questions

Answers are generated from the same data shown on this page.

What does patent US2018225568A1 cover?
Methods and apparatuses are described for analyzing unstructured computer text for domain-specific stopword identification and removal. A computer data store stores unstructured text. A server computing device splits the unstructured text into phrases and generates tokens from the phrases. The server computing device generates a set of bootstrap keywords using the tokens. An artificial intellig…
Who is the assignee on this patent?
Fmr Llc
What technology area does this patent fall under?
Primary CPC classification G06F16/3344. Mapped technology areas include Physics.
When was this patent published?
Publication date Thu Aug 09 2018 00:00:00 GMT+0000 (Coordinated Universal Time) (A1). Legal status and post-grant events are not shown on this page.
What related patents are in patentsdb?
We list 8 related publications on this page (citations in our corpus or others sharing the same primary CPC).