Normalizing electronic communications using feature sets

US9280747B1 · US · B1

Patent metadata
FieldValue
Publication numberUS-9280747-B1
Application numberUS-201514928784-A
CountryUS
Kind codeB1
Filing dateOct 30, 2015
Priority dateMay 27, 2015
Publication dateMar 8, 2016
Grant dateMar 8, 2016

How to read this patent

A practical reading order for non-experts. Skip the full description unless you need deep technical detail.

  1. Title

    What the patent document calls the invention.

  2. Abstract

    A short plain-language summary of the technical disclosure.

  3. Assignees and inventors

    Who owns or filed the patent and who is credited as inventor.

  4. Key dates

    Filing, priority, publication, and grant dates set the timeline.

  5. First independent claim

    The legal scope of protection — read this for what is actually claimed.

  6. CPC / IPC classifications

    Technology tags used to group this patent with similar filings.

  7. Citations and related patents

    Prior art links and similar publications in this corpus.

Abstract

Official abstract text for this publication.

Electronic communications can be normalized using feature sets. For example, an electronic representation of a noncanonical communication can be received, and multiple candidate canonical versions of the noncanonical communication can be determined. A first feature set representative of the noncanonical communication can be determined by splitting the noncanonical communication into at least one n-gram and at least one k-skip-n-gram. Multiple comparison feature sets can be determined by splitting multiple terms in training data into respective comparison feature sets. Multiple Jaccard index values can be determined using the first feature set and the multiple comparison feature sets. A subset of the multiple terms in the training data in which an associated Jaccard index value exceeds a threshold can be selected. The subset of the multiple terms can be included in the multiple candidate canonical versions. A normalized version of the noncanonical communication can be selected from the multiple candidate canonical versions.

First claim

Opening claim text (preview).

What is claimed is: 1. A non-transitory computer readable medium comprising program code executable by a processor for causing the processor to: receive an electronic representation of a noncanonical communication; determine a plurality of candidate canonical versions of the noncanonical communication using a database generated using training data; determine a first feature set representative of the noncanonical communication by splitting the noncanonical communication into at least one n-gram and at least one k-skip-n-gram, wherein an n-gram comprises a sequence of a predefined number of adjacent characters, and wherein a k-skip-n-gram comprises a sequence of nonadjacent characters in a communication selected such that a maximum number of skipped characters are positioned between each of the nonadjacent characters in the communication; determine a plurality of comparison feature sets by splitting each term in a plurality of terms in the training data into a respective comparison feature set comprising at least one n-gram and at least one k-skip-n-gram; determine a plurality of Jaccard index values using the first feature set and the plurality of comparison feature sets, each Jaccard index value of the plurality of Jaccard index values being representative of a similarity between the noncanonical communication and a term of the plurality of terms in the training data; select a subset of the plurality of terms in the training data in which an associated Jaccard index value exceeds a threshold; include the subset of the plurality of terms in the plurality of candidate canonical versions of the noncanonical communication; and select a normalized version of the noncanonical communication from the plurality of candidate canonical versions. 2. The non-transitory computer readable medium of claim 1 , wherein the predefined number of adjacent characters is two and the maximum number of skipped characters is one. 3. The non-transitory computer readable medium of claim 2 , further comprising program code executable by the processor for causing the processor to: select the normalized version of the noncanonical communication from the plurality of candidate canonical versions by: determining a confidence score for each candidate canonical version of the plurality of candidate canonical versions using a classifier; and selecting a candidate from the plurality of candidate canonical versions associated with a highest confidence score as the normalized version of the noncanonical communication. 4. The non-transitory computer readable medium of claim 3 , wherein the classifier is configured to use a Jaccard index value associated with a respective candidate canonical version to determine the confidence score for the respective candidate canonical version. 5. The non-transitory computer readable medium of claim 3 , wherein the classifier is configured to use a support value and a confidence value associated with a respective candidate canonical version to determine the confidence score for the respective candidate canonical version, wherein the support value comprises a number of times the respective candidate canonical version occurs in the training data used for generating the plurality of candidate canonical versions of the noncanonical communication, and wherein the confidence value comprises a ratio of an amount of times the respective candidate canonical version was selected as the normalized version of the noncanonical communication divided by another amount of times the noncanonical communication is present in the training data. 6. The non-transitory computer readable medium of claim 3 , wherein the classifier is configured to use a difference between a first number of characters in a respective candidate canonical version and a second number of characters in the noncanonical communication to determine the confidence score for the respective candidate canonical version. 7. The non-transitory computer readable medium of claim 3 , wherein the classifier is configured to use a confidence difference between a first part of speech (POS) tag confidence associated with a respective candidate canonical version and a second POS tag confidence associated with the noncanonical communication to determine the confidence score for the respective candidate canonical version, the first POS tag confidence and the second POS tag confidence being determinable by a POS tagger. 8. The non-transitory computer readable medium of claim 1 , further comprising program code executable by the processor for causing the processor to: receive the electronic representation of the noncanonical communication from a text message, an e-mail, an electronic document, a social media post, a tweet, a blog post, a forum post, media content, or streaming content. 9. The non-transitory computer readable medium of claim 1 , further comprising program code executable by the processor for causing the processor to: include the normalized version of the noncanonical communication in a data set for use in textual analysis; and perform textual analysis on the data set to determine one or more trends indicated by the data set. 10. The non-transitory computer readable medium of claim 1 , further comprising program code executable by the processor for causing the processor to: determine at least one Jaccard index value of the plurality of Jaccard Index values by weighting at least one feature of a comparison feature set. 11. A method comprising: receiving an electronic representation of a noncanonical communication; determining a plurality of candidate canonical versions of the noncanonical communication using a database generated using training data; determining a first feature set representative of the noncanonical communication by splitting the noncanonical communication into at least one n-gram and at least one k-skip-n-gram, wherein an n-gram comprises a sequence of a predefined number of adjacent characters in a communication, and wherein a k-skip-n-gram comprises a sequence of nonadjacent characters in the communication selected such that a maximum number of skipped characters are positioned between each of the nonadjacent characters in the communication; determining a plurality of comparison feature sets by splitting each term in a plurality of terms in the training data into a respective comparison feature set comprising at least one n-gram and at least one k-skip-n-gram; determining a plurality of Jaccard index values using the first feature set and the plurality of comparison feature sets, each Jaccard index value of the plurality of Jaccard index values being representative of a similarity between the noncanonical communication and a term of the plurality of terms in the training data; selecting a subset of the plurality of terms in the training data in which an associated Jaccard index value exceeds a threshold; including the subset of the plurality of terms in the plurality of candidate canonical versions of the noncanonical communication; and selecting a normalized version of the noncanonical communication from the plurality of candidate canonical versions. 12. The method of claim 11 , wherein the predefined number of adjacent characters is two and the maximum number of skipped characters is one. 13. The method of claim 12 , further comprising: selecting the normalized version of the noncanonical communication from the subset of the plurality of candidate canonical versions by: determining a confidence score for each candidate canonical version of the subset using a classifier; and selecting a candidate from the subset of the plurality of candidate canonical versi

Assignees

Inventors

Classifications

  • G06N7/01Primary

    Probabilistic graphical models, e.g. probabilistic networks · CPC title

  • Physics · mapped topic

  • Real-time or near real-time messaging, e.g. instant messaging [IM] · CPC title

  • G06N7/005Primary

    Physics · mapped topic

  • Physics · mapped topic

Patent family

Related publications grouped by family.

External sources

Frequently asked questions

Answers are generated from the same data shown on this page.

What does patent US9280747B1 cover?
Electronic communications can be normalized using feature sets. For example, an electronic representation of a noncanonical communication can be received, and multiple candidate canonical versions of the noncanonical communication can be determined. A first feature set representative of the noncanonical communication can be determined by splitting the noncanonical communication into at least on…
Who is the assignee on this patent?
Sas Inst Inc
What technology area does this patent fall under?
Primary CPC classification G06N7/01. Mapped technology areas include Physics.
When was this patent published?
Publication date Tue Mar 08 2016 00:00:00 GMT+0000 (Coordinated Universal Time) (B1). Legal status and post-grant events are not shown on this page.
What related patents are in patentsdb?
We list 8 related publications on this page (citations in our corpus or others sharing the same primary CPC).