Systems and methods for multilingual document filtering

US9984068B2 · US · B2

Patent metadata
FieldValue
Publication numberUS-9984068-B2
Application numberUS-201514858413-A
CountryUS
Kind codeB2
Filing dateSep 18, 2015
Priority dateSep 18, 2015
Publication dateMay 29, 2018
Grant dateMay 29, 2018

How to read this patent

A practical reading order for non-experts. Skip the full description unless you need deep technical detail.

  1. Title

    What the patent document calls the invention.

  2. Abstract

    A short plain-language summary of the technical disclosure.

  3. Assignees and inventors

    Who owns or filed the patent and who is credited as inventor.

  4. Key dates

    Filing, priority, publication, and grant dates set the timeline.

  5. First independent claim

    The legal scope of protection — read this for what is actually claimed.

  6. CPC / IPC classifications

    Technology tags used to group this patent with similar filings.

  7. Citations and related patents

    Prior art links and similar publications in this corpus.

Abstract

Official abstract text for this publication.

Systems, apparatus, computer-readable media, and methods to provide filtering and/or search based at least in part on semantic representations of words in a document subject to the filtering and/or search are disclosed. Furthermore key words for conducting the filtering and/or search, such as taboo words and/or search terms, may be semantically compared to the semantic representation of the words in the document. A common semantic vector space, such as a base language semantic vector space, may be used to compare the key word semantic vectors and the semantic vectors of the words of the document, regardless of the native language in which the document is written or the language in which the key words are provided.

First claim

Opening claim text (preview).

The claimed invention is: 1. One or more non-transitory computer-readable medium comprising computer-executable instruction that, when executed by one or more processors, cause the one or more processors to at least: in response to receiving electronic content to be delivered to a destination address, identify a first word in the electronic content and a second word in the electronic content; determine a first base language semantic vector of the first word; determine a second base language semantic vector of the second word; determine, for a keyword, a key word base language semantic vector, the keyword being a taboo word; determine a first distance between the first base language semantic vector and the key word base language semantic vector; determine a second distance between the second base language semantic vector and the key word base language semantic vector; determine that the first distance is less than a threshold distance; determine that the second distance is less than the threshold distance; determine a sum of the first distance and the second distance; determine a score of the electronic content based at least in part on the sum, wherein the score indicates a relevance of the electronic content to the key word; determine that the electronic content is not to be delivered to the destination address based at least in part on the score of the electronic content; and prevent the electronic content from being delivered to the destination address. 2. The one or more non-transitory computer-readable medium of claim 1 , wherein the computer-executable instructions further cause the one or more processors to sequester the electronic content, when the electronic content is not to be delivered to the destination address. 3. The one or more non-transitory computer-readable medium of claim 1 , wherein the determining of the first base language semantic vector includes: determining a native language semantic vector corresponding to the first word; and transforming, based at least in part on a native language-to-base language translation matrix, the native language semantic vector to the first base language semantic vector. 4. The one or more non-transitory computer-readable medium of claim 1 , wherein the determining of the key word base language semantic vector includes: determining a key word native language semantic vector corresponding to the key word; and transforming, based at least in part on a native language-to-base language translation matrix, the key word native language semantic vector to the key word base language semantic vector. 5. The one or more non-transitory computer-readable medium of claim 1 , wherein the determining of the first distance includes determining at least one of: a cosine distance between the first base language semantic vector and the key word base language semantic vector, or an Euclidean distance between the first base language semantic vector and the key word base language semantic vector. 6. The one or more non-transitory computer-readable medium of claim 1 , wherein the computer-executable instructions further cause the one or more processors to: determine a first relevance between a first training document and the key word, the first training document having a first known filtering status; determine a second relevance between a second training document and the key word, the second training document having a second known filtering status; determine, with a filtering model, a filtering status for a plurality of training documents based at least in part on the first relevance and the second relevance; compare the filtering status to the first known filtering status; compare the filtering status to the second known filtering status; and train the filtering model based at least in part on a result of the comparing of the filtering status to the first known filtering status and the comparing of the filtering status to the second known filtering status. 7. A system, comprising: at least one memory that stores computer-executable instructions; and at least one processor to access the at least one memory, the computer-executable instructions, when executed, to cause the at least one processor to at least: in response to receiving electronic content to be delivered to a destination address, determine a first base language semantic vector corresponding to a first word in the electronic content; determine a second base language semantic vector corresponding to a second word in the electronic content; determine, for a key word, a key word base language semantic vector, the key word being a taboo word; determine a set of distance data including a first distance between the key word base language semantic vector and the first base language semantic vector, and a second distance between the key word base language semantic vector and the second base language semantic vector; determine whether the first distance and the second distance are less than a threshold distance; when the first and second distances are less than the threshold distance, add the first distance and the second distance to obtain a sum; determine a score of the electronic content based at least in part on the sum, wherein the score indicates a relevance of the electronic content to the key word; and determine that the electronic content is not to be delivered to the destination address based at least in part on the score of the electronic content; and prevent the electronic content from being delivered to the destination address. 8. The system of claim 7 , wherein the computer-executable instructions further cause the at least one processor to determine the first base language semantic vector by: determining a first native language semantic vector corresponding to the first word, wherein the first word is in a native language and the first native language semantic vector is defined in a native language semantic vector space corresponding to a native language of the first word; identifying a native language-to-base language translation matrix corresponding to the native language; and transforming, based at least in part on the native language-to-base language translation matrix, the first native language semantic vector to the first base language semantic vector. 9. The system of claim 7 , wherein the key word is associated with at least one of: pornography; sexually explicit content; violent content; adult content; gambling related content; gaming related content; or violent content. 10. The system of claim 7 , wherein the computer-executable instructions further cause the at least one processor to determine a key word base language semantic vector by identifying that the key word is received in a base language corresponding to the key word base language semantic vector. 11. The system of claim 7 , wherein the electronic content is first electronic content, the first electronic content includes a first document, the first document includes a first plurality of words, the set of distance data is a first set of distance data, and the computer-executable instructions further cause the at least one processor to: determine a third base language semantic vector corresponding to a third word included in a second document included in second electronic content; determine a fourth base language semantic vector corresponding to a fourth word included in the second document; determine a second set of distance data corresponding to the second document, wherein the second set of distance data includes a third distance between the third base language semantic vector and the key word base language semantic vector, and a fourth distance between the fourth base language s

Assignees

Inventors

Classifications

  • Search customisation based on user profiles and personalisation · CPC title

  • Use of machine translation, e.g. for multi-lingual retrieval, for server-side translation for client devices or for real-time translation · CPC title

  • Language identification · CPC title

  • using vector based model · CPC title

  • Translation of the query language, e.g. Chinese to English · CPC title

Patent family

Related publications grouped by family.

External sources

Frequently asked questions

Answers are generated from the same data shown on this page.

What does patent US9984068B2 cover?
Systems, apparatus, computer-readable media, and methods to provide filtering and/or search based at least in part on semantic representations of words in a document subject to the filtering and/or search are disclosed. Furthermore key words for conducting the filtering and/or search, such as taboo words and/or search terms, may be semantically compared to the semantic representation of the wor…
Who is the assignee on this patent?
Mcafee Inc, Mcafee Llc
What technology area does this patent fall under?
Primary CPC classification G06F16/3337. Mapped technology areas include Physics.
When was this patent published?
Publication date Tue May 29 2018 00:00:00 GMT+0000 (Coordinated Universal Time) (B2). Legal status and post-grant events are not shown on this page.
What related patents are in patentsdb?
We list 8 related publications on this page (citations in our corpus or others sharing the same primary CPC).