Regular expression generation and screening of textual items

US10956522B1 · US · B1

Patent metadata
FieldValue
Publication numberUS-10956522-B1
Application numberUS-201816003770-A
CountryUS
Kind codeB1
Filing dateJun 8, 2018
Priority dateJun 8, 2018
Publication dateMar 23, 2021
Grant dateMar 23, 2021

How to read this patent

A practical reading order for non-experts. Skip the full description unless you need deep technical detail.

  1. Title

    What the patent document calls the invention.

  2. Abstract

    A short plain-language summary of the technical disclosure.

  3. Assignees and inventors

    Who owns or filed the patent and who is credited as inventor.

  4. Key dates

    Filing, priority, publication, and grant dates set the timeline.

  5. First independent claim

    The legal scope of protection — read this for what is actually claimed.

  6. CPC / IPC classifications

    Technology tags used to group this patent with similar filings.

  7. Citations and related patents

    Prior art links and similar publications in this corpus.

Abstract

Official abstract text for this publication.

An online system enforces policies to content items that are distributed on its platform and blocks content items that violate one or more of those policies. To identify content items that are slightly varied from each other, the online system generates an embedding for each of the known content items that have already been determined to be noncompliant with one or more policies. The online system then groups the known noncompliant content items that are clustered together in the embedding space. The texts of the group of known noncompliant content items are converted to finite state automata and are merged to generate a common automaton. The common automaton is used to generate a common regular expression that is used to screen new content items. When a new content item matches the textual pattern defined by the common regular expression, the system may block the new content item.

First claim

Opening claim text (preview).

What is claimed is: 1. A method comprising: accessing a data store containing a plurality of content items determined to be noncompliant with one or more policies of an online system; generating an embedding for each of the plurality of content items using text of the content item, each embedding generated being a mathematical vector that represents semantic characteristics of textual content of the content item relative to other content items in an embedding space, the semantic characteristics determined from word-to-word co-occurrence statistics of the text; grouping two or more of the content items into a cluster based on distances among the embeddings corresponding to the two or more content items in the embedding space, the distances determined from the semantic characteristics of each of the content items according to the word-to-word co-occurrence statistics of the text of each of the content items; extracting one or more strings of words from the textual content of one or more of the content items in the cluster; creating a common regular expression that represents the one or more strings of words; receiving a new content item for distribution by the online system; screening the new content item by applying the common regular expression to determine whether the new content item matches the common regular expression; and responsive to determining that the new content item matches the common regular expression, withholding the content item from users of the online system. 2. The method of claim 1 , further comprising: further responsive to determining that the new content item matches the common regular expression, adding the new content item to the data store that contains the plurality of content items determined to be noncompliant. 3. The method of claim 1 , further comprising: applying the common regular expression to a plurality of known compliant content items that are determined to be compliant with the one or more policies of the online system; determining a match rate of the common regular expression with respect to the known compliant content items; and responsive to the match rate being higher than a predetermined threshold, removing the common expression from screening the new content item. 4. The method of claim 1 , wherein generating the embedding for each of the plurality of content items comprises taking an average of word vectors corresponding to the textual content of each of the content item, the average being the mathematical vector. 5. The method of claim 1 , wherein generating the embedding for each of the plurality of content items comprises: providing the textual content as input to a deep neural network; and determining the embedding representing the textual content based on an output of the deep neural network. 6. The method of claim 1 , wherein the one or more strings of words of each of the content items comprises the entire textual content of the content item. 7. The method of claim 1 , wherein creating the common regular expression comprises: determining a regular expression for the one or more strings of words for each of the content items; generating a plurality of automata, each of the plurality of automata corresponding to each of the regular expression; merging the plurality of automata into a common automaton; and generating the common regular expression based on the common automaton. 8. The method of claim 1 , wherein the new content item comprises a landing page of a third party web site. 9. The method of claim 1 , wherein the new content item comprises an advertisement. 10. A non-transitory computer readable storage medium configured to store program code, the program code comprising instructions that, when executed by a processor, cause the processor to: access a data store containing a plurality of content items determined to be noncompliant with one or more policies of an online system; generate an embedding for each of the plurality of content items using text of the content item, each embedding generated being a mathematical vector that represents semantic characteristics of textual content of the content item relative to other content items in an embedding space, the semantic characteristics determined from word-to-word co-occurrence statistics of the text; group two or more of the content items into a cluster based on distances among the embeddings corresponding to the two or more content items in the embedding space, the distances determined from the semantic characteristics of each of the content items according to the word-to-word co-occurrence statistics of the text of each of the content items; extract one or more strings of words from the textual content of one or more of the content items in the cluster; create a common regular expression that represents the one or more strings of words; receive a new content item for distribution by the online system; screen the new content item by applying the common regular expression to determine whether the new content item matches the common regular expression; and responsive to determining that the new content item matches the common regular expression, withhold the content item from users of the online system. 11. The non-transitory computer readable storage medium of claim 10 , wherein the program code further causes the processor to, further responsive to determining that the new content item matches the common regular expression, add the new content item to the data store that contains the plurality of content items determined to be noncompliant. 12. The non-transitory computer readable storage medium of claim 10 , wherein the program code further causes the processor to: apply the common regular expression to a plurality of known compliant content items that are determined to be compliant with the one or more policies of the online system; determine a match rate of the common regular expression with respect to the known compliant content items; and responsive to the match rate being higher than a predetermined threshold, remove the common expression from screening the new content item. 13. The non-transitory computer readable storage medium of claim 10 , wherein generating the embedding for each of the plurality of content items comprises taking an average of word vectors corresponding to the textual content of each of the content item, the average being the mathematical vector. 14. The non-transitory computer readable storage medium of claim 10 , wherein generating the embedding for each of the plurality of content items comprises: providing the textual content as input to a deep neural network; and determining the embedding representing the textual content based on an output of the deep neural network. 15. The non-transitory computer readable storage medium of claim 10 , wherein the one or more strings of words of each of the content items comprises the entire textual content of the content item. 16. The non-transitory computer readable storage medium of claim 10 , wherein creating the common regular expression comprises: determining a regular expression for the one or more strings of words for each of the content items; generating a plurality of automata, each of the plurality of automata corresponding to each of the regular expression; merging the plurality of automata into a common automaton; and generating the common regular expression based on the common automaton. 17. The non-transitory computer readable storage medium of claim 10 , wherein the new content item comprises a landing page of a third party website.

Assignees

Inventors

Classifications

  • Feedforward networks · CPC title

  • Supervised learning · CPC title

  • Creation or modification of classes or clusters · CPC title

  • for managing network security; network security policies in general (filtering policies H04L63/0227) · CPC title

  • G06N20/00Primary

    Machine learning · CPC title

Patent family

Related publications grouped by family.

External sources

Frequently asked questions

Answers are generated from the same data shown on this page.

What does patent US10956522B1 cover?
An online system enforces policies to content items that are distributed on its platform and blocks content items that violate one or more of those policies. To identify content items that are slightly varied from each other, the online system generates an embedding for each of the known content items that have already been determined to be noncompliant with one or more policies. The online sys…
Who is the assignee on this patent?
Facebook Inc
What technology area does this patent fall under?
Primary CPC classification G06N20/00. Mapped technology areas include Physics.
When was this patent published?
Publication date Tue Mar 23 2021 00:00:00 GMT+0000 (Coordinated Universal Time) (B1). Legal status and post-grant events are not shown on this page.
What related patents are in patentsdb?
We list 5 related publications on this page (citations in our corpus or others sharing the same primary CPC).