Spam classification system based on network flow data
US-2017359362-A1 · Dec 14, 2017 · US
US11048769B2 · US · B2
| Field | Value |
|---|---|
| Publication number | US-11048769-B2 |
| Application number | US-201916430292-A |
| Country | US |
| Kind code | B2 |
| Filing date | Jun 3, 2019 |
| Priority date | Aug 19, 2016 |
| Publication date | Jun 29, 2021 |
| Grant date | Jun 29, 2021 |
A practical reading order for non-experts. Skip the full description unless you need deep technical detail.
What the patent document calls the invention.
A short plain-language summary of the technical disclosure.
Who owns or filed the patent and who is credited as inventor.
Filing, priority, publication, and grant dates set the timeline.
The legal scope of protection — read this for what is actually claimed.
Technology tags used to group this patent with similar filings.
Prior art links and similar publications in this corpus.
Official abstract text for this publication.
A digital magazine server displays content items from various sources to users of client devices. Each source of a content item is identified by a domain, and content items for different sources have different domain-level quality. To differentiate sources of content items, the domains identifying the sources are ranked based on domain scores of the domains generated by an aggregate of multiple trained domain classifiers. A domain score of a domain indicates a domain-level quality of content items provided by a source identified by the domain. Each of the trained domain classifiers (e.g., a naïve Bayes classifier, a random forest classifier, and a logistic regression classifier) generates a prediction of whether a domain is a spam domain based on the domain features and domains with known labels. Based on the domain scores of domains, the domain ranking module may adaptively select content items from the sources with corresponding domains scores.
Opening claim text (preview).
What is claimed is: 1. A computer-implemented method comprising: receiving a plurality of content items of digital magazines and user interactions with the plurality of content items; obtaining domain features from the plurality of content items, a domain feature of a content item related to a domain identifying a source of the content item; identifying one or more domains based on the obtained domain features; for each identified domain: applying a first trained domain classifier to an identified domain to generate a first prediction of the identified domain being a spam domain; applying a second trained domain classifier to the identified domain to generate a second prediction of the identified domain being the spam domain; applying a third trained domain classifier to the identified domain to generate a third prediction of the identified domain being the spam domain; generating an aggregate prediction based on the first prediction, the second prediction and the third prediction of the identified domain being the spam domain; generating a confusion score of the identified domain based on standard deviation of the first prediction, the second prediction and the third prediction of the identified domain being the spam domain; and generating a domain score for the identified domain based on the aggregate prediction of the identified domain being the spam domain and the confusion score of the identified domain, the domain score for the identified domain indicating a domain-level quality of content items provided by a source identified by the identified domain; and ranking the identified one or more domains based on domains scores associated with the identified domains. 2. The method of claim 1 , wherein the domain features from the plurality of content items comprise at least one of: average click through rate of the plurality of content items; average length of the plurality of content items; a ratio of likes to dislikes of the content items expressed by users of the plurality of content items; a percentage of content items that are tagged Not Safe For Work; and a percentage of views with less than a threshold percentage of completion. 3. The method of claim 1 , wherein the user interactions with the plurality of content items include at least one of: commenting on one or more content items of the plurality of content items by users of the digital magazines; sharing universal resources locators (URLs) of one or more content items of the plurality of content items among the users of the digital magazines; accessing one or more content items of the plurality of content items; flipping, dragging or resizing a page presenting one or more content items of the plurality of content items in a digital magazine; and expressing a preference for a content item of the plurality of the content items. 4. The method of claim 1 , wherein the first trained domain classier is a classifier trained using random forest technique to generate the first prediction of the identified domain being a spam domain, the second trained domain classifier is a classifier trained using logistic regression technique to generate the second prediction of the identified domain being a spam domain, and the third trained domain classier is a classifier trained using nave Bayes technique to generate the third prediction of the identified domain being a spam domain. 5. The method of claim 1 , wherein the first trained domain classifier, the second trained domain classifier, and the third trained domain classifier are each trained using a plurality of domain training data including the domain features extracted from the plurality of content items of the digital magazines and domain names having known labels indicating a type of the domain. 6. The method of claim 5 , wherein a type of the domain is selected from a group of domain types consisting of: spam domain, unlabeled domain, major partner domain, minor partner domain, and whitelisted domain. 7. The method of claim 1 , further comprising: adaptively selecting content items for publishing on the digital magazines from sources identified by corresponding domains based on domain scores of the corresponding domains; and presenting the selected content items on the digital magazines. 8. The method of claim 7 , wherein adaptively selecting content items from sources identified by corresponding domains comprises: responsive to a domain having a domain score lower than a first threshold, blocking content items provided by a source identified by the domain; and responsive to a domain having a domain score higher than a second threshold, increasing number of content items for publishing from a source identified by the domain. 9. A non-transitory computer-readable storage medium storing executable computer program instructions, the computer program instructions when executed by a computer processor cause the computer processor to: receive a plurality of content items of digital magazines and user interactions with the plurality of content items; obtain domain features from the plurality of content items, a domain feature of a content item related to a domain identifying a source of the content item; identify one or more domains based on the obtained domain features; for each identified domain: apply a first trained domain classifier to an identified domain to generate a first prediction of the identified domain being a spam domain; apply a second trained domain classifier to the identified domain to generate a second prediction of the identified domain being the spam domain; apply a third trained domain classifier to the identified domain to generate a third prediction of the identified domain being the spam domain; generate an aggregate prediction based on the first prediction, the second prediction and the third prediction of the identified domain being the spam domain; generate a confusion score of the identified domain based on standard deviation of the first prediction, the second prediction and the third prediction of the identified domain being the spam domain; and generate a domain score for the identified domain based on the aggregate prediction of the identified domain being the spam domain and the confusion score of the identified domain, the domain score for the identified domain indicating a domain-level quality of content items provided by a source identified by the identified domain; and rank the identified one or more domains based on domains scores associated with the identified domains. 10. The computer-readable storage medium of claim 9 , wherein the domain features from the plurality of content items comprise at least one of: average click through rate of the plurality of content items; average length of the plurality of content items; a ratio of likes to dislikes of the content items expressed by users of the plurality of content items; a percentage of content items that are tagged Not Safe For Work; and a percentage of views with less than a threshold percentage of completion. 11. The computer-readable storage medium of claim 9 , wherein the user interactions with the plurality of content items include at least one of: commenting on one or more content items of the plurality of content items by users of the digital magazines; sharing universal resources locators (URLs) of one or more content items of the plurality of content items among the users of the digital magazines; accessing one or more content items of the plurality of content items; flipping, dragging or resizing a page presenting one or more content items of the plurality of content items in a digital magazine; and expressing a preference for a conten
Probabilistic graphical models, e.g. probabilistic networks · CPC title
Ensemble learning · CPC title
Search customisation based on user profiles and personalisation · CPC title
Navigation, e.g. using categorised browsing · CPC title
using probabilistic model · CPC title
Related publications grouped by family.
Answers are generated from the same data shown on this page.