Spam classification system based on network flow data

US10397256B2 · US · B2

Patent metadata
FieldValue
Publication numberUS-10397256-B2
Application numberUS-201615365008-A
CountryUS
Kind codeB2
Filing dateNov 30, 2016
Priority dateJun 13, 2016
Publication dateAug 27, 2019
Grant dateAug 27, 2019

How to read this patent

A practical reading order for non-experts. Skip the full description unless you need deep technical detail.

  1. Title

    What the patent document calls the invention.

  2. Abstract

    A short plain-language summary of the technical disclosure.

  3. Assignees and inventors

    Who owns or filed the patent and who is credited as inventor.

  4. Key dates

    Filing, priority, publication, and grant dates set the timeline.

  5. First independent claim

    The legal scope of protection — read this for what is actually claimed.

  6. CPC / IPC classifications

    Technology tags used to group this patent with similar filings.

  7. Citations and related patents

    Prior art links and similar publications in this corpus.

Abstract

Official abstract text for this publication.

In an example embodiment, a computer-implemented method comprises obtaining labels from messages associated with an email service provider, wherein the labels indicate for each message IP how many spam and non-spam messages have been received; obtaining network data features from a cloud service provider; providing the labels and network data features to a machine learning application; generating a prediction model representing an algorithm for determining whether a particular set of network data features are spam or not; applying the prediction model to network data features for an unlabeled message; and generating an output of the prediction model indicating a likelihood that the unlabeled message is spam.

First claim

Opening claim text (preview).

What is claimed is: 1. A computer-implemented method for sharing data between at least an email service provider and a cloud service provider in order to identify network spamming message patterns without accessing spamming message content, the method comprising: obtaining labels from messages associated with an email service provider, wherein the labels indicate for each message IP address how many spam and non-spam messages have been received; obtaining network data features from a cloud service provider; providing the labels and the network data features to a machine learning application, wherein the machine learning application identifies correlations between IP addresses associated with the labels and IP addresses associated with the network data features, the correlations being used to facilitate the machine learning application in generating a prediction model to detect spamming hosts that generate spamming messages; generating the prediction model representing an algorithm for determining whether a particular set of network data features are spam or not; and after an unlabeled message, which has not yet been characterized as spam or not as spam, is generated by a computing device of the cloud service provider and after the unlabeled message is received at a router of the cloud service provider in preparation for transmittal to a recipient computing device, applying the prediction model to the unlabeled message to determine whether the unlabeled message is spam or is not spam, wherein the network data features from the cloud service provider include descriptors of connections between the computing device that generated the unlabeled message and the recipient computing device, the descriptors including information describing a source and destination IP address, source and destination ports, a protocol type, and a union of TCP flags. 2. The computer-implemented method of claim 1 , further comprising: generating an output of the prediction model indicating a likelihood that the unlabeled message is spam. 3. The computer-implemented method of claim 1 , further comprising: obtain an updated set of labels from messages associated with the email service provider; and retrain the prediction models based upon the updated set of labels. 4. The computer-implemented method of claim 1 , wherein a virtual machine residing on the computing device generated the unlabeled message, and wherein the method further comprises: when the unlabeled message is identified as being spam, labeling the virtual machine as spamming. 5. The computer-implemented method of claim 1 , wherein the machine learning application is a trained learner having a classification algorithm that is used to predict spam from a sparse matrix created from the network data features. 6. The computer-implemented method of claim 1 , wherein the network data features correspond to IPFIX data. 7. The computer-implemented method of claim 1 , wherein the network data features comprise email metadata. 8. The computer-implemented method of claim 1 , wherein the labels from messages associated with an email service provider are stored as a reputation dataset. 9. A machine-learning server comprising: one or more processor(s); and one or more computer-readable hardware storage device(s) having stored thereon computer-executable instructions that are executable by the one or more processor(s) to cause the machine-learning server to: obtain labels from messages associated with an email service provider, wherein the labels indicate for each message IP address how many spam and non-spam messages have been received; obtain network data features from a cloud service provider; provide the labels and the network data features to a machine learning application, wherein the machine learning application identifies correlations between IP addresses associated with the labels and IP addresses associated with the network data features, the correlations being used to facilitate the machine learning application in generating a prediction model to detect spamming hosts that generate spamming messages; generate the prediction model representing an algorithm for determining whether a particular set of network data features are spam or not; and after an unlabeled message, which has not yet been characterized as spam or not as spam, is generated by a computing device of the cloud service provider and after the unlabeled message is received at a router of the cloud service provider in preparation for transmittal to a recipient computing device, apply the prediction model to the unlabeled message to determine whether the unlabeled message is spam or is not spam, wherein the network data features from the cloud service provider include descriptors of connections between the computing device that generated the unlabeled message and the recipient computing device, the descriptors including information describing a source and destination IP address, source and destination ports, a protocol type, and a union of TCP flags. 10. The machine-learning server of claim 9 , wherein execution of the computer-executable instructions further causes the machine-learning server to: generate an output of the prediction model indicating a likelihood that the unlabeled message is spam. 11. The machine-learning server of claim 9 , wherein execution of the computer-executable instructions further causes the machine-learning server to: obtain an updated set of labels from messages associated with the email service provider; and retrain the prediction models based upon the updated set of labels. 12. The machine-learning server of claim 9 , wherein execution of the computer-executable instructions further causes the machine-learning server to: forward the prediction model to a cloud management application for use in identifying spamming machines on a cloud service. 13. The machine-learning server of claim 9 , wherein the machine learning application is a trained learner having a classification algorithm that is used to predict spam from a sparse matrix created from the network data features. 14. The machine-learning server of claim 9 , wherein the network data features correspond to IPFIX data. 15. The machine-learning server of claim 9 , wherein the network data features comprise email metadata. 16. The machine-learning server of claim 9 , wherein the labels from the messages associated with the email service provider are stored as a reputation dataset. 17. The machine-learning server of claim 9 , wherein the descriptors are included as flow-based metadata. 18. The machine-learning server of claim 9 , wherein the descriptors are included as flow-based metadata, and wherein execution of the computer-executable instructions further causes the machine-learning server to: condense the flow-based metadata into flow records that capture data about the messages. 19. A computer-implemented method for sharing data between different services to identify network spamming patterns, the method comprising: receiving a prediction model representing an algorithm for determining whether a particular set of network data features are spam or not, wherein: the prediction model is generated from labels from messages associated with an email service provider and from network data features from a cloud service provider, the prediction model is generated by a machine learning application that identifies correlations between IP addresses associated with the labels and IP addresses associated with the network data features, and the co

Assignees

Inventors

Classifications

  • Probabilistic graphical models, e.g. probabilistic networks · CPC title

  • in which an application is distributed across nodes in the network (software deployment G06F8/60; multiprogramming arrangements G06F9/46) · CPC title

  • Traffic logging, e.g. anomaly detection · CPC title

  • G06N20/00Primary

    Machine learning · CPC title

  • Physics · mapped topic

Patent family

Related publications grouped by family.

External sources

Frequently asked questions

Answers are generated from the same data shown on this page.

What does patent US10397256B2 cover?
In an example embodiment, a computer-implemented method comprises obtaining labels from messages associated with an email service provider, wherein the labels indicate for each message IP how many spam and non-spam messages have been received; obtaining network data features from a cloud service provider; providing the labels and network data features to a machine learning application; generati…
Who is the assignee on this patent?
Microsoft Technology Licensing Llc
What technology area does this patent fall under?
Primary CPC classification H04L63/1425. Mapped technology areas include Electricity.
When was this patent published?
Publication date Tue Aug 27 2019 00:00:00 GMT+0000 (Coordinated Universal Time) (B2). Legal status and post-grant events are not shown on this page.
What related patents are in patentsdb?
We list 7 related publications on this page (citations in our corpus or others sharing the same primary CPC).