Method and system for training and validating machine learning in network environments

US11301778B2 · US · B2

Patent metadata
FieldValue
Publication numberUS-11301778-B2
Application numberUS-201916359336-A
CountryUS
Kind codeB2
Filing dateMar 20, 2019
Priority dateMar 21, 2018
Publication dateApr 12, 2022
Grant dateApr 12, 2022

How to read this patent

A practical reading order for non-experts. Skip the full description unless you need deep technical detail.

  1. Title

    What the patent document calls the invention.

  2. Abstract

    A short plain-language summary of the technical disclosure.

  3. Assignees and inventors

    Who owns or filed the patent and who is credited as inventor.

  4. Key dates

    Filing, priority, publication, and grant dates set the timeline.

  5. First independent claim

    The legal scope of protection — read this for what is actually claimed.

  6. CPC / IPC classifications

    Technology tags used to group this patent with similar filings.

  7. Citations and related patents

    Prior art links and similar publications in this corpus.

Abstract

Official abstract text for this publication.

A system and method for training and validating ML algorithms in real networks, including: generating synthetic traffic and receiving it along with real traffic; aggregating the received traffic into network flows by using metadata and transforming them to generate a first dataset readable by the ML algorithm, comprising features defined by the metadata; labelling the traffic and selecting a subset of the features from the labelled dataset used in an iterative training to generate a trained model; filtering out a part of real traffic to obtain a second labelled dataset; and selecting a subset of features from the second labelled dataset used for validating the trained model by comparing predicted results for the trained model and the labels; repeating the steps with a different subset of features to generate another trained model until results are positive in terms of precision or accuracy.

First claim

Opening claim text (preview).

The invention claimed is: 1. A method for training and validating machine learning algorithms in real network environments wherein real traffic is moving between an internal network and an external network, the method comprising: (i) generating synthetic traffic; (ii) receiving packets of traffic comprising the real traffic and the generated synthetic traffic; (iii) aggregating the packets of the received traffic into network flows by using metadata; (iv) transforming each network flow to generate a first dataset with a format readable by a machine learning algorithm, the first dataset comprising features defined by the metadata for each network flow; (v) labelling the real traffic and the synthetic traffic to obtain a labelled dataset which comprises the first dataset and additional features defined as labels, wherein the labelling is done by matching features from the first dataset and features collected from the metadata, and assigning an unknown label if there is no match; (vi) selecting a subset of the features from the labelled dataset to be used in an iterative training to generate a trained model; (vii) filtering out a part of real traffic of the labelled dataset to obtain a second labelled dataset; (viii) selecting a subset of features used in the trained model from the second labelled dataset to be used for validating the trained model; (ix) comparing predicted results for the trained model and the labels obtained from labelling; and (x) if the comparison is negative, repeating steps (v)-(ix) with a different subset of features to generate another trained model. 2. The method according to claim 1 , wherein the packets of the received traffic are aggregated by metadata extracted from the packets and selected from the group consisting of source IP address, destination IP address, source port, destination port, TCP sequence number, start date, end date, and any other value of the network packet. 3. The method according to claim 1 , wherein the packets of the received traffic are aggregated by metadata calculated and selected from the group consisting of counters of packets per flow, list of protocol flags, average size of bytes, and any other statistical traffic information derived from the aggregation of the packets belonging to the same network flow. 4. The method according to claim 1 , further comprising storing in a historic cache the aggregated network flow and including additional metadata from other previous related flows in a defined time window, the additional metadata being selected from the group consisting of number of identical flows, average number of packets per flow in the identical flows, time between identical flows, and any other features mathematically derived from the historic cache for a period of time in the past. 5. The method according to claim 1 , further comprising: if the machine learning algorithm requires real traffic for training process, including a part of real traffic with the unknown label into the first dataset before selecting the features from the first dataset to be used in the iterative training; and otherwise, removing the real traffic labelled with the unknown label from the first dataset. 6. The method according to claim 1 , further comprising removing the labels obtained from labelling which are not needed in training process if the machine learning algorithm is unsupervised, before selecting the features from the first dataset to be used in the iterative training. 7. The method according to claim 1 , wherein the step of comparing is performed in terms of precision, accuracy, or any other criteria defined by configurable parameters. 8. A system for training and validating machine learning algorithms in real network environments comprising an internal network and an external network providing real traffic through a network device, wherein the system comprises at least one processor and memory configured to perform functions via appropriately encoded instructions in interaction with the network device, by implementing functional components that include: at least one generator of synthetic traffic; a probe configured for: receiving packets of traffic comprising the real traffic and the generated synthetic traffic through the network device; aggregating the packets of the received traffic into network flows by using metadata; and transforming each network flow to generate a first dataset with a format readable by a machine learning algorithm, the first dataset comprising features defined by the metadata for each network flow; a train and validation module comprising: a labeller configured for: (i) labelling the real traffic and the synthetic traffic to obtain a labelled dataset which comprises the first dataset and additional features defined as labels, wherein the labelling is done by matching features from the first dataset and features collected from the metadata, and assigning an unknown label where there is no match; and a feature extractor configured for: (ii) selecting a subset of the features from the labelled dataset to be used in an iterative training to generate a trained model; (iii) filtering out a part of real traffic of the labelled dataset to obtain a second labelled dataset; (iv) selecting a subset of features used in the trained model from the second labelled dataset to be used for validating the trained model; and the train and validation module being further configured for: (v) comparing predicted results for the trained model and the labels obtained from labelling; and, if the comparison is negative, repeating steps (i)-(v) with a different subset of features to generate another trained model. 9. The system according to claim 8 , wherein the generator of synthetic traffic is selected from a synthetic user module and a synthetic server. 10. The system according to claim 8 , wherein the probe aggregates packets of the received traffic by metadata extracted from the packets and selected from source IP address, destination IP address, source port, destination port, TCP sequence number, start date and end date or any other value of the network packet. 11. The system according to claim 8 , wherein the probe aggregates packets of the received traffic by metadata calculated and selected from counters of packets per flow, list of protocol flags, average size of bytes and any other statistical traffic information derived from the aggregation of the packets belonging to the same network flow. 12. The system according to claim 8 , further comprising a historic cache for storing the aggregated network flow and including additional metadata from other previous related flows in a defined time window, the additional metadata being selected from number of identical flows, average number of packets per flow in the identical flows, time between identical flows, and any other features mathematically derived from the historic cache for a period of time in the past. 13. The system according to claim 8 , wherein the train and validation module is further configured for: if the machine learning algorithm requires real traffic for training process, including a part of real traffic with the unknown label into the first dataset before selecting the features from the first dataset to be used in the iterative training; otherwise, removing the real traffic labelled with the unknown label from the first dataset. 14. The system according to claim 8 , wherein the train and validation module is further configured for removing the labels obtained from labelling which are not needed in training process if the machine learning algorithm is unsupervised, before selecting the

Assignees

Inventors

Classifications

  • Network analysis or design · CPC title

  • G06N20/00Primary

    Machine learning · CPC title

  • H04L41/16Primary

    using machine learning or artificial intelligence · CPC title

  • using flow identification · CPC title

  • in wire-line communication networks, e.g. low power modes or reduced link rate · CPC title

Patent family

Related publications grouped by family.

External sources

Frequently asked questions

Answers are generated from the same data shown on this page.

What does patent US11301778B2 cover?
A system and method for training and validating ML algorithms in real networks, including: generating synthetic traffic and receiving it along with real traffic; aggregating the received traffic into network flows by using metadata and transforming them to generate a first dataset readable by the ML algorithm, comprising features defined by the metadata; labelling the traffic and selecting a su…
Who is the assignee on this patent?
Telefonica Sa
What technology area does this patent fall under?
Primary CPC classification G06N20/00. Mapped technology areas include Physics.
When was this patent published?
Publication date Tue Apr 12 2022 00:00:00 GMT+0000 (Coordinated Universal Time) (B2). Legal status and post-grant events are not shown on this page.
What related patents are in patentsdb?
We list 7 related publications on this page (citations in our corpus or others sharing the same primary CPC).