Searching of data structures in pre-processing data for a machine learning classifier

US11899669B2 · US · B2

Patent metadata
FieldValue
Publication numberUS-11899669-B2
Application numberUS-201815926790-A
CountryUS
Kind codeB2
Filing dateMar 20, 2018
Priority dateMar 20, 2017
Publication dateFeb 13, 2024
Grant dateFeb 13, 2024

How to read this patent

A practical reading order for non-experts. Skip the full description unless you need deep technical detail.

  1. Title

    What the patent document calls the invention.

  2. Abstract

    A short plain-language summary of the technical disclosure.

  3. Assignees and inventors

    Who owns or filed the patent and who is credited as inventor.

  4. Key dates

    Filing, priority, publication, and grant dates set the timeline.

  5. First independent claim

    The legal scope of protection — read this for what is actually claimed.

  6. CPC / IPC classifications

    Technology tags used to group this patent with similar filings.

  7. Citations and related patents

    Prior art links and similar publications in this corpus.

Abstract

Official abstract text for this publication.

A data processing system is configured to pre-process data for a machine learning classifier. The data processing system includes an input port that receives one or more data items, an extraction engine that extracts a plurality of data signatures and structure data, a logical rule set generation engine configured to generate a data structure, select a particular data signature of the data structure, identify each instance of the particular data signature in the data structure, segment the data structure around instances of the particular data signature, identify one or more sequences of data signatures connected to the particular data signature, and generate a logical ruleset. A classification engine executes one or more classifiers against the logical ruleset to classify the one or more data items received by the input port.

First claim

Opening claim text (preview).

What is claimed is: 1. A data processing system configured to pre-process data for a machine learning classifier, the data processing system comprising: an input port that receives one or more data items; a shared memory data store that stores the one or more data items, with each of the one or more data items being written to the shared memory data store; and at least one processor configured to perform operations comprising: extracting, from a data item of the one or more data items written to the shared memory data store, a plurality of data signatures and structure data representing relationships among the data signatures, wherein a type of the data signatures is based on a domain of the one or more data items; generating, based on the type of the data signatures, a data structure from the plurality of data signatures, wherein the data structure includes a plurality of nodes connected with edges, each node in the data structure represents a data signature, and wherein each edge specifies a relationship between a first node and a second node, with the specified relationship corresponding to a relationship represented in the structure data for data signatures represented by those first and second nodes; selecting a particular data signature of the data structure; for the particular data signature of the data structure that is selected, identifying each instance of the particular data signature in the data structure; segmenting, based on the type of the data signatures, the data structure around instances of the particular data signature; and identifying, based on the segmenting, one or more sequences of data signatures connected to the particular data signature, each of the one or more sequences being different from one or more other identified sequences of data signatures connected to the particular data signature in the data structure, each of the one or more sequences including a connected set of data signatures in the data structure that are connected to the particular data signature, wherein each connection represents a relationship between a first data signature and a second data signature of the set of data signatures; generating, based on the type of the data signatures and independent of precoding any sequence of data signatures into a rule, a logical ruleset, wherein each logical rule of the logical ruleset is defined by a sequence of data signatures of the one or more sequences of data signatures that are identified, and wherein a logical rule of the logical ruleset comprises a shape rule specifying one or a sequence of more shapes that are either permitted to be adjacent in the sequence of data signatures, restricted from being adjacent in the sequence of data signatures, or replacing one or more prior shapes of the sequence of data signatures; executing one or more classifiers against the logical ruleset to classify the one or more data items received by the input port; and generating, based on the executing, one or more additional logical rules for the logical ruleset, the one or more additional logical rules specifying an additional relationship between at least two shapes of the plurality of shapes. 2. The data processing system of claim 1 , wherein generation of the logical ruleset enables classification of the one or more data items with a reduced amount of data, relative to an amount of data required to classify the one or more data items independent of the generation of the logical ruleset. 3. The data processing system of claim 2 , wherein classification of the one or more data items with a reduced amount of data increases a processing speed of the data processing system in classifying the one or more data items, relative to a processing speed of the data processing system in classifying the one or more data items independent of the generation of the logical ruleset. 4. The data processing system of claim 1 , the operations further comprising: determining a frequency for which each logical rule of the logical ruleset appears in the data structure; and generating a vector representing the data item, the vector defined by the frequency for each logical rule of the logical ruleset. 5. The data processing system of claim 4 , the operations further comprising: comparing the vector with another vector generated for another data item of the one or more data items, wherein comparing includes computing a distance between the vector and the other vector in a vector space. 6. The data processing system of claim 1 , the operations further comprising: determining which logical rules of the logical ruleset occur in another data item of the one or more data items; and representing the other data item as a vector of the logical rules that occur in the other data item. 7. The data processing system of claim 1 , the operations further comprising: ranking the plurality of data signatures; and selecting a higher ranked data signature to be the particular data signature, the higher ranked data signature being ranked higher than another data signature that is lower ranked. 8. The data processing system of claim 7 , wherein data signatures above a threshold ranking are iteratively selected to be the particular data signature, and wherein the logical ruleset comprises logical rules generated for each of the data signatures selected to be the particular data signature. 9. The data processing system of claim 7 , wherein the ranking for a data signature is proportional to a frequency in which that data signature appears in the plurality of data signatures. 10. The data processing system of claim 7 , the operations further comprising: weighting a data signature with a predetermined weight value, and wherein ranking is based on the predetermined weight value of the data signature. 11. The data processing system of claim 1 , the operations further comprising: determining, for a logical rule, a frequency for which a sequence that defines the logical rule appears in the data structure; determining that that frequency is less than a threshold frequency; and removing the logical rule from the logical ruleset. 12. The data processing system of claim 1 , wherein the one or more sequences comprise a plurality of sequences, and the operations further comprising: determining that a first sequence of the plurality of sequences includes a second sequence of the plurality of sequences; and removing, from the logical ruleset, a logical rule defined by the first sequence. 13. The data processing system of claim 1 , the operations further comprising: comparing a portion of the data item to a library of specified data signatures, and wherein a data signature is extracted from the data item when the portion of the data item matches a specified data signature of the library. 14. The data processing system of claim 13 , wherein a specified data signature of the library is assigned one or more parameter values, and wherein the operations further comprise extracting the data signature from the data item when the portion of the data item satisfies the one or more parameter values assigned to the data signature, wherein the data item satisfies the one or more parameter values assigned to the data signature when the data item is within a threshold of the one or more parameter values in accordance with fuzzy matching. 15. The data processing system of claim 1 , the operations further comprising: receiving data indicating a threshold number of sequences; determining that a number of identified sequences for the data signature exceeds the threshold number of sequences; segmenting the data signature into sub-data s

Assignees

Inventors

Classifications

  • Applying rules; Deductive queries · CPC title

  • Ontology · CPC title

  • Graphs; Linked lists (G06F16/9027 takes precedence) · CPC title

  • Forward inferencing; Production systems · CPC title

  • G06N20/00Primary

    Machine learning · CPC title

Patent family

Related publications grouped by family.

External sources

Frequently asked questions

Answers are generated from the same data shown on this page.

What does patent US11899669B2 cover?
A data processing system is configured to pre-process data for a machine learning classifier. The data processing system includes an input port that receives one or more data items, an extraction engine that extracts a plurality of data signatures and structure data, a logical rule set generation engine configured to generate a data structure, select a particular data signature of the data stru…
Who is the assignee on this patent?
Univ Carnegie Mellon
What technology area does this patent fall under?
Primary CPC classification G06F16/24564. Mapped technology areas include Physics.
When was this patent published?
Publication date Tue Feb 13 2024 00:00:00 GMT+0000 (Coordinated Universal Time) (B2). Legal status and post-grant events are not shown on this page.
What related patents are in patentsdb?
We list 8 related publications on this page (citations in our corpus or others sharing the same primary CPC).