Methods and Systems for Transforming Training Data to Improve Data Classification

US2017344617A1 · US · A1

Patent metadata
FieldValue
Publication numberUS-2017344617-A1
Application numberUS-201615217282-A
CountryUS
Kind codeA1
Filing dateJul 22, 2016
Priority dateMay 31, 2016
Publication dateNov 30, 2017
Grant date

How to read this patent

A practical reading order for non-experts. Skip the full description unless you need deep technical detail.

  1. Title

    What the patent document calls the invention.

  2. Abstract

    A short plain-language summary of the technical disclosure.

  3. Assignees and inventors

    Who owns or filed the patent and who is credited as inventor.

  4. Key dates

    Filing, priority, publication, and grant dates set the timeline.

  5. First independent claim

    The legal scope of protection — read this for what is actually claimed.

  6. CPC / IPC classifications

    Technology tags used to group this patent with similar filings.

  7. Citations and related patents

    Prior art links and similar publications in this corpus.

Abstract

Official abstract text for this publication.

In one embodiment, a method for transforming training data to improve data classification is disclosed. The method comprises extracting concepts from a training data set. The method comprises computing frequency of occurrence of each concept in each category and removing concepts from the data records when the frequency of occurrence of a concept in a category is less than a threshold frequency value. Further, the method comprises computing a percentage contribution of each concept of remaining concepts in each category upon removing the concepts and eliminating concepts, from the remaining concepts, contributing equally to each category based on the percentage contribution of each concept to provide a reformed training data set. Further, the method comprises appending a category name to a corresponding data record in the reformed training data set based on a normalized frequency of occurrence of the concept in a category to improve data classification.

First claim

Opening claim text (preview).

What is claimed is: 1 . A method for transforming training data to improve data classification, the method comprising: extracting, by a data transforming system, concepts from a training data set, wherein the training data set comprises data records corresponding to one or more categories; computing, by the data transforming system, frequency of occurrence of each concept in each category of the one or more categories; removing, by the data transforming system, one or more concepts from the data records when the frequency of occurrence of a concept in a category is less than a threshold frequency value; computing, by the data transforming system, a percentage contribution of each concept of remaining concepts in each category upon removing the one or more concepts; eliminating, by the data transforming system, concepts, from the remaining concepts, contributing equally to each category based on the percentage contribution of each concept to provide a reformed training data set; and appending, by the data transforming system, a category name to a corresponding data record in the reformed training data set based on a normalized frequency of occurrence of the concept in a category to improve data classification. 2 . The method of claim 1 , wherein computing the percentage contribution of each concept of the remaining concepts in each category further comprises creating a relative contribution matrix based on the percentage contribution of each concept in each category. 3 . The method of claim 2 , wherein eliminating concepts contributing equally to each category comprises: computing, by the data transformation system, a maximum percentage contribution and a standard deviation for each row in the relative contribution matrix; creating, by the data transformation system, an asymmetry matrix, wherein each cell comprises a distance of each cell value in the relative contribution matrix from the maximum percentage contribution of each concept; and eliminating, by the data transformation system, concepts corresponding to a row in the asymmetry matrix from the training data set when a maximum distance of distances in the row of the asymmetry matrix is less than a pre-defined contribution value. 4 . The method of claim 3 , wherein the distance of each cell value in the relative contribution matrix from the maximum percentage contribution of each concept is computed using the standard deviation of each row in the relative contribution matrix. 5 . The method of claim 1 , wherein appending the category name to the corresponding data record comprises: creating, by the data transformation system, a domain concept frequency matrix comprising concepts in the reformed training data set and the frequency of occurrence of each concept in each category; computing, by the data transformation system, the normalized frequency of occurrence based on a minimum frequency of occurrence and a maximum frequency of occurrence; and appending, by the data transformation system, the category name corresponding to a maximum normalized frequency of occurrence to the corresponding data record the maximum normalized frequency of occurrence times to improve data classification. 6 . The method of claim 1 , wherein appending the category name to the corresponding data record biases a classifier to classify a data record in the category corresponding to the category name. 7 . A data transforming system for transforming training data to improve data classification, the data transforming system comprising: a processor; and a memory communicatively coupled to the processor, wherein the memory stores processor instructions, which, on execution, causes the processor to: extract concepts from a training data set, wherein the training data set comprises data records corresponding to one or more categories; compute frequency of occurrence of each concept in each category of the one or more categories; remove one or more concepts from the data records when the frequency of occurrence of a concept in a category is less than a threshold frequency value; compute a percentage contribution of each concept of remaining concepts in each category upon removing the one or more concepts; eliminate concepts, from the remaining concepts, contributing equally to each category based on the percentage contribution of each concept to provide a reformed training data set; and append a category name to a corresponding data record in the reformed training data set based on a normalized frequency of occurrence of the concept in a category to improve data classification. 8 . The data transforming system of claim 7 , wherein the processor is caused to create a relative contribution matrix based on a percentage contribution of each concept in each category. 9 . The data transforming system of claim 8 , wherein the processor is caused to: compute a maximum percentage contribution and a standard deviation for each row in the relative contribution matrix; create an asymmetry matrix, wherein each cell comprises a distance of each cell value in the relative contribution matrix from the maximum percentage contribution of each concept; and eliminate concepts corresponding to a row in the asymmetry matrix from the training data set when a maximum distance of distances in the row of the asymmetry matrix is less than a predefined contribution value. 10 . The data transforming system of claim 9 , wherein the distance of each cell value in the relative contribution matrix from the maximum percentage contribution of each concept is computed using the standard deviation of each row in the relative contribution matrix. 11 . The data transforming system of claim 7 , wherein the processor is caused to: create a domain concept frequency matrix comprising concepts in the reformed training data set and the frequency of occurrence of each concept in each category; compute the normalized frequency of occurrence based on a minimum frequency of occurrence and a maximum frequency of occurrence; and append the category name corresponding to a maximum normalized frequency of occurrence to the corresponding data record the maximum normalized frequency of occurrence times to improve data classification. 12 . The data transforming system of claim 7 , wherein appending the category name to the corresponding data record biases a classifier to classify a data record in the category corresponding to the category name. 13 . A non-transitory computer-readable medium storing computer-executable instructions for: extracting concepts from a training data set, wherein the training data set comprises data records corresponding to one or more categories; computing frequency of occurrence of each concept in each category of the one or more categories; removing one or more concepts from the data records when the frequency of occurrence of a concept in a category is less than a threshold frequency value; computing a percentage contribution of each concept of remaining concepts in each category upon removing the one or more concepts; eliminating concepts, from the remaining concepts, contributing equally to each category based on the percentage contribution of each concept to provide a reformed training data set; and appending a category name to a corresponding data record in the reformed training data set based on a normalized frequency of occurrence of the concept in a category to improve data classification. 14 . The non-transitory computer-readable medium of claim 13 , wherein computing the percentage contribution of each concept of the remaining concepts in each category further compr

Assignees

Inventors

Classifications

Patent family

Related publications grouped by family.

External sources

Frequently asked questions

Answers are generated from the same data shown on this page.

What does patent US2017344617A1 cover?
In one embodiment, a method for transforming training data to improve data classification is disclosed. The method comprises extracting concepts from a training data set. The method comprises computing frequency of occurrence of each concept in each category and removing concepts from the data records when the frequency of occurrence of a concept in a category is less than a threshold frequency…
Who is the assignee on this patent?
Wipro Ltd
What technology area does this patent fall under?
Primary CPC classification G06N5/022. Mapped technology areas include Physics.
When was this patent published?
Publication date Thu Nov 30 2017 00:00:00 GMT+0000 (Coordinated Universal Time) (A1). Legal status and post-grant events are not shown on this page.
What related patents are in patentsdb?
We list 8 related publications on this page (citations in our corpus or others sharing the same primary CPC).