Rejecting Biased Data Using a Machine Learning Model
US-2020081865-A1 · Mar 12, 2020 · US
US11854532B2 · US · B2
| Field | Value |
|---|---|
| Publication number | US-11854532-B2 |
| Application number | US-202217567493-A |
| Country | US |
| Kind code | B2 |
| Filing date | Jan 3, 2022 |
| Priority date | Oct 30, 2018 |
| Publication date | Dec 26, 2023 |
| Grant date | Dec 26, 2023 |
A practical reading order for non-experts. Skip the full description unless you need deep technical detail.
What the patent document calls the invention.
A short plain-language summary of the technical disclosure.
Who owns or filed the patent and who is credited as inventor.
Filing, priority, publication, and grant dates set the timeline.
The legal scope of protection — read this for what is actually claimed.
Technology tags used to group this patent with similar filings.
Prior art links and similar publications in this corpus.
Official abstract text for this publication.
Disclosed is a system and method for detecting and addressing bias in training data prior to building language models based on the training data. Accordingly system and method, detect bias in training data for Intelligent Virtual Assistant (IVA) understanding and highlight any found. Suggestions for reducing or eliminating them may be provided This detection may be done for each model within the Natural Language Understanding (NLU) component. For example, the language model, as well as any sentiment or other metadata models used by the NLU, can introduce understanding bias. For each model deployed, training data is automatically analyzed for bias and corrections suggested.
Opening claim text (preview).
What is claimed is: 1. A non-transitory computer readable medium comprising instructions that, when executed by a processor of a processing system, cause the processing system to: digitally process training data for a language model from among multi-class training data to identify if the training data comprises a class population bias by comparing a distribution of each given value for a plurality of given class labels to a representation threshold value associated with the class label; adjust the training data to compensate for the bias identified upon determination that the training data comprises class population bias, based upon the given values by adding examples of an underrepresented class or removing examples of an overrepresented class; and randomly select a percentage of samples of training data of the overrepresented class for removal. 2. The non-transitory computer readable medium of claim 1 , wherein the instructions further cause the processing system to refer detected bias to a human reviewer for further determination of bias. 3. The non-transitory computer readable medium of claim 1 , wherein the determination includes one of deeming the class population bias being artificial requiring repair and deeming the class population bias being accurate allowing disregarding. 4. The non-transitory computer readable medium of claim 1 , wherein instructions further cause the processing system to digitally process the training data by scanning the training data with a bias scoring system. 5. The non-transitory computer readable medium of claim 1 , wherein instructions further cause the processing system to adjust the training data to compensate for the bias identified by deleting examples of class label combinations for entry values above a predetermined threshold until normalized entries of all class values are below the predetermined threshold. 6. The non-transitory computer readable medium of claim 1 , wherein the removal of the percentage of samples is automatic without human intervention. 7. The non-transitory computer readable medium of claim 1 , wherein instructions further cause the processing system to report identified bias to a user before compensating for the identified bias. 8. A method of automatically detecting bias in training data for training a language model, comprising: digitally processing training data for a language model from among multi-class training data to identify if the training data comprises a class population bias by comparing a distribution of each given value for a plurality of given class labels to a representation threshold value associated with the class label; adjusting the training data to compensate for the bias identified upon determining that the training data comprises class population bias, based upon the given values by adding examples of an underrepresented class or removing examples of an overrepresented class; and randomly selecting a percentage of samples of training data of the overrepresented class for removal. 9. The method of claim 8 , further comprising referring detected bias to a human reviewer for further determination of bias. 10. The method of claim 8 , wherein the determination includes one of deeming the class population bias being artificial requiring repair and deeming the class population bias being accurate allowing disregarding. 11. The method of claim 8 , wherein digitally processing the training data comprises scanning the training data with a bias scoring system. 12. The method of claim 8 , further comprising adjusting the training data to compensate for the bias identified by deleting examples of class label combinations for entry values above the predetermined threshold until the normalized entries of all class values are below the predetermined threshold. 13. The method of claim 8 , wherein the removing of the percentage of samples is automatic without human intervention. 14. The method of claim 8 , the method further comprising reporting identified bias to a user before compensating for the identified bias. 15. A system for automatically detecting bias in training data for training a language model, comprising: a memory comprising computer readable instructions; and a processor configured to execute the computer readable instructions, that cause the system to: digitally process training data for a language model from among multi-class training data to identify if the training data comprises a class population bias by comparing a distribution of each given value for a plurality of given class labels to a representation threshold value associated with the class label; adjust the training data to compensate for the bias identified upon determination that the training data comprises class population bias, based upon the given values by adding examples of an underrepresented class or removing examples of an overrepresented class; and randomly select a percentage of samples of training data of the overrepresented class for removal. 16. The system of claim 15 , wherein the instructions further cause the system to refer detected bias to a human reviewer for further determination of bias.. 17. The system of claim 15 , wherein the determination includes one of deeming the class population bias being artificial requiring repair and deeming the class population bias being accurate allowing disregarding. 18. The system of claim 15 , wherein the instructions further cause the system to digitally process the training data by scanning the training data with a bias scoring system. 19. The system of claim 15 , wherein the instructions further cause the system to to adjust the training data to compensate for the bias identified by deleting examples of class label combinations for entry values above the predetermined threshold until the normalized entries of all class values are below the predetermined threshold. 20. The system of claim 15 , wherein the removal of the percentage of samples is automatic without human intervention. 21. The system of claim 15 , wherein the instructions further cause the system to report identified bias to a user before compensating for the identified bias.
Supervised learning · CPC title
Training · CPC title
Matrix or vector computation {, e.g. matrix-matrix or matrix-vector multiplication, matrix factorization (matrix transposition G06F7/78)} · CPC title
Natural language analysis (semantic analysis of natural language G06F40/30) · CPC title
using natural language modelling · CPC title
Related publications grouped by family.
Answers are generated from the same data shown on this page.