Training mechanism of verbal harassment detection systems

US11670286B2 · US · B2

Patent metadata
FieldValue
Publication numberUS-11670286-B2
Application numberUS-202017135651-A
CountryUS
Kind codeB2
Filing dateDec 28, 2020
Priority dateDec 31, 2019
Publication dateJun 6, 2023
Grant dateJun 6, 2023

How to read this patent

A practical reading order for non-experts. Skip the full description unless you need deep technical detail.

  1. Title

    What the patent document calls the invention.

  2. Abstract

    A short plain-language summary of the technical disclosure.

  3. Assignees and inventors

    Who owns or filed the patent and who is credited as inventor.

  4. Key dates

    Filing, priority, publication, and grant dates set the timeline.

  5. First independent claim

    The legal scope of protection — read this for what is actually claimed.

  6. CPC / IPC classifications

    Technology tags used to group this patent with similar filings.

  7. Citations and related patents

    Prior art links and similar publications in this corpus.

Abstract

Official abstract text for this publication.

In some cases, lower quality, large scale training data can be automatically generated by automatic labeling. The generated training data can be used to pre-train a machine learning model. For instance, the model can be a model for detection of verbal harassment. Parameters of the pre-trained model can be refined or updated using another one or more higher-quality sets of training data, with which the model can be subsequently trained.

First claim

Opening claim text (preview).

What is claimed is: 1. A computer-implemented method of training a machine learning model for detection of verbal harassment, the method comprising: by one or more hardware processors: determining a plurality of verbal harassment heuristics using a first plurality of segments, the segments of the first plurality of segments previously labeled with an occurrence of verbal harassment or a non-occurrence of verbal harassment; determining a plurality of labels for a second plurality of segments by applying the plurality of verbal harassment heuristics and a plurality of verbal harassment patterns, the segments of the second plurality of segments not previously labeled with the occurrence or the non-occurrence of verbal harassment; aggregating the plurality of labels into a plurality of likelihoods for the occurrence of verbal harassment; selecting a subset of segments from the second plurality of segments based on comparing the plurality of likelihoods to at least one threshold; pre-training a machine learning model for verbal harassment detection using the subset of segments from the second plurality of segments and a plurality of randomly selected segments, wherein the subset of segments from the second plurality of segments represents training data indicative of the occurrence of verbal harassment and the plurality of randomly selected segments represents training data indicative of the non-occurrence of verbal harassment; and subsequent to the pre-training, updating one or more parameters of the machine learning model using a third plurality of segments. 2. The method of claim 1 , wherein the third plurality of segments comprises at least some segments previously labeled with the occurrence or the non-occurrence of verbal harassment. 3. The method of claim 1 , wherein the third plurality of segments comprises a number of segments that is larger than a number of segments in at least one of the second plurality of segments or the plurality of randomly selected segments. 4. The method of claim 1 , wherein at least one of the first plurality of segments, the second plurality of segments, third plurality of segments, or the plurality of randomly selected segments comprise text data. 5. The method of claim 4 , wherein text data has been obtained by applying automatic speech recognition to audio data. 6. The method of claim 1 , wherein a number of segments in the second plurality of segments is larger than a number of segments in the first plurality of segments. 7. The method of claim 1 , wherein determining the plurality of labels for the second plurality of segments comprises determining more than one label for at least one segment of the second plurality of segments. 8. The method of claim 7 , wherein aggregating the plurality of labels comprises selecting a single label for the at least one segment of the second plurality of segments. 9. The method of claim 1 , wherein the plurality of randomly selected segments comprise training data indicative of the non-occurrence of verbal harassment. 10. The method of claim 9 , wherein the subset of segments from the second plurality of segments comprises training data indicative of the occurrence of verbal harassment. 11. The method of claim 1 , wherein the at least one threshold is equal to or greater than 0.9. 12. The method of claim 1 , wherein the segments of the first plurality of segments comprise manually-generated labels. 13. The method of claim 1 , wherein the machine learning model for verbal harassment detection comprises a text classification machine learning model. 14. The method of claim 13 , wherein the text classification machine learning model comprises at least one of hierarchical attention model, a fastText model, or a convolutional neural network model. 15. A computer-implemented method of training a machine learning model for detection of verbal harassment, the method comprising: by one or more hardware processors: generating a first set of training data comprising a first plurality of segments by labeling at least some of the segments of the first plurality of segments; pre-training a machine learning model for verbal harassment detection using a subset of segments from the first set of training data and a plurality of randomly selected segments from the first set of training data, wherein the subset of segments from the first set of training data represents training data indicative of a occurrence of verbal harassment and the plurality of randomly selected segments represents training data indicative of a non-occurrence of verbal harassment; and subsequent to completion of pre-training the machine learning model using the subset of segments from the first set of training data, updating one or more parameters of the machine learning model using a second set of training data comprising a second plurality of segments. 16. The method of claim 15 , wherein labeling at least some of the segments of the first plurality of segments comprises labeling the segments of the first plurality of segments as comprising an occurrence or a non-occurrence of verbal harassment. 17. The method of claim 16 , wherein labeling at least some of the segments of the first plurality of segments incorrectly labels at least one segment of the first plurality of segments. 18. The method of claim 15 , wherein the second set of training data comprises at least some segments previously labeled with an occurrence or a non-occurrence of verbal harassment. 19. The method of claim 15 , wherein pre-training a machine learning model for verbal harassment detection comprises further using a plurality of randomly selected segments. 20. The method of claim 15 , wherein at least some segments of the first plurality of segments are labelled with manually-generated labels.

Assignees

Inventors

Classifications

  • G10L15/063Primary

    Training · CPC title

  • Dynamic search techniques; Heuristics; Dynamic trees; Branch-and-bound · CPC title

  • Recognition networks (G10L15/142, G10L15/16 take precedence) · CPC title

  • Combinations of networks · CPC title

  • Semantic analysis · CPC title

Patent family

Related publications grouped by family.

External sources

Frequently asked questions

Answers are generated from the same data shown on this page.

What does patent US11670286B2 cover?
In some cases, lower quality, large scale training data can be automatically generated by automatic labeling. The generated training data can be used to pre-train a machine learning model. For instance, the model can be a model for detection of verbal harassment. Parameters of the pre-trained model can be refined or updated using another one or more higher-quality sets of training data, with wh…
Who is the assignee on this patent?
Beijing Didi Infinity Technology & Dev Co Ltd
What technology area does this patent fall under?
Primary CPC classification G10L15/063. Mapped technology areas include Physics.
When was this patent published?
Publication date Tue Jun 06 2023 00:00:00 GMT+0000 (Coordinated Universal Time) (B2). Legal status and post-grant events are not shown on this page.
What related patents are in patentsdb?
We list 12 related publications on this page (citations in our corpus or others sharing the same primary CPC).