Machine learning techniques for analyzing textual content
US-2021182496-A1 · Jun 17, 2021 · US
US11663258B2 · US · B2
| Field | Value |
|---|---|
| Publication number | US-11663258-B2 |
| Application number | US-202017133869-A |
| Country | US |
| Kind code | B2 |
| Filing date | Dec 24, 2020 |
| Priority date | May 20, 2020 |
| Publication date | May 30, 2023 |
| Grant date | May 30, 2023 |
A practical reading order for non-experts. Skip the full description unless you need deep technical detail.
What the patent document calls the invention.
A short plain-language summary of the technical disclosure.
Who owns or filed the patent and who is credited as inventor.
Filing, priority, publication, and grant dates set the timeline.
The legal scope of protection — read this for what is actually claimed.
Technology tags used to group this patent with similar filings.
Prior art links and similar publications in this corpus.
Official abstract text for this publication.
The present disclosure discloses a method and apparatus for processing a dataset. The method includes: obtaining a first text set meeting a preset similarity matching condition with a target text from multiple text blocks provided by a target user; obtaining a second text set from the first text set, in which each text in the second text set does not belong to a same text block as the target text; generating a negative sample set of the target text based on content of a candidate text block to which each text in the second text set belongs; generating a positive sample set of the target text based on content of a target text block to which the target text belongs; and generating a dataset of the target user based on the negative sample set and the positive sample set, and training a matching model based on the dataset.
Opening claim text (preview).
What is claimed is: 1. A computer-implemented method for processing a dataset, comprising: obtaining a plurality of text blocks provided by a target user, each text block comprising a plurality of texts with similar semantics, and obtaining a first text set meeting a preset similarity matching condition with a target text from the plurality of text blocks; obtaining a second text set from the first text set, wherein each text in the second text set does not belong to a same text block as the target text; generating a negative sample set of the target text based on content of a candidate text block to which each text in the second text set belongs; generating a positive sample set of the target text based on content of a target text block to which the target text belongs; and generating a dataset of the target user based on the negative sample set and the positive sample set, and training a matching model based on the dataset of the target user for recognizing a text similarity; wherein the obtaining the first text set meeting the preset similarity matching condition with the target text from the plurality of text blocks comprises: obtaining a sub-vector, a text vector and a position vector corresponding to the target text, and inputting the sub-vector, the text vector and the position vector corresponding to the target text into a preset training language representation model to obtain a target statement vector; obtaining a sub-vector, a text vector and a position vector corresponding to each text in the plurality of text blocks, and inputting the sub-vector, the text vector and the position vector corresponding to each text into the preset training language representation model to obtain a statement vector corresponding to each text; calculating a cosine similarity between the target statement vector and the statement vector corresponding to each text; and comparing the cosine similarity with a preset cosine threshold, and generating the first text set based on texts corresponding to cosine similarities greater than the preset cosine threshold. 2. The method of claim 1 , wherein obtaining the first text set meeting the preset similarity matching condition with the target text from the plurality of text blocks comprises: performing word segmentation on the target text to generate a first word segmentation set, and performing word segmentation on respective texts in the plurality of text blocks to generate a plurality of second word segmentation sets; comparing the first word segmentation set with each set of the plurality of second word segmentation sets to obtain a word segmentation duplication between the first word segmentation set and each set of the plurality of second word segmentation sets; and comparing the word segmentation duplication between the first word segmentation set and each set of the plurality of second word segmentation sets with a preset threshold, and generating the first text set based on second word segmentation sets corresponding to word segmentation duplications greater than the preset threshold. 3. The method of claim 1 , wherein before obtaining the second text set from the first text set, the method further comprises: obtaining a number of texts in the first text set, and determining whether the number of texts is greater than a preset number threshold; and in a case that the number of texts is greater than the preset number threshold, deleting one or more texts in the first text set based on the preset number threshold, such that the number of texts is equal to the preset number threshold. 4. The method of claim 1 , wherein obtaining the second text set from the first text set comprises: obtaining a block identifier corresponding to the target text; obtaining a block identifier corresponding to each text in the first text set; and comparing the block identifier corresponding to each text with the block identifier corresponding to the target text, and generating the second text set based on one or more texts having inconsistent block identifiers with the target text. 5. The method of claim 1 , wherein generating the negative sample set of the target text based on the content of the candidate text block to which each text in the second text set belongs comprises: obtaining the content of the candidate text block to which each text in the second text set belongs; performing text combination on a plurality of texts in each candidate text block to generate a first negative sample set; performing text combination on a plurality of texts in different candidate text blocks to generate a second negative sample set; and generating the negative sample set of the target text based on the first negative sample set and the second negative sample set. 6. The method of claim 1 , wherein generating the positive sample set of the target text based on the content of the target text block to which the target text belongs comprises: obtaining the content of the target text block to which the target text belongs; and performing text combination on a plurality of texts in the target text block to generate the positive sample set of the target text. 7. The method of claim 1 , wherein the target user comprises a plurality of sub-users, obtaining the plurality of text blocks provided by the target user comprises: obtaining a plurality of text blocks provided by each sub-user; and generating the dataset of the target user based on the negative sample set and the positive sample set comprises: generating a sub-dataset corresponding to each sub-user based on the negative sample set and the positive sample set; combining sub-datasets corresponding to respective sub-users to generate a candidate dataset; and performing deduplication processing on the candidate dataset based on a preset deduplication strategy to generate the dataset of the target user. 8. The method of claim 1 , wherein training the matching model based on the dataset for recognizing the text similarity comprises: obtaining a first query statement and a second query statement; encoding the first query statement to generate a first query vector; encoding the second query statement to generate a second query vector; and inputting the first query vector and the second query vector to the matching model to obtain a matching type outputted, and determining the text similarity between the first query statement and the second query statement based on the matching type. 9. The method of claim 1 , wherein training the matching model based on the dataset for recognizing the text similarity comprises: obtaining a first query statement and a second query statement; inputting the first query statement and the second query statement to the matching model for statement alignment to obtain an alignment result; and determining the text similarity between the first query statement and the second query statement based on the alignment result. 10. An apparatus for processing a dataset, comprising: a processor; and a memory, configured to store instructions executable by the processor, wherein the processor is configured to execute the instructions stored in the memory, so as to: obtain a plurality of text blocks provided by a target user, each text block comprising a plurality of texts with similar semantics; obtain a first text set meeting a preset similarity matching condition with a target text from the plurality of text blocks; obtain a second text set from the first text set, wherein each text in the second text set does not belong to a same text block as the target text; generate a negative sample set of the target text based on content of a candidate text block to which each text in the second text set belongs; generate a positive sample set of the target text ba
Lexical analysis, e.g. tokenisation or collocates · CPC title
Knowledge representation; Symbolic representation · CPC title
Generating training patterns; Bootstrap methods, e.g. bagging or boosting · CPC title
Matching criteria, e.g. proximity measures · CPC title
Aggregation; Duplicate elimination · CPC title
Related publications grouped by family.
Answers are generated from the same data shown on this page.