Who is the assignee on this patent?

Beijing Baidu Netcom Sci & Tech Co Ltd

What technology area does this patent fall under?

Primary CPC classification G06F40/30. Mapped technology areas include Physics.

When was this patent published?

Publication date Tue May 30 2023 00:00:00 GMT+0000 (Coordinated Universal Time) (B2). Legal status and post-grant events are not shown on this page.

What related patents are in patentsdb?

We list 3 related publications on this page (citations in our corpus or others sharing the same primary CPC).

Method and apparatus for processing dataset

US11663258B2 · US · B2

Patent metadata
Field	Value
Publication number	US-11663258-B2
Application number	US-202017133869-A
Country	US
Kind code	B2
Filing date	Dec 24, 2020
Priority date	May 20, 2020
Publication date	May 30, 2023
Grant date	May 30, 2023

How to read this patent

A practical reading order for non-experts. Skip the full description unless you need deep technical detail.

Title
What the patent document calls the invention.
Abstract
A short plain-language summary of the technical disclosure.
Assignees and inventors
Who owns or filed the patent and who is credited as inventor.
Key dates
Filing, priority, publication, and grant dates set the timeline.
First independent claim
The legal scope of protection — read this for what is actually claimed.
CPC / IPC classifications
Technology tags used to group this patent with similar filings.
Citations and related patents
Prior art links and similar publications in this corpus.

Abstract

Official abstract text for this publication.

The present disclosure discloses a method and apparatus for processing a dataset. The method includes: obtaining a first text set meeting a preset similarity matching condition with a target text from multiple text blocks provided by a target user; obtaining a second text set from the first text set, in which each text in the second text set does not belong to a same text block as the target text; generating a negative sample set of the target text based on content of a candidate text block to which each text in the second text set belongs; generating a positive sample set of the target text based on content of a target text block to which the target text belongs; and generating a dataset of the target user based on the negative sample set and the positive sample set, and training a matching model based on the dataset.

First claim

Opening claim text (preview).

What is claimed is: 1. A computer-implemented method for processing a dataset, comprising: obtaining a plurality of text blocks provided by a target user, each text block comprising a plurality of texts with similar semantics, and obtaining a first text set meeting a preset similarity matching condition with a target text from the plurality of text blocks; obtaining a second text set from the first text set, wherein each text in the second text set does not belong to a same text block as the target text; generating a negative sample set of the target text based on content of a candidate text block to which each text in the second text set belongs; generating a positive sample set of the target text based on content of a target text block to which the target text belongs; and generating a dataset of the target user based on the negative sample set and the positive sample set, and training a matching model based on the dataset of the target user for recognizing a text similarity; wherein the obtaining the first text set meeting the preset similarity matching condition with the target text from the plurality of text blocks comprises: obtaining a sub-vector, a text vector and a position vector corresponding to the target text, and inputting the sub-vector, the text vector and the position vector corresponding to the target text into a preset training language representation model to obtain a target statement vector; obtaining a sub-vector, a text vector and a position vector corresponding to each text in the plurality of text blocks, and inputting the sub-vector, the text vector and the position vector corresponding to each text into the preset training language representation model to obtain a statement vector corresponding to each text; calculating a cosine similarity between the target statement vector and the statement vector corresponding to each text; and comparing the cosine similarity with a preset cosine threshold, and generating the first text set based on texts corresponding to cosine similarities greater than the preset cosine threshold. 2. The method of claim 1 , wherein obtaining the first text set meeting the preset similarity matching condition with the target text from the plurality of text blocks comprises: performing word segmentation on the target text to generate a first word segmentation set, and performing word segmentation on respective texts in the plurality of text blocks to generate a plurality of second word segmentation sets; comparing the first word segmentation set with each set of the plurality of second word segmentation sets to obtain a word segmentation duplication between the first word segmentation set and each set of the plurality of second word segmentation sets; and comparing the word segmentation duplication between the first word segmentation set and each set of the plurality of second word segmentation sets with a preset threshold, and generating the first text set based on second word segmentation sets corresponding to word segmentation duplications greater than the preset threshold. 3. The method of claim 1 , wherein before obtaining the second text set from the first text set, the method further comprises: obtaining a number of texts in the first text set, and determining whether the number of texts is greater than a preset number threshold; and in a case that the number of texts is greater than the preset number threshold, deleting one or more texts in the first text set based on the preset number threshold, such that the number of texts is equal to the preset number threshold. 4. The method of claim 1 , wherein obtaining the second text set from the first text set comprises: obtaining a block identifier corresponding to the target text; obtaining a block identifier corresponding to each text in the first text set; and comparing the block identifier corresponding to each text with the block identifier corresponding to the target text, and generating the second text set based on one or more texts having inconsistent block identifiers with the target text. 5. The method of claim 1 , wherein generating the negative sample set of the target text based on the content of the candidate text block to which each text in the second text set belongs comprises: obtaining the content of the candidate text block to which each text in the second text set belongs; performing text combination on a plurality of texts in each candidate text block to generate a first negative sample set; performing text combination on a plurality of texts in different candidate text blocks to generate a second negative sample set; and generating the negative sample set of the target text based on the first negative sample set and the second negative sample set. 6. The method of claim 1 , wherein generating the positive sample set of the target text based on the content of the target text block to which the target text belongs comprises: obtaining the content of the target text block to which the target text belongs; and performing text combination on a plurality of texts in the target text block to generate the positive sample set of the target text. 7. The method of claim 1 , wherein the target user comprises a plurality of sub-users, obtaining the plurality of text blocks provided by the target user comprises: obtaining a plurality of text blocks provided by each sub-user; and generating the dataset of the target user based on the negative sample set and the positive sample set comprises: generating a sub-dataset corresponding to each sub-user based on the negative sample set and the positive sample set; combining sub-datasets corresponding to respective sub-users to generate a candidate dataset; and performing deduplication processing on the candidate dataset based on a preset deduplication strategy to generate the dataset of the target user. 8. The method of claim 1 , wherein training the matching model based on the dataset for recognizing the text similarity comprises: obtaining a first query statement and a second query statement; encoding the first query statement to generate a first query vector; encoding the second query statement to generate a second query vector; and inputting the first query vector and the second query vector to the matching model to obtain a matching type outputted, and determining the text similarity between the first query statement and the second query statement based on the matching type. 9. The method of claim 1 , wherein training the matching model based on the dataset for recognizing the text similarity comprises: obtaining a first query statement and a second query statement; inputting the first query statement and the second query statement to the matching model for statement alignment to obtain an alignment result; and determining the text similarity between the first query statement and the second query statement based on the alignment result. 10. An apparatus for processing a dataset, comprising: a processor; and a memory, configured to store instructions executable by the processor, wherein the processor is configured to execute the instructions stored in the memory, so as to: obtain a plurality of text blocks provided by a target user, each text block comprising a plurality of texts with similar semantics; obtain a first text set meeting a preset similarity matching condition with a target text from the plurality of text blocks; obtain a second text set from the first text set, wherein each text in the second text set does not belong to a same text block as the target text; generate a negative sample set of the target text based on content of a candidate text block to which each text in the second text set belongs; generate a positive sample set of the target text ba

Assignees

Beijing Baidu Netcom Sci & Tech Co Ltd

Inventors

Classifications

G06F40/284
Lexical analysis, e.g. tokenisation or collocates · CPC title
G06N5/02
Knowledge representation; Symbolic representation · CPC title
G06F18/214
Generating training patterns; Bootstrap methods, e.g. bagging or boosting · CPC title
G06F18/22
Matching criteria, e.g. proximity measures · CPC title
G06F16/24556
Aggregation; Duplicate elimination · CPC title

Patent family

Related publications grouped by family.

View patent family 72537652

External sources

Frequently asked questions

Answers are generated from the same data shown on this page.

What does patent US11663258B2 cover?: The present disclosure discloses a method and apparatus for processing a dataset. The method includes: obtaining a first text set meeting a preset similarity matching condition with a target text from multiple text blocks provided by a target user; obtaining a second text set from the first text set, in which each text in the second text set does not belong to a same text block as the target te…
Who is the assignee on this patent?: Beijing Baidu Netcom Sci & Tech Co Ltd
What technology area does this patent fall under?: Primary CPC classification G06F40/30. Mapped technology areas include Physics.
When was this patent published?: Publication date Tue May 30 2023 00:00:00 GMT+0000 (Coordinated Universal Time) (B2). Legal status and post-grant events are not shown on this page.
What related patents are in patentsdb?: We list 3 related publications on this page (citations in our corpus or others sharing the same primary CPC).

How to read this patent

Abstract

First claim

Assignees

Inventors

Classifications

Patent family

External sources

Related patents

Machine learning techniques for analyzing textual content

Systems and methods for measuring goals based on matching electronic activities to record objects

Adaptive sampling scheme for imbalanced large scale data

Frequently asked questions