Method and apparatus for performing word segmentation on text, device, and medium

US11468236B2 · US · B2

Patent metadata
FieldValue
Publication numberUS-11468236-B2
Application numberUS-202017020166-A
CountryUS
Kind codeB2
Filing dateSep 14, 2020
Priority dateJan 14, 2020
Publication dateOct 11, 2022
Grant dateOct 11, 2022

How to read this patent

A practical reading order for non-experts. Skip the full description unless you need deep technical detail.

  1. Title

    What the patent document calls the invention.

  2. Abstract

    A short plain-language summary of the technical disclosure.

  3. Assignees and inventors

    Who owns or filed the patent and who is credited as inventor.

  4. Key dates

    Filing, priority, publication, and grant dates set the timeline.

  5. First independent claim

    The legal scope of protection — read this for what is actually claimed.

  6. CPC / IPC classifications

    Technology tags used to group this patent with similar filings.

  7. Citations and related patents

    Prior art links and similar publications in this corpus.

Abstract

Official abstract text for this publication.

Embodiments of the present disclosure provide a method and apparatus for performing word segmentation on a text, a device and a medium, which relate to the field of data processing technology and particularly to a smart search technology. The method may include: dividing a to-be-segmented text into at least two layers of character fragment combinations, any layer of character fragments being child character fragments of a previous layer of character fragments and/or parent character fragments of a next layer of character fragments; and segmenting the to-be-segmented text according to a target word granularity based on the at least two layers of character fragment combinations.

First claim

Opening claim text (preview).

What is claimed is: 1. A method for performing word segmentation on a text, comprising: dividing a to-be-segmented text into at least two layers of character fragment combinations, any layer of character fragments being child character fragments of a previous layer of character fragments and/or parent character fragments of a next layer of character fragments; and segmenting the to-be-segmented text according to a target word granularity based on the at least two layers of character fragment combinations; wherein the dividing the to-be-segmented text into at least two layers of character fragment combinations comprises: extracting candidate character fragments of at least one kind of length from the previous layer of character fragments, the previous layer of character fragments belonging to a previous layer of character fragment combination; combining the extracted candidate character fragments to obtain candidate character fragment combinations; and determining a current layer of character fragment combination from the candidate character fragment combinations according to an overlapping relationship between the candidate character fragments and historical usage information of the candidate character fragments, the current layer of character fragment combination including at least one character fragment of the current layer. 2. The method according to claim 1 , wherein the determining the current layer of character fragment combination from the candidate character fragment combinations according to the overlapping relationship between the candidate character fragments and historical usage information of the candidate character fragments comprises: filtering a candidate character fragment combination having an overlap from the candidate character fragment combinations, to obtain target character fragment combinations; and determining the current layer of character fragment combination from the target character fragment combinations according to a number of candidate character fragments included in the target character fragment combinations and historical usage information of the candidate character fragments. 3. The method according to claim 2 , wherein the determining the current layer of character fragment combination from the target character fragment combinations according to the number of candidate character fragments included in the target character fragment combinations and historical usage information of the candidate character fragments comprises: calculating an information entropy of the candidate character fragments according to historical adjacent character information of the candidate character fragments; determining weights of the target character fragment combinations according to the calculated information entropy; and determining the current layer of character fragment combination from the target character fragment combinations according to the number of the candidate character fragments included in the target character fragment combinations and the weights of the target character fragment combinations. 4. The method according to claim 1 , wherein the segmenting the to-be-segmented text according to the target word granularity based on the at least two layers of character fragment combinations comprises: determining target segmentation fragments from character fragments of the character fragment combinations according to historical usage information of character fragments in the character fragment combinations and a parent-child relationship between character fragments in different character fragment combinations; and combining the target segmentation fragments, and segmenting the to-be-segmented text according to the target word granularity based on the combination of target segmentation fragments. 5. The method according to claim 4 , wherein the determining target segmentation fragments from character fragments of the character fragment combinations according to historical usage information of character fragments in the character fragment combination and a parent-child relationship between character fragments in different character fragment combinations comprises: determining, according to historical usage information of a parent character fragment in the character fragment combinations, a weight of the parent character fragment; determining, according to historical usage information of a child character fragment associated with the parent character fragment, a comprehensive weight of the child character fragment; and comparing the weight of the parent character fragment with the comprehensive weight of the child character fragment; and terminating a traversal for a branch to which the parent character fragment belongs and using the child character fragment associated with the parent character fragment as the target segmentation fragment, in response to the weight of the parent character fragment is greater than the comprehensive weight of the child character fragment. 6. The method according to claim 1 , wherein after segmenting the to-be-segmented text, the method further comprises: comparing a target segmentation word obtained through the segmentation with an existing segmentation word, the existing segmentation word being obtained by segmenting the to-be-segmented text based on an existing word segmentation logic; and determining a to-be-mined word from the target segmentation word according to a comparison result. 7. An electronic device, comprising: at least one processor; and a memory, communicatively connected with the at least one processor, wherein the memory stores instructions executable by the at least one processor, and the instructions when executed by the at least one processor, cause the at least one processor to perform operations, the operations comprising: dividing a to-be-segmented text into at least two layers of character fragment combinations, any layer of character fragments being child character fragments of a previous layer of character fragments and/or parent character fragments of a next layer of character fragments; and segmenting the to-be-segmented text according to a target word granularity based on the at least two layers of character fragment combinations; wherein the dividing the to-be-segmented text into at least two layers of character fragment combinations comprises: extracting candidate character fragments of at least one kind of length from the previous layer of character fragments, the previous layer of character fragments belonging to a previous layer of character fragment combination; combining the extracted candidate character fragments to obtain candidate character fragment combinations; and determining a current layer of character fragment combination from the candidate character fragment combinations according to an overlapping relationship between the candidate character fragments and historical usage information of the candidate character fragments, the current layer of character fragment combination including at least one character fragment of the current layer. 8. The electronic device according to claim 7 , wherein the determining the current layer of character fragment combination from the candidate character fragment combinations according to the overlapping relationship between the candidate character fragments and historical usage information of the candidate character fragments comprises: filtering a candidate character fragment combination having an overlap from the candidate character fragment combinations, to obtain target character fragment combinations; and determining the current layer of character fragment combination from the target character fragment combinations according to a number of candidate character fragments included in the target

Assignees

Inventors

Classifications

  • G06F40/279Primary

    Recognition of textual entities · CPC title

  • using natural language analysis · CPC title

  • Energy efficient computing, e.g. low power processors, power management or thermal management · CPC title

Patent family

Related publications grouped by family.

External sources

Frequently asked questions

Answers are generated from the same data shown on this page.

What does patent US11468236B2 cover?
Embodiments of the present disclosure provide a method and apparatus for performing word segmentation on a text, a device and a medium, which relate to the field of data processing technology and particularly to a smart search technology. The method may include: dividing a to-be-segmented text into at least two layers of character fragment combinations, any layer of character fragments being ch…
Who is the assignee on this patent?
Baidu online network technology beijing co ltd
What technology area does this patent fall under?
Primary CPC classification G06F40/279. Mapped technology areas include Physics.
When was this patent published?
Publication date Tue Oct 11 2022 00:00:00 GMT+0000 (Coordinated Universal Time) (B2). Legal status and post-grant events are not shown on this page.
What related patents are in patentsdb?
We list 1 related publication on this page (citations in our corpus or others sharing the same primary CPC).