Method and device for chinese concept embedding generation based on wikipedia link structure
US-2021073307-A1 · Mar 11, 2021 · US
US11416684B2 · US · B2
| Field | Value |
|---|---|
| Publication number | US-11416684-B2 |
| Application number | US-202016784145-A |
| Country | US |
| Kind code | B2 |
| Filing date | Feb 6, 2020 |
| Priority date | Feb 6, 2020 |
| Publication date | Aug 16, 2022 |
| Grant date | Aug 16, 2022 |
A practical reading order for non-experts. Skip the full description unless you need deep technical detail.
What the patent document calls the invention.
A short plain-language summary of the technical disclosure.
Who owns or filed the patent and who is credited as inventor.
Filing, priority, publication, and grant dates set the timeline.
The legal scope of protection — read this for what is actually claimed.
Technology tags used to group this patent with similar filings.
Prior art links and similar publications in this corpus.
Official abstract text for this publication.
Techniques are described for intelligently identifying concept labels for a set of multiple documents where the identified concept labels are representative of and semantically relevant to the information contained by the set of documents. The technique includes extracting semantic units (e.g., paragraphs) from the set of documents and determining concept labels applicable to the semantic units based on relevance scores computed for the concept labels. The technique includes determining an initial set of concept labels for the set of documents based on the applicable concept labels. The technique further includes obtaining a reference hierarchy associated with the reference set of concept labels and determining a final set of concept labels for the set of documents using a reference hierarchy, the initial set of concept labels, and the relevance scores. The technique includes outputting information identifying the final set of concept labels for the set of documents.
Opening claim text (preview).
What is claimed is: 1. A method comprising: extracting, by a computer system, a plurality of semantic units from a plurality of documents; for each semantic unit in the plurality of semantic units, determining, by the computer system, from a reference set of concept labels, one or more concept labels applicable to the semantic unit; based on the concept labels determined for the plurality of semantic units, determining, by the computer system, an initial set of concept labels for the plurality of documents, wherein determining the initial set of concept labels comprises: for each semantic unit in the plurality of semantic units, computing an entropy value for the semantic unit based on the one or more concept labels determined to be applicable to the semantic unit, wherein the entropy value for the semantic unit indicates a degree of specificity of the one or more concept labels to the semantic unit; ordering the plurality of semantic units based on the computed entropy values; and using the ordered plurality of semantic units to determine the initial set of concept labels for the plurality of documents; obtaining, by the computer system, a reference hierarchy associated with the reference set of concept labels, the reference hierarchy identifying hierarchical relationships between two or more concept labels in the reference set of concept labels; determining, by the computer system, a final set of concept labels for the plurality of documents using the reference hierarchy and the initial set of concept labels; and outputting, by the computer system, information identifying the final set of concept labels for the plurality of documents. 2. The method of claim 1 , wherein the reference set of concept labels comprise titles of a plurality of reference documents, and wherein the plurality of reference documents comprise Wikipedia articles. 3. The method of claim 1 , wherein the plurality of semantic units comprise a plurality of paragraphs in the plurality of documents. 4. The method of claim 1 , wherein determining, from the reference set of concept labels, for each semantic unit in the plurality of semantic units, the one or more concept labels applicable to the semantic unit comprises: for each semantic unit in the plurality of semantic units: for each concept label in the reference set of concept labels, computing, by the computer system, a relevance score for the concept label for the semantic unit, the relevance score for the concept label indicative of a degree of relevance of the concept label to contents of the semantic unit; and based on the relevance scores computed for the concept labels in the reference set of concept labels for the semantic unit, selecting, by the computer system, the one or more concept labels applicable to the semantic unit from the reference set of concept labels. 5. The method of claim 1 , wherein: a semantic unit in the plurality of semantic units with a higher computed entropy value is placed lower in the ordered plurality of semantic units than a semantic unit in the plurality of semantic units having a lower computed entropy value. 6. The method of claim 1 , wherein determining the initial set of concept labels for the plurality of documents further comprises: (a) selecting an unprocessed semantic unit in the ordered plurality of semantic units with the lowest entropy value; (b) adding to the initial set of concept labels, any concept label associated with the semantic unit that is not already in the initial set of concept labels; and (c) marking the selected semantic unit as processed. 7. The method of claim 6 further comprising repeating (a), (b), and (c) until all the semantic units in the ordered plurality of semantic units have been processed or until a first threshold criterion is satisfied, wherein the first threshold criterion is satisfied when a preconfigured threshold number of concept labels are included in the initial set of concept labels. 8. The method of claim 7 further comprising: determining that the first threshold criterion is satisfied; and adding additional one or more concept labels to the initial set of concept labels to ensure that each semantic unit in the plurality of semantic units is associated with at least one concept label in the initial set of concept labels. 9. The method of claim 8 , wherein adding the additional one or more concept labels to the initial set of concept labels comprises: for at least one unprocessed semantic unit in the ordered plurality of semantic units: identifying that a first concept label associated with the at least one unprocessed semantic unit is not included in the initial set of concept labels; and adding the first concept label to the initial set of concept labels. 10. The method of claim 1 , wherein determining the final set of concept labels comprises: identifying, based upon the reference hierarchy, hierarchical relationships between concept labels in the initial set of concept labels; generating a Directed Acyclic Graph (DAG) of nodes for representing the hierarchical relationships, each node in the DAG of nodes representing a concept label in the initial set of concept labels and, wherein connections between the nodes in the DAG of nodes represent the hierarchical relationships; identifying, based upon the reference hierarchy, a set of ancestor concept labels for the concept labels in the initial set of concept labels, wherein, for at least a first concept label in the initial set of concept labels, the set of ancestor concept labels comprises multiple concept labels that are ancestors of the first concept label in the reference hierarchy and the multiple concept labels are not in the initial set of concept labels; and updating the DAG of nodes to add nodes corresponding to the set of ancestor concept labels to the DAG of nodes, wherein the updating comprises adding connections to the DAG of nodes to represent hierarchical relationships between the nodes representing the set of ancestor concept labels and the nodes representing the concept labels in the initial set of concept labels. 11. The method of claim 10 , wherein determining the final set of concept labels further comprises: assigning a weight to each node in the DAG of nodes based on relevance scores associated with the concept labels represented by the DAG of nodes; computing a usefulness score for each node in the DAG of nodes based on the weight of the node, wherein the usefulness score for each node in the DAG of nodes is computed based on a weighted relevance score computed for the node and a weighted relevance score computed for one or more descendant nodes of the node in the DAG of nodes; selecting a node from the DAG of nodes with the highest usefulness score; and adding a concept label represented by the node selected from the DAG of nodes to the final set of concept labels. 12. The method of claim 11 further comprising: (a) removing the selected node from the DAG of nodes to generate an updated DAG of nodes; (b) re-computing a weight for each node remaining in the updated DAG of nodes; (c) re-computing a usefulness score for each node in the updated DAG of nodes; (d) selecting a node from the updated DAG of nodes with the highest usefulness score; and (e) adding a concept label represented by the node selected from the updated DAG of nodes to the final set of concept labels. 13. The method of claim 12 further comprising: repeating (a), (b), (c), (d), and (e) until a number of concept labels included in the final set of concept labels equals or is higher than a pre-configured threshold number of concept labels. 14.
Thesauruses; Synonyms · CPC title
Semantic analysis · CPC title
Graphical models, e.g. Bayesian networks · CPC title
Extracting the logical structure, e.g. chapters, sections or page numbers; Identifying elements of the document, e.g. authors · CPC title
using statistical methods · CPC title
Related publications grouped by family.
Answers are generated from the same data shown on this page.