System and method for storing and searching data extracted from text documents
US-2016275180-A1 · Sep 22, 2016 · US
US11250204B2 · US · B2
| Field | Value |
|---|---|
| Publication number | US-11250204-B2 |
| Application number | US-201715831412-A |
| Country | US |
| Kind code | B2 |
| Filing date | Dec 5, 2017 |
| Priority date | Dec 5, 2017 |
| Publication date | Feb 15, 2022 |
| Grant date | Feb 15, 2022 |
A practical reading order for non-experts. Skip the full description unless you need deep technical detail.
What the patent document calls the invention.
A short plain-language summary of the technical disclosure.
Who owns or filed the patent and who is credited as inventor.
Filing, priority, publication, and grant dates set the timeline.
The legal scope of protection — read this for what is actually claimed.
Technology tags used to group this patent with similar filings.
Prior art links and similar publications in this corpus.
Official abstract text for this publication.
A method for generating a context-aware knowledge base is provided. The method may include extracting document object model (DOM) tag elements associated with one or more webpages. The method may further include identifying and extracting webpage data associated with the extracted DOM tags. The method may further include determining a context associated with the identified and extracted webpage data by detecting and extracting resource description framework (RDF) triplets in candidate DOM tag elements. The method may further include ranking the extracted RDF triplets. The method may also include validating one or more RDF triplets associated with the ranked RDF triplets. The method may further include connecting the validated RDF triplets to a knowledge graph associated with a knowledge base of the one or more webpages.
Opening claim text (preview).
What is claimed is: 1. A method for generating a context-aware knowledge base, the method comprising: extracting document object model (DOM) tag elements associated with a webpage; identifying and extracting webpage data associated with a first DOM tag element from the extracted DOM tags; determining a context associated with the identified and extracted webpage data for the first DOM tag element, wherein determining the context comprises, detecting and extracting resource description framework (RDF) triplets in candidate DOM tag elements, wherein the candidate DOM tag elements are based on a determined relationship to the first DOM tag element and include parent and sibling DOM tag elements, and wherein detecting and extracting the RDF triplets comprises detecting and extracting the RDF triplets from the candidate DOM tag elements nearest the first DOM tag element and based on an order associated with the determined relationship until text is identified, and ranking the extracted RDF triplets based on a connection between the RDF triplets and the webpage data associated with the first DOM tag element; validating one or more RDF triplets associated with the ranked RDF triplets; and connecting the validated RDF triplets to a knowledge graph associated with a knowledge base of the webpage. 2. The method of claim 1 , wherein extracting the DOM tag elements associated with the webpage further comprises: determining a relationship between the extracted DOM tag elements. 3. The method of claim 1 , wherein identifying and extracting the webpage data associated with the first DOM tag element from the extracted DOM tags further comprises: extracting text associated with the first DOM tag element. 4. The method of claim 1 , wherein ranking the extracted RDF triplets further comprises: determining a confidence score for the extracted RDF triplets, wherein the confidence score represents a level of connection between an extracted subject and an extracted object associated with the extracted RDF triplets. 5. The method of claim 1 , wherein validating the one or more RDF triplets associated with the ranked RDF triplets further comprises: generating and setting one or more threshold confidence scores; and enabling a user to edit and validate the one or more RDF triplets associated with the ranked RDF triplets. 6. The method of claim 1 , further comprising: tracking changes to the validated RDF triplets. 7. A computer system for generating a context-aware knowledge base, comprising: one or more processors, one or more computer-readable memories, one or more computer-readable tangible storage devices, and program instructions stored on at least one of the one or more storage devices for execution by at least one of the one or more processors via at least one of the one or more memories, wherein the computer system is capable of performing a method comprising: extracting document object model (DOM) tag elements associated with a webpage; identifying and extracting webpage data associated with a first DOM tag element from the extracted DOM tags; determining a context associated with the identified and extracted webpage data for the first DOM tag element, wherein determining the context comprises, detecting and extracting resource description framework (RDF) triplets in candidate DOM tag elements, wherein the candidate DOM tag elements are based on a determined relationship to the first DOM tag element and include parent and sibling DOM tag elements, and wherein detecting and extracting the RDF triplets comprises detecting and extracting the RDF triplets from the candidate DOM tag elements nearest the first DOM tag element and based on an order associated with the determined relationship until text is identified, and ranking the extracted RDF triplets based on a connection between the RDF triplets and the webpage data associated with the first DOM tag element; validating one or more RDF triplets associated with the ranked RDF triplets; and connecting the validated RDF triplets to a knowledge graph associated with a knowledge base of the webpage. 8. The computer system of claim 7 , wherein extracting the DOM tag elements associated with the webpage further comprises: determining a relationship between the extracted DOM tag elements. 9. The computer system of claim 7 , wherein identifying and extracting the webpage data associated with the first DOM tag element from the extracted DOM tags further comprises: extracting text associated with the first DOM tag element. 10. The computer system of claim 7 , wherein ranking the extracted RDF triplets further comprises: determining a confidence score for the extracted RDF triplets, wherein the confidence score represents a level of connection between an extracted subject and an extracted object associated with the extracted RDF triplets. 11. The computer system of claim 7 , wherein validating the one or more RDF triplets associated with the ranked RDF triplets further comprises: generating and setting one or more threshold confidence scores; and enabling a user to edit and validate the one or more RDF triplets associated with the ranked RDF triplets. 12. The computer system of claim 7 , further comprising: tracking changes to the validated RDF triplets. 13. A computer program product for generating a context-aware knowledge base, comprising: one or more computer-readable storage devices and program instructions stored on at least one of the one or more tangible storage devices, the program instructions executable by a processor, the program instructions comprising: program instructions to extract document object model (DOM) tag elements associated with a webpage; program instructions to identify and extract webpage data associated with a first DOM tag element from the extracted DOM tags; program instructions to determine a context associated with the identified and extracted webpage data for the first DOM tag element, wherein determining the context comprises, program instructions to detect and extract resource description framework (RDF) triplets in candidate DOM tag elements, wherein the candidate DOM tag elements are based on a determined relationship to the first DOM tag element and include parent and sibling DOM tag elements, and wherein detecting and extracting the RDF triplets comprises detecting and extracting the RDF triplets from the candidate DOM tag elements nearest the first DOM tag element and based on an order associated with the determined relationship until text is identified, and program instructions to rank the extracted RDF triplets based on a connection between the RDF triplets and the webpage data associated with the first DOM tag element; program instructions to validate one or more RDF triplets associated with the ranked RDF triplets; and program instructions to connect the validated RDF triplets to a knowledge graph associated with a knowledge base of the webpage. 14. The computer program product of claim 13 , wherein the program instructions to extract the DOM tag elements associated with the webpage further comprises: program instructions to determine a relationship between the extracted DOM tag elements. 15. The computer program product of claim 13 , wherein the program instructions to rank the extracted RDF triplets further comprises: program instructions to determine a confidence score for the extracted RDF triplets, wherein the confidence score represents a level of connection between an extracted subject and an extracted object associated with the extracted RDF triplets. 16. The compute
Knowledge engineering; Knowledge acquisition · CPC title
Indexing; Web crawling techniques · CPC title
based on simulated virtual individual or collective life forms, e.g. social simulations or particle swarm optimisation [PSO] · CPC title
Markup, e.g. Standard Generalized Markup Language [SGML] or Document Type Definition [DTD] · CPC title
Related publications grouped by family.
Answers are generated from the same data shown on this page.