Business data lake search engine

US10795895B1 · US · B1

Patent metadata
FieldValue
Publication numberUS-10795895-B1
Application numberUS-201715794387-A
CountryUS
Kind codeB1
Filing dateOct 26, 2017
Priority dateOct 26, 2017
Publication dateOct 6, 2020
Grant dateOct 6, 2020

How to read this patent

A practical reading order for non-experts. Skip the full description unless you need deep technical detail.

  1. Title

    What the patent document calls the invention.

  2. Abstract

    A short plain-language summary of the technical disclosure.

  3. Assignees and inventors

    Who owns or filed the patent and who is credited as inventor.

  4. Key dates

    Filing, priority, publication, and grant dates set the timeline.

  5. First independent claim

    The legal scope of protection — read this for what is actually claimed.

  6. CPC / IPC classifications

    Technology tags used to group this patent with similar filings.

  7. Citations and related patents

    Prior art links and similar publications in this corpus.

Abstract

Official abstract text for this publication.

Business Data Lake searching techniques are provided. A method comprises obtaining a graph representing tables of the Business Data Lake, where each node represents one table and edges between nodes represent foreign key connections; applying a node rank algorithm to determine a relevancy score of the tables based on a number of links to/from other tables; and, in response to a query: ranking a relevancy of query items based on a term frequency-based score to generate candidate results; extracting a candidate sub-graph based on the following: a top-L tables based on the term frequency-based score, and/or a top-M tables based on a topic model distance score for the given query and candidate items; enriching the extracted candidate sub-graph by adding new tables using an item-to-item collaborative filter where a similarity between two tables is measured based on a number of interactions; and ordering the tables in the enriched sub-graph based on the relevancy score and/or a user-to-item collaborative filter that evaluates past user interactions with prior results.

First claim

Opening claim text (preview).

What is claimed is: 1. A method comprising: obtaining a directed graphical structure representing a plurality of tables of a Business Data Lake, where each node in the directed graphical structure represents one of said tables and edges between the nodes represent connections established by foreign keys in the tables, wherein a foreign key in a first table identifies a row of one or more of another table and the first table; applying a node rank algorithm to the directed graphical structure to determine a relevancy score of the tables based on a number of links to or from other tables; in response to a query, performing the following steps: ranking a relevancy of one or more items in the query based on a term frequency-based score to generate candidate results; extracting a candidate sub-graph from the directed graphical structure based on one or more of the following: a top-L tables based on the term frequency-based score, and a top-M tables based on a topic model distance score for the given query and items in candidate results; enriching the extracted candidate sub-graph by adding one or more tables not previously in the extracted candidate sub-graph using an item-to-item collaborative filter where a similarity value between two tables is measured based on a number of interactions with the two tables by a plurality of users of the Business Data Lake that have interacted with the two tables; and ordering the tables in the enriched extracted candidate sub-graph based on one or more of the relevancy score generated by the node rank algorithm and a user-to-item collaborative filter that evaluates past interactions of the users with prior search results. 2. The method of claim 1 , wherein the extracting the candidate sub-graph from the directed graphical structure is further based on a top-N tables having a term frequency-inverse document frequency (TF-IDF) vector having a lowest cosine distance from the term frequency-inverse document frequency (TF-IDF) vector of the given query. 3. The method of claim 1 , wherein the enriching the extracted candidate sub-graph by adding one or more tables not previously in the extracted candidate sub-graph further comprises adding a predefined number of additional layers of neighbor nodes in the directed graphical structure based on a foreign key relation to one or more nodes in the extracted candidate sub-graph. 4. The method of claim 1 , wherein the relevancy of the one or more items in the query based on an Okapi score and the topic model distance score for the given query and items in the candidate results comprises a Kullback-Leibler divergence (KLD) distance. 5. The method of claim 1 , further comprising the step of extracting the foreign keys and additional inetadata from the tables to model relationships between the tables. 6. The method of claim 1 , further comprising the steps of indexing past user queries and creating one or more of user profiles and table profiles related to one or more of said past user queries and user interactions with the tables. 7. The method of claim 1 , wherein the step of enriching the extracted candidate sub-graph by adding one or more tables not previously in the extracted candidate sub-graph further comprises adding one or more tables to the extracted candidate sub-graph based on an item-to-item collaborative filter value between each table in the extracted candidate sub-graph and additional tables in the Business Data Lake that exceeds a predefined threshold. 8. The method of claim 1 , wherein the node rank algorithm identifies one or more of popular and important tables. 9. A system, comprising: a memory; and at least one processing device, coupled to the memory, operative to implement the following steps: obtaining a directed graphical structure representing a plurality of tables of a Business Data Lake, where each node in the directed graphical structure represents one of said tables and edges between the nodes represent connections established by foreign keys in the tables, wherein a foreign key in a first table identifies a row of one or more of another table and the first table; applying a node rank algorithm to the directed graphical structure to determine a relevancy score of the tables based on a number of links to or from other tables; in response to a query, performing the following steps: ranking a relevancy of one or more items in the query based on a term frequency-based score to generate candidate results; extracting a candidate sub-graph from the directed graphical structure based on one or more of the following: a top-L tables based on the term frequency-based score, and a top-M tables based on a topic model distance score for the given query and items in candidate results; enriching the extracted candidate sub-graph by adding one or more tables not previously in the extracted candidate sub-graph using an item-to-item collaborative filter where a similarity value between two tables is measured based on a number of interactions with the two tables by a plurality of users of the Business Data Lake that have interacted with the two tables; and ordering the tables in the enriched extracted candidate sub-graph based on one or more of the relevancy score generated by the node rank algorithm and a user-to-item collaborative filter that evaluates past interactions of the users with prior search results. 10. The system of claim 9 , wherein the extracting the candidate sub-graph from the directed graphical structure is further based on a top-N tables having a term frequency-inverse document frequency (TF-IDF) vector having a lowest cosine distance from the term frequency-inverse document frequency (TF-IDF) vector of the given query. 11. The system of claim 9 , wherein the enriching the extracted candidate sub-graph by adding one or more tables not previously in the extracted candidate sub-graph further comprises adding a predefined number of additional layers of neighbor nodes in the directed graphical structure based on a foreign key relation to one or more nodes in the extracted candidate sub-graph. 12. The system of claim 9 , wherein the relevancy of the one or more items in the query based on an Okapi score and the topic model distance score for the given query and items in the candidate results comprises a Kullback-Leibler divergence (KLD) distance. 13. The system of claim 9 , further comprising the step of extracting the foreign keys and additional metadata from the tables to model relationships between the tables. 14. The system of claim 9 , further comprising the steps of indexing past user queries and creating one or more of user profiles and table profiles related to one or more of said past user queries and user interactions with the tables. 15. The system of claim 9 , wherein the step of enriching the extracted candidate sub-graph by adding one or more tables not previously in the extracted candidate sub-graph further comprises adding one or more tables to the extracted candidate sub-graph based on an item-to-item collaborative filter value between each table in the extracted candidate sub-graph and additional tables in the Business Data Lake that exceeds a predefined threshold. 16. The system of claim 9 , wherein the node rank algorithm identifies one or more of popular and important tables. 17. A computer program product, comprising a tangible machine-readable storage medium having encoded therein executable code of one or more software programs, wherein the one or more software programs when executed by at least one processing device perform the following steps: obtaining a dir

Assignees

Inventors

Classifications

  • Filtering based on additional data, e.g. user or group profiles (filtering in web context G06F16/9535, G06F16/9536) · CPC title

  • Special types of queries, e.g. statistical queries, fuzzy queries or distributed queries · CPC title

  • Graphs; Linked lists (G06F16/9027 takes precedence) · CPC title

  • using ranking · CPC title

  • Query formulation · CPC title

Patent family

Related publications grouped by family.

External sources

Frequently asked questions

Answers are generated from the same data shown on this page.

What does patent US10795895B1 cover?
Business Data Lake searching techniques are provided. A method comprises obtaining a graph representing tables of the Business Data Lake, where each node represents one table and edges between nodes represent foreign key connections; applying a node rank algorithm to determine a relevancy score of the tables based on a number of links to/from other tables; and, in response to a query: ranking a…
Who is the assignee on this patent?
Emc Ip Holding Co Llc
What technology area does this patent fall under?
Primary CPC classification G06F16/9024. Mapped technology areas include Physics.
When was this patent published?
Publication date Tue Oct 06 2020 00:00:00 GMT+0000 (Coordinated Universal Time) (B1). Legal status and post-grant events are not shown on this page.
What related patents are in patentsdb?
We list 3 related publications on this page (citations in our corpus or others sharing the same primary CPC).