Search engine using self-supervised learning and predictive models for searches based on partial information

US12346340B2 · US · B2

Patent metadata
FieldValue
Publication numberUS-12346340-B2
Application numberUS-202217871843-A
CountryUS
Kind codeB2
Filing dateJul 22, 2022
Priority dateMar 29, 2016
Publication dateJul 1, 2025
Grant dateJul 1, 2025

How to read this patent

A practical reading order for non-experts. Skip the full description unless you need deep technical detail.

  1. Title

    What the patent document calls the invention.

  2. Abstract

    A short plain-language summary of the technical disclosure.

  3. Assignees and inventors

    Who owns or filed the patent and who is credited as inventor.

  4. Key dates

    Filing, priority, publication, and grant dates set the timeline.

  5. First independent claim

    The legal scope of protection — read this for what is actually claimed.

  6. CPC / IPC classifications

    Technology tags used to group this patent with similar filings.

  7. Citations and related patents

    Prior art links and similar publications in this corpus.

Abstract

Official abstract text for this publication.

A search engine responding to a user query to find relevant data assets in a federation business data lake (FBDL) system by monitoring and recording interactions of known users interacting with data assets in the FBDL system. Predicted data usage for unknown or new users is derived by training a generative model that uses reconstructive self-supervised learning (SSL) techniques to generate possible values for missing data usage features of the unknown users. The predicted usage is then used to generate similarity scores that are combined for those of the known users to help inform the search engine processing to return relevant results to a target user.

First claim

Opening claim text (preview).

What is claimed is: 1. A server computer-implemented method of processing queries input to a data retrieval system storing data assets for users in an enterprise, comprising: storing, in a federation business data lake (FBDL) storage maintained for a large-scale data processing system, data assets retrievable by a user; providing a search engine for entry of queries by users looking for data in the FBDL; monitoring and recording, by a monitoring component of the server, all interactions of a plurality of known users, including a first user and a target user, each interaction comprising an activity that triggers a read/write cycle to the FBDL storage; first deriving a similarity of each of the plurality of known users to the target user based on respective past and current data retrieval patterns of each of known users for data queried in the search engine; identifying an unknown user for whom there are no known interactions with the plurality of known users or the data assets to constitute missing features; generating a graph for the unknown user representing data asset interactions for the unknown user; training a generative model that uses reconstructive self-supervised learning (SSL) techniques for the graph to generate possible values for the missing features; second deriving a similarity of the unknown user to the target user based on the trained model; and returning a result to a query input to the search engine by the target user based on the similarity of the known users to the target user and the similarity of the unknown user to the target user. 2. The method of claim 1 wherein the graph comprises nodes that constitute actions, assets and actors of the interactions, and wherein an actor has an edge to all actions, actions have edges between them in sequence, and each action also has an edge to the data assets it affects. 3. The method of claim 2 wherein the actor comprises the unknown user. 4. The method of claim 2 wherein the training step comprises taking the graph and masking the user data access feature that is to be generated later; letting the generative model learn to reconstruct this feature; and replacing the masked user data access feature with the reconstructed feature. 5. The method of claim 4 further comprising, for the known users: measuring a number of interactions of the first user with the data assets; calculating an average number of assets accessed by the first user; receiving the query in the search engine from the target user to access a desired data asset; calculating an amount of interaction of the target user with the data assets based on their respective number of interactions; and comparing a similarity of the first user to the target user based on respective past and current data retrieval patterns of each of the first user and target user for data queried in the search engine. 6. The method of claim 5 further comprising factoring a user profile into the similarity by: building a respective user profile of each of the first user and the target user based on their respective organizational roles, informal social associations, gender, and age; and calculating a similarity ranking between the target user and the first user based on the compared similarity and the respective profiles of the target user and the first user. 7. The method of claim 6 further comprising calculating a relevance score of the desired data item relative to other data items for the target user based on the amount of interaction of the first user and a weighted sum quantifying the similarity ranking between the first user and the target user to identify one or more relevant data assets responsive to the query input by the target user. 8. The method of claim 7 wherein the relevance score represents a predicted relevance that comprises the target user's past interactions with the data assets and the cumulative interactions of other users including the first user with the data assets, such that if one or more of the other users has similar interaction behavior to the target user, then knowledge of the one or more other users can impact the relevance of the information with regard to one or more new data assets predicted to be useful to the target user. 9. The method of claim 8 wherein the relevance score is modified by the similarity of the unknown user to the target user and the similarity of the first user to the target user. 10. The method of claim 1 wherein the user interaction of the known users with the data processing system, and predicted interaction of the unknown user comprises querying data, making data requests, applying parsers, and running analytics on data elements making up the data assets. 11. The method of claim 10 wherein the data processing system is maintained by a large scale enterprise, and wherein the data assets comprise Big Data-scale data sets, and wherein the data assets comprise databases, stacks of databases, file systems, and enterprise services, and wherein the data assets are accessed through a Hadoop layer storing open source software components to control storing, processing, and analyzing the data. 12. A method of processing queries input to a data retrieval system storing data assets for users in an enterprise, comprising: storing, in a federation business data lake (FBDL) storage maintained for a large-scale data processing system, data assets retrievable by users and related to products searched for by the users; deriving a similarity of a target user to one or more known users based on respective past and current data retrieval patterns of each of known users for the products searched for by the users in a search engine query, and respective user profiles for each of the target user and one or more known users; receiving input for an unknown user comprising only partial features for either search activity or profile, for processing using a generative model and resulting in missing features; training a generative model using reconstructive self-supervised learning (SSL) techniques to generate possible values for the missing features; predicting, using the trained generative model a likely pattern of search engine access of the unknown user with the products; calculating, from a set of user interaction counts, a predicted similarity of the unknown user to the one or more known users based on the predicted likely pattern of search engine access; and returning a result to a query input to the search engine by the target user based on the the similarity of a target user to one or more known users and the predicted similarity of the unknown user to the one or more known users. 13. The method of claim 12 further comprising: generating a graph for the unknown user representing data asset interactions for the unknown user. 14. The method of claim 13 wherein the training step comprises: taking the graph and masking the user data access feature that is to be generated later; letting the generative model learn to reconstruct this feature; and replacing the masked user data access feature with the reconstructed feature. 15. The method of claim 12 wherein the predicted interaction of the unknown user comprises querying data, making data requests, applying parsers, and running analytics on data elements making up the data assets. 16. The method of claim 15 wherein the data assets are maintained in a processing system maintained by a large scale enterprise, and wherein the data assets comprise Big Data-scale data sets, and wherein the data assets comprise databases, stacks of databases, file systems, and enterprise servic

Assignees

Inventors

Classifications

  • Search customisation based on user profiles and personalisation · CPC title

  • Graphs; Linked lists (G06F16/9027 takes precedence) · CPC title

  • Search customisation based on social or collaborative filtering · CPC title

  • G06F16/256Primary

    in federated or virtual databases · CPC title

  • G06F16/254Primary

    Extract, transform and load [ETL] procedures, e.g. ETL data flows in data warehouses · CPC title

Patent family

Related publications grouped by family.

External sources

Frequently asked questions

Answers are generated from the same data shown on this page.

What does patent US12346340B2 cover?
A search engine responding to a user query to find relevant data assets in a federation business data lake (FBDL) system by monitoring and recording interactions of known users interacting with data assets in the FBDL system. Predicted data usage for unknown or new users is derived by training a generative model that uses reconstructive self-supervised learning (SSL) techniques to generate poss…
Who is the assignee on this patent?
Dell Products Lp, Emc Ip Holding Co Llc
What technology area does this patent fall under?
Primary CPC classification G06F16/256. Mapped technology areas include Physics.
When was this patent published?
Publication date Tue Jul 01 2025 00:00:00 GMT+0000 (Coordinated Universal Time) (B2). Legal status and post-grant events are not shown on this page.
What related patents are in patentsdb?
We list 3 related publications on this page (citations in our corpus or others sharing the same primary CPC).