Data facet generation and recommendation
US-2023401457-A1 · Dec 14, 2023 · US
US12450003B2 · US · B2
| Field | Value |
|---|---|
| Publication number | US-12450003-B2 |
| Application number | US-202217821513-A |
| Country | US |
| Kind code | B2 |
| Filing date | Aug 23, 2022 |
| Priority date | Aug 23, 2022 |
| Publication date | Oct 21, 2025 |
| Grant date | Oct 21, 2025 |
A practical reading order for non-experts. Skip the full description unless you need deep technical detail.
What the patent document calls the invention.
A short plain-language summary of the technical disclosure.
Who owns or filed the patent and who is credited as inventor.
Filing, priority, publication, and grant dates set the timeline.
The legal scope of protection — read this for what is actually claimed.
Technology tags used to group this patent with similar filings.
Prior art links and similar publications in this corpus.
Official abstract text for this publication.
Examples described herein relate to preparing datasets in a storage device for machine learning (ML) applications. Examples include maintaining ML facet mappings between ML facets and dataset preparation tags, deriving ML facets of a dataset stored in the storage device, and generating filtered datasets from the datasets using the ML facets and ML facet mappings. The filtered dataset is associated with improved dataset quality compared to unfiltered dataset. The storage device transmits the filtered dataset to ML applications requesting the dataset. Some examples include recommending, by the storage device, ML facets to the ML application based on performance metrics.
Opening claim text (preview).
What is claimed is: 1. A storage device comprising: a processing resource; and a non-transitory machine-readable storage medium comprising instructions executable by the processing resource to: store machine learning (ML) facet mappings between ML facets and dataset preparation tags in a repository, wherein the ML facets are properties of datasets or ML models for optimizing quality of the datasets; identify a ML facet of a dataset stored in the storage device; determine, based on at least one of dataset metrics of the dataset, storage performance metrics of the storage device, and application performance metrics, a first quality score for the dataset, wherein the first quality score indicates an amount of relevant information in the dataset; identify a dataset preparation tag mapped to the identified ML facet as indicated in the ML facet mappings; generate a filtered dataset from the dataset based on the dataset preparation tag and determine, based on at least one of dataset metrics of the filtered dataset, the storage performance metrics of the storage device, and the application performance metrics, a second quality score that indicates an amount of relevant information in the filtered dataset; and in response to a request for the dataset from an ML application and determining that the second quality score is greater than the first quality score, transmit the filtered dataset to the ML application across a bandwidth-limited communication link. 2. The storage device of claim 1 , wherein to identify the ML facet, the processing resource executes one or more of the instructions to: input the dataset to analytics workflow, wherein the analytics workflow determines the ML facet of the dataset and a dataset portion associated with the ML facet. 3. The storage device of claim 2 , wherein to generate the filtered dataset, the processing resource executes one or more of the instructions to: Identify a dataset preparation operation indicated in the dataset preparation tag; and prepare the dataset based on the dataset preparation operation and the dataset portion. 4. The storage device of claim 1 , further comprising: an ML facets store to store ML facets of each dataset in the storage device and an identifier of the respective dataset. 5. The storage device of claim 1 , wherein the processing resource executes one or more of the instructions to: store an ML facet mapping between ML facets, application type, and dataset type. 6. The storage device of claim 5 , wherein the processing resource executes one or more of the instructions to: in response to receiving the request for the dataset, recommend one or more of the ML facets to the ML application for selection based on the mapping between the ML facets, the application type, and the dataset type. 7. The storage device of claim 6 , wherein to recommend the ML facets, the processing resource executes one or more of the instructions to: identify one or more of the ML facets based on the dataset type of the dataset and the application type of the ML application; and transmit one or more the ML facets to the ML application as a recommendation. 8. The storage device of claim 7 , further comprising a user interface to: present one or more of the ML facets to the ML application for selection. 9. The storage device of claim 1 , wherein the processing resource executes one or more of the instructions to: receive, from a test application, the application performance metrics, wherein the application performance metrics include one or more of time-to-insights, accuracy, precision, or recall; and determine the storage performance metrics, the dataset metrics of the dataset, and the dataset metrics of the filtered dataset, wherein: the storage performance metrics include one or more of samples per IO operation or throughput; and the dataset metrics of the dataset and the dataset metrics of the filtered dataset include at least a dataset size. 10. The storage device of claim 9 , wherein the processing resource executes one or more of the instructions to: determine a rank for each of the ML facets based on the storage performance metrics, the dataset metrics, and the application performance metrics; and recommend the ML facets to the ML application based on the rank. 11. The storage device of claim 1 , wherein the processing resource executes one or more of the instructions to: store the filtered dataset in persistent storage of the storage device; create a volume containing the filtered dataset; and display the volume to the ML application. 12. The storage device of claim 1 , wherein the ML facets include one or more of correlated features, non-correlated features, hyperparameters, bias, seasonality, balanced dataset, mean, quadrant, private data, variance, missing values, data completeness, anomalous dataset, quantization, high frequency filtering, and null datasets. 13. A method comprising: storing, by a storage device, machine learning (ML) facet mappings between ML facets and dataset preparation tags in a repository, wherein the ML facets are properties of datasets or ML models for optimizing quality of the datasets; identifying, by the storage device, one or more ML facets of a dataset stored in the storage device; determining, based on at least one of dataset metrics of the dataset, storage performance metrics of the storage device, and application performance metrics, a first quality score for the dataset, wherein the first quality score indicates an amount of relevant information in the dataset; receiving, by the storage device, a request for the dataset from an ML application executing on a computing device; recommending, by the storage device, the one or more ML facets to the ML application for selection; generating, by the storage device, a filtered dataset from the dataset based on dataset preparation tags mapped to the selected ML facets and determining, based on at least one of dataset metrics of the filtered dataset, the storage performance metrics of the storage device, and the application performance metrics, a second quality score that indicates an amount of relevant information in the filtered dataset; and in response to determining that the second quality score is greater than the first quality score, transmitting, by the storage device, the filtered dataset to the ML application across a bandwidth-limited communication link. 14. The method of claim 13 , further comprising: in response to generating the filtered dataset, applying, by the storage device, a dataset management policy for the filtered dataset based on the ML facets, wherein the dataset management policy includes rules to perform one or more of data protection, data backup, or data tiering. 15. The method of claim 14 , wherein: the application performance metrics include one or more of time-to-insights, accuracy, precision, or recall; the storage performance metrics include one or more of samples per IO operation and throughput; and the dataset metrics of the dataset and the dataset metrics of the filtered dataset include at least a dataset size. 16. The method of claim 15 , further comprising: based on the quality scores, storing the filtered dataset in a first storage component and the dataset in a second storage component, wherein the first storage component allows faster data retrieval. 17. The method of claim 14 , further comprising: in response to determining that the ML facets include sensitive data, apply the dataset management policy to encrypt the filtered dataset. 18. The method of
Improving or facilitating administration, e.g. storage management · CPC title
Non-volatile semiconductor memory device, e.g. flash memory, one time programmable memory [OTP] · CPC title
Machine learning · CPC title
Improving data quality; Data cleansing, e.g. de-duplication, removing invalid entries or correcting typographical errors · CPC title
In-line storage system · CPC title
Related publications grouped by family.
Answers are generated from the same data shown on this page.