What technology area does this patent fall under?

Primary CPC classification G06Q30/04. Mapped technology areas include Physics.

When was this patent published?

Publication date Tue Sep 24 2024 00:00:00 GMT+0000 (Coordinated Universal Time) (B2). Legal status and post-grant events are not shown on this page.

What related patents are in patentsdb?

We list 12 related publications on this page (citations in our corpus or others sharing the same primary CPC).

Archived data crawling

US12099557B2 · US · B2

Patent metadata
Field	Value
Publication number	US-12099557-B2
Application number	US-202117464731-A
Country	US
Kind code	B2
Filing date	Sep 2, 2021
Priority date	Sep 2, 2021
Publication date	Sep 24, 2024
Grant date	Sep 24, 2024

How to read this patent

A practical reading order for non-experts. Skip the full description unless you need deep technical detail.

Title
What the patent document calls the invention.
Abstract
A short plain-language summary of the technical disclosure.
Assignees and inventors
Who owns or filed the patent and who is credited as inventor.
Key dates
Filing, priority, publication, and grant dates set the timeline.
First independent claim
The legal scope of protection — read this for what is actually claimed.
CPC / IPC classifications
Technology tags used to group this patent with similar filings.
Citations and related patents
Prior art links and similar publications in this corpus.

Abstract

Official abstract text for this publication.

Described is a content mining system comprising a crawler configured to retrieve a plurality of files from a data storage system. The content mining system further comprises a plurality of converters configured to extract data from the plurality of files retrieved by the crawler from the data storage system, where each of the plurality of converters is configured to process a respective type of data. The content mining system further comprises a plurality of queues interposed between the crawler and the plurality of converters, where each queue is associated with a single converter.

First claim

Opening claim text (preview).

What is claimed is: 1. A content mining system comprising: a crawler configured to retrieve a plurality of files from a tape drive storage system by iterating through the plurality of files on the tape drive storage system based on sequential starting offsets of the files; a plurality of converters configured to extract data from the plurality of files retrieved by the crawler from the tape drive storage system, wherein each of the plurality of converters is configured to process a respective type of data; and a plurality of queues interposed between the crawler and the plurality of converters, wherein each queue is associated with a single converter. 2. The content mining system of claim 1 , wherein each queue of the plurality of queues is respectively sized based on a processing speed of a corresponding converter. 3. The content mining system of claim 2 , wherein queue sizes are inversely related to converter processing speeds. 4. The content mining system of claim 2 , wherein a first queue corresponding to a first converter with a first processing speed for a first file type is sized to store a first number of files of the first file type, wherein a second queue corresponding to a second converter with a second processing speed for a second file type is sized to store a second number of files of the second file type, wherein the first processing speed is greater than the second processing speed, and wherein the first number of files is less than the second number of files. 5. The content mining system of claim 1 , wherein the plurality of converters include at least a Comma Separated Value (CSV) converter, a JavaScript® Object Notation (JSON) converter, a HyperText Markup Language (HTML) converter, and a Smart Document Understanding (SDU) converter. 6. A computer-implemented method comprising: aggregating, by a crawler, a plurality of files with a plurality of file types from a tape drive storage system by iterating through the plurality of files on the tape drive storage system based on sequential starting offsets of the files; providing the plurality of files to a plurality of queues interposed between the crawler and a plurality of converters, wherein files with a respective file type are sent to respective queues corresponding to a converter configured to process the files with the respective file type; and extracting data from the plurality of files by the plurality of converters, wherein the plurality of converters incrementally process the plurality of files using the plurality of queues. 7. The method of claim 6 , wherein each queue of the plurality of queues is respectively sized based on a processing speed of a corresponding converter. 8. The method of claim 7 , wherein queue sizes are inversely related to converter processing speeds. 9. The method of claim 7 , wherein a first queue corresponding to a first converter with a first processing speed for a first file type is sized to store a first number of files of the first file type, wherein a second queue corresponding to a second converter with a second processing speed for a second file type is sized to store a second number of files of the second file type, wherein the first processing speed is greater than the second processing speed, and wherein the first number of files is less than the second number of files. 10. The method of claim 6 , wherein the plurality of converters include at least a Comma Separated Value (CSV) converter, a JavaScript® Object Notation (JSON) converter, and a HyperText Markup Language (HTML) converter. 11. The method of claim 6 , wherein the method is performed by one or more computers according to software that is downloaded to the one or more computers from a remote data processing system. 12. The method of claim 11 , wherein the method further comprises: metering a usage of the software; and generating an invoice based on metering the usage. 13. A computer program product comprising one or more computer readable storage media, and program instructions collectively stored on the one or more computer readable storage media, the program instructions comprising instructions configured to cause one or more processors to perform a method comprising: aggregating, by a crawler, a plurality of files with a plurality of file types from a tape drive storage system by iterating through the plurality of files on the tape drive storage system based on sequential starting offsets of the files; providing the plurality of files to a plurality of queues interposed between the crawler and a plurality of converters, wherein files with a respective file type are sent to respective queues corresponding to a converter configured to process the files with the respective file type; and extracting data from the plurality of files by the plurality of converters, wherein the plurality of converters incrementally process the plurality of files using the plurality of queues. 14. The computer program product of claim 13 , wherein each queue of the plurality of queues is respectively sized based on a processing speed of a corresponding converter. 15. The computer program product of claim 14 , wherein queue sizes are inversely related to converter processing speeds.

Assignees

Inventors

Classifications

G06F16/2465
Query processing support for facilitating data mining operations in structured databases · CPC title
G06F16/116
Details of conversion of file system types or formats · CPC title
G06Q30/04Primary
Billing or invoicing · CPC title
G06F2216/03
Data mining · CPC title
G06F16/113
Details of archiving (lifecycle management in storage systems G06F3/0649; point-in-time backing up or restoration of persistent data G06F11/1446) · CPC title

Patent family

Related publications grouped by family.

View patent family 85386672

External sources

Frequently asked questions

Answers are generated from the same data shown on this page.

What does patent US12099557B2 cover?: Described is a content mining system comprising a crawler configured to retrieve a plurality of files from a data storage system. The content mining system further comprises a plurality of converters configured to extract data from the plurality of files retrieved by the crawler from the data storage system, where each of the plurality of converters is configured to process a respective type of…
Who is the assignee on this patent?: IBM
What technology area does this patent fall under?: Primary CPC classification G06Q30/04. Mapped technology areas include Physics.
When was this patent published?: Publication date Tue Sep 24 2024 00:00:00 GMT+0000 (Coordinated Universal Time) (B2). Legal status and post-grant events are not shown on this page.
What related patents are in patentsdb?: We list 12 related publications on this page (citations in our corpus or others sharing the same primary CPC).