Archived data crawling

US12099557B2 · US · B2

Patent metadata
FieldValue
Publication numberUS-12099557-B2
Application numberUS-202117464731-A
CountryUS
Kind codeB2
Filing dateSep 2, 2021
Priority dateSep 2, 2021
Publication dateSep 24, 2024
Grant dateSep 24, 2024

How to read this patent

A practical reading order for non-experts. Skip the full description unless you need deep technical detail.

  1. Title

    What the patent document calls the invention.

  2. Abstract

    A short plain-language summary of the technical disclosure.

  3. Assignees and inventors

    Who owns or filed the patent and who is credited as inventor.

  4. Key dates

    Filing, priority, publication, and grant dates set the timeline.

  5. First independent claim

    The legal scope of protection — read this for what is actually claimed.

  6. CPC / IPC classifications

    Technology tags used to group this patent with similar filings.

  7. Citations and related patents

    Prior art links and similar publications in this corpus.

Abstract

Official abstract text for this publication.

Described is a content mining system comprising a crawler configured to retrieve a plurality of files from a data storage system. The content mining system further comprises a plurality of converters configured to extract data from the plurality of files retrieved by the crawler from the data storage system, where each of the plurality of converters is configured to process a respective type of data. The content mining system further comprises a plurality of queues interposed between the crawler and the plurality of converters, where each queue is associated with a single converter.

First claim

Opening claim text (preview).

What is claimed is: 1. A content mining system comprising: a crawler configured to retrieve a plurality of files from a tape drive storage system by iterating through the plurality of files on the tape drive storage system based on sequential starting offsets of the files; a plurality of converters configured to extract data from the plurality of files retrieved by the crawler from the tape drive storage system, wherein each of the plurality of converters is configured to process a respective type of data; and a plurality of queues interposed between the crawler and the plurality of converters, wherein each queue is associated with a single converter. 2. The content mining system of claim 1 , wherein each queue of the plurality of queues is respectively sized based on a processing speed of a corresponding converter. 3. The content mining system of claim 2 , wherein queue sizes are inversely related to converter processing speeds. 4. The content mining system of claim 2 , wherein a first queue corresponding to a first converter with a first processing speed for a first file type is sized to store a first number of files of the first file type, wherein a second queue corresponding to a second converter with a second processing speed for a second file type is sized to store a second number of files of the second file type, wherein the first processing speed is greater than the second processing speed, and wherein the first number of files is less than the second number of files. 5. The content mining system of claim 1 , wherein the plurality of converters include at least a Comma Separated Value (CSV) converter, a JavaScript® Object Notation (JSON) converter, a HyperText Markup Language (HTML) converter, and a Smart Document Understanding (SDU) converter. 6. A computer-implemented method comprising: aggregating, by a crawler, a plurality of files with a plurality of file types from a tape drive storage system by iterating through the plurality of files on the tape drive storage system based on sequential starting offsets of the files; providing the plurality of files to a plurality of queues interposed between the crawler and a plurality of converters, wherein files with a respective file type are sent to respective queues corresponding to a converter configured to process the files with the respective file type; and extracting data from the plurality of files by the plurality of converters, wherein the plurality of converters incrementally process the plurality of files using the plurality of queues. 7. The method of claim 6 , wherein each queue of the plurality of queues is respectively sized based on a processing speed of a corresponding converter. 8. The method of claim 7 , wherein queue sizes are inversely related to converter processing speeds. 9. The method of claim 7 , wherein a first queue corresponding to a first converter with a first processing speed for a first file type is sized to store a first number of files of the first file type, wherein a second queue corresponding to a second converter with a second processing speed for a second file type is sized to store a second number of files of the second file type, wherein the first processing speed is greater than the second processing speed, and wherein the first number of files is less than the second number of files. 10. The method of claim 6 , wherein the plurality of converters include at least a Comma Separated Value (CSV) converter, a JavaScript® Object Notation (JSON) converter, and a HyperText Markup Language (HTML) converter. 11. The method of claim 6 , wherein the method is performed by one or more computers according to software that is downloaded to the one or more computers from a remote data processing system. 12. The method of claim 11 , wherein the method further comprises: metering a usage of the software; and generating an invoice based on metering the usage. 13. A computer program product comprising one or more computer readable storage media, and program instructions collectively stored on the one or more computer readable storage media, the program instructions comprising instructions configured to cause one or more processors to perform a method comprising: aggregating, by a crawler, a plurality of files with a plurality of file types from a tape drive storage system by iterating through the plurality of files on the tape drive storage system based on sequential starting offsets of the files; providing the plurality of files to a plurality of queues interposed between the crawler and a plurality of converters, wherein files with a respective file type are sent to respective queues corresponding to a converter configured to process the files with the respective file type; and extracting data from the plurality of files by the plurality of converters, wherein the plurality of converters incrementally process the plurality of files using the plurality of queues. 14. The computer program product of claim 13 , wherein each queue of the plurality of queues is respectively sized based on a processing speed of a corresponding converter. 15. The computer program product of claim 14 , wherein queue sizes are inversely related to converter processing speeds.

Assignees

Inventors

Classifications

  • Query processing support for facilitating data mining operations in structured databases · CPC title

  • Details of conversion of file system types or formats · CPC title

  • G06Q30/04Primary

    Billing or invoicing · CPC title

  • Data mining · CPC title

  • Details of archiving (lifecycle management in storage systems G06F3/0649; point-in-time backing up or restoration of persistent data G06F11/1446) · CPC title

Patent family

Related publications grouped by family.

External sources

Frequently asked questions

Answers are generated from the same data shown on this page.

What does patent US12099557B2 cover?
Described is a content mining system comprising a crawler configured to retrieve a plurality of files from a data storage system. The content mining system further comprises a plurality of converters configured to extract data from the plurality of files retrieved by the crawler from the data storage system, where each of the plurality of converters is configured to process a respective type of…
Who is the assignee on this patent?
IBM
What technology area does this patent fall under?
Primary CPC classification G06Q30/04. Mapped technology areas include Physics.
When was this patent published?
Publication date Tue Sep 24 2024 00:00:00 GMT+0000 (Coordinated Universal Time) (B2). Legal status and post-grant events are not shown on this page.
What related patents are in patentsdb?
We list 12 related publications on this page (citations in our corpus or others sharing the same primary CPC).