Keyword based data crawling
US-2017351763-A1 · Dec 7, 2017 · US
US12099557B2 · US · B2
| Field | Value |
|---|---|
| Publication number | US-12099557-B2 |
| Application number | US-202117464731-A |
| Country | US |
| Kind code | B2 |
| Filing date | Sep 2, 2021 |
| Priority date | Sep 2, 2021 |
| Publication date | Sep 24, 2024 |
| Grant date | Sep 24, 2024 |
A practical reading order for non-experts. Skip the full description unless you need deep technical detail.
What the patent document calls the invention.
A short plain-language summary of the technical disclosure.
Who owns or filed the patent and who is credited as inventor.
Filing, priority, publication, and grant dates set the timeline.
The legal scope of protection — read this for what is actually claimed.
Technology tags used to group this patent with similar filings.
Prior art links and similar publications in this corpus.
Official abstract text for this publication.
Described is a content mining system comprising a crawler configured to retrieve a plurality of files from a data storage system. The content mining system further comprises a plurality of converters configured to extract data from the plurality of files retrieved by the crawler from the data storage system, where each of the plurality of converters is configured to process a respective type of data. The content mining system further comprises a plurality of queues interposed between the crawler and the plurality of converters, where each queue is associated with a single converter.
Opening claim text (preview).
What is claimed is: 1. A content mining system comprising: a crawler configured to retrieve a plurality of files from a tape drive storage system by iterating through the plurality of files on the tape drive storage system based on sequential starting offsets of the files; a plurality of converters configured to extract data from the plurality of files retrieved by the crawler from the tape drive storage system, wherein each of the plurality of converters is configured to process a respective type of data; and a plurality of queues interposed between the crawler and the plurality of converters, wherein each queue is associated with a single converter. 2. The content mining system of claim 1 , wherein each queue of the plurality of queues is respectively sized based on a processing speed of a corresponding converter. 3. The content mining system of claim 2 , wherein queue sizes are inversely related to converter processing speeds. 4. The content mining system of claim 2 , wherein a first queue corresponding to a first converter with a first processing speed for a first file type is sized to store a first number of files of the first file type, wherein a second queue corresponding to a second converter with a second processing speed for a second file type is sized to store a second number of files of the second file type, wherein the first processing speed is greater than the second processing speed, and wherein the first number of files is less than the second number of files. 5. The content mining system of claim 1 , wherein the plurality of converters include at least a Comma Separated Value (CSV) converter, a JavaScript® Object Notation (JSON) converter, a HyperText Markup Language (HTML) converter, and a Smart Document Understanding (SDU) converter. 6. A computer-implemented method comprising: aggregating, by a crawler, a plurality of files with a plurality of file types from a tape drive storage system by iterating through the plurality of files on the tape drive storage system based on sequential starting offsets of the files; providing the plurality of files to a plurality of queues interposed between the crawler and a plurality of converters, wherein files with a respective file type are sent to respective queues corresponding to a converter configured to process the files with the respective file type; and extracting data from the plurality of files by the plurality of converters, wherein the plurality of converters incrementally process the plurality of files using the plurality of queues. 7. The method of claim 6 , wherein each queue of the plurality of queues is respectively sized based on a processing speed of a corresponding converter. 8. The method of claim 7 , wherein queue sizes are inversely related to converter processing speeds. 9. The method of claim 7 , wherein a first queue corresponding to a first converter with a first processing speed for a first file type is sized to store a first number of files of the first file type, wherein a second queue corresponding to a second converter with a second processing speed for a second file type is sized to store a second number of files of the second file type, wherein the first processing speed is greater than the second processing speed, and wherein the first number of files is less than the second number of files. 10. The method of claim 6 , wherein the plurality of converters include at least a Comma Separated Value (CSV) converter, a JavaScript® Object Notation (JSON) converter, and a HyperText Markup Language (HTML) converter. 11. The method of claim 6 , wherein the method is performed by one or more computers according to software that is downloaded to the one or more computers from a remote data processing system. 12. The method of claim 11 , wherein the method further comprises: metering a usage of the software; and generating an invoice based on metering the usage. 13. A computer program product comprising one or more computer readable storage media, and program instructions collectively stored on the one or more computer readable storage media, the program instructions comprising instructions configured to cause one or more processors to perform a method comprising: aggregating, by a crawler, a plurality of files with a plurality of file types from a tape drive storage system by iterating through the plurality of files on the tape drive storage system based on sequential starting offsets of the files; providing the plurality of files to a plurality of queues interposed between the crawler and a plurality of converters, wherein files with a respective file type are sent to respective queues corresponding to a converter configured to process the files with the respective file type; and extracting data from the plurality of files by the plurality of converters, wherein the plurality of converters incrementally process the plurality of files using the plurality of queues. 14. The computer program product of claim 13 , wherein each queue of the plurality of queues is respectively sized based on a processing speed of a corresponding converter. 15. The computer program product of claim 14 , wherein queue sizes are inversely related to converter processing speeds.
Query processing support for facilitating data mining operations in structured databases · CPC title
Details of conversion of file system types or formats · CPC title
Billing or invoicing · CPC title
Data mining · CPC title
Details of archiving (lifecycle management in storage systems G06F3/0649; point-in-time backing up or restoration of persistent data G06F11/1446) · CPC title
Related publications grouped by family.
Answers are generated from the same data shown on this page.