Managing record format information

US10445309B2 · US · B2

Patent metadata
FieldValue
Publication numberUS-10445309-B2
Application numberUS-94509410-A
CountryUS
Kind codeB2
Filing dateNov 12, 2010
Priority dateNov 13, 2009
Publication dateOct 15, 2019
Grant dateOct 15, 2019

How to read this patent

A practical reading order for non-experts. Skip the full description unless you need deep technical detail.

  1. Title

    What the patent document calls the invention.

  2. Abstract

    A short plain-language summary of the technical disclosure.

  3. Assignees and inventors

    Who owns or filed the patent and who is credited as inventor.

  4. Key dates

    Filing, priority, publication, and grant dates set the timeline.

  5. First independent claim

    The legal scope of protection — read this for what is actually claimed.

  6. CPC / IPC classifications

    Technology tags used to group this patent with similar filings.

  7. Citations and related patents

    Prior art links and similar publications in this corpus.

Abstract

Official abstract text for this publication.

Data is prepared for processing in a data processing system using format information. Data is received that includes records that have values for fields over an input device or port. A target record format for processing the data is determined. Multiple records are analyzed according to validation tests to determine whether the data matches candidate record formats. Each candidate record format specifies a format for each field, and each validation test corresponds to at least one candidate record format. In response to receiving results of the validation tests, the target record format is associated with the data based on at least one of: a candidate record format for which at least a partial match was determined according to at least one validation test, a parsed record format selected according to a data type associated with the data, and a constructed record format generated from an analysis of data characteristics.

First claim

Opening claim text (preview).

What is claimed is: 1. A method for discovery of record formats of data records for processing in a data processing system, the method including: receiving, by the data processing system from a data source, a data stream including plural distinct records that have a record format, with the records having fields that have data values; and selecting by the data processing system a record format that corresponds to the format of the data source, with the record format being one of a plurality of distinct candidate record formats, by: accessing from storage the distinct candidate record formats, with each particular one of the distinct candidate record formats specifying a data type for each field of a group of one or more fields of that particular one of the distinct candidate record formats; for each of two or more particular candidate record formats accessed, parsing data in each of multiple ones of the received, distinct records with a parser that applies, to the data, a data type for a field that is specified by the particular candidate record format; for at least one of the two or more particular candidate record formats accessed, determining that the parser identifies one or more errors when attempting to parse data in at least one of the multiple ones of the received, distinct records; and responsive to that determination, storing results data that specifies the data type or the field that was not parsed; and for each of the two or more particular candidate record formats accessed, determining a measure of correspondence for the particular candidate record format based on an amount of data in each of the multiple ones of the received, distinct records that is successfully parsed by data types for those fields specified by the particular candidate record format, which measure of correspondence is based on an extent to which the particular candidate record format corresponds to the format of each of the multiple ones of the received, distinct records; wherein the determined measure of correspondence for the at least one of the two or more particular candidate record formats accessed is further based on a number of one or more errors that the parser identifies for one or more data types of one or more corresponding fields specified by the at least one of the two or more particular candidate record formats, as specified by the stored results data for the at least one of the two or more particular candidate record formats; and wherein the selected record format has a higher or equivalent measure of correspondence, relative to one or more other measures of correspondence for one or more other distinct candidate record formats. 2. The method of claim 1 , wherein a record is associated with a known file type. 3. The method of claim 2 , wherein the file type corresponds to a file extension. 4. The method of claim 1 , wherein the received, plural distinct records include first records, and the particular candidate record format includes a first candidate record format, and wherein the method further includes: receiving second records that each have one or more values for respective fields over the input device or port; and associating a second candidate record format with the second records based on: applying validation tests to the second records and none of the validation tests applied to the second records determining at least a partial match to one or more of the candidate record formats, and not having a known data type associated with the second records. 5. The method of claim 1 , further including: recognizing tags in the data stream; parsing the data stream to determine the multiple ones of the received, distinct records based on the recognized tags; and generating a constructed record format based on the recognized tags in the data stream. 6. The method of claim 1 , further including: recognizing delimiters in the data stream; parsing the data stream to determine the multiple ones of the received, distinct records based on the recognized delimiters; and determining the record format from the parsed data stream. 7. The method of claim 1 , further including: generating a constructed record format from an analysis of characteristics of the data by recognizing that the data is in a binary form without tags or delimiters indicating values of multiple records; and receiving one or more field identifiers from a user interface. 8. The method of claim 1 , further including: applying the particular candidate record format that corresponds to the format of the multiple ones of the received, distinct records to the received data stream to determine values for each of a plurality of fields in the multiple ones of the received, distinct records. 9. The method of claim 1 , further including: determining whether the received, distinct records match the particular candidate record format by: analyzing values for the received, distinct records according to a validation test to determine whether a number of valid values is larger than a predetermined threshold. 10. The method of claim 9 , wherein analyzing determined values for a first record of the received, distinct records according to the validation test includes performing a corresponding field test on each determined value for each field. 11. The method of claim 10 , wherein performing a first field test on a determined value for a first field includes matching a number of characters in the determined value to a predetermined number of characters. 12. The method of claim 10 , wherein performing a first field test on a determined value for a first field includes matching the determined value to one of multiple predetermined valid values for the first field. 13. The method of claim 10 , wherein the number of valid values is based on a number of records for which the determined value for a given field passes the field test corresponding to the given field. 14. The method of claim 1 wherein for at least a first one of the candidate record formats accessed, determining that the parser was unable to parse data in at least one of the multiple ones of the received, distinct records; and responsive to that determination, logging in an error log an error entry that specifies the data type for the field that was not parsed. 15. The method of claim 14 wherein determining a measure of correspondence includes: determining the measure of correspondence for the particular candidate record format based on a number of errors that are logged for the data types of corresponding fields specified by the particular candidate record format, as specified by the error log for the at least first one of the candidate record formats. 16. A computer-readable storage medium storing a computer program for discovery of record formats of data records for processing in a data processing system, the computer program including instructions for causing a computer to: receive, from a data source, a data stream including plural distinct records that have a record format, with the records having fields that have data values; and select a record format that corresponds to the format of the data source, with the record format being one of a plurality of distinct candidate record formats, by: accessing from storage the distinct candidate record formats, with each particular one of the distinct candidate record formats specifying a data type for each field of a group of one or more fields of that particular one of the distinct candidate record formats; for each of two or more particular candidate record formats accessed, parsing data in each of multiple ones o

Assignees

Inventors

Classifications

  • Ensuring data consistency and integrity · CPC title

Patent family

Related publications grouped by family.

External sources

Frequently asked questions

Answers are generated from the same data shown on this page.

What does patent US10445309B2 cover?
Data is prepared for processing in a data processing system using format information. Data is received that includes records that have values for fields over an input device or port. A target record format for processing the data is determined. Multiple records are analyzed according to validation tests to determine whether the data matches candidate record formats. Each candidate record format…
Who is the assignee on this patent?
Parmenter David W, Gould Joel, Farver Jennifer M, and 3 more
What technology area does this patent fall under?
Primary CPC classification G06F16/2365. Mapped technology areas include Physics.
When was this patent published?
Publication date Tue Oct 15 2019 00:00:00 GMT+0000 (Coordinated Universal Time) (B2). Legal status and post-grant events are not shown on this page.
What related patents are in patentsdb?
We list 8 related publications on this page (citations in our corpus or others sharing the same primary CPC).