Finding partition boundaries for parallel processing of markup language documents

US9477651B2 · US · B2

Patent metadata
FieldValue
Publication numberUS-9477651-B2
Application numberUS-89324810-A
CountryUS
Kind codeB2
Filing dateSep 29, 2010
Priority dateSep 29, 2010
Publication dateOct 25, 2016
Grant dateOct 25, 2016

How to read this patent

A practical reading order for non-experts. Skip the full description unless you need deep technical detail.

  1. Title

    What the patent document calls the invention.

  2. Abstract

    A short plain-language summary of the technical disclosure.

  3. Assignees and inventors

    Who owns or filed the patent and who is credited as inventor.

  4. Key dates

    Filing, priority, publication, and grant dates set the timeline.

  5. First independent claim

    The legal scope of protection — read this for what is actually claimed.

  6. CPC / IPC classifications

    Technology tags used to group this patent with similar filings.

  7. Citations and related patents

    Prior art links and similar publications in this corpus.

Abstract

Official abstract text for this publication.

A method, a computer program product and a system identify partition locations within an extended markup language (XML) document without parsing so as to process portions of said document in parallel. The XML document includes sections required to remain continuous. The document is scanned for continuous sections without parsing, and boundaries of the initial partitions are adjusted to reside outside the continuous sections to determine resulting partitions for the document. The resulting partitions may be processed in parallel to provide the document information for storage.

First claim

Opening claim text (preview).

What is claimed is: 1. A computer-implemented method of identifying partition locations within an XML document and performing parallel processing of the XML document, said method comprising: determining, by a processor, a partition node XPath in the XML based upon extract, transfer and load (ETL) job requirements and a schema of the XML document wherein the partition node XPath is a path occurring multiple times within the XML document at a main body portion; identifying, by the processor, a header context of the XML document by parsing the XML document from a start of the XML document to a point in the XML document before a first occurrence of a partition node in the partition node XPath; marking, by the processor, said XML document at a location prior to the first occurrence of said partition node with an indication of an end point of said header context; identifying, by the processor, a footer context of the XML document by reverse parsing of the XML document from an end of the XML document until a first occurrence of a close of a partition in the partition node XPath; marking, by the processor, said XML document at a location after the first occurrence of the close of said partition node with an indication of a start point of said footer context; and merging, by the processor, the header context and the footer context within said XML document, wherein the merging comprises moving values of the footer context to a marked location at an end of the header context while maintaining sequencing of level information within the header context and the footer context, and each resulting partition is processed with said merged header and said footer context; before parsing the main body portion of the XML document: determining, by the processor, locations within said XML document to form initial partitions, scanning without parsing, by the processor, said XML document to identify sections required to remain continuous based on the ETL job requirements and the schema of the XML document, adjusting, without parsing by the processor, boundaries of said initial partitions to reside outside said continuous sections to determine resulting partitions for said XML document; and performing parsing via parallel processing of the XML document, by a plurality of processors, using the adjusted boundaries of the resulting partitions. 2. The method of claim 1 , further comprising: processing said resulting partitions in parallel to provide document information for storage. 3. The method of claim 1 , wherein the adjusting boundaries of said initial partitions is performed to maintain at least one of a character data section, a comment section, and a nested node definition within a single continuous section. 4. The method of claim 1 , wherein scanning said document for said continuous sections without parsing and adjusting, without parsing, boundaries of said initial partitions to reside outside said continuous sections to determine resulting partitions for said document comprises: a) scanning said document from a start point of said XML document to a first partition point to determine whether the first partition point is located within a continuous section; b) in response to a determination that the first partition point is within a continuous section, moving the first partition point to a location within said XML document that is prior or subsequent to an occurrence of the continuous section; c) repeating steps a) and b) with subsequent partition points until reaching an end of said document, wherein said scanning occurs from an immediate prior partition point to a next partition point in said document. 5. The method of claim 1 , wherein said XML document has a memory size of at least 1 GB. 6. A computer program product for identifying partition locations within an XML document and performing parallel processing of the XML document, the computer program product comprising: a computer readable storage medium having computer readable program code embodied therewith, the computer readable program code configured to: determine, by a processor, a partition node XPath in the XML based upon extract, transfer and load (ETL) job requirements and a schema of the XML document wherein the partition node XPath is a path occurring multiple times within the XML document at a main body portion; identify, by the processor, a header context of the XML document by parsing the XML document from a start of the XML document to a point in the XML document before a first occurrence of a partition node in the partition node XPath; mark, by the processor, said XML document at a location prior to the first occurrence of said partition node with an indication of an end point of said header context; identify, by the processor, a footer context of the XML document by reverse parsing of the XML document from an end of the XML document until a first occurrence of a close of a partition in the partition node XPath; mark, by the processor, said XML document at a location after the first occurrence of the close of said partition node with an indication of a start point of said footer context; and merge, by the processor, the header context and the footer context within said XML document, wherein the merging comprises moving values of the footer context to a marked location at an end of the header context while maintaining sequencing of level information within the header context and the footer context, and each resulting partition is processed with said merged header and said footer context; before parsing the main body portion of the XML document, the computer readable program code is further configured to: determine, by the processor, locations within said XML document to form initial partitions, scan, without parsing, said XML document to identify sections required to remain continuous based on the ETL job requirements and the schema of the XML document, and adjust, without parsing, boundaries of said initial partitions to reside outside said continuous sections to determine resulting partitions for said document; and perform parsing via parallel processing of the XML document, by a plurality of processors, using the adjusted boundaries of the resulting partitions. 7. The computer program product of claim 6 , wherein said computer readable program code is further configured to: process said resulting partitions in parallel to provide document information for storage. 8. The computer program product of claim 6 , wherein the computer readable program code is further configured to adjust boundaries of said initial partitions so as to maintain at least one of a character data section, a comment section, and a nested node definition within a single continuous section. 9. The computer program product of claim 6 , wherein said computer readable program code is configured to scan said XML document for said continuous sections without parsing and adjust boundaries of said initial partitions to reside outside said continuous sections to determine resulting partitions for said document by: a) scanning said XML document from a start point of said document to a first partition point to determine whether the first partition point is located within a continuous section; b) in response to a determination that the first partition point is within a continuous section, moving the first partition point to a location within said document that is prior or subsequent to an occurrence of the continuous section; c) repeating steps a) and b) with subsequent partition points until reaching an end of said XML document, wherein said scanning occurs from an immediate prior partition point to a next partition point in said document. 10. A system for identifying part

Assignees

Inventors

Classifications

  • Fragmentation of text files, e.g. creating reusable text-blocks; Linking to fragments, e.g. using XInclude; Namespaces · CPC title

  • Parallelism detection · CPC title

  • Physics · mapped topic

  • Physics · mapped topic

  • Physics · mapped topic

Patent family

Related publications grouped by family.

External sources

Frequently asked questions

Answers are generated from the same data shown on this page.

What does patent US9477651B2 cover?
A method, a computer program product and a system identify partition locations within an extended markup language (XML) document without parsing so as to process portions of said document in parallel. The XML document includes sections required to remain continuous. The document is scanned for continuous sections without parsing, and boundaries of the initial partitions are adjusted to reside o…
Who is the assignee on this patent?
Agarwal Manoj K, Bar-Or Amir, Bhide Manish Anand, and 3 more
What technology area does this patent fall under?
Primary CPC classification G06F17/2705. Mapped technology areas include Physics.
When was this patent published?
Publication date Tue Oct 25 2016 00:00:00 GMT+0000 (Coordinated Universal Time) (B2). Legal status and post-grant events are not shown on this page.
What related patents are in patentsdb?
We list 3 related publications on this page (citations in our corpus or others sharing the same primary CPC).