Query generation using structural similarity between documents

US9436747B1 · US · B1

Patent metadata
FieldValue
Publication numberUS-9436747-B1
Application numberUS-201514750483-A
CountryUS
Kind codeB1
Filing dateJun 25, 2015
Priority dateNov 9, 2010
Publication dateSep 6, 2016
Grant dateSep 6, 2016

How to read this patent

A practical reading order for non-experts. Skip the full description unless you need deep technical detail.

  1. Title

    What the patent document calls the invention.

  2. Abstract

    A short plain-language summary of the technical disclosure.

  3. Assignees and inventors

    Who owns or filed the patent and who is credited as inventor.

  4. Key dates

    Filing, priority, publication, and grant dates set the timeline.

  5. First independent claim

    The legal scope of protection — read this for what is actually claimed.

  6. CPC / IPC classifications

    Technology tags used to group this patent with similar filings.

  7. Citations and related patents

    Prior art links and similar publications in this corpus.

Abstract

Official abstract text for this publication.

Methods, systems, and apparatus, including computer program products, for generating synthetic queries using seed queries and structural similarity between documents are described. In one aspect, a method includes identifying embedded coding fragments (e.g., HTML tag) from a structured document and a seed query; generating one or more query templates, each query template corresponding to at least one coding fragment, the query template including a generative rule to be used in generating candidate synthetic queries; generating the candidate synthetic queries by applying the query templates to other documents that are hosted on the same web site as the document; identifying terms that match structure of the query templates as candidate synthetic queries; measuring a performance for each of the candidate synthetic queries; and designating as synthetic queries the candidate synthetic queries that have performance measurements exceeding a performance threshold.

First claim

Opening claim text (preview).

What is claimed is: 1. A method comprising: identifying, by one or more computers, a seed query for a structured document based on a performance of the seed query with respect to the structured document; identifying, by the one or more computers, a structure of a portion of the structured document that includes at least one term of the seed query; generating, by the one or more computers, a query template that specifies the structure and a portion of the structure from which text should be extracted; generating, by the one or more computers, one or more synthetic queries using the query template and one or more other structured documents, the generating comprising: identifying a portion of a particular structured document that includes the structure specified by the query template; and generating a synthetic query using text contained in the portion of the structure of the particular structured document specified by the query template; and storing, by the one or more computers, the one or more synthetic queries in a query store. 2. The method of claim 1 , wherein the query template includes a generative rule that specifies the portion of the structure from which text should be extracted. 3. The method of claim 1 , further comprising: receiving a query; identifying a stored synthetic query that includes a term that matches a term of the received query; identifying a search result to provide in response to the query based on the identified stored synthetic query; and providing data that initiates presentation of the search result. 4. The method of claim 1 , further comprising: receiving a query; identifying a stored synthetic query that includes a term that matches a term of the received query; and providing data that initiates presentation of the identified stored synthetic query as a potential query refinement for the query. 5. The method of claim 1 , wherein the structure comprises a pair of markup tags. 6. The method of claim 1 , further comprising identifying the one or more other structured documents based on the one or more other structured documents being hosted on a same domain as the structured document. 7. The method of claim 1 , wherein generating the query template comprises: identifying a number of structured documents that include the structure; and generating the query template in response to the number of structured documents that include the structure satisfying a template qualification value. 8. A system, comprising: a data processing apparatus; and a memory storage apparatus in data communication with the data processing apparatus, the memory storage apparatus storing instructions executable by the data processing apparatus and that upon such execution cause the data processing apparatus to perform operations comprising: identifying, by one or more computers, a seed query for a structured document based on a performance of the seed query with respect to the structured document; identifying, by the one or more computers, a structure of a portion of the structured document that includes at least one term of the seed query; generating, by the one or more computers, a query template that specifies the structure and a portion of the structure from which text should be extracted; generating, by the one or more computers, one or more synthetic queries using the query template and one or more other structured documents, the generating comprising: identifying a portion of a particular structured document that includes the structure specified by the query template; and generating a synthetic query using text contained in the portion of the structure of the particular structured document specified by the query template; and storing, by the one or more computers, the one or more synthetic queries in a query store. 9. The system of claim 8 , wherein the query template includes a generative rule that specifies the portion of the structure from which text should be extracted. 10. The system of claim 8 , wherein the operations further comprise: receiving a query; identifying a stored synthetic query that includes a term that matches a term of the received query; identifying a search result to provide in response to the query based on the identified stored synthetic query; and providing data that initiates presentation of the search result. 11. The system of claim 8 , wherein the operations further comprise: receiving a query; identifying a stored synthetic query that includes a term that matches a term of the received query; and providing data that initiates presentation of the identified stored synthetic query as a potential query refinement for the query. 12. The system of claim 8 , wherein the structure comprises a pair of markup tags. 13. The system of claim 8 , wherein the operations further comprise identifying the one or more other structured documents based on the one or more other structured documents being hosted on a same domain as the structured document. 14. The system of claim 8 , wherein generating the query template comprises: identifying a number of structured documents that include the structure; and generating the query template in response to the number of structured documents that include the structure satisfying a template qualification value. 15. A non-transitory computer storage medium encoded with a computer program, the program comprising instructions that when executed by a data processing apparatus cause the data processing apparatus to perform operations comprising: identifying, by one or more computers, a seed query for a structured document based on a performance of the seed query with respect to the structured document; identifying, by the one or more computers, a structure of a portion of the structured document that includes at least one term of the seed query; generating, by the one or more computers, a query template that specifies the structure and a portion of the structure from which text should be extracted; generating, by the one or more computers, one or more synthetic queries using the query template and one or more other structured documents, the generating comprising: identifying a portion of a particular structured document that includes the structure specified by the query template; and generating a synthetic query using text contained in the portion of the structure of the particular structured document specified by the query template; and storing, by the one or more computers, the one or more synthetic queries in a query store. 16. The non-transitory computer storage medium of claim 15 , wherein the query template includes a generative rule that specifies the portion of the structure from which text should be extracted. 17. The non-transitory computer storage medium of claim 15 , wherein the operations further comprise: receiving a query; identifying a stored synthetic query that includes a term that matches a term of the received query; identifying a search result to provide in response to the query based on the identified stored synthetic query; and providing data that initiates presentation of the search result. 18. The non-transitory computer storage medium of claim 15 , wherein the operations further comprise: receiving a query; identifying a stored synthetic query that includes a term that matches a term of the received query; and providing data that initiates presentation of the identified stored synthetic query as a potential query refinement for the query. 19. The non-transitory computer storage med

Assignees

Inventors

Classifications

  • G06F16/254Primary

    Extract, transform and load [ETL] procedures, e.g. ETL data flows in data warehouses · CPC title

  • Query formulation · CPC title

  • using system suggestions (G06F16/3325 takes precedence) · CPC title

  • G06F16/951Primary

    Indexing; Web crawling techniques · CPC title

  • Templates · CPC title

Patent family

Related publications grouped by family.

External sources

Frequently asked questions

Answers are generated from the same data shown on this page.

What does patent US9436747B1 cover?
Methods, systems, and apparatus, including computer program products, for generating synthetic queries using seed queries and structural similarity between documents are described. In one aspect, a method includes identifying embedded coding fragments (e.g., HTML tag) from a structured document and a seed query; generating one or more query templates, each query template corresponding to at lea…
Who is the assignee on this patent?
Google Inc
What technology area does this patent fall under?
Primary CPC classification G06F16/254. Mapped technology areas include Physics.
When was this patent published?
Publication date Tue Sep 06 2016 00:00:00 GMT+0000 (Coordinated Universal Time) (B1). Legal status and post-grant events are not shown on this page.
What related patents are in patentsdb?
We list 1 related publication on this page (citations in our corpus or others sharing the same primary CPC).