Statistical machine translation based search query spelling correction

US10176168B2 · US · B2

Patent metadata
FieldValue
Publication numberUS-10176168-B2
Application numberUS-201113296640-A
CountryUS
Kind codeB2
Filing dateNov 15, 2011
Priority dateNov 15, 2011
Publication dateJan 8, 2019
Grant dateJan 8, 2019

How to read this patent

A practical reading order for non-experts. Skip the full description unless you need deep technical detail.

  1. Title

    What the patent document calls the invention.

  2. Abstract

    A short plain-language summary of the technical disclosure.

  3. Assignees and inventors

    Who owns or filed the patent and who is credited as inventor.

  4. Key dates

    Filing, priority, publication, and grant dates set the timeline.

  5. First independent claim

    The legal scope of protection — read this for what is actually claimed.

  6. CPC / IPC classifications

    Technology tags used to group this patent with similar filings.

  7. Citations and related patents

    Prior art links and similar publications in this corpus.

Abstract

Official abstract text for this publication.

Statistical Machine Translation (SMT) based search query spelling correction techniques are described herein. In one or more implementations, search data regarding searches performed by clients may be logged. The logged data includes query correction pairs that may be used to ascertain error patterns indicating how misspelled substrings may be translated to corrected substrings. The error patterns may be used to determine suggestions for an input query and to develop query correction models used to translate the input query to a corrected query. In one or more implementations, probabilistic features from multiple query correction models are combined to score different correction candidates. One or more top scoring correction candidates may then be exposed as suggestions for selection by a user and/or provided to a search engine to conduct a corresponding search using the corrected query version(s).

First claim

Opening claim text (preview).

What is claimed is: 1. A method implemented by one or more computing devices, the method comprising: receiving, at the one or more computing devices via a communications network, a plurality of search queries performed by a plurality of client devices; logging search data regarding the plurality of search queries performed by the plurality of client devices by storing the search data in a data storage device accessible to the one or more computing devices; ascertaining, at the one or more computing devices, error patterns based on query correction pairs described by the logged search data for both words and substrings that are contained by a word; developing, at the one or more computing devices based at least in part upon the ascertained error patterns, multiple query correction models reflecting how frequently the ascertained error patterns occur in the logged search data, including: at least a first query correction model that encodes probabilities for translations of substrings that are contained by a word based on character alignment, each probability for translation determined based on a frequency of a particular input substring being changed to a particular correct substring, and at least a second query correction model that encodes probabilities for n-gram models for both words and characters based on preceding words and characters; receiving, at the one or more computing devices via the communications network, an input search query performed by a subject client device; translating, at the one or more computing devices, the input search query performed by the subject client device to a corrected search query by selecting a top ranking candidate from a group of candidates, each candidate ranked using a combination of probabilities from each of the multiple query correction models; providing the corrected search query to a search engine to generate a set of search results based, at least in part, on the corrected search query; and providing the set of search results to the subject client device via the communications network. 2. A method as described in claim 1 , wherein the combination of probabilities includes multiple probabilities from one or more of the multiple query correction models. 3. A method as described in claim 1 , wherein the combination of probabilities further includes a feature descriptive of translations used for a given candidate. 4. A method as described in claim 1 , wherein the at least a second query correction model comprises a trigram word-based language model configured to encode probabilities for a next word in a sequence given two preceding words in the sequence. 5. A method as described in claim 1 , wherein the at least a second query correction model includes a nine-gram character-based language model configured to encode probabilities for a next character in a sequence given eight preceding characters in the sequence. 6. A method as described in claim 1 , wherein each candidate is ranked according to a weighted log-linear model that combines the probabilities from each of the multiple query correction models. 7. A method as described in claim 6 , wherein the log-linear model has the form of: P ⁡ ( C | Q ) = 1 Z ⁡ ( Q , C ) ⁢ exp ⁢ ∑ i ⁢ λ i ⁢ h i ⁡ ( Q , C ) ⁢ ⁢ where ⁢ ⁢ 1 Z ⁡ ( Q , C ) is a normalization factor, h i (Q,C) represents the probabilities that are combined and λ i represents weight factors applied to the probabilities. 8. A method as described in claim 6 , wherein the top ranking candidate is the candidate that maximizes the weighted log-linear model. 9. One or more computer-readable storage media storing instructions that, when executed via the one or more processors of one or more computing devices, implement a correction module to perform operations comprising: receiving, at the one or more computing devices via a communications network, a plurality of search queries performed by a plurality of client devices; logging search data regarding the plurality of search queries performed by the plurality of client devices by storing the search data in a data storage device accessible to the one or more computing devices; ascertaining, at the one or more computing devices, error patterns based on query correction pairs described by the logged search data for both words and substrings that are contained by a word; developing, at the one or more computing devices based at least in part upon the ascertained error patterns, multiple query correction models reflecting how frequently the ascertained error patterns occur in the logged search data, including: at least a first query correction model that encodes probabilities for translations of substrings that are contained by a word based on character alignment, each probability for translation determined based on a frequency of a particular input substring being changed to a particular correct substring, and at least a second query correction model that encodes probabilities for n-gram models for both words and characters based on preceding words and characters; receiving, at the one or more computing devices via the communications network, an input search query performed by a subject client device; parsing the input search query to obtain individual search terms and substrings contained in the individual search terms; generating a group of candidates corresponding to the input search query using the ascertained error patterns; ranking the group of candidates one to another in accordance with scores computed using the multiple query correction models developed based on the ascertained error patterns; tra

Assignees

Inventors

Classifications

  • G06F40/44Primary

    Statistical methods, e.g. probability models · CPC title

  • G06F40/232Primary

    Orthographic correction, e.g. spell checking or vowelisation · CPC title

  • Indexing; Web crawling techniques · CPC title

  • Query formulation · CPC title

  • Physics · mapped topic

Patent family

Related publications grouped by family.

External sources

Frequently asked questions

Answers are generated from the same data shown on this page.

What does patent US10176168B2 cover?
Statistical Machine Translation (SMT) based search query spelling correction techniques are described herein. In one or more implementations, search data regarding searches performed by clients may be logged. The logged data includes query correction pairs that may be used to ascertain error patterns indicating how misspelled substrings may be translated to corrected substrings. The error patte…
Who is the assignee on this patent?
Gao Jianfeng, Hwang Mei Yuh, Huang Xuedong D, and 3 more
What technology area does this patent fall under?
Primary CPC classification G06F40/44. Mapped technology areas include Physics.
When was this patent published?
Publication date Tue Jan 08 2019 00:00:00 GMT+0000 (Coordinated Universal Time) (B2). Legal status and post-grant events are not shown on this page.
What related patents are in patentsdb?
We list 8 related publications on this page (citations in our corpus or others sharing the same primary CPC).