Language model adaptation based on filtered data

US9564122B2 · US · B2

Patent metadata
FieldValue
Publication numberUS-9564122-B2
Application numberUS-201414224086-A
CountryUS
Kind codeB2
Filing dateMar 25, 2014
Priority dateMar 25, 2014
Publication dateFeb 7, 2017
Grant dateFeb 7, 2017

How to read this patent

A practical reading order for non-experts. Skip the full description unless you need deep technical detail.

  1. Title

    What the patent document calls the invention.

  2. Abstract

    A short plain-language summary of the technical disclosure.

  3. Assignees and inventors

    Who owns or filed the patent and who is credited as inventor.

  4. Key dates

    Filing, priority, publication, and grant dates set the timeline.

  5. First independent claim

    The legal scope of protection — read this for what is actually claimed.

  6. CPC / IPC classifications

    Technology tags used to group this patent with similar filings.

  7. Citations and related patents

    Prior art links and similar publications in this corpus.

Abstract

Official abstract text for this publication.

A method for adapting a language model for a context of a domain, comprising obtaining textual contents from a large source by a request directed to the context of the domain, discarding at least a part of the textual contents that contain textual terms determined as irrelevant to the context of the domain, thereby retaining, as retained data, at least a part of the textual contents that contain textual terms determined as relevant to the context of the domain, and adapting the language model by incorporating therein at least a part of the textual terms of the retained data, wherein the method is performed on an at least one computerized apparatus configured to perform the method and equipped for communication with the large source, and an apparatus for performing the same.

First claim

Opening claim text (preview).

The invention claimed is: 1. A method for adapting a language model for a context of a domain, comprising; from a source having textual information with a variety of phrases related to the context of the domain obtaining textual contents as data directed to the context of the domain by querying the source with phrases representative of the subject matter of the domain regardless and irrespective of any language model; responsive to a state of a provided selector, determining is one state semantic relevancy or in another state semantic relevancy and lexical relevancy of the textual contents to the context of the domain; discarding at least a part of the textual contents that contain textual terms determined as irrelevant to the context of the domain, thereby retaining, as retained data, at least a part of the textual contents that contain textual terms determined as relevant to the context of the domain; and adapting the language model by incorporating therein at least a part of the textual terms of the retained data, wherein the method is performed on an at least one computerized apparatus configured to perform the method and equipped for communication with the source. 2. The method according to claim 1 , wherein at least a part of the retained data having at least one textual term which do not match any element in a provided set of at least one textual term is discarded prior to adapting the language model. 3. The method according to claim 1 , wherein the at least a part of the textual contents that contains textual terms determined as irrelevant to the context of the domain comprises a plurality of parts of the textual contents that contains textual terms determined as irrelevant to the context of the domain. 4. The method according to claim 1 , wherein the at least a part of the textual content that contains textual terms determined as relevant to the context of the domain comprises a plurality of parts of the textual contents that contain textual terms determined as relevant to the context of the domain. 5. The method according to claim 1 , wherein the textual contents are further obtained via a provided at least one link to the source. 6. The method according to claim 1 , wherein the source is at least a part of the Web. 7. The method according to claim 1 , wherein determining semantic relevancy of the textual contents directed to the context of the domain comprises determining Whether the contents of the textual contents directed to the context of the domain—referred to also as a page content—are sufficiently semantically close to an at least one textual phrase pertaining to a domain—referred to also as a seed—wherein the page content that is sufficiently semantically close to the seed is referred to also as semantically filtered data. 8. The method according to claim 7 , wherein the determination of semantically closeness of the page content and the seed comprises mapping the page content and the seed to vectors V[page] and V[seed], respectively, and measuring the degree of relatedness between said two vectors by a similarity function mathematically defined as a mapping of said two vectors to a real number. 9. The method according to claim 8 , wherein the page content is determined to have passed a semantic test when the value of the similarity function applied on V[page] and V[seed] is larger than a threshold that is a positive value smaller than 1. 10. The method according to claim 8 , wherein the threshold is preset. 11. The method according to claim 8 , wherein the threshold is determined by a user. 12. The method according to claim 8 , wherein the similarity function similarity is calculated by normalizing each of the vectors by its norm value, denoted respectively as Vn[page] and Vn[seed], and subsequently calculating an inner product of Vn[page] and Vn[seed]. 13. The method according to claim 12 , wherein the nom of a vector is the square root of the inner product of the vector with itself. 14. The method according to claim 12 , wherein the inner product of a vector is a sum of the values of the elements in a vector constructed by an element-wise multiplication of the vector. 15. The method according to claim 7 , wherein determining lexical relevancy of the textual contents to the context of the domain comprises determining whether utterances in the semantically filtered data contains at least phrase—without precluding a word—from provided phrases related to the domain. 16. A method for adapting a baseline language model for a context of a domain by data of the Web, comprising: obtaining, from the domain, textual data as data representative of the context of the domain; based on the data representative of the context of the domain and regardless and irrespective of any language model, forming a query that is provided to an at least one search engine of the Web, thereby acquiring an at least one result comprising textual contents; responsive to a state of a provided selector, determining in one state semantic relevancy or in another state semantic relevancy and lexical relevancy of the at least one result to the context of the domain; discarding at least a part of the at least one result in which the textual contents includes at least one textual term that does not pertain to the data representative of the context of the domain; adapting the baseline language model to an adapted language model by incorporating therein textual terms of the at least one result that pertain to the data representative of the context of the domain, wherein the method is performed on an at least one computerized apparatus configured to perform the method and equipped for communication with at least one computerized server linkable to the Web. 17. The method according to claim 16 , wherein the discarding of the at least a part of the at least one result comprises discarding all parts of the at least one result in which the textual contents comprises at least one textual term that does not pertain to the data representative of the context of the domain. 18. The method according to claim 16 , wherein the discarding of the at least a part of the at least one result comprises discarding the at least one results in which the textual content comprises at least one textual term that does not pertain to the data representative of the context of the domain. 19. The method according to claim 16 , further discarding at least a part of the at least one result having textual terms which do not match any element in a provided set of at least one textual term. 20. The method according to claim 16 , wherein the acquiring at least one result further comprises acquiring further textual contents via a provided at least one link to a Web site. 21. The method according to claim 16 , wherein the at least one result comprises a plurality of results. 22. The method according to claim 16 , Wherein the textual terms that pertain to the data representative of the context of the domain are determined based on a semantic relationship between the textual terms and the data representative of the context of the domain. 23. The method according to claim 16 , further comprising evaluating the adapted language model by comparing the performance of the adapted language model with the performance of the baseline language model in recognizing textual terms in a provided speech data comprising coded textual terms related to the domain. 24. The method according to claim 23 , wherein the p

Assignees

Inventors

Classifications

  • Semantic analysis · CPC title

  • Adaptation · CPC title

  • G10L15/00Primary

    Speech recognition (G10L17/00 takes precedence) · CPC title

  • using context dependencies, e.g. language models · CPC title

  • Physics · mapped topic

Patent family

Related publications grouped by family.

External sources

Frequently asked questions

Answers are generated from the same data shown on this page.

What does patent US9564122B2 cover?
A method for adapting a language model for a context of a domain, comprising obtaining textual contents from a large source by a request directed to the context of the domain, discarding at least a part of the textual contents that contain textual terms determined as irrelevant to the context of the domain, thereby retaining, as retained data, at least a part of the textual contents that contai…
Who is the assignee on this patent?
Nice-Systems Ltd, Nice Ltd
What technology area does this patent fall under?
Primary CPC classification G10L15/00. Mapped technology areas include Physics.
When was this patent published?
Publication date Tue Feb 07 2017 00:00:00 GMT+0000 (Coordinated Universal Time) (B2). Legal status and post-grant events are not shown on this page.
What related patents are in patentsdb?
We list 8 related publications on this page (citations in our corpus or others sharing the same primary CPC).