Multilingual content based recommendation system

US9898773B2 · US · B2

Patent metadata
FieldValue
Publication numberUS-9898773-B2
Application numberUS-201414546719-A
CountryUS
Kind codeB2
Filing dateNov 18, 2014
Priority dateNov 18, 2014
Publication dateFeb 20, 2018
Grant dateFeb 20, 2018

How to read this patent

A practical reading order for non-experts. Skip the full description unless you need deep technical detail.

  1. Title

    What the patent document calls the invention.

  2. Abstract

    A short plain-language summary of the technical disclosure.

  3. Assignees and inventors

    Who owns or filed the patent and who is credited as inventor.

  4. Key dates

    Filing, priority, publication, and grant dates set the timeline.

  5. First independent claim

    The legal scope of protection — read this for what is actually claimed.

  6. CPC / IPC classifications

    Technology tags used to group this patent with similar filings.

  7. Citations and related patents

    Prior art links and similar publications in this corpus.

Abstract

Official abstract text for this publication.

Example apparatus and methods access multiple sources of information concerning features for applications, clean the data from the multiple sources, extract features from the cleaned data, selectively weight the sources, data or extracted features and produce a feature vector. The feature vector may then be used in a single language feature space or in a multi-language feature space. Feature spaces may then be used to find similarities between applications to facilitate recommending applications. In one embodiment, different feature spaces may be connected using a graph where nodes represent items and edges represent similarity relationships between items based on related feature spaces. Traversing the graph may allow similarities to be found that might not otherwise be possible. For example, while there may be no direct English to Hebrew similarity relationship, there may be English to French and French to Hebrew relationships that can be followed in the graph.

First claim

Opening claim text (preview).

What is claimed is: 1. A method, comprising: accessing electronic data from multiple different sources, where the electronic data represents unstructured text in two or more different languages, where the unstructured text represents titles or descriptions for applications, books, movies, or video games available in an electronic marketplace; extracting one or more features from the data; producing a plurality of feature vectors from the one or more features where a feature vector comprises one or more elements; producing from the plurality of feature vectors, one or more feature spaces from which a content-based similarity recommendation can be made; producing a graph with nodes and edges where the nodes represent computer applications, books, movies, or video games available in an electronic marketplace and the edges represent content-based similarity relationships, where the content-based similarity relationships are based, at least in part, on the one or more feature spaces, wherein the graph is represented using a latent vector space model that provides a distance function between nodes in the graph; and producing a content-based similarity score for two nodes that are not directly connected by an edge in the graph, where the content-based similarity score is a function of two or more other content-based similarity scores computed for other pairs of nodes in the graph. 2. The method of claim 1 , where cleaning the electronic data comprises performing one or more processes independently or interdependently, where the one or more processes change the capitalization of a word in the electronic data, separate concatenated words in the electronic data, merge synonyms in the electronic data, remove a non-Unicode symbol in the electronic data, remove a banned word in the electronic data, remove an uninformative word in the electronic data, or translate a word in the electronic data. 3. The method of claim 1 , comprising producing weights for members of the multiple different data sources, for different types of data, for different types of features, or for different features, and where producing the feature space from the plurality of features depends, at least in part, on the weights. 4. The method of claim 3 , the one or more elements being single words, types of nouns, n-grams, short phrases, symbols, acronyms, or abbreviations. 5. The method of claim 4 , where the one or more feature spaces are associated with multiple languages. 6. The method of claim 4 , cleaning the electronic data to make clean data from which feature vectors can be produced. 7. A computer-readable storage medium storing computer-executable instructions that when executed by a computer control the computer to perform a method, the method comprising: accessing electronic data from multiple different sources, where the electronic data represents unstructured text in two or more different languages, where the unstructured text represents titles or descriptions for applications available in an electronic marketplace; cleaning the electronic data to make clean data from which feature vectors can be produced, where cleaning the electronic data comprises performing one or more processes independently or interdependently, where the one or more processes change the capitalization of a word in the electronic data, separate concatenated words in the electronic data, merge synonyms in the electronic data, remove a non-Unicode symbol in the electronic data, remove a banned word in the electronic data, remove an uninformative word in the electronic data, or translate a word in the electronic data; extracting one or more features from the cleaned data using tokenization, n-gram extraction, proper noun detection, lemmatization, or stemming; producing weights for members of the multiple different data sources, for different types of data, for different types of features, or for different features; producing scores for the one or more features based, at least in part, on the weights and on term frequency—inverse document frequency (TF-IDF) or latent semantic indexing; producing a plurality of feature vectors from the one or more features based, at least in part, on the scores, where a feature vector comprises one or more elements, the one or more elements being single words, types of nouns, n-grams, short phrases, symbols, acronyms, or abbreviations; producing from the plurality of feature vectors, one or more feature spaces from which a content-based similarity recommendation can be made, where the one or more feature spaces are associated with single languages, where the one or more feature spaces depend, at least in part, on the weights; producing a graph whose nodes represents the applications available in the electronic marketplace and whose edges represent content-based similarity relationships, where the content-based similarity relationships are based, at least in part, on the one or more feature spaces, and producing a content-based similarity score for two nodes that are not directly connected by an edge in the graph, where the content-based similarity score is a function of two or more other content-based similarity scores computed for other pairs of nodes in the graph. 8. The media of claim 7 , where the graph is represented using a latent vector space model, where the graph or latent vector space model provides a distance function between items and items or between items and users. 9. The media of claim 7 , further comprising generating a feature vector of the plurality of feature vectors by: cleaning electronic data from one or more sources to produce cleaned data; extracting one or more features from the cleaned data; determining weights for the one or more sources, for the cleaned data, or for the one or more features, and producing a feature vector from the one or more features. 10. The media of claim 9 , further comprising scoring a feature ƒ in the feature vector for an item a according to: a [ƒ]= L [ƒ]·Σ T|ƒεT W T [ƒ]·SCORE T [ƒ] where: SCORE T [ƒ] is the score of the feature ƒ in treatment T, W T [ƒ] are treatment weights, and L[ƒ] are preferred words weight for the feature ƒ. 11. A method for content recommendation, comprising: accessing electronic data from multiple different sources, where the electronic data represents unstructured text in two or more different languages, where the unstructured text represents titles or descriptions for items available in an electronic marketplace; extracting one or more features from the data; producing a plurality of feature vectors from the one or more features where a feature vector comprises one or more elements; producing from the plurality of feature vectors, one or more feature spaces from which a content-based similarity recommendation can be made; producing a graph with nodes and edges where the nodes represent items available in an electronic marketplace and the edges represent content-based similarity relationships, where the content-based similarity relationships are based, at least in part, on the one or more feature spaces, wherein the graph is represented using a latent vector space model that provides a distance function between nodes in the graph; and producing a content-based similarity score for two nodes that are not directly connected by an edge in the graph, where the content-based similarity score is a function of two or more other content-based similarity scores computed for other pairs of nodes in the graph. 12. The method of claim 11 , where cleaning the electronic data comprises performing one or more processes independently or interdependently, where the one or more processes change the capi

Assignees

Inventors

Classifications

Patent family

Related publications grouped by family.

External sources

Frequently asked questions

Answers are generated from the same data shown on this page.

What does patent US9898773B2 cover?
Example apparatus and methods access multiple sources of information concerning features for applications, clean the data from the multiple sources, extract features from the cleaned data, selectively weight the sources, data or extracted features and produce a feature vector. The feature vector may then be used in a single language feature space or in a multi-language feature space. Feature sp…
Who is the assignee on this patent?
Microsoft Technology Licensing Llc
What technology area does this patent fall under?
Primary CPC classification G06Q30/0631. Mapped technology areas include Physics.
When was this patent published?
Publication date Tue Feb 20 2018 00:00:00 GMT+0000 (Coordinated Universal Time) (B2). Legal status and post-grant events are not shown on this page.
What related patents are in patentsdb?
We list 2 related publications on this page (citations in our corpus or others sharing the same primary CPC).