Who is the assignee on this patent?

Baidu Usa Llc, Baidu Com Times Tech Beijing Co Ltd

What technology area does this patent fall under?

Primary CPC classification G06N3/042. Mapped technology areas include Physics.

When was this patent published?

Publication date Tue Mar 05 2024 00:00:00 GMT+0000 (Coordinated Universal Time) (B2). Legal status and post-grant events are not shown on this page.

What related patents are in patentsdb?

We list 5 related publications on this page (citations in our corpus or others sharing the same primary CPC).

Video recommendation with multi-gate mixture of experts soft actor critic

US11922287B2 · US · B2

Patent metadata
Field	Value
Publication number	US-11922287-B2
Application number	US-202017040039-A
Country	US
Kind code	B2
Filing date	Jul 15, 2020
Priority date	Jul 15, 2020
Publication date	Mar 5, 2024
Grant date	Mar 5, 2024

How to read this patent

A practical reading order for non-experts. Skip the full description unless you need deep technical detail.

Title
What the patent document calls the invention.
Abstract
A short plain-language summary of the technical disclosure.
Assignees and inventors
Who owns or filed the patent and who is credited as inventor.
Key dates
Filing, priority, publication, and grant dates set the timeline.
First independent claim
The legal scope of protection — read this for what is actually claimed.
CPC / IPC classifications
Technology tags used to group this patent with similar filings.
Citations and related patents
Prior art links and similar publications in this corpus.

Abstract

Official abstract text for this publication.

Described herein are embodiments of a reinforcement learning based large-scale multi-objective ranking system. Embodiments of the system may be used for optimizing short-video recommendation on a video sharing platform. Multiple competing ranking objective and implicit selection bias in user feedback are the main challenges in real-world platform. In order to address those challenges, multi-gate mixture of experts (MMoE) and soft actor critic (SAC) are integrated together into a MMoE_SAC system. Experiment results demonstrate that embodiments of the MMoE_SAC system may greatly reduce a loss function compared to systems only based on single strategies.

First claim

Opening claim text (preview).

What is claimed is: 1. A computer-implemented method for multi-objective ranking comprising: receiving, at a multi-gate mixture of experts (MMoE) layer comprising multiple experts and a gating network, embeddings corresponding to one or more states and one or more actions, wherein each expert is a deep neural network (DNN); generating, by each of multiple experts using soft actor critic (SAC), an expert output based on the embeddings, each expert output related to one or more prediction parameters corresponding to one or more actions; obtaining a weighted sum of the expert outputs by the multiple experts, in accordance with weights generated by the gating network for the experts, in which each expert has a corresponding weight obtained from the gating network; and generating a prediction output based on the weighted sum, wherein a training process comprises: regarding each action as a task; adding an entropy-regularized term to a policy function; and learning policy parameters by minimizing a Kullback-Leibler divergence between the policy function and a quotient obtained by dividing an exponential of a soft-Q function and a partition function. 2. The computer-implemented method of claim 1 wherein the embeddings are generated by steps of: dividing a plurality of features for the one or more states and the one or more actions into categorical features and numerical features; and defining a universal dynamic feature embedding dictionary to map or project the plurality of features into a unified embedding space for the embedding. 3. The computer-implemented method of claim 2 wherein defining a universal dynamic feature embedding dictionary to map or project the plurality of features into a unified embedding space comprising: using a one-hot or multi-hot vector for each embedding lookup for categorical features; and transforming, using a transformation weight matrix, the categorical features from sparse features to dense features. 4. The computer-implemented method of claim 1 wherein loss calculation for each of the one of more actions is independent from each other during a training process. 5. The computer-implemented method of claim 1 wherein the training process further comprises steps of: implementing a soft policy iteration by repeating soft policy evaluation and soft policy improvement alternately; and training soft-Q function parameters by minimizing a soft Bellman residual. 6. The computer-implemented method of claim 5 wherein during the soft policy iteration, a soft Q-function with a minimum Q-value among multiple soft Q-functions is taken for each policy improvement step. 7. A system for multi-objective ranking comprising: one or more processors; and a non-transitory computer-readable medium or media comprising one or more sets of instructions which, when executed by at least one of the one or more processors, causes steps to be performed comprising: converting features from one or more data sources into embeddings; receiving, at a multi-gate mixture of experts (MMoE) layer comprising multiple experts and a gating network, the embeddings corresponding to one or more states and one or more actions, wherein each expert is a neural network; generating, by each of multiple experts using soft actor critic (SAC), a prediction based on an input, each expert output related to one or more prediction parameters corresponding to one or more actions; obtaining a weighted sum of the expert outputs by the multiple experts, in accordance with weights generated by the gating network for the experts, in which each expert has a corresponding weight obtained from the gating network; and generating a prediction output based on the weighted sum, wherein a training process comprises: regarding each action as a task; adding an entropy-regularized term to a policy function; and learning policy parameters by minimizing a Kullback-Leibler divergence between the policy function and a quotient obtained by dividing an exponential of a soft-Q function and a partition function. 8. The system of claim 7 wherein converting features from one or more data sources into embeddings comprises the steps of: dividing the features into categorical features and numerical features; and defining a universal dynamic feature embedding dictionary to map or project the features into a unified embedding space for the embedding. 9. The system of claim 8 wherein defining a universal dynamic feature embedding dictionary to map or project input features into a unified embedding space comprises the steps of: using a one-hot or multi-hot vector for each embedding lookup for categorical features; and transforming, using a transformation weight matrix, the categorical features from sparse features to dense features. 10. The system of claim 7 wherein the non-transitory computer-readable medium or media further comprises one or more sets of instructions which, when executed by at least one of the one or more processors, causes steps of training to be performed comprising: implementing a soft policy iteration by repeating soft policy evaluation and soft policy improvement alternately; and training soft-Q function parameters by minimizing a soft Bellman residual. 11. The system of claim 10 wherein during the soft policy iteration, a Q-function with a minimum Q-value among multiple Q-functions is taken for each policy improvement step. 12. A non-transitory computer-readable medium or media comprising one or more sequences of instructions which, when executed by at least one processor, causes steps for multi-objective ranking comprising: converting features from one or more data sources into embeddings; receiving, at a multi-gate mixture of experts (MMoE) layer comprising multiple experts and a gating network, the embeddings corresponding to one or more states and one or more actions, wherein each expert is a neural network; generating, by each of multiple experts using soft actor critic (SAC), a prediction based on an input, each expert output related to one or more prediction parameters corresponding to one or more actions; obtaining a weighted sum of the expert outputs by the multiple experts, in accordance with weights generated by the gating network for the experts, in which each expert has a corresponding weight obtained from the gating network; and generating a prediction output based on the weighted sum, wherein a training process comprises: regarding each action as a task; adding an entropy-regularized term to a policy function; and learning policy parameters by minimizing a Kullback-Leibler divergence between the policy function and a quotient obtained by dividing an exponential of a soft-Q function and a partition function. 13. The non-transitory computer-readable medium or media of claim 12 wherein converting features from one or more data sources into embeddings comprises steps of: dividing a plurality of features for one or more states and the one or more actions into categorical features and numerical features; and defining a universal dynamic feature embedding dictionary to map or project the plurality of features into a unified embedding space for the embedding. 14. The non-transitory computer-readable medium or media of claim 13 wherein defining a universal dynamic feature embedding dictionary to map or project input features into a unified embedding space comprises the steps of: using a one-hot or multi-hot vector for each embedding lookup for categorical features; and transforming, using a transformation weight matrix, the categorical features from sparse features to dense features.

Assignees

Inventors

Classifications

G06N3/0442
characterised by memory or gating, e.g. long short-term memory [LSTM] or gated recurrent units [GRU] · CPC title
G06N3/0464
Convolutional networks [CNN, ConvNet] · CPC title
G06N3/092
Reinforcement learning · CPC title
G06N3/042Primary
Knowledge-based neural networks; Logical representations of neural networks · CPC title
G06N3/08Primary
Learning methods · CPC title

Patent family

Related publications grouped by family.

View patent family 79291538

External sources

Frequently asked questions

Answers are generated from the same data shown on this page.

What does patent US11922287B2 cover?: Described herein are embodiments of a reinforcement learning based large-scale multi-objective ranking system. Embodiments of the system may be used for optimizing short-video recommendation on a video sharing platform. Multiple competing ranking objective and implicit selection bias in user feedback are the main challenges in real-world platform. In order to address those challenges, multi-gat…
Who is the assignee on this patent?: Baidu Usa Llc, Baidu Com Times Tech Beijing Co Ltd
What technology area does this patent fall under?: Primary CPC classification G06N3/042. Mapped technology areas include Physics.
When was this patent published?: Publication date Tue Mar 05 2024 00:00:00 GMT+0000 (Coordinated Universal Time) (B2). Legal status and post-grant events are not shown on this page.
What related patents are in patentsdb?: We list 5 related publications on this page (citations in our corpus or others sharing the same primary CPC).

How to read this patent

Abstract

First claim

Assignees

Inventors

Classifications

Patent family

External sources

Related patents

Online hyperparameter tuning in distributed machine learning

Addressing a loss-metric mismatch with adaptive loss alignment

Generation of video recommendations using connection networks

Generation of Video Recommendations Using Connection Networks

Recommender system

Frequently asked questions