Video recommendation with multi-gate mixture of experts soft actor critic

US11922287B2 · US · B2

Patent metadata
FieldValue
Publication numberUS-11922287-B2
Application numberUS-202017040039-A
CountryUS
Kind codeB2
Filing dateJul 15, 2020
Priority dateJul 15, 2020
Publication dateMar 5, 2024
Grant dateMar 5, 2024

How to read this patent

A practical reading order for non-experts. Skip the full description unless you need deep technical detail.

  1. Title

    What the patent document calls the invention.

  2. Abstract

    A short plain-language summary of the technical disclosure.

  3. Assignees and inventors

    Who owns or filed the patent and who is credited as inventor.

  4. Key dates

    Filing, priority, publication, and grant dates set the timeline.

  5. First independent claim

    The legal scope of protection — read this for what is actually claimed.

  6. CPC / IPC classifications

    Technology tags used to group this patent with similar filings.

  7. Citations and related patents

    Prior art links and similar publications in this corpus.

Abstract

Official abstract text for this publication.

Described herein are embodiments of a reinforcement learning based large-scale multi-objective ranking system. Embodiments of the system may be used for optimizing short-video recommendation on a video sharing platform. Multiple competing ranking objective and implicit selection bias in user feedback are the main challenges in real-world platform. In order to address those challenges, multi-gate mixture of experts (MMoE) and soft actor critic (SAC) are integrated together into a MMoE_SAC system. Experiment results demonstrate that embodiments of the MMoE_SAC system may greatly reduce a loss function compared to systems only based on single strategies.

First claim

Opening claim text (preview).

What is claimed is: 1. A computer-implemented method for multi-objective ranking comprising: receiving, at a multi-gate mixture of experts (MMoE) layer comprising multiple experts and a gating network, embeddings corresponding to one or more states and one or more actions, wherein each expert is a deep neural network (DNN); generating, by each of multiple experts using soft actor critic (SAC), an expert output based on the embeddings, each expert output related to one or more prediction parameters corresponding to one or more actions; obtaining a weighted sum of the expert outputs by the multiple experts, in accordance with weights generated by the gating network for the experts, in which each expert has a corresponding weight obtained from the gating network; and generating a prediction output based on the weighted sum, wherein a training process comprises: regarding each action as a task; adding an entropy-regularized term to a policy function; and learning policy parameters by minimizing a Kullback-Leibler divergence between the policy function and a quotient obtained by dividing an exponential of a soft-Q function and a partition function. 2. The computer-implemented method of claim 1 wherein the embeddings are generated by steps of: dividing a plurality of features for the one or more states and the one or more actions into categorical features and numerical features; and defining a universal dynamic feature embedding dictionary to map or project the plurality of features into a unified embedding space for the embedding. 3. The computer-implemented method of claim 2 wherein defining a universal dynamic feature embedding dictionary to map or project the plurality of features into a unified embedding space comprising: using a one-hot or multi-hot vector for each embedding lookup for categorical features; and transforming, using a transformation weight matrix, the categorical features from sparse features to dense features. 4. The computer-implemented method of claim 1 wherein loss calculation for each of the one of more actions is independent from each other during a training process. 5. The computer-implemented method of claim 1 wherein the training process further comprises steps of: implementing a soft policy iteration by repeating soft policy evaluation and soft policy improvement alternately; and training soft-Q function parameters by minimizing a soft Bellman residual. 6. The computer-implemented method of claim 5 wherein during the soft policy iteration, a soft Q-function with a minimum Q-value among multiple soft Q-functions is taken for each policy improvement step. 7. A system for multi-objective ranking comprising: one or more processors; and a non-transitory computer-readable medium or media comprising one or more sets of instructions which, when executed by at least one of the one or more processors, causes steps to be performed comprising: converting features from one or more data sources into embeddings; receiving, at a multi-gate mixture of experts (MMoE) layer comprising multiple experts and a gating network, the embeddings corresponding to one or more states and one or more actions, wherein each expert is a neural network; generating, by each of multiple experts using soft actor critic (SAC), a prediction based on an input, each expert output related to one or more prediction parameters corresponding to one or more actions; obtaining a weighted sum of the expert outputs by the multiple experts, in accordance with weights generated by the gating network for the experts, in which each expert has a corresponding weight obtained from the gating network; and generating a prediction output based on the weighted sum, wherein a training process comprises: regarding each action as a task; adding an entropy-regularized term to a policy function; and learning policy parameters by minimizing a Kullback-Leibler divergence between the policy function and a quotient obtained by dividing an exponential of a soft-Q function and a partition function. 8. The system of claim 7 wherein converting features from one or more data sources into embeddings comprises the steps of: dividing the features into categorical features and numerical features; and defining a universal dynamic feature embedding dictionary to map or project the features into a unified embedding space for the embedding. 9. The system of claim 8 wherein defining a universal dynamic feature embedding dictionary to map or project input features into a unified embedding space comprises the steps of: using a one-hot or multi-hot vector for each embedding lookup for categorical features; and transforming, using a transformation weight matrix, the categorical features from sparse features to dense features. 10. The system of claim 7 wherein the non-transitory computer-readable medium or media further comprises one or more sets of instructions which, when executed by at least one of the one or more processors, causes steps of training to be performed comprising: implementing a soft policy iteration by repeating soft policy evaluation and soft policy improvement alternately; and training soft-Q function parameters by minimizing a soft Bellman residual. 11. The system of claim 10 wherein during the soft policy iteration, a Q-function with a minimum Q-value among multiple Q-functions is taken for each policy improvement step. 12. A non-transitory computer-readable medium or media comprising one or more sequences of instructions which, when executed by at least one processor, causes steps for multi-objective ranking comprising: converting features from one or more data sources into embeddings; receiving, at a multi-gate mixture of experts (MMoE) layer comprising multiple experts and a gating network, the embeddings corresponding to one or more states and one or more actions, wherein each expert is a neural network; generating, by each of multiple experts using soft actor critic (SAC), a prediction based on an input, each expert output related to one or more prediction parameters corresponding to one or more actions; obtaining a weighted sum of the expert outputs by the multiple experts, in accordance with weights generated by the gating network for the experts, in which each expert has a corresponding weight obtained from the gating network; and generating a prediction output based on the weighted sum, wherein a training process comprises: regarding each action as a task; adding an entropy-regularized term to a policy function; and learning policy parameters by minimizing a Kullback-Leibler divergence between the policy function and a quotient obtained by dividing an exponential of a soft-Q function and a partition function. 13. The non-transitory computer-readable medium or media of claim 12 wherein converting features from one or more data sources into embeddings comprises steps of: dividing a plurality of features for one or more states and the one or more actions into categorical features and numerical features; and defining a universal dynamic feature embedding dictionary to map or project the plurality of features into a unified embedding space for the embedding. 14. The non-transitory computer-readable medium or media of claim 13 wherein defining a universal dynamic feature embedding dictionary to map or project input features into a unified embedding space comprises the steps of: using a one-hot or multi-hot vector for each embedding lookup for categorical features; and transforming, using a transformation weight matrix, the categorical features from sparse features to dense features.

Assignees

Inventors

Classifications

  • characterised by memory or gating, e.g. long short-term memory [LSTM] or gated recurrent units [GRU] · CPC title

  • Convolutional networks [CNN, ConvNet] · CPC title

  • Reinforcement learning · CPC title

  • G06N3/042Primary

    Knowledge-based neural networks; Logical representations of neural networks · CPC title

  • G06N3/08Primary

    Learning methods · CPC title

Patent family

Related publications grouped by family.

External sources

Frequently asked questions

Answers are generated from the same data shown on this page.

What does patent US11922287B2 cover?
Described herein are embodiments of a reinforcement learning based large-scale multi-objective ranking system. Embodiments of the system may be used for optimizing short-video recommendation on a video sharing platform. Multiple competing ranking objective and implicit selection bias in user feedback are the main challenges in real-world platform. In order to address those challenges, multi-gat…
Who is the assignee on this patent?
Baidu Usa Llc, Baidu Com Times Tech Beijing Co Ltd
What technology area does this patent fall under?
Primary CPC classification G06N3/042. Mapped technology areas include Physics.
When was this patent published?
Publication date Tue Mar 05 2024 00:00:00 GMT+0000 (Coordinated Universal Time) (B2). Legal status and post-grant events are not shown on this page.
What related patents are in patentsdb?
We list 5 related publications on this page (citations in our corpus or others sharing the same primary CPC).