Robust reinforcement learning in personalized content prediction

US11645580B2 · US · B2

Patent metadata
FieldValue
Publication numberUS-11645580-B2
Application numberUS-202016748313-A
CountryUS
Kind codeB2
Filing dateJan 21, 2020
Priority dateJan 21, 2020
Publication dateMay 9, 2023
Grant dateMay 9, 2023

How to read this patent

A practical reading order for non-experts. Skip the full description unless you need deep technical detail.

  1. Title

    What the patent document calls the invention.

  2. Abstract

    A short plain-language summary of the technical disclosure.

  3. Assignees and inventors

    Who owns or filed the patent and who is credited as inventor.

  4. Key dates

    Filing, priority, publication, and grant dates set the timeline.

  5. First independent claim

    The legal scope of protection — read this for what is actually claimed.

  6. CPC / IPC classifications

    Technology tags used to group this patent with similar filings.

  7. Citations and related patents

    Prior art links and similar publications in this corpus.

Abstract

Official abstract text for this publication.

A system and method for content selection and presentation is disclosed. A system receives a plurality of content elements configured for presentation in at least one content container and selects one of the plurality of content elements for presentation in the at least one content container. The one of the plurality of content elements is selected by a trained selection model configured to use Thompson sampling. An interface including the selected one of the plurality of content elements is generated.

First claim

Opening claim text (preview).

What is claimed is: 1. A system for generating item recommendations, comprising: a memory having instructions stored thereon, and a processor that reads the instructions to: receive a plurality of content elements for presentation in at least one content container; select one of the plurality of content elements for presentation in the at least one content container, wherein the one of the plurality of content elements is selected by a trained selection model using Thompson sampling, wherein the trained selection model calculates one or more posterior distribution parameters of a total reward value Q, and wherein the Thompson sampling is applied to the one or more posterior distribution parameters calculated using a short-term reward value, r, and a long term reward value, R, wherein the short-term reward value, r, is an expected reward from a user in a current session and the long-term reward value, R, is an expected discounted rewards from the user in future sessions; generate an interface including the selected one of the plurality of content elements; provide the interface for display; monitor a state and an action taken through the interface within a time period T; determine a posterior distribution parameter of the state and action within the time period T; update long-term reward information based on the posterior distribution parameter of the state and action; and generate an updated selection model based on the updated long-term reward information. 2. The system of claim 1 , wherein the plurality of content elements are selected based on a received persona. 3. The system of claim 1 , wherein the trained selection model is trained using a plurality of prior impressions. 4. The system of claim 1 , wherein the trained selection model implements a state-action-reward-state-action (SARSA) process modified to use Thompson sampling. 5. The system of claim 1 , wherein the Thompson sampling is implemented by a machine learning model trained using reinforcement learning. 6. The system of claim 5 , wherein the machine learning model is a neural network. 7. A non-transitory computer readable medium having instructions stored thereon, wherein the instructions, when executed by a processor cause a device to perform operations comprising: receiving a request for an interface, wherein the request includes a user persona; selecting at least one of a plurality of content elements for inclusion in the interface, wherein the at least one of the plurality of content elements is selected using Thompson sampling, wherein a trained selection model calculates one or more posterior distribution parameters of a total reward value Q, and wherein the Thompson sampling is applied to the one or more posterior distribution parameters calculated using a short-term reward value, r, and a long term reward value, R, wherein the short-term reward value, r, is an expected reward from a user in a current session and the long-term reward value, R, is an expected discounted rewards from the user in future sessions; generating an interface including the selected at least one of the plurality of content elements; providing the interface for display; monitoring a state and an action taken through the interface within a time period T; determining a posterior distribution parameter of the state and action within the time period T; updating long-term reward information based on the posterior distribution parameter; and generating an undated selection model based on the updated long-term reward information. 8. The non-transitory computer readable medium of claim 7 , wherein the Thompson sampling is implemented by a machine learning model trained using reinforcement learning. 9. The non-transitory computer readable medium of claim 8 , wherein the machine learning model is trained using a plurality of prior impressions. 10. The non-transitory computer readable medium of claim 8 , wherein the machine learning model implements a state-action-reward-state-action (SARSA) process modified to use Thompson sampling. 11. The non-transitory computer readable medium of claim 8 , wherein the machine learning model is a neural network. 12. A computer-implemented method, comprising: receiving a plurality of content elements configured for presentation in at least one content container; selecting one of the plurality of content elements for presentation in the at least one content container, wherein the one of the plurality of content elements is selected by a trained selection model configured to use Thompson sampling, wherein the trained selection model calculates one or more posterior distribution parameters of a total reward value Q, and wherein the Thompson sampling is applied to the one or more posterior distribution parameters calculated using a short-term reward value, r, and a long term reward value, R, wherein the short-term reward value, r, is an expected reward from a user in a current session and the long-term reward value, R, is an expected discounted rewards from the user in future sessions; generating an interface including the selected one of the plurality of content elements; providing the interface for display; monitoring a state and an action taken through the interface within a time period T; determining a posterior distribution parameter of the state and action within the time Period T; updating long-term reward information based on the posterior distribution parameter; and generating an undated selection model based on the updated long-term reward information. 13. The computer-implemented method of claim 12 , wherein the plurality of content elements are selected based on a received persona. 14. The computer-implemented method of claim 12 , wherein the trained selection model is trained using a plurality of prior impressions. 15. The computer-implemented method of claim 12 , wherein the trained selection model is configured to implement a state-action-reward-state-action (SARSA) process modified to use Thompson sampling. 16. The computer-implemented method of claim 12 , wherein the Thompson sampling is implemented by a machine learning model trained using reinforcement learning. 17. The computer-implemented method of claim 16 , wherein the machine learning model is a neural network.

Assignees

Inventors

Classifications

  • Reinforcement learning · CPC title

  • Learning methods · CPC title

  • Probabilistic graphical models, e.g. probabilistic networks · CPC title

  • G06N20/00Primary

    Machine learning · CPC title

  • G06N3/006Primary

    based on simulated virtual individual or collective life forms, e.g. social simulations or particle swarm optimisation [PSO] · CPC title

Patent family

Related publications grouped by family.

External sources

Frequently asked questions

Answers are generated from the same data shown on this page.

What does patent US11645580B2 cover?
A system and method for content selection and presentation is disclosed. A system receives a plurality of content elements configured for presentation in at least one content container and selects one of the plurality of content elements for presentation in the at least one content container. The one of the plurality of content elements is selected by a trained selection model configured to use…
Who is the assignee on this patent?
Walmart Apollo Llc
What technology area does this patent fall under?
Primary CPC classification G06N20/00. Mapped technology areas include Physics.
When was this patent published?
Publication date Tue May 09 2023 00:00:00 GMT+0000 (Coordinated Universal Time) (B2). Legal status and post-grant events are not shown on this page.
What related patents are in patentsdb?
We list 12 related publications on this page (citations in our corpus or others sharing the same primary CPC).