Method and apparatus for constructing informative outcomes to guide multi-policy decision making

US11087200B2 · US · B2

Patent metadata
FieldValue
Publication numberUS-11087200-B2
Application numberUS-201815923577-A
CountryUS
Kind codeB2
Filing dateMar 16, 2018
Priority dateMar 17, 2017
Publication dateAug 10, 2021
Grant dateAug 10, 2021

How to read this patent

A practical reading order for non-experts. Skip the full description unless you need deep technical detail.

  1. Title

    What the patent document calls the invention.

  2. Abstract

    A short plain-language summary of the technical disclosure.

  3. Assignees and inventors

    Who owns or filed the patent and who is credited as inventor.

  4. Key dates

    Filing, priority, publication, and grant dates set the timeline.

  5. First independent claim

    The legal scope of protection — read this for what is actually claimed.

  6. CPC / IPC classifications

    Technology tags used to group this patent with similar filings.

  7. Citations and related patents

    Prior art links and similar publications in this corpus.

Abstract

Official abstract text for this publication.

In Multi-Policy Decision-Making (MPDM), many computationally-expensive forward simulations are performed in order to predict the performance of a set of candidate policies. In risk-aware formulations of MPDM, only the worst outcomes affect the decision making process, and efficiently finding these influential outcomes becomes the core challenge. Recently, stochastic gradient optimization algorithms, using a heuristic function, were shown to be significantly superior to random sampling. In this disclosure, it was shown that accurate gradients can be computed-even through a complex forward simulation—using approaches similar to those in dep networks. The proposed approach finds influential outcomes more reliably, and is faster than earlier methods, allowing one to evaluate more policies while simultaneously eliminating the need to design an easily-differentiable heuristic function.

First claim

Opening claim text (preview).

What is claimed is: 1. A computer-implemented method for issuing a command to a controlled object in an environment, comprising: for each policy from a set of policies, wherein the policy specifies the command for the controlled object to implement and the command directly affects motion of the controlled object: receiving a state estimate for each of one or more monitored objects and the controlled object, wherein each state estimate includes state elements, and wherein the state elements are indicative of a position of the respective object and a velocity of the respective object; generating seed states for each of the one or more monitored objects and the controlled object; simulating movement of the one or more monitored objects and the controlled object using the seed states, wherein the simulation includes the controlled object executing the respective policy; assigning a cost to outcome of the simulation with the seed states; determining a probability associated with the seed states; quantifying an outcome of the simulation with the seed states based on the cost and the probability; simulating movement of the one or more monitored objects and the controlled object using perturbed inputs by: perturbing at least one state element of the seed states to determine perturbed seed states; simulating movement of the one or more monitored objects and the controlled object using the perturbed seed states; assigning a perturbed cost to outcome of the simulation with the perturbed seed states; determining a perturbed probability associated with the perturbed seed states; quantifying a perturbed outcome of the simulation with the perturbed seed states based on the perturbed cost and the perturbed probability; and repeating the simulating movement of the one or more monitored objects and the controlled objects with perturbed inputs until a predetermined condition is met, thereby generating a plurality of perturbed outcomes; determining a policy score for the respective policy, wherein the policy score correlates to the perturbed outcome having highest value amongst the plurality of perturbed outcomes for the respective policy; selecting a given policy from the set of policies, wherein the given policy has most benign outcome amongst the policies in the set of policies; and issuing the command to the controlled object in accordance with the given policy. 2. The method of claim 1 wherein the perturbing at least one state element of the seed states maximizes a product of the perturbed cost and the perturbed probability. 3. The method of claim 1 further comprises perturbing at least one state element using backpropagation. 4. The method of claim 1 wherein simulating movement of the one or more monitored objects further comprises representing trajectory of an object using a differentiable function. 5. The method of claim 4 further comprises representing trajectory of an object by recursively applying a transition function over a series of time steps, where the transition function is defined such that objects are repelled by other agents and attracted towards a goal in accordance with a social force model. 6. The method claim 4 wherein simulating movement of the one or more monitored objects includes determining perturbed seed states by iteratively computing gradient for each time step in the series of time steps with respect to the perturbed seed states. 7. The method of claim 5 wherein the given policy avoids a set of undesired outcomes, and wherein the set of undesired outcomes includes at least one of (i) a collision between the controlled object and one of the one or more monitored objects; and (ii) the controlled object being within a predetermined distance of the one or more monitored objects. 8. The method of claim 1 wherein the state elements are represented by a probability distribution. 9. The method of claim 8 wherein one of the state elements is indicative of a goal of the controlled object. 10. The method of claim 9 wherein the cost is determined using a blame metric, wherein the blame metric is a function of a distance between the controlled object and one of the one or more monitored objects and the velocity of the controlled object. 11. The method of claim 10 wherein the cost is determined using a progress toward the goal of the controlled object. 12. The method of claim 1 wherein the received state estimate for each of the one or more monitored objects is a probabilistic estimate of the one or more monitored objects based on a perception module. 13. The method of claim 1 wherein the set of policies include at least one of the following commands for the controlled object to: (i) change a trajectory to follow one of the one or more monitored objects; (ii) remain in the same position; (iii) move forward; (iv) decelerate; and (v) accelerate. 14. A controlled object in an environment comprising: a controller configured to, for each policy from a set of policies, determine a best policy, wherein the policy specifies a command for the controlled object to implement, and wherein the command indicates a direction of the controlled object; a non-transitory computer-readable medium that stores instructions that, when executed by a processor, cause the processor to: receive, from a perception module, a state estimate for each of one or more monitored objects and the controlled object, wherein each state estimate includes state elements, and wherein the state elements are indicative of a position of the respective object and a velocity of the respective object; generate, via a seed state generator, seed states for each of the one or more monitored objects and the controlled object; simulate, via a simulator, movement of the one or more monitored objects and the controlled object using the seed states, wherein the simulation includes the controlled object executing the respective policy; quantify, via an outcome quantifier, an outcome for the seed states by assigning a cost to the outcome and determining a probability associated with the seed states; simulate, via the simulator, movement of the one or more monitored objects and the controlled objects using perturbed inputs by: perturbing, via a perturbing module, at least one state element of the seed states using backpropagation to determine perturbed seed states, wherein the perturbing maximizes the product of a perturbed cost and a perturbed probability; simulating, via the simulator, movement of the one or more monitored objects and the controlled object using the perturbed seed states to calculate the perturbed cost associated with the perturbed seed states; quantifying, via the outcome quantifier, a perturbed outcome for the perturbed seed states as a perturbed product of the perturbed cost and the perturbed probability associated with the perturbed seed states; and repeating, via the controller, the simulating movement of the one or more monitored objects and the controlled objects with perturbed inputs until a predetermined condition is met, thereby generating a plurality of perturbed products; determine, via the controller, a policy score for the respective policy, wherein the policy score correlates to the perturbed outcome having highest value amongst the policies in the set of policies; select, via the controller, a given policy from the set of policies, wherein the given policy has most benign outcome amongst the policies in the set of policies; and issue, via the controller, the command to the controlled object in accordance with the given policy. 15. The controlled object of claim 14 further comprise simulating

Assignees

Inventors

Classifications

  • G06N3/008Primary

    based on physical entities controlled by simulated intelligence so as to replicate intelligent life forms, e.g. based on robots replicating pets or humans in their appearance or behaviour · CPC title

  • Probabilistic graphical models, e.g. probabilistic networks · CPC title

  • Diagnosis, testing or measuring; Detecting, analysing or monitoring not otherwise provided for (error detection, error correction or monitoring in digital computers or digital computer components G06F11/00) · CPC title

  • G06N3/02Primary

    Neural networks · CPC title

  • Backpropagation, e.g. using gradient descent · CPC title

Patent family

Related publications grouped by family.

External sources

Frequently asked questions

Answers are generated from the same data shown on this page.

What does patent US11087200B2 cover?
In Multi-Policy Decision-Making (MPDM), many computationally-expensive forward simulations are performed in order to predict the performance of a set of candidate policies. In risk-aware formulations of MPDM, only the worst outcomes affect the decision making process, and efficiently finding these influential outcomes becomes the core challenge. Recently, stochastic gradient optimization algori…
Who is the assignee on this patent?
Univ Michigan Regents
What technology area does this patent fall under?
Primary CPC classification G06N3/008. Mapped technology areas include Physics.
When was this patent published?
Publication date Tue Aug 10 2021 00:00:00 GMT+0000 (Coordinated Universal Time) (B2). Legal status and post-grant events are not shown on this page.
What related patents are in patentsdb?
We list 6 related publications on this page (citations in our corpus or others sharing the same primary CPC).