What technology area does this patent fall under?

Primary CPC classification G06V40/28. Mapped technology areas include Physics.

When was this patent published?

Publication date Tue Mar 10 2026 00:00:00 GMT+0000 (Coordinated Universal Time) (B1). Legal status and post-grant events are not shown on this page.

What related patents are in patentsdb?

We list 12 related publications on this page (citations in our corpus or others sharing the same primary CPC).

Detecting events by actors using dynamically cropped images

US12573237B1 · US · B1

Patent metadata
Field	Value
Publication number	US-12573237-B1
Application number	US-202318344519-A
Country	US
Kind code	B1
Filing date	Jun 29, 2023
Priority date	Jun 29, 2023
Publication date	Mar 10, 2026
Grant date	Mar 10, 2026

How to read this patent

A practical reading order for non-experts. Skip the full description unless you need deep technical detail.

Title
What the patent document calls the invention.
Abstract
A short plain-language summary of the technical disclosure.
Assignees and inventors
Who owns or filed the patent and who is credited as inventor.
Key dates
Filing, priority, publication, and grant dates set the timeline.
First independent claim
The legal scope of protection — read this for what is actually claimed.
CPC / IPC classifications
Technology tags used to group this patent with similar filings.
Citations and related patents
Prior art links and similar publications in this corpus.

Abstract

Official abstract text for this publication.

Systems within materials handling facilities or retail establishments are programmed to receive images from cameras, process clips of the images to generate sets of features representing product spaces and actors depicted within such images, and to classify the clips as depicting or not depicting a shopping event. The images are dynamically cropped to reduce amounts of data that must be processed to in order to determine whether the images depict shopping events. The images are cropped by calculating a center point based on positions of points on product spaces and detected overlaps of hands and the product spaces. Features of clips determined to depict a shopping event are combined into a sequence and transferred, along with classifications of such clips, to a multi-camera system that generates a shopping hypothesis based on such features.

First claim

Opening claim text (preview).

What is claimed is: 1 . A system comprising: a camera comprising at least one processor and at least one memory component, wherein the camera has a field of view including at least a portion of a fixture having a plurality of product spaces within the field of view; and a computer system in communication with at least the camera, wherein the computer system has positions of at least one point corresponding to each of the plurality of product spaces stored thereon, and wherein the computer system is programmed with one or more sets of instructions that, when executed by the computer system, cause the computer system to execute a method comprising: receiving a plurality of images from the camera, wherein each of the plurality of images was captured over a period of time; determining positions of at least one hand of an actor over at least a portion of the period of time; identifying a first point of an image plane of the camera corresponding to a portion of a first product space; identifying a second point of the image plane corresponding to a portion of a second product space; calculating, for each of the plurality of images, a first score representative of an overlap between the positions of the at least one hand and the first point; calculating, for each of the plurality of images, a second score representative of an overlap between the positions of the at least one hand and the second point; selecting a crop center for a first clip of images based at least in part on the first scores calculated for the images of the first clip, the second scores calculated for the images of the first clip, the first point and the second point, wherein each of the images of the first clip is one of the plurality of images; cropping each of the first clip of images by a crop window about the crop center; providing the cropped first clip of images as inputs to a model, wherein the model comprises: a feature encoder having a convolutional neural network backbone, an action encoder, a region encoder and a hand encoder; a feature queue; and a sequence head comprising an action head, an item head, a hand head and a quantity head; receiving outputs from the model in response to the inputs; determining that the actor executed at least one of a taking event, a return event or an event that is neither the taking event nor the return event with an item associated with the product space based at least in part on the outputs received from the model in response to the inputs; and storing an indication that the actor executed the at least one of the taking event, the return event or the event that is neither the taking event nor the return event in association with the actor to an external system in communication with the camera over one or more networks. 2 . The system of claim 1 , wherein each of the first clip of images comprises a plurality of pixels, and wherein the method further comprises: prior to providing the cropped first clip of images as inputs to the model, stacking each of the cropped first clip of images with channels representing hands and items, wherein each of the plurality of pixels is represented by: a plurality of color channels representing a color of one of the plurality of pixels; a channel indicating whether the one of the plurality of pixels depicts a portion of a hand; and a channel indicating whether the one of the plurality of pixels depicts an item within the portion of the hand. 3 . The system of claim 1 , wherein an area of each of the plurality of images is approximately twelve times greater than an area of the crop window. 4 . A method comprising: capturing at least a first plurality of images by a first camera having a first field of view, wherein at least a portion of a fixture comprising a first product space and a second product space is within the first field of view; determining positions of at least one hand of a first actor over at least a first period of time; selecting a first center point for cropping at least some of the first plurality of images captured over the first period of time, wherein the first center point is selected based at least in part on a first position corresponding to the first product space, a second position corresponding to the second product space and the positions of the at least one hand of the first actor over the first period of time; generating a first clip of images, wherein each of the images of the first clip is one of the first plurality of images captured over the first period of time cropped by a window about the first center point; generating a first hypothesis based at least in part on the first clip of images, wherein the first hypothesis identifies: a first event, wherein the first event is one of a taking event, a return event or neither a taking event nor a return event; the first actor; and a first item associated with one of the first product space or the second product space; and associating at least a first quantity of the first item with the first actor based at least in part on the first hypothesis. 5 . The method of claim 4 , wherein generating the first hypothesis comprises: generating a first set of features from the first clip of images, wherein each of the first set of features is one of an action feature, a region feature or a hand feature; determining a first classification of an event type depicted in the first clip of images; and generating the first hypothesis based at least in part on the first set of features and the first classification. 6 . The method of claim 5 , further comprising: determining positions of the at least one hand of the first actor over at least a second period of time; generating a second clip of images, wherein each of the images of the second clip is one of the first plurality of images captured over the second period of time cropped by the window about the first center point; generating a second set of features from the second clip of images, wherein each of the second set of features is one of an action feature, a region feature or a hand feature; determining a second classification of an event type depicted in the second clip of images; determining that the first classification is consistent with the second classification; and in response to determining that the first classification is consistent with the second classification, generating a sequence of features comprising the first set of features and the second set of features, wherein the first hypothesis is generated based at least in part on the sequence of features, the first classification and the second classification. 7 . The method of claim 6 , wherein determining the first classification of at least the first clip of images comprises: determining a first plurality of scores based at least in part on the first clip of images, wherein each of the first plurality of scores represents one of a probability that the first clip of images depicts a taking event, a probability that the first clip of images depicts a return event, or a probability that the first clip of images does not depict a taking event or a return event; and determining the first classification based at least in part on a greatest one of the first plurality of scores, wherein determining the second classification of at least the second clip of images comprises: determining a second plurality of scores based at least in part on the second clip of images, wherein each of the second plurality of scores represents one of a probability that the second clip of images depicts a taking event, a probability that the second clip of images depicts a return event, or a probability that the second clip of images does not depict a taking event or a return event; and determining the secon

Assignees

Amazon Tech Inc

Inventors

Classifications

G06V10/764
using classification, e.g. of video objects · CPC title
G06V20/52
Surveillance or monitoring of activities, e.g. for recognising suspicious objects (recognising microscopic objects G06V20/69) · CPC title
G06V10/25
Determination of region of interest [ROI] or a volume of interest [VOI] · CPC title
G06V10/82
using neural networks · CPC title
G06T2207/20132
Image cropping · CPC title

Patent family

Related publications grouped by family.

View patent family 99012652

External sources

Frequently asked questions

Answers are generated from the same data shown on this page.

What does patent US12573237B1 cover?: Systems within materials handling facilities or retail establishments are programmed to receive images from cameras, process clips of the images to generate sets of features representing product spaces and actors depicted within such images, and to classify the clips as depicting or not depicting a shopping event. The images are dynamically cropped to reduce amounts of data that must be process…
Who is the assignee on this patent?: Amazon Tech Inc
What technology area does this patent fall under?: Primary CPC classification G06V40/28. Mapped technology areas include Physics.
When was this patent published?: Publication date Tue Mar 10 2026 00:00:00 GMT+0000 (Coordinated Universal Time) (B1). Legal status and post-grant events are not shown on this page.
What related patents are in patentsdb?: We list 12 related publications on this page (citations in our corpus or others sharing the same primary CPC).