User-agent anomaly detection using sentence embedding

US11907658B2 · US · B2

Patent metadata
FieldValue
Publication numberUS-11907658-B2
Application numberUS-202117308931-A
CountryUS
Kind codeB2
Filing dateMay 5, 2021
Priority dateMay 5, 2021
Publication dateFeb 20, 2024
Grant dateFeb 20, 2024

How to read this patent

A practical reading order for non-experts. Skip the full description unless you need deep technical detail.

  1. Title

    What the patent document calls the invention.

  2. Abstract

    A short plain-language summary of the technical disclosure.

  3. Assignees and inventors

    Who owns or filed the patent and who is credited as inventor.

  4. Key dates

    Filing, priority, publication, and grant dates set the timeline.

  5. First independent claim

    The legal scope of protection — read this for what is actually claimed.

  6. CPC / IPC classifications

    Technology tags used to group this patent with similar filings.

  7. Citations and related patents

    Prior art links and similar publications in this corpus.

Abstract

Official abstract text for this publication.

Systems and methods for user-agent anomaly detection are disclosed. In one embodiment, a user-agent string may be embedded into a numerical data vector representation using a sentence embedding algorithm (e.g., FastText). A predictive score may be calculated based on the numerical data vector representation and using a probability distribution function model that models a likelihood of occurrence of the observed user-agent based on patterns learned from historic payload data (e.g., a Gaussian Mixture Model). The predictive score may be compared to a threshold and, based on the comparison, it may be determined whether the user-agent is fraudulent.

First claim

Opening claim text (preview).

What is claimed is: 1. A computer system comprising: a non-transitory memory storing instructions; and one or more hardware processors configured to read the instructions and cause the computer system to perform operations comprising: receiving, from a user-agent computer application of a device, a request to access a resource of the computer system; determining, based on the request, a character string that identifies the user-agent computer application; generating, from the character string, a plurality of character n-grams based on a plurality of word sizes, wherein the plurality of character n-grams comprises at least a first set of character n-grams corresponding to a first word size and a second set of character n-grams corresponding to a second word size; determining a plurality of hash values based on performing one or more hash functions on the plurality of character n-grams; embedding, using a sentence embedding algorithm, the character string into a numerical data vector representation of the user-agent client application based on the plurality of hash values, wherein the embedding comprises transforming each of the plurality of hash values into a numerical value within the numerical data vector representation of the user-agent computer application; calculating, for the user-agent client application, a predictive score based on the numerical data vector representation, wherein the predictive score indicates whether the character string that identifies the user-agent computer application corresponds to an anomaly based on a probability distribution function that models patterns learned from historic data associated with a plurality of user-agent computer applications; comparing the predictive score to a threshold; and determining, based on the comparing, whether the user-agent computer application corresponds to a fraudulent application. 2. The computer system of claim 1 , wherein the probability distribution function comprises a Gaussian Mixture Model, having a weighted sum of M-component Gaussian densities, generated based on the patterns learned from the historic data associated with the plurality of user-agent computer applications, and wherein the M-component Gaussian densities correspond to normal distributions of subpopulations of the plurality of user-agent computer applications. 3. The computer system of claim 1 , wherein the operations further comprise: aggregating the historic data associated with the plurality of user-agent computer applications; based on the historic data, extracting a plurality of character strings for the plurality of user-agent computer applications; embedding the plurality of character strings into a plurality of respective numerical data vector representations for the plurality of user-agent computer applications; and generating the probability distribution function based on the plurality of respective numerical data vector representations. 4. The computer system of claim 1 , wherein the character string represents two or more of an operating system type of an operating system of the device, an application model associated with the user-agent computer application, a device type associated with the device, or a device manufacturer associated with the device. 5. The computer system of claim 1 , wherein the numerical data vector representation has a dimensionality that corresponds to a parameter of the probability distribution function. 6. The computer system of claim 1 , wherein the request is a Hypertext Transfer Protocol (HTTP) request, and wherein the character string is extracted from the HTTP request. 7. The computer system of claim 1 , wherein the operations further comprise denying the user-agent computer application from accessing the resource in response to determining that the user-agent computer application corresponds to the fraudulent application. 8. A method comprising: receiving, by a computer system, a request from a user-agent application of a device to access at least one resource associated with a service provider system; extracting, from the request, an identifier of the user-agent application, wherein the identifier comprises a character string; generating, from the character string, a plurality of character n-grams based on a plurality of word sizes, wherein the plurality of character n-grams comprises a first set of character n-grams corresponding to a first word size and a second set of character n-grams corresponding to a second word size; determining a plurality of hash values based on performing one or more hash functions on the plurality of character n-grams; converting, by the computer system, the character string into a numerical data vector representation of the user-agent application based on the plurality of hash values, wherein the converting comprises transforming each of the plurality of hash values into a numerical value within the numerical data vector representation of the user-agent application; calculating, by the computer system and for the user-agent application, a predictive score based on the numerical data vector representation, wherein the predictive score indicates whether the identifier of the user-agent application corresponds to an anomaly based on a probability distribution function that models patterns learned from historic data associated with a plurality of user-agent applications that have requested access to the at least one resource associated with the service provider system; comparing, by the computer system, the predictive score to a threshold; and based on the comparing, classifying, by the computer system, the user-agent application as non-fraudulent or fraudulent. 9. The method of claim 8 , wherein the probability distribution function comprises a Gaussian Mixture Model, having a weighted sum of M-component Gaussian densities, generated based on the patterns learned from the historic data associated with the plurality of user-agent applications that have requested access to the at least one resource associated with the service provider system, and wherein the M-component Gaussian densities correspond to normal distributions of subpopulations of the plurality of user-agent applications. 10. The method of claim 8 , further comprising: aggregating, by the computer system, the historic data associated with the plurality of user-agent applications that have requested access to the at least one resource associated with the computer system; based on the historic data, extracting, by the computer system, a plurality of character strings for the plurality of user-agent applications; converting, by the computer system, the plurality of character strings into a plurality of respective numerical data vector representations of the plurality of user-agent applications; and generating, by the computer system, the probability distribution function based on the plurality of respective numerical data vector representations. 11. The method of claim 10 , wherein the converting the plurality of character strings into the plurality of respective numerical data vector representations is performed using a FastText algorithm. 12. The method of claim 8 , wherein the numerical data vector representation has at least 300 dimensions. 13. The method of claim 8 , further comprising: classifying the user-agent application as fraudulent based on the comparing; and storing the character string in a blacklist database that prevents the user-agent application from accessing the at least one resource. 14. The method of claim 13 , further comprising blocking an Internet Protocol address associated with the device. 15. A

Assignees

Inventors

Classifications

  • G06F40/284Primary

    Lexical analysis, e.g. tokenisation or collocates · CPC title

  • using statistics or function optimisation, e.g. modelling of probability density functions · CPC title

  • based on parametric or probabilistic models, e.g. based on likelihood ratio or false acceptance rate versus a false rejection rate · CPC title

  • Semantic analysis · CPC title

Patent family

Related publications grouped by family.

External sources

Frequently asked questions

Answers are generated from the same data shown on this page.

What does patent US11907658B2 cover?
Systems and methods for user-agent anomaly detection are disclosed. In one embodiment, a user-agent string may be embedded into a numerical data vector representation using a sentence embedding algorithm (e.g., FastText). A predictive score may be calculated based on the numerical data vector representation and using a probability distribution function model that models a likelihood of occurren…
Who is the assignee on this patent?
Paypal Inc
What technology area does this patent fall under?
Primary CPC classification G06F40/284. Mapped technology areas include Physics.
When was this patent published?
Publication date Tue Feb 20 2024 00:00:00 GMT+0000 (Coordinated Universal Time) (B2). Legal status and post-grant events are not shown on this page.
What related patents are in patentsdb?
We list 12 related publications on this page (citations in our corpus or others sharing the same primary CPC).