What is claimed is:
1. A computer system comprising:
a non-transitory memory storing instructions; and
one or more hardware processors configured to read the instructions and cause the computer system to perform operations comprising:
receiving, from a user-agent computer application of a device, a request to access a resource of the computer system;
determining, based on the request, a character string that identifies the user-agent computer application;
generating, from the character string, a plurality of character n-grams based on a plurality of word sizes, wherein the plurality of character n-grams comprises at least a first set of character n-grams corresponding to a first word size and a second set of character n-grams corresponding to a second word size;
determining a plurality of hash values based on performing one or more hash functions on the plurality of character n-grams;
embedding, using a sentence embedding algorithm, the character string into a numerical data vector representation of the user-agent client application based on the plurality of hash values, wherein the embedding comprises transforming each of the plurality of hash values into a numerical value within the numerical data vector representation of the user-agent computer application;
calculating, for the user-agent client application, a predictive score based on the numerical data vector representation, wherein the predictive score indicates whether the character string that identifies the user-agent computer application corresponds to an anomaly based on a probability distribution function that models patterns learned from historic data associated with a plurality of user-agent computer applications;
comparing the predictive score to a threshold; and
determining, based on the comparing, whether the user-agent computer application corresponds to a fraudulent application.
2. The computer system of claim 1 , wherein the probability distribution function comprises a Gaussian Mixture Model, having a weighted sum of M-component Gaussian densities, generated based on the patterns learned from the historic data associated with the plurality of user-agent computer applications, and wherein the M-component Gaussian densities correspond to normal distributions of subpopulations of the plurality of user-agent computer applications.
3. The computer system of claim 1 , wherein the operations further comprise:
aggregating the historic data associated with the plurality of user-agent computer applications;
based on the historic data, extracting a plurality of character strings for the plurality of user-agent computer applications;
embedding the plurality of character strings into a plurality of respective numerical data vector representations for the plurality of user-agent computer applications; and
generating the probability distribution function based on the plurality of respective numerical data vector representations.
4. The computer system of claim 1 , wherein the character string represents two or more of an operating system type of an operating system of the device, an application model associated with the user-agent computer application, a device type associated with the device, or a device manufacturer associated with the device.
5. The computer system of claim 1 , wherein the numerical data vector representation has a dimensionality that corresponds to a parameter of the probability distribution function.
6. The computer system of claim 1 , wherein the request is a Hypertext Transfer Protocol (HTTP) request, and wherein the character string is extracted from the HTTP request.
7. The computer system of claim 1 , wherein the operations further comprise denying the user-agent computer application from accessing the resource in response to determining that the user-agent computer application corresponds to the fraudulent application.
8. A method comprising:
receiving, by a computer system, a request from a user-agent application of a device to access at least one resource associated with a service provider system;
extracting, from the request, an identifier of the user-agent application, wherein the identifier comprises a character string;
generating, from the character string, a plurality of character n-grams based on a plurality of word sizes, wherein the plurality of character n-grams comprises a first set of character n-grams corresponding to a first word size and a second set of character n-grams corresponding to a second word size;
determining a plurality of hash values based on performing one or more hash functions on the plurality of character n-grams;
converting, by the computer system, the character string into a numerical data vector representation of the user-agent application based on the plurality of hash values, wherein the converting comprises transforming each of the plurality of hash values into a numerical value within the numerical data vector representation of the user-agent application;
calculating, by the computer system and for the user-agent application, a predictive score based on the numerical data vector representation, wherein the predictive score indicates whether the identifier of the user-agent application corresponds to an anomaly based on a probability distribution function that models patterns learned from historic data associated with a plurality of user-agent applications that have requested access to the at least one resource associated with the service provider system;
comparing, by the computer system, the predictive score to a threshold; and
based on the comparing, classifying, by the computer system, the user-agent application as non-fraudulent or fraudulent.
9. The method of claim 8 , wherein the probability distribution function comprises a Gaussian Mixture Model, having a weighted sum of M-component Gaussian densities, generated based on the patterns learned from the historic data associated with the plurality of user-agent applications that have requested access to the at least one resource associated with the service provider system, and wherein the M-component Gaussian densities correspond to normal distributions of subpopulations of the plurality of user-agent applications.
10. The method of claim 8 , further comprising:
aggregating, by the computer system, the historic data associated with the plurality of user-agent applications that have requested access to the at least one resource associated with the computer system;
based on the historic data, extracting, by the computer system, a plurality of character strings for the plurality of user-agent applications;
converting, by the computer system, the plurality of character strings into a plurality of respective numerical data vector representations of the plurality of user-agent applications; and
generating, by the computer system, the probability distribution function based on the plurality of respective numerical data vector representations.
11. The method of claim 10 , wherein the converting the plurality of character strings into the plurality of respective numerical data vector representations is performed using a FastText algorithm.
12. The method of claim 8 , wherein the numerical data vector representation has at least 300 dimensions.
13. The method of claim 8 , further comprising:
classifying the user-agent application as fraudulent based on the comparing; and
storing the character string in a blacklist database that prevents the user-agent application from accessing the at least one resource.
14. The method of claim 13 , further comprising blocking an Internet Protocol address associated with the device.
15. A