Fraud detection using Machine Learing

Almost everyone must have received a phone call or a text message from their financial institution when they have attempted to perform a financial transaction out of the normal. Hopefully, most of the time there are no bad actors involved and the transaction has been attempted by you. So other than a minor annoyance and a delay in completing your transaction, all goes good.

Ever wondered what goes behind the scenes there. What we have is a rule based way of working, similar to a risk underwriter who takes into consideration a whole lot of variables into consideration. But this rule based approach does have a set of drawbacks viz.

Rigid definition
Once a rule has been defined, it kind of remain the same unless there is a periodic human involved review process

Maintainability and relevance
The very fact that we need human to keep the rules up to date makes it maintenance intensive and susceptible to be over run by bad actor

More false than true alerts
From practical experience you much have realised that for every 1 genuine alert or near zero alert which you receive, all the other alerts are for genuine transactions which you attempting

Machine Learning could be a good candidate in this use case

Why machine learning?

We can take every shortcoming of a rule based approach and turn it on its head

Fluid definition
Machine learning models by definition are able to self-learn and quickly adapt to changing patterns. The fact these models have at their disposal huge amounts of data to work through is definitely going in their favor.

Easier to maintain
The fact that humans are removed from the equations makes the entire system easier to maintain.

More true alerts
Because the underlying mechanism of a machine learning model – viz. more amount of training data and better statistical algorithm to go with it – inherently give them a leg up, the result is just much better performance

Well then how would we apply machine learning in this case?

Well turns out not in any way different than a conventional ML model – viz.

Gather data
Collect a good amount of data – appropriately labeled – meaning a set of financial transactions labelled as legitimate and a set of transactions labelled as fraudulent.

Feature Engineering
Next would be the feature engineering step where we decide which features are of real importance here. We can club them into 2 kinds – Features pertaining to properties of transactions such as identity and location; Features pertaining to customer behavior such as frequency and type of orders. To define these a little bit better

  • Identity: This could be any variable which uniquely identifies a consumer – say cell phone number
  • Orders: Nature and type of orders in terms of kind and quantity of products bought
  • Payment methods: Whether it is credit, debit, paypal, venmo, digital wallets
  • Locations: usually the IP address of the machine, which can be spoofed through a VPN connection so care must be taken here

Train algorithm
Next would be the logical step of training an algorithm to come up with the least cost mathematical function

To get a little bit more technical on the machine learning techniques – those are of a few kinds here

  • Tree-based: Here we are referring to models such as Random Forest and XGBoost which performs great on many kinds of transaction datasets
  • Neural Networks: are slowly gaining in popularity and used in conjunction with anomaly detection which brings us to the next techinque
  • Clustering: wherein data points lying close to the centroid of a cluster are said to be safe, while outliers are labelled fraudulent
  • Nearest Neighbor: Normal data points occur in close proximity, while anomalous data points are far from any neighbors.
  • Classification: where we use learning based on labeled data to distinguish legitimate from fraudulent transaction and finally would be
  • Deep learning: where well defined stochastic models are used to segregate normal data points occurring in high probability regions, from abnormal data points occurring in low-probability regions.