Anomaly Detection using Isolation Forest in Python

The notion of Anomaly detection is about identifying events or observations that rarely occur and are misfit in the larger scope of universal set. They deviate quite a bit from what may be considered normal for the events. These are also termed as outliers. So consider this, if you stand out in the crowd, you are an anomaly and will trigger an instant alert with the IT police ????.

There are a lot of scenarios where anomaly detections come in handy. Here are some examples where we can use anomaly detection,

Financial Fraud

One of the primary use cases of anomaly detection is financial transactions. We can monitor deposits and withdrawals from bank for large changes, credit card transaction changes (like suddenly credit card used in a different country than normal). Banks regularly use anomaly detection to monitor for such changes.

Network Security

Traffic is normally consistent over a corporate network. If there is sudden change of volume of traffic or source, it would mean something is not normal. That should trigger an alarm that network may be compromised.

Email Variations

Emails are one of the other things that normally remain consistent. If you start getting emails from senders that do not send you emails, it will be a matter of concern.

eCommerce

eCommerce vendors also use anomaly detection to check sudden changes in sales or lead conversions. Also, one more use case is sudden increase of negative reviews.

Medical

Doctors also can use anomaly detection for medical records. For example, if a patients blood sugar level suddenly jumps up a lot from normal, that should trigger some anxiety.

Types of Anomalies

Now let’s talk about the types of anomalies. Broadly speaking we can have anomalies distributed in two distinct categories.

Univariate Anomaly

In case of univariate anomaly, we will see the outlier only depends on one factor. Let’s consider an example where we consistently get around 100 network requests every 15 minutes. Suddenly we see the trend change to get 2 requests per 15 minute. That is a change from the norm.

Multivariate Anomaly

For multivariate, anomaly will be dependent on more than one factor. Consider the same example as above again. Let’s add one more factor to the equation. Time of day. We see that between 9:00 AM to 4:00 PM, we do get an average of 100 requests per 15 minutes. But at night between 11:00 PM and 3:00 AM, the requests drop to an average of 5. Now this 2 requests may be something that we got during that time span. What looked like an outlier before, is no longer an outlier. It is normal.

Isolation Forest

In this blog we will use Isolation Forest to build a anomaly detection program. We will create our own dataset to use for this.

Isolation Forest is an algorithm for detecting anomalies in a data series. It uses binary trees for detection of anomalies. The way it works is that it uses the knowledge that anomaly will be very few and so can be identified with ease. This algorithm creates Binary trees using the following steps:

Given a dataset, a random feature is first selected and based on the minimum and maximum for this feature a random threshold is identified.
A branch is now created for this random feature. Anything less than threshold is added to the left and greater is added to the right. This creates a binary tree.
This step continues recursively till all data is processed or till maximum depth is reached.

We can easily identify anomalies from these isolation trees. Based on our assumption, anomalies are few. So, we should be able to reach branches containing anomalies with minimum number of steps. Anything that is within the range of values, will take much longer to traverse. This is the core of this algorithm.

To write this program, we will use the following python libraries.

scikit-learn
pandas
matplotlib (not needed unless you want to plot)
seaborn (not needed unless you want to plot)

So, the first part will take care of generating some random data for us to use.

Generating Random Data

Think about a banking system. Let’s say that you do mostly US transactions. Your transactions normally lie in the range of $1000 and $5000. So we will consider that as the normal. We will create some data based on this assumption. However, to add some anomalies, we will add few transactions that are not in US. We will also create transactions that fall beyond the established norms. Given below is the class that generates dummy data.

import numpy as np
import pandas as pd

class GenerateData:
    def __init__(self):
        pass

    def gen_clean_data(self, cnt, cntanom):
        df = pd.DataFrame(np.random.randint(1000, 5000, size=(cnt, 1)), columns=['Amount'])
        df['Country'] = 'USA'

        # Add Anomalies
        dfanom = df.sample(n=cntanom)
        for index, row in dfanom.iterrows():
            # Replace these
            df.loc[index,'Amount'] = np.random.randint(10000, 20000)

        dfanom = df.sample(n=cntanom)
        for index, row in dfanom.iterrows():
            df.loc[index,'Country'] = 'IND'
        
        # Now convert the Country to Category so we can assign a unique number to it
        df['Country'] = df['Country'].astype('category')
        # Assign the numbers to a new column
        df['CountryCode'] = df['Country'].cat.codes
        print(df[df['Amount'] > 10000].head(2))
        print(df[df['Country'] == 'IND'].head(2))
        return df

Here we are creating a pandas dataframe on line #9 using random values between 1,000 and 5,000. This we assume to be amounts. We then add a second column for country and assign all of them as USA.

After this we start creating the anomalies starting line #13. First we select some random rows and replace the amounts there with random values between 10,000 and 20,000. Next starting on line #18 we again select some random records and assign the country to IND. These will be anomalies for our data. Let’s print some data generated.

-- SAMPLE data --
   Amount Country  CountryCode
0    4172     USA            1
1    3971     USA            1
2    1666     USA            1
3    1499     USA            1
4    2509     USA            1
-- Anomalous Amount --
     Amount Country  CountryCode
100   19946     USA            1
326   13315     USA            1
327   19765     USA            1
563   19956     USA            1
625   12403     USA            1
-- Anomalous Country --
     Amount Country  CountryCode
114    1769     IND            0
139    1586     IND            0
389    3550     IND            0
395    4499     IND            0
461    1444     IND            0

So, if we try to get a density diagram, we end up with something like this,

def visualize_data(self):
    sns.kdeplot(self.df['Amount'], bw_method = .3, color='crimson', fill=True)
    plt.show()

Of course there are so few data in the 10,000 to 20,000 range, it doesn’t even produce a blip in the diagram. Now that we have the data, let’s start building the isolation forest.

Creating the Isolation Forest

We start by generating some dummy data, 10,000 records with 10 anomalies. We will keep it as the class variable.

class App:
    def __init__(self):
        self.total = 10000
        self.anomalies = 10
        gd = GenerateData()
        self.df = gd.gen_clean_data(self.total, self.anomalies)

Next let’s create the model.

def define_model(self):
  model = IsolationForest(
    n_estimators=1000,
    max_samples='auto',
    max_features=1,
    contamination=self.anomalies/self.total,    # Need approximately how many outliers
    verbose=0,
    random_state=np.random.randint(10, 99)
  )
  return model

Let’s see what that all means,

n_estimator: This is the number of base estimators in the ensemble.
max_samples: This will be the number of samples to train the estimator. We will use the ‘auto’ for this. This will use 256 samples by default.
max_features: This is the number of features that will be drawn fby algorithm each run. We will again use 1 for this. This is also the default. The reason why we use 1 is because we will try to find anomaly based on Amount and then followed by Country.
contamination: This is one of the more important variables. This is the ratio of anomalies that we expect to be in the dataset. Too high a value, and we get wrong resultset.
random_state: This controls the pseudo randomness of the selection.

Amount Anomaly

def get_anomalies_amount(self, model):
  # Fit on Amount
  model.fit(self.df[['Amount']])

  self.df['Amount_Score'] = model.decision_function(self.df[['Amount']])
  self.df['Amount_Anomaly_Score'] = model.predict(self.df[['Amount']])
  print(self.df[self.df['Amount_Anomaly_Score'] == -1].head(20))

This gives us an amount anomaly statistics. Let’s see how that looks like.

      Amount Country  CountryCode  Amount_Score  Amount_Anomaly_Score
3288   13390     USA            1     -0.039717                    -1
3759   17180     USA            1     -0.054328                    -1
3972   11165     USA            1     -0.029847                    -1
5654   13134     USA            1     -0.037543                    -1
6077   16812     USA            1     -0.053482                    -1
6459   18546     USA            1     -0.057434                    -1
6975   19639     USA            1     -0.059098                    -1
7324   18282     USA            1     -0.057099                    -1
8038   14660     USA            1     -0.045660                    -1
8093   15883     USA            1     -0.050889                    -1

That actually looks good. Now let’s try to find the Country anomaly.

Country Anomaly

def get_anomalies_country(self, model):
  # Fit on Amount
  model.fit(self.df[['CountryCode']])

  self.df['CountryCode_Score'] = model.decision_function(self.df[['CountryCode']])
  self.df['CountryCode_Anomaly_Score'] = model.predict(self.df[['CountryCode']])
  print(self.df[self.df['CountryCode_Anomaly_Score'] == -1].head(20))

Now that gives the following results,

      Amount Country  ...  CountryCode_Score  CountryCode_Anomaly_Score
634     1192     IND  ...          -0.094795                         -1
1838    2056     IND  ...          -0.094795                         -1
2251    3917     IND  ...          -0.094795                         -1
2814    4854     IND  ...          -0.094795                         -1
3491    3241     IND  ...          -0.094795                         -1
5209    1388     IND  ...          -0.094795                         -1
5464    1494     IND  ...          -0.094795                         -1
6833    4549     IND  ...          -0.094795                         -1
7490    3000     IND  ...          -0.094795                         -1
8371    3703     IND  ...          -0.094795                         -1

All those results seem pretty accurate. However if you are following this, try changing some values for contamination variable and see how it affects the state of things. Too high a value, and you will see wrong selections in anomalies.

Conclusion

That’s a quick introduction to isolation forest used for anomaly detection. If you go to Wikipedia you will see this is an algorithm introduced by Fei Tony Liu in 2008. For our scenario, we are just implementing the algorithm already provided in scikit learn. For historic reasons of course that is important, from from implementation perspective, we will just use the library. Hope you found this useful. Ciao for now!