
The notion of Anomaly detection is about identifying events or observations that rarely occur and are misfit in the larger scope of universal set. They deviate quite a bit from what may be considered normal for the events. These are also termed as outliers. So consider this, if you stand out in the crowd, you are an anomaly and will trigger an instant alert with the IT police ????.
There are a lot of scenarios where anomaly detections come in handy. Here are some examples where we can use anomaly detection,
Financial Fraud
One of the primary use cases of anomaly detection is financial transactions. We can monitor deposits and withdrawals from bank for large changes, credit card transaction changes (like suddenly credit card used in a different country than normal). Banks regularly use anomaly detection to monitor for such changes.
Network Security
Traffic is normally consistent over a corporate network. If there is sudden change of volume of traffic or source, it would mean something is not normal. That should trigger an alarm that network may be compromised.
Email Variations
Emails are one of the other things that normally remain consistent. If you start getting emails from senders that do not send you emails, it will be a matter of concern.
eCommerce
eCommerce vendors also use anomaly detection to check sudden changes in sales or lead conversions. Also, one more use case is sudden increase of negative reviews.
Medical
Doctors also can use anomaly detection for medical records. For example, if a patients blood sugar level suddenly jumps up a lot from normal, that should trigger some anxiety.
Types of Anomalies
Now let’s talk about the types of anomalies. Broadly speaking we can have anomalies distributed in two distinct categories.
Univariate Anomaly
In case of univariate anomaly, we will see the outlier only depends on one factor. Let’s consider an example where we consistently get around 100 network requests every 15 minutes. Suddenly we see the trend change to get 2 requests per 15 minute. That is a change from the norm.
Multivariate Anomaly
For multivariate, anomaly will be dependent on more than one factor. Consider the same example as above again. Let’s add one more factor to the equation. Time of day. We see that between 9:00 AM to 4:00 PM, we do get an average of 100 requests per 15 minutes. But at night between 11:00 PM and 3:00 AM, the requests drop to an average of 5. Now this 2 requests may be something that we got during that time span. What looked like an outlier before, is no longer an outlier. It is normal.
Isolation Forest
In this blog we will use Isolation Forest to build a anomaly detection program. We will create our own dataset to use for this.
Isolation Forest is an algorithm for detecting anomalies in a data series. It uses binary trees for detection of anomalies. The way it works is that it uses the knowledge that anomaly will be very few and so can be identified with ease. This algorithm creates Binary trees using the following steps:
- Given a dataset, a random feature is first selected and based on the minimum and maximum for this feature a random threshold is identified.
- A branch is now created for this random feature. Anything less than threshold is added to the left and greater is added to the right. This creates a binary tree.
- This step continues recursively till all data is processed or till maximum depth is reached.
We can easily identify anomalies from these isolation trees. Based on our assumption, anomalies are few. So, we should be able to reach branches containing anomalies with minimum number of steps. Anything that is within the range of values, will take much longer to traverse. This is the core of this algorithm.
To write this program, we will use the following python libraries.
- scikit-learn
- pandas
- matplotlib (not needed unless you want to plot)
- seaborn (not needed unless you want to plot)
So, the first part will take care of generating some random data for us to use.
Generating Random Data
Think about a banking system. Let’s say that you do mostly US transactions. Your transactions normally lie in the range of $1000 and $5000. So we will consider that as the normal. We will create some data based on this assumption. However, to add some anomalies, we will add few transactions that are not in US. We will also create transactions that fall beyond the established norms. Given below is the class that generates dummy data.
import numpy as np import pandas as pd class GenerateData: def __init__(self): pass def gen_clean_data(self, cnt, cntanom): df = pd.DataFrame(np.random.randint(1000, 5000, size=(cnt, 1)), columns=['Amount']) df['Country'] = 'USA' # Add Anomalies dfanom = df.sample(n=cntanom) for index, row in dfanom.iterrows(): # Replace these df.loc[index,'Amount'] = np.random.randint(10000, 20000) dfanom = df.sample(n=cntanom) for index, row in dfanom.iterrows(): df.loc[index,'Country'] = 'IND' # Now convert the Country to Category so we can assign a unique number to it df['Country'] = df['Country'].astype('category') # Assign the numbers to a new column df['CountryCode'] = df['Country'].cat.codes print(df[df['Amount'] > 10000].head(2)) print(df[df['Country'] == 'IND'].head(2)) return df
Here we are creating a pandas dataframe on line #9 using random values between 1,000 and 5,000. This we assume to be amounts. We then add a second column for country and assign all of them as USA.
After this we start creating the anomalies starting line #13. First we select some random rows and replace the amounts there with random values between 10,000 and 20,000. Next starting on line #18 we again select some random records and assign the country to IND. These will be anomalies for our data. Let’s print some data generated.
-- SAMPLE data --
Amount Country CountryCode
0 4172 USA 1
1 3971 USA 1
2 1666 USA 1
3 1499 USA 1
4 2509 USA 1
-- Anomalous Amount --
Amount Country CountryCode
100 19946 USA 1
326 13315 USA 1
327 19765 USA 1
563 19956 USA 1
625 12403 USA 1
-- Anomalous Country --
Amount Country CountryCode
114 1769 IND 0
139 1586 IND 0
389 3550 IND 0
395 4499 IND 0
461 1444 IND 0
So, if we try to get a density diagram, we end up with something like this,

def visualize_data(self): sns.kdeplot(self.df['Amount'], bw_method = .3, color='crimson', fill=True) plt.show()
Of course there are so few data in the 10,000 to 20,000 range, it doesn’t even produce a blip in the diagram. Now that we have the data, let’s start building the isolation forest.
Creating the Isolation Forest
We start by generating some dummy data, 10,000 records with 10 anomalies. We will keep it as the class variable.
class App: def __init__(self): self.total = 10000 self.anomalies = 10 gd = GenerateData() self.df = gd.gen_clean_data(self.total, self.anomalies)
Next let’s create the model.
def define_model(self): model = IsolationForest( n_estimators=1000, max_samples='auto', max_features=1, contamination=self.anomalies/self.total, # Need approximately how many outliers verbose=0, random_state=np.random.randint(10, 99) ) return model
Let’s see what that all means,
- n_estimator: This is the number of base estimators in the ensemble.
- max_samples: This will be the number of samples to train the estimator. We will use the ‘auto’ for this. This will use 256 samples by default.
- max_features: This is the number of features that will be drawn fby algorithm each run. We will again use 1 for this. This is also the default. The reason why we use 1 is because we will try to find anomaly based on Amount and then followed by Country.
- contamination: This is one of the more important variables. This is the ratio of anomalies that we expect to be in the dataset. Too high a value, and we get wrong resultset.
- random_state: This controls the pseudo randomness of the selection.
Amount Anomaly
def get_anomalies_amount(self, model): # Fit on Amount model.fit(self.df[['Amount']]) self.df['Amount_Score'] = model.decision_function(self.df[['Amount']]) self.df['Amount_Anomaly_Score'] = model.predict(self.df[['Amount']]) print(self.df[self.df['Amount_Anomaly_Score'] == -1].head(20))
This gives us an amount anomaly statistics. Let’s see how that looks like.
Amount Country CountryCode Amount_Score Amount_Anomaly_Score
3288 13390 USA 1 -0.039717 -1
3759 17180 USA 1 -0.054328 -1
3972 11165 USA 1 -0.029847 -1
5654 13134 USA 1 -0.037543 -1
6077 16812 USA 1 -0.053482 -1
6459 18546 USA 1 -0.057434 -1
6975 19639 USA 1 -0.059098 -1
7324 18282 USA 1 -0.057099 -1
8038 14660 USA 1 -0.045660 -1
8093 15883 USA 1 -0.050889 -1
That actually looks good. Now let’s try to find the Country anomaly.
Country Anomaly
def get_anomalies_country(self, model): # Fit on Amount model.fit(self.df[['CountryCode']]) self.df['CountryCode_Score'] = model.decision_function(self.df[['CountryCode']]) self.df['CountryCode_Anomaly_Score'] = model.predict(self.df[['CountryCode']]) print(self.df[self.df['CountryCode_Anomaly_Score'] == -1].head(20))
Now that gives the following results,
Amount Country ... CountryCode_Score CountryCode_Anomaly_Score
634 1192 IND ... -0.094795 -1
1838 2056 IND ... -0.094795 -1
2251 3917 IND ... -0.094795 -1
2814 4854 IND ... -0.094795 -1
3491 3241 IND ... -0.094795 -1
5209 1388 IND ... -0.094795 -1
5464 1494 IND ... -0.094795 -1
6833 4549 IND ... -0.094795 -1
7490 3000 IND ... -0.094795 -1
8371 3703 IND ... -0.094795 -1
All those results seem pretty accurate. However if you are following this, try changing some values for contamination variable and see how it affects the state of things. Too high a value, and you will see wrong selections in anomalies.
Conclusion
That’s a quick introduction to isolation forest used for anomaly detection. If you go to Wikipedia you will see this is an algorithm introduced by Fei Tony Liu in 2008. For our scenario, we are just implementing the algorithm already provided in scikit learn. For historic reasons of course that is important, from from implementation perspective, we will just use the library. Hope you found this useful. Ciao for now!