Supervised Learning(Part-8)

 Ensemble Model



Ensemble models in machine learning operate on a similar idea. They combine the decisions from multiple models to improve the overall performance

1. Introduction to Ensemble Learning


Let’s understand the concept of ensemble learning with an example. Suppose you are a movie director and you have created a short movie on a very important and interesting topic. Now, you want to take preliminary feedback (ratings) on the movie before making it public. What are the possible ways by which you can do that?

A: You may ask one of your friends to rate the movie for you.
Now it’s entirely possible that the person you have chosen loves you very much and doesn’t want to break your heart by providing a 1-star rating to the horrible work you have created.

B: Another way could be by asking 5 colleagues of yours to rate the movie.
This should provide a better idea of the movie. This method may provide honest ratings for your movie. But a problem still exists. These 5 people may not be “Subject Matter Experts” on the topic of your movie. Sure, they might understand the cinematography, the shots, or the audio, but at the same time may not be the best judges of dark humour.

C: How about asking 50 people to rate the movie?
Some of which can be your friends, some of them can be your colleagues and some may even be total strangers.

The responses, in this case, would be more generalized and diversified since now you have people with different sets of skills. And as it turns out – this is a better approach to get honest ratings than the previous cases we saw.

With these examples, you can infer that a diverse group of people are likely to make better decisions as compared to individuals. Similar is true for a diverse set of models in comparison to single models. This diversification in Machine Learning is achieved by a technique called Ensemble Learning.
. Simple Ensemble Techniques



In this section, we will look at a few simple but powerful techniques, namely:
  1. Max Voting
  2. Averaging
  3. Weighted Averaging
1 Max Voting

The max voting method is generally used for classification problems. In this technique, multiple models are used to make predictions for each data point. The predictions by each model are considered as a ‘vote’. The predictions which we get from the majority of the models are used as the final prediction.

For example, when you asked 5 of your colleagues to rate your movie (out of 5); we’ll assume three of them rated it as 4 while two of them gave it a 5. Since the majority gave a rating of 4, the final rating will be taken as 4. You can consider this as taking the mode of all the predictions.

The result of max voting would be something like this:

Colleague 1Colleague 2Colleague 3Colleague 4Colleague 5Final rating
545444

2 Averaging

Similar to the max voting technique, multiple predictions are made for each data point in averaging. In this method, we take an average of predictions from all the models and use it to make the final prediction. Averaging can be used for making predictions in regression problems or while calculating probabilities for classification problems.

For example, in the below case, the averaging method would take the average of all the values.

i.e. (5+4+5+4+4)/5 = 4.4

Colleague 1Colleague 2Colleague 3Colleague 4Colleague 5Final rating
545444.4



3 Weighted Average

This is an extension of the averaging method. All models are assigned different weights defining the importance of each model for prediction. For instance, if two of your colleagues are critics, while others have no prior experience in this field, then the answers by these two friends are given more importance as compared to the other people.

The result is calculated as [(5*0.23) + (4*0.23) + (5*0.18) + (4*0.18) + (4*0.18)] = 4.41.


Colleague 1Colleague 2Colleague 3Colleague 4Colleague 5Final rating
weight0.230.230.180.180.18
rating545444.41



More ensembling Techniques


1. RandomForestClassifier-

Random Forest has been amongst the most used ensembling model that follows up the concept of Bagging. Here, we would be considering a number of trees, take 1000s of Decision trees, all independent of each other, might be using entire/part of the training dataset (the distribution would be random) and producing different predictions. And using these results, and the average result is taken and considered as the final prediction by the model. It ensures the model don’t overfit.


Example — let us have 100 Decision trees out of which 60 predict 1 and 40 predict 0(considering binary classification). As 1 predictors are more, hence result is 1.

2. Gradient Boosting Machine-

Unlike the Random Forest Classifier which works on the concept of Bagging, GBM uses Boosting. Here also we would be taking up 10s of Decision Trees but they won’t be independent. These trees would be working in a sequential order. The output of one tree is used by other trees to focus more on the errors and to fit over the residuals. The common problem faced is it overfits very soon, hence keep the number of trees comparatively low to RFC.


Example-Let us have 5 decision trees.The 1st one,let F1 intakes the training data and produces output Y1.Now, the 2nd tree, let H1, would take X as input but Y — Y1(predicted by tree 1, F1) as target. The combined output of F1 & H1 is the final output.If number of trees are more, the same chain continues.

Y2=F1(X): target is Y+H1(X): target is Y — Y1 where

X=input/training data

Y=Target value

F1=a weak learner

H1=Booster for F1, the new decision tree model

Y1=output of F1(X)

Y2=Improved results

Now for the next boosting round, we use

Y3=Y2 + H2(X): target is Y-Y2

here, all notations remain the same except H2 is the new booster and Y3 is an improved version of Y2.

now the same step can be repeated further on for better results for the mentioned number of trees for the ensembling purpose. The rest of the models described below uses Boosting Technique for ensembling purposes.

3. eXtreme Gradient Boosting Machine-

It is the most popular model when it comes to Kaggle competitions. It is an upgraded version of GBM hence faster and uses less space as it doesn’t go for all possible splits, but for some useful splits i.e. if 1000 splits point are possible, it may go for only 100 best points hence savings everywhere whether space or time!!! (using a presorted splitting algorithm). It is often taken as Regularized GBM as a term lambda(let it be L for now) is multiplied with the function being used for boosting in the above example(H1). Hence equation becomes L*H1() instead of H1().

4. Light Gradient Boosting Machine-

LGBM has also been amongst the emerging models getting popularity in the data science domain. Though accuracy for both, XGB & LGBM models is quite close, their implementation is slightly different. To find out best splits, amongst all possible splits(100 out 1000 split points concept, hence to reduce the extra work ), LGBM uses Gradient-based one-sided sampling(GOSS) while XGB uses pre-sorted algorithm for splitting purpose.

5. Catboost-

Not so popular, CatBoost is comparatively slower than LGBM & XGB but has an unbeatable advantage, it can intake categorical data as text form (you need to mention which columns are categorical) and train the model, & hence the name Categorical Boost. The case is, it understands categorical data while other models just accept it when presented as numeric.No preprocessing step for converting text to numeric using OneHotEncoder or LabelEncoding for categories is required because of which it produces better results.

Comments

Popular posts from this blog

Supervised Learning(Part-5)

Supervised Learning(Part-2)

Text Analysis (Part - 4)