Unsupervised Learning(Part-2)

August 07, 2020

Scenario of Overfitting and underfitting in the gaussian distributions :

As seen in the above figure, if there is zero possible probability density among various gaussian distributions, it could lead to Overfit Extreme. Whereas, if there is one large gaussian distribution, it leads to the Underfit Extreme or underfitting of the data.

Cross Validation : Cross-validation is a resampling procedure used to evaluate machine learning models on a limited data sample. The procedure has a single parameter called k that refers to the number of groups that a given data sample is to be split into. As such, the procedure is often called k-fold cross-validation. When a specific value for k is chosen, it may be used in place of k in the reference to the model, such as k=10 becoming 10-fold cross-validation.

Cross-validation is primarily used in applied machine learning to estimate the skill of a machine learning model on unseen data. That is, to use a limited sample in order to estimate how the model is expected to perform in general when used to make predictions on data not used during the training of the model. It is a popular method because it is simple to understand and because it generally results in a less biased or less optimistic estimate of the model skill than other methods, such as a simple train/test split.

K-Means : K-means algorithm explores for a preplanned number of clusters in an unlabeled multidimensional dataset, it concludes this via an easy interpretation of how an optimized cluster can be expressed. Let's us get in the way as if we have data points representation as represented in the figure. To understand it more visually, we need to partition it by decision boundary as shown.

Let us take two random centroids, θ₁ and θ_2,which are n- dimensional. Decision boundary says that data points on one side of the decision boundary are closer to the centroid but far from the points at the other side. Now, we will build perpendicular bisector over there. It says C₁₁ points are closer to θ₁C₂₂are close to θ_2.

According to the algorithm, if there are two centroids and data points present then measure the distance of each data point to the centroid and the data point which is allocated more closer will be allocated to the centroid.

How do we calculate the distance?

It can be done using Euclidian Distance i.e sqrt((x2-x1)^2 + (y2-y1)^2).

Search This Blog

Data Science

Unsupervised Learning(Part-2)

Comments

Post a Comment

Popular posts from this blog

Model Evaluation and Selection

Convolutional Neural Networks(Part-4)

Graph Analysis(Part-2)