Posts

Showing posts from July, 2020

Model Evaluation and Selection(Part-3)

Image
  Hypothesis Testing Given a hypothesis, test if it is true with a particular confidence  • Hypotheses: errorD (h1) > errorD (h2)  • What % of the probability mass is associated with errorD (h1) - errorD (h2) > 0  • Example  • Let the error rates measured for the two hypotheses h1 and h2 on a sample of size 100 be 0.3 and 0.2 respectively  • The standard deviation for the normal distribution defined on d’ = errorS1 (h1 )- errorS2 (h2 ) is • 1.640.061 = 0.1  • 1.64 standard deviation corresponds to the 90% confidence interval and hence the % mass of the probability distribution > 0 is 95%  • Result: Accept the hypothesis with 95% confidence that h2 is a more accurate hypothesis than h1 on D (the underlying population) Comparing Two Algorithms Given two learning algorithms, L1 and L2 , which one is better, on average, at learning a particular target function  • Estimating the relative performance  • Calculate the Expected Value of the difference in errors  • For all samples of

Model Evaluation and Selection

Image
Model Process :    In decision maker process, we follow f : x  → y and Real world process (RWP) tends to the population. After building a model, we need to know, how accurate the model is expected to be on population. For that, here comes the Confusion Matrix Here we are getting 88 out of 100 as correct, therefore the                        Accuracy = 88/100 = 0.88 we need to check whether the accuracy will be fine for model, for that we check Expected Accuracy.                                E[x] =  Σ x P(x)X                                h(x(vector)) = y (Actual) Goal to be achieved here is    h(x(vector)) = f(x(vector)) for each x belongs to population.   h(x(vector)) = y , E [ h(x(vector)) = f (x(vector)) ] for each x belongs to population.  Here,  E[ h(x(vector)) = f (x(vector)) ]  says that how much we have predicted correctness on unseen data.  When we have data we split it into two categories : Training Data Test Data As we don't know the distribution, functional form and

Model Selection and Evaluation (Part-2)

  Heuristic Search :  It can be defined as , the hypothesis space is infinite. Hypothesis testing leads to the evaluation of model which can be done by two methods : Chi - Square Method  : Here, x 2     =  Σ(O ij  - E ij ) / E ij    where, x 2  is degree of freedom ,O ij  is observed value and E ij  is expected value. t-test :  Here, we find the confidence interval. Machine Learning model is representation of estimate  of  population. SMOTE  : It makes synthetic sample. It says that draw a line of the particular instance to its closest neighbor instance of the sample class which is under-represented in data. Here, we make model to learn decision boundary that is connecting two instances and it gives better generalization. Transformation can be done after SMOTE. Oversampling and under-sampling can only be done on training data, never on test data because on training data we can learn function to predict the low frequency class. On test data, we estimate or access the accuracy of the mod

Introduction to Machine Learning (Part-9)

Image
  Regularization : Regularization of a model can be done by avoiding overfitting in the model. It adds a penalty term in the cost function based on the parameters. Cost func   θ* = arg θ max logP(S;θ)= Σ m j=1 logP(y (f) |x j ;θ)-λΣ n i=1 β i 2 λ is a constant (Hyper-Parameter) that determine the strength of the penalty term. For Linear Regression :                     θ* = arg θ min =  Σ m i=1 (y i -y i ') 2 +λΣ n j=1 β j 2 Way to minimize this is by minimizing individual terms. Here, for all j = 1 ,  β = 0. In linear regression using L 2 penalty term Σ n j=1 β j 2 results in Ridge regression and using L 1 penalty term Σ n i=1 |β j | results in Lasso Regression. In Linear regression remove co related independent variables . Overfitting :  Try to keep the model simple by damping  β to avoid complexity. Samples and Estimation :  Sample is the subset of Population, training data is used to estimate parameters of the model.        L( θ ) = P(D| θ ) =   π x i ∈ D P(x i |θ)      θ

Probability (Part-3)

Image
  Probabilistic Inference :  The computation from observed evidence, of posterior probabilities for query  Propositions. It leads to the Joint Probability Distribution.                If P(A|B) = P(B|A)*P(A) /P(B)  so here P(B|A) is the evidence , P(A) is the prior probability and P(B) is the posterior Probability. General Inference Procedure :  Let X be the query value which is a dependent variable. Let E be the evidence variables and e be the observed values and is specific for them. Let Y be the unobserved Variables.     P(X|e) =  α P(X,e) =  α  Σ y P(X,e,Y) where  α is the normalization constant. Bayesian  Belief Network :   It is a probabilistic Graphical Model. It follows casuality, means   there will be the reason for something.Way to reduce its parameters is by making sum of some independent variables. Product Rule : Applicable when there are two variables present.                                P(A 1 A 2 ) = P(A 2 |A 1 ) P(A 1 ) Chain Rule :  It is the generalized form of prod

Introduction to Machine Learning (Part-7)

Image
  Linear Regression : In Linear regression, we are bound with the straight line which is our decision boundary. The value of y can be same for different values of x. Can be represented as :  According to Gaussian :   P(y|x=x 1 ) = (1/(√2π)*σ)* e -(y-mean) 2 /2σ 2 Through linear regression equation we define y i ' = β 1 x 1 + β 0 We are predicting value of y as    y i ' = β 1 x 1  + β 0 + e  where e is the error or noise. Let us assume    σ for every value of x is same and   β 1 = 0 then,                                β 0 *argmin C =  Σ n i=1 (y i -β 0 ) 2                                  By differentiating on both sides                                 dC/dβ 0 = Σ n i=1 (0 + 2β 0 - 2y i ) = 0                                          2Σ n i=1 β 0  - 2Σ n i=1 y i = 0                                          nβ 0 - Σy i = 0                                          β 0 = Σy i /n Here it comes out to be the best value for    β 0 which is the mean average of y values observ

Introduction to Machine Learning (Part-6)

 Real world process is generated from Population where sample is the sub set of population. Sample can be of two type :  Generative : It says if there are n variables x 1 ,x 2 ,.....,x n and we need to find the probability of these n variables then we apply Joint probability distribution on it as it also follows joint probability distribution then P(x) = P(x 1 ,x 2 ,....,x n ) Discriminative : It learns from the conditional probability, which can be defined as  if f : x(vector) maps to y then P(y|x(vector)), where it simply learn by reducing parameters. In hypothesis set, each H i is not equal to H j   because each parameter is different in this set. Here, h* = argmin C(y,y'(x(vector))) where y is actual value and y' is the predicted value. Constraints on  which we get ability to learn are : Hypothesis set chosen Search algorithm Parameters grown exponentially with respect to number of variables : For binary : 2 n -1 For non-binary : k n -1 Logistic regression is used when y

Introduction to Machine Learning (part-5)

Image
Data Mining : It  defined as a process used to extract usable  data  from a larger set of any raw  data . It implies analyzing  data  patterns in large batches of  data  using one or more software. Crisp-dm   known for Cross-Industry Standard process for Data Mining. Aim is to develop a tool and application neutral process for conducting data mining and define tasks, outputs from these tasks, terminology and mining problem type characterization. It has four level of abstraction : Phases           • Example: Data Preparation  Generic Tasks             • A Stable, general and complete set of tasks                                                              • Example: Data Cleaning  Specialized Task            • How is the Generic task carried out                                                                           • Example: Missing Value Handling  Process Instance           • Example: The mean value for numeric attributes and the median for                        categorical attri

Introduction to Machine Learning(Part-4)

Image
Training Data is the subset of population, where population can be of n variables x 1 ,x 2 ,........,x n . Let us discuss it by example , assume we are taking to factors to differentiate individuals that is Eye color and Hair color. Eye color can be Blue,Green  and Brown. Hair color can be Black, red, blond and grey. So the total combinations we will be working be 3*4 = 12 Here, we can make prediction of function by Joint Probability Distribution.We assume that eye color and hair color does not tell us much about the individuals.Let us add an attribute Milk allergy which has binary values Yes(y) or No(N), now the combinations on which we can work will be 3*4*2 = 24. If we add an another attribute income which is continuous then the combinations to learn is  3*4*2* ∞ =  ∞. So , there are two ways by which we can learn this problem: Discretize / Bucket : In this we divide the attribute income in interval like: k maps to [1,100k] ,[100k,500k,.......] Probability Density Function : Here we

Introduction to Machine Learning (Part-3)

  Probability is a way to model uncertain events and Machine Learning  is trying to make a process similar to real world process. It is a basic model using machine learning. It says if we have some missing data then through function we can retrieve the missing data. Example :  Apply for a bank loan is a real world process, it can be based on Age, income, Education, Marriage status, Defaulted ...etc. Here we have to apply Joint Probability Distribution. Here from function we predict probability to approximate actual probability. Principle Components :  These are the factors of the process , on which a process is based on. Logistic Regression :  It  is used where the scenario is that input is continuous variable and output is categorical variable.It can be represented as :                       P(Y=1|X) = 1 / 1 + e ( 2 Σ i=1 β i x i ) Here  β i   is the parameter. We have to learn the function which minimizes the error.                    P(Y=0|X) = 1 - P(Y=1|X)