Saturday, February 15, 2020

Machine Learning - Linear Regression


What is a Linear Regression?

Linear Equation:

Y = a+bX+e

Y = Dependent Variable

X= Independent Variable

a=y-intercept

b= slope

e= error term/Residual

Interpretation of b = one unit change of x will change the average/expected value of y by b unit.

Interpretation of a = often y-intercept does not have any practical meaning as x=0 is beyond the scope of the model.

R-square: 0.8 means 80% variation independent variable(e.g sales) can be explained by the independent variables eg. advertising expenditure.

ANOVA: H0: all regression coefficients in population are zero
                 Ha: At least one of the regression coefficient is non zero

T-test : H0: individual regression coefficient in population is zero.
             Ha: individual regression coefficient in population is non zero

Assumptions of linear regression

  • The error term is normally distributed with zero mean and finite variance. Error ~N(0,σ2).
  • For each fixed value of X, the distribution of Y is normal. The means of all these normal distributions of Y, given X, lie on the fitted regression line(plane).
  • Variance of error term is constant.(homoscedasticity). This variance does not depend on the values assumed by X.
  • Error terms are uncorrelated. In other words, the observations are drawn independently.
  • Uncorrelated error with X.
  • Relationship between y and X is linear.
Unusual and Influential data

A single observation that is substantially different from all other observations can make a large difference in the results of your regression analysis
  • Outliers: In the linear regression, an outlier is an observation with a large residual. In other words, it is an observation whose dependent-variable value is unusual given its values on the predictor variables. An outlier may indicate a sample peculiarity or may indicate a data entry error or other problem.
  •  Leverage: An observation with an extreme value on a predictor variable is called a point with high leverage. Leverage is a measure of how far an observation deviates from the mean of that variable.
  • Influence: An observation is said to be influential if removing the observation substantially changes the estimates of coefficients. Influence can be thought of as the product of leverage and outlierness.








Machine Learning - Regression

Regression

Let's understand what is Regression in Machine Learning?

A case for regression

The Advertising data displays sales for a particular product as a function of advertising budgets for TV, radio and newspaper media. In our role as statistical consultants, we are asked to suggest, on the basis of this data, a marketing plan for next year that will result in high product sales.

here are a few important questions that we might seek to address:


  1. Is there a relationship between the advertising budget and sales?
  2. How strong is the relationship between the advertising budget and sales?
  3. which media contribute to sales?
  4. How accurately can we estimate the effects of each medium on sales?
  5. Is the relationship linear?
If there is approximately a straight-line relationship between advertising expenditure in the various media and sales, then linear regression is an appropriate tool. if not then it may still be possible to transform the predictor or the response So that linear regression can be used.






Machine Learning - Hypothesis Testing

Important terminologies -- Hypothesis testing

  • Population = all possible values
  • Sample = a portion of the population
  • Parameter = a characteristic of a population, e.g., the population mean μ
  • Statistic = calculated from data from the sample, e.g., sample mean

Hypothesis Testing

Hypothesis testing test a claim about a population parameter(characteristics) using evidence from sample data.

Steps of hypothesis testing:

A) State Null and alternative hypothesis
B)Calculate test statistics
C)Decide the levels of Significance
D)p-value and decision

Null and Alternative Hypotheses

  • Convert the research question to null and alternative hypotheses
  • The null hypothesis(H0) is a claim of "no difference in the population"
  • the alternative hypothesis (Ha) claims "H0 is false"
  • Collect data and seek evidence against H0 as a way of bolstering Ha(deduction)

Example: "Body Weight"
The problem: in the 1970s, 20-24-year-old men joining the army had an average body weight of 65kg. The standard deviation of body weight was 10kg. We test whether the average body now differs.

The null hypothesis is H0: μ =65(no difference)
The alternative hypothesis can be either Ha: μ > 65(one-sided test) or Ha:μ not equals 65(two-sided test).

Sampling Distributions of Mean




Errors in hypothesis testing




Test Statistic

This is an example of a one-sample test of a mean when sigma is known.use this statistic to test the problem.


P-value
The p-value is the probability of getting test statistics as extreme as the observed value or more extreme than it when H0 is true?

One-sided P-value for z statistic of 0.6

Interpretation 

p-value answers the question: what is the probability of getting the observed test statistic when H0 is true?

Thus, smaller and smaller .P-values provides stronger and stronger evidence against H0

Small p-value => strong evidence H0 is false and Ha is true

Decision Rule

alpha = probability of rejecting H0 when it is true

Set alpha threshold(eg. 0.0  or 0.10, or 0.05)

Reject H0 and retain Ha when p-value less than and equal to alpha.




Saturday, February 8, 2020

Machine Learning - K-fold cross validation Python

K-fold cross-validation:

In machine learning, we couldn’t fit the model on the training data and can’t say that the model will work accurately for the real data. For this, we must assure that our model got the correct patterns from the data, and it is not getting up too much noise. For this purpose, we use the cross-validation technique.

Cross-validation is a technique in which we train our model using the subset of the data-set and then evaluate using the complementary subset of the data-set.

The three steps involved in cross-validation are as follows :

Reserve some portion of the sample data-set.
Using the rest data-set train the model.
Test the model using the reserve portion of the data-set.

Methods of Cross Validation

Validation

In this method, we perform training on 50% of the given data-set and the rest 50% is used for testing purposes. The major drawback of this method is that we perform training on 50% of the dataset, it may possible that the remaining 50% of the data contains some important information which we are leaving while training our model i.e higher bias.

LOOCV (Leave One Out Cross Validation)
In this method, we perform training on the whole data-set but leaves only one data-point of the available data-set and then iterates for each data-point. It has some advantages as well as disadvantages also.
An advantage of using this method is that we make use of all data points and hence it is low bias.
The major drawback of this method is that it leads to higher variation in the testing model as we are testing against one data point. If the data point is an outlier it can lead to the higher variation. Another drawback is it takes a lot of execution time as it iterates over ‘the number of data points’ times.

K-Fold Cross Validation
In this method, we split the data-set into k number of subsets(known as folds) then we perform training on all the subsets but leave one(k-1) subset for the evaluation of the trained model. In this method, we iterate k times with a different subset reserved for the testing purposes each time.








ML and AI - Regularization Python

What is Regularization in Machine Learning?

Regularization - Regularization in machine learning is a process of introducing additional information in order to prevent overfitting.


  • The green and blue functions both incur zero loss on the given data points.
  • Regularization will induce a model to prefer the green function, which may generalize better to unseen data.
         



Use of  regularization in classification:

One particular use of regularization is in the field of classification. Empirical learning of classifiers(learning from a finite data set) is always an undermined problem.
because in general, we are trying to infer a function of any X given only some example

x1,x2,x3,x4,x5,x6............xn.

A regularization term(or regularization) R(f) is added to the loss function:

 
where V is an underlying loss function that describes the cost of predicting f(x) when the label is y, such as the square loss or hinge loss and lamda is a parameter that controls the importance of the regularization term.R(f) is typically chosen to impose a penalty.




Wednesday, February 5, 2020

Data Science, Machine Learning and Artificial Intelligence using Python part-3

Overfitting/underfitting

We can see that linear function is not sufficient to fit the training samples. This is called underfitting.

A polynomial of degree 2 approximates the true function almost perfectly.
However, for higher degrees of polynomial, the model will overfit the training data.
We evaluate quantitively overfitting/underfitting by using cross-validation. we calculate the error on the test set and compare it with error on training set to determine overfitting or underfitting.



High Bias versus High Variance

High Bias - Both training and test errors are high and both errors are more or less the same.
High Variance - Training error is low but testing is very high compared to training error.








Data Science, Machine Learning and Artificial Intelligence using Python part-2

The bias-variance tradeoff

In machine learning,bias-variance tradeoff is the problem of simultaneously minimizing two sources of error that prevent supervised learning algorithms from generalizing beyond their training data.

High bias(underfitting) can cause an algorithm to miss the relevant relations between features and target outputs.

High variance(overfitting) can cause an algorithm to model the random noise in the training data, rather than the intended outputs.

Variance

Variance refers to the amount by which "f" would change if we estimated it using different training data sets.

Since the training data are used to fit the statistical learning method, different training data sets will result in a different "f" . But ideally, the estimate for "f" should not vary too much between training sets.

However, if a method has high variance then small changes in the training data can result in large changes in " f " .In general, more flexible statistical methods have higher variance.

Bias 

Bias refers to the error that is introduced by approximating a real-life problem, which may be extremely complicated, by a much simpler model.

For example, linear regression assumes that there is a linear relationship between Y and X1, X2, X3.....XP.It is unlikely that any real-life problem truly has such a simple linear relationship, and So, performing linear regression will undoubtedly result in some bias in the estimate of " f ".

Bias Variance trade-off 

As a general rule, as we use more flexible methods, the variance will increase and the bias will decrease.

The relative rate of change of these two quantities determines whether the test ERROR increases or decreases.

As we increase the flexibility of a class of methods, the bias tends to initially decrease faster than the variance increases. Consequently, the expected test ERROR declines.

However, at some point, increasing flexibility has little impact on the bais but starts to significantly increase the variance. When this happens the test ERROR increases.