In this blog I am capturing some quick reference points about Linear Regression machine learning algorithm. At any point of time , it can be referred to refresh the understanding about this algorithm.
In general , there are two types of linear regression , simple and multiple. Let’s find out the key points for each one of these.
|Relationship between a dependent variable and one independent variable using a straight line.|
|The standard equation of the regression line is given by the following expression: |
Y = β₀ + β₁.X , β₀= Intercept , β₁= Slope
|The best-fit line is found by minimizing the expression of RSS (Residual Sum of Squares).|
|RSS is equal to the sum of squares of the residual for each data point in the plot.|
|Residual for any data point is found by subtracting predicted value of dependent variable from actual value of dependent variable.|
|The strength of the linear regression model can be assessed by R 2 or Coefficient of Determination|
|R 2 or Coefficient of Determination:|
o R2 is a number which explains what portion of the given data variation is explained by the developed model.
o It always takes a value between 0 & 1.
o In general term, it provides a measure of how well actual outcomes are replicated by the model, based on the proportion of total variation of outcomes explained by the model, i.e. expected outcomes.
o Overall, the higher the R-squared, the better the model fits your data.
o Mathematically, it is represented as: R2 = 1 – (RSS / TSS)
|RSS (Residual Sum of Squares):|
It is defined as the total sum of error across the whole sample.
It is the measure of the difference between the expected and the actual output.
A small RSS indicates a tight fit of the model to the data.
|TSS (Total sum of squares):|
It is the sum of errors of the data points from mean of response variable.
TSS gives us the deviation of all the points from the mean line.
|Multiple Linear Regression|
|• Relationship between one dependent variable and several independent variables (explanatory variables).|
|• Steps to follow|
o Prepare data for analysis and build a model containing all variables
o Check VIF and summary. Remove variables with high VIF (>2 generally) and which are insignificant (p>0.05), one by one.
o If the model has variables which have a high VIF and are significant, check and remove other insignificant variables
o Now, variables must be significant.
o If the number of variables is still high, remove them in order of insignificance until you arrive at a limited number of variables that explain the model well.
|• p-value indicates the probability that the null hypothesis is true|
|• Independent variables that have a low p-value are likely to be a meaningful addition to model.|
|• Dummy Variables: The categorical variables need to be converted to numeric form to be used in regression modeling.|
|• R-squared: Always increases when number of variables increase|
|• Adjusted R-squared: Imposes penalty if the increase in R-squared is small|
|• Adjusted R-squared is a better metric than R-squared to assess how good the model fits the data.|
|• Multicollinearity is a phenomenon where two or more independent variables in a multiple regression model are correlated to each other. It makes difficult to assess the effect of individual predictors. A better way to assess multicollinearity is to compute the variance inflation factor (VIF).|
|• To summarize: the higher the VIF, the higher the multicollinearity.|
|• Model Validation: It is desired that the R-squared between the predicted value and the actual value in the test set should be high.|
|• Variable Selection Methods:|
o Backward selection
o Forward selection
o Stepwise selection