Linear Regression

Linear Regression

When you have only two or few data points, it's easy to see the trend when you connect them. But, when you have too many data points that you can't really see the trend or it's too tangled, we can use Linear Regression. There are 3 different ways to approach: OLS, OLS using SVD, and Statsmodel.

1. OLS:

To find the line align to couple points, we usually use y = mx+b. But, as we mentioned, we are trying to deal with a situation with many data points. So, let's say the set of data points, D = {(x1,y1),...,(xn,yn)}. Thus, we have each value for y and b and we have this new equation, yi = mxi + b + εi which can be converted to y = XB + E using a vector form for slopes and intercepts. Here, εi is the residual or the error which is a vertical distance to the line yi = mxi + b. As usual, we want to solve for the least squares estimator for the vector B, that is, (XT X)−1XT y.

Then I coded the equation and let's see how it worked with our example below:

2. OLS using SVD:

I wish, or maybe everyone, that we can use the same equation above to get the least squares estimator all the time, but the world always drags you down using gravity. Many times, we won't be able to get (XT X)−1 because X don't always have full rank. But, it's ok because we can use SVD. As you know, the SVD form of X is X = UΣV H . When you replace X with that SVD form, you can have the least squares estimator, V Σ−1UT y.

I coded again with the new least squares and applied another example below:

3. Statsmodel:

Luckily, we do not have to use code those equations to do linear regression. We have a library called statsmodel. From the data, you set y value which is a dependent variable, the value you want to focus on. Next, you set the values for temp_X which are other variables that we want to see the relations with the set dependent variable. I used two for loops but you can use itertools instead to find get the OLS values for each variables. In the code below, when I looped through each variables, I also got the highest R-squared. R-squared is coefficient of determination which represents the accuracy of the model because it's the indicator for the model how it fits the data. Thus, the larger the value is, fits better. However, it doesn't automatically say that it's flawlessly accurate. It can take negative values, rewarding overfitting, and punishing under-fit models.

So, I got the R squared which is not that great because it's between 0 to 1. On the other hand, when you check the speed, the mathematical approach from 1 and 2 are faster than statsmodel which is interesting.

*GitHub: https://github.com/KwakSukyoung/coding/blob/master/ACME/LinearRegression/linear_regression.ipynb

Search This Blog

Sukyoung Kwak

Linear Regression

Comments

Post a Comment

Popular posts from this blog

Pandas - Grouping

Basic SQL - Earthquakes

Basic SQL - Student Information