Code
import numpy as np
import pandas as pd
import polars as pl
import matplotlib.pyplot as plt
import seaborn as sns
plt.style.use('ggplot')An Extension for Linear Regression Models
Jesus LM
Mar, 2024
Oftentimes we’ll encounter data where the relationship between the feature(s) and the response variable can’t be best described with a straight line. In these cases, we should use polynomial regression.
An example of a polynomial, coud be:
Terminology
Degree of a polynomial: The highest power in polynomial. In our example, 4
Coefficient: Each constant (3, 7, 2, 11) in polynomial is a coefficient. In polynomial regression, these coefficients will be estimated
Leading term: The term with the highest power (). It determines the polynomial’s graph behavior
Leading coefficient: The coefficient of the leading term (3)
Constant term: The y intercept, it never changes: no matter what the value of x is, the constant term remains the same
Let’s return to $ 3x4 - 7x3 + 2x2 + 11 $, if we write a polynomial’s terms from the highest degree term to the lowest degree term, it’s called a polynomial’s standard form.
In the context of machine learning, you’ll often see it reversed:
where:
The other ßs are the coefficients/parameters we’d like to find when we train our model on the available x and y values
It’s not a coincidence: polynomial regression is a linear model used for describing non-linear relationships
How is this possible? The magic lies in creating new features by raising the original features to a power
Linear regression is just a first-degree polynomial. Polynomial regression uses higher-degree polynomials. Both of them are linear models, but the first results in a straight line, the latter gives you a curved line.
Degree = 2 means that we want to work with a 2nd degree polynomial,
It may seem confusing, why are we importing LinearRegression module? 😮
We just have to remind that polynomial regression is a linear model, that’s why we import LinearRegression. 🙂
LinearRegression()In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
LinearRegression()
array([ 1.70806452, 3.04187987, 4.70292388, 6.69119657,
9.00669792, 11.64942794, 14.61938662, 17.91657397,
21.54098999, 25.49263467, 29.77150802, 34.37761004,
39.31094073, 44.57150008, 50.1592881 , 56.07430478,
62.31655014, 68.88602415, 75.78272684, 83.00665819,
90.55781821, 98.4362069 , 106.64182425, 115.17467027,
124.03474495, 133.22204831, 142.73658033, 152.57834101,
162.74733037, 173.24354839])
| x_1 | x_2 | y | |
|---|---|---|---|
| 0 | 16.243454 | 13.413857 | 570.412369 |
| 1 | 6.117564 | 36.735231 | 111.681987 |
| 2 | 5.281718 | 12.104749 | 62.392124 |
| 3 | 10.729686 | 17.807356 | 303.538953 |
| 4 | 8.654076 | 32.847355 | 151.109269 |
| ... | ... | ... | ... |
| 95 | 0.773401 | 48.823150 | -0.430738 |
| 96 | 3.438537 | 18.069578 | 44.308720 |
| 97 | 0.435969 | 12.608466 | 19.383456 |
| 98 | 6.200008 | 24.328550 | 78.371729 |
| 99 | 6.980320 | 31.333263 | 132.108914 |
100 rows × 3 columns
LinearRegression()In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
LinearRegression()
The formula for Root Mean Square Error is:
The smaller the RMSE metric the better the model
Polynomial regression
RMSE: 20.94
# Split into train and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)
# Create linear regression instance
lin_reg_model = LinearRegression()
# Fit regression model
lin_reg_model.fit(X_train, y_train)
# Create predictions
lin_reg_y_predicted = lin_reg_model.predict(X_test)
# Calculate RMSE
lin_reg_rmse = np.sqrt(mean_squared_error(y_test, lin_reg_y_predicted))
# Print results
print(f'Linear regression\nRMSE: {lin_reg_rmse:,.2f}')Linear regression
RMSE: 62.30
In the train_test_split method we use X instead of poly_features, and it’s for a good reason.
X contains our two original features (x_1 and x_2), so our linear regression model takes the form of:
On the other hand, poly_features contains new features as well, created out of x_1 and x_2, so our polynomial regression model (based on a 2nd degree polynomial with two features) looks like this:
This is because poly.fit_transform(X) added three new features to the original two (x1 () and x2 ()): , and
and need no explanation, as we’ve already covered how they are created in the “Coding a polynomial regression model with scikit-learn” section.
What’s more interesting is – when two features are multiplied by each other, it’s called an interaction term. An interaction term accounts for the fact that one variable’s value may depend on another variable’s value (more on this here). poly.fit_transform() automatically created this interaction term for us, isn’t that cool? 🙂
The RMSE for the polynomial regression model is 20.94, while the RMSE for the linear regression model is 62.3. The polynomial regression model performs almost 3 times better than the linear regression model!
In this notebook, we have been able to expose that a basic understanding of polynomial regression. And we have shown RMSE metric for comparing the performance among different ML models.
We used a 2nd degree polynomial for ourthis example. Naturally, we should always test and trial before deploying a model to find what degree of polynomial performs best.
Ujhelyi T. Polynomial regression I mainly based on.
Jesus LM
Economist & Data Scientist