Polynomial Regression Analysis Example

Oftentimes we’ll encounter data where the relationship between the feature(s) and the response variable can’t be best described with a straight line. In these cases, we should use polynomial regression.

An example of a polynomial, coud be:

$3x^4 – 7x^3 + 2x^2 + 11$

Terminology

Degree of a polynomial: The highest power in polynomial. In our example, 4
Coefficient: Each constant (3, 7, 2, 11) in polynomial is a coefficient. In polynomial regression, these coefficients will be estimated
Leading term: The term with the highest power ( $3x^4$ ). It determines the polynomial’s graph behavior
Leading coefficient: The coefficient of the leading term (3)
Constant term: The y intercept, it never changes: no matter what the value of x is, the constant term remains the same

The difference between linear and polynomial regression.

Let’s return to $ 3x4 - 7x3 + 2x2 + 11 $, if we write a polynomial’s terms from the highest degree term to the lowest degree term, it’s called a polynomial’s standard form.

In the context of machine learning, you’ll often see it reversed:

$y = \beta_0 + \beta_1x + \beta_2x^2 + \dots + \beta_nx^n$

where:

y is the response variable we want to predict
x is the feature
$\beta_0$ is the y intercept

The other ßs are the coefficients/parameters we’d like to find when we train our model on the available x and y values

n is the degree of the polynomial (the higher n is, the more complex curved lines you can create)
The above polynomial regression formula is very similar to the multiple linear regression formula:

$y = \beta_0 + \beta_1x + \beta_2x + \dots + \beta_nx$

It’s not a coincidence: polynomial regression is a linear model used for describing non-linear relationships

How is this possible? The magic lies in creating new features by raising the original features to a power

Linear regression is just a first-degree polynomial. Polynomial regression uses higher-degree polynomials. Both of them are linear models, but the first results in a straight line, the latter gives you a curved line.

Environment settings

Code

import numpy as np
import pandas as pd
import polars as pl
import matplotlib.pyplot as plt
import seaborn as sns
plt.style.use('ggplot')

Code

from sklearn.preprocessing import PolynomialFeatures

Code

# create dummy dataset
x = np.arange(0, 30)
y = [3, 4, 5, 7, 10, 8, 9, 10, 10, 23, 27, 44, 50, 63, 67, 60, 62, 70, 75,
     88, 81, 87, 95, 100, 108, 135, 151, 160, 169, 179]

Code

plt.figure(figsize=(10,6))
plt.scatter(x, y)
plt.show()

Code

# Create an polynomial instance
poly = PolynomialFeatures(degree=2, include_bias=False)

Degree = 2 means that we want to work with a 2nd degree polynomial,

$y = \beta_0 + \beta_1x + \beta_2x^2$

Code

poly_features = poly.fit_transform(x.reshape(-1, 1))

Code

from sklearn.linear_model import LinearRegression

It may seem confusing, why are we importing LinearRegression module? 😮

We just have to remind that polynomial regression is a linear model, that’s why we import LinearRegression. 🙂

Code

# Create a LinearRegression() instance
poly_reg_model = LinearRegression()

Code

# Fit model to data
poly_reg_model.fit(poly_features, y)

LinearRegression()

In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.

Code

# Predict responses
y_predicted = poly_reg_model.predict(poly_features)

Code

y_predicted

array([  1.70806452,   3.04187987,   4.70292388,   6.69119657,
         9.00669792,  11.64942794,  14.61938662,  17.91657397,
        21.54098999,  25.49263467,  29.77150802,  34.37761004,
        39.31094073,  44.57150008,  50.1592881 ,  56.07430478,
        62.31655014,  68.88602415,  75.78272684,  83.00665819,
        90.55781821,  98.4362069 , 106.64182425, 115.17467027,
       124.03474495, 133.22204831, 142.73658033, 152.57834101,
       162.74733037, 173.24354839])

Code

plt.figure(figsize=(10, 6))
plt.title("A Basic Polynomial Regression Example", size=16)
plt.scatter(x, y)
plt.plot(x, y_predicted, c="green")
plt.show()

Polynomial Regression with Multiple Features

Code

# Create data
np.random.seed(1)
x_1 = np.absolute(np.random.randn(100, 1) * 10)
x_2 = np.absolute(np.random.randn(100, 1) * 30)
y = 2*x_1**2 + 3*x_1 + 2 + np.random.randn(100, 1)*20

Code

# Create visual
fig, axes = plt.subplots(nrows=1, ncols=2, figsize=(10, 4))
axes[0].scatter(x_1, y)
axes[1].scatter(x_2, y)
axes[0].set_title("x_1 plotted")
axes[1].set_title("x_2 plotted")
plt.show()

Code

# Create dataframe
df = pd.DataFrame(
    {"x_1":x_1.reshape(100,),
     "x_2":x_2.reshape(100,),
     "y":y.reshape(100,)},
        index=range(0,100))
df

	x_1	x_2	y
0	16.243454	13.413857	570.412369
1	6.117564	36.735231	111.681987
2	5.281718	12.104749	62.392124
3	10.729686	17.807356	303.538953
4	8.654076	32.847355	151.109269
...	...	...	...
95	0.773401	48.823150	-0.430738
96	3.438537	18.069578	44.308720
97	0.435969	12.608466	19.383456
98	6.200008	24.328550	78.371729
99	6.980320	31.333263	132.108914

100 rows × 3 columns

Code

from sklearn.model_selection import train_test_split

Code

# Define train and test sets
X, y = df[["x_1", "x_2"]], df["y"]
poly = PolynomialFeatures(degree=2, include_bias=False)
poly_features = poly.fit_transform(X)
X_train, X_test, y_train, y_test = train_test_split(poly_features, y, test_size=0.3, random_state=42)

Code

# Create polynomial regression
poly_reg_model = LinearRegression()
poly_reg_model.fit(X_train, y_train)

LinearRegression()

Code

# Test model
poly_reg_y_predicted = poly_reg_model.predict(X_test)

The formula for Root Mean Square Error is:

$RMSE = \sqrt{\frac{1}{n} \sum_{i=1}^n (y_i - \hat{y}_i)^2}$

The smaller the RMSE metric the better the model

Code

from sklearn.metrics import mean_squared_error

poly_reg_rmse = np.sqrt(mean_squared_error(y_test, poly_reg_y_predicted))
print(f'Polynomial regression\nRMSE: {poly_reg_rmse:.2f}')

Polynomial regression
RMSE: 20.94

Comparing Polynomial vs Linear Regression Models

Code

# Split into train and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)
# Create linear regression instance
lin_reg_model = LinearRegression()
# Fit regression model
lin_reg_model.fit(X_train, y_train)
# Create predictions
lin_reg_y_predicted = lin_reg_model.predict(X_test)
# Calculate RMSE
lin_reg_rmse = np.sqrt(mean_squared_error(y_test, lin_reg_y_predicted))
# Print results
print(f'Linear regression\nRMSE: {lin_reg_rmse:,.2f}')

Linear regression
RMSE: 62.30

In the train_test_split method we use X instead of poly_features, and it’s for a good reason.

X contains our two original features (x_1 and x_2), so our linear regression model takes the form of:

$y = \beta_0 + \beta_1x_1 + \beta_2x_2$

Code

lin_reg_model.coef_

array([43.73176255, -0.53140809])

Code

lin_reg_model.intercept_

np.float64(-117.07280081594811)

On the other hand, poly_features contains new features as well, created out of x_1 and x_2, so our polynomial regression model (based on a 2nd degree polynomial with two features) looks like this:

$y = \beta_0 + \beta_1x_1 + \beta_2x_2 + \beta_3x_1^2 + \beta_4x_2^2 + \beta_5x_1x_2$

This is because poly.fit_transform(X) added three new features to the original two (x1 ( $x_1$ ) and x2 ( $x_2$ )): $x_1^2$ , $x_2^2$ and $x_1x_2$

$x_1^2$ and $x_2^2$ need no explanation, as we’ve already covered how they are created in the “Coding a polynomial regression model with scikit-learn” section.

What’s more interesting is $x_1x_2$ – when two features are multiplied by each other, it’s called an interaction term. An interaction term accounts for the fact that one variable’s value may depend on another variable’s value (more on this here). poly.fit_transform() automatically created this interaction term for us, isn’t that cool? 🙂

Code

poly_reg_model.coef_

array([ 3.61945509, -1.0859955 ,  1.89905813,  0.0207338 ,  0.01300394])

Code

print(f'Linear regression\nRMSE: {lin_reg_rmse:.2f}')

Linear regression
RMSE: 62.30

The RMSE for the polynomial regression model is 20.94, while the RMSE for the linear regression model is 62.3. The polynomial regression model performs almost 3 times better than the linear regression model!

Conclusion

In this notebook, we have been able to expose that a basic understanding of polynomial regression. And we have shown RMSE metric for comparing the performance among different ML models.

We used a 2nd degree polynomial for ourthis example. Naturally, we should always test and trial before deploying a model to find what degree of polynomial performs best.

References

Ujhelyi T. Polynomial regression I mainly based on.
How to become a Data Scientist

Contact

Jesus LM
Economist & Data Scientist

Medium | Linkedin | Twitter