2.6 - Luis's Notes - First predictive model with scikit-learn¶

In this notebook, we will build our first predictive model using Python's powerful scikit-learn library. We will focus on Linear Regression, a fundamental algorithm in machine learning, and explore how scikit-learn simplifies the process of training, evaluating, and using models.

Throughout this notebook, we will cover the following steps:

Import the required libraries: We will load the scikit-learn tools we need.
Load and prepare the data: We will generate synthetic data for our example.
Split the data into training and test sets: We will separate the data to train and effectively evaluate the model.
Train the model: We will fit the linear regression model to our training data.
Evaluate the model: We will measure the model's performance using standard metrics.
Visualize the results: We will plot the data and the regression line for visual understanding.
Use the model to make predictions: We will make predictions with the trained model.

Step 1: Import the required libraries¶

In this section, we will import the essential scikit-learn functions and classes we need to build, train, evaluate, and use our linear regression model.

In [ ]:

# Import LinearRegression to build the model
from sklearn.linear_model import LinearRegression

# Import train_test_split to split the data into training and test sets
from sklearn.model_selection import train_test_split

# Import mean_squared_error and r2_score to evaluate the model
from sklearn.metrics import mean_squared_error, r2_score

Step 2: Data loading and preparation¶

For this example, we will generate a synthetic dataset that simulates a linear relationship between an independent variable (X) and a dependent variable (y), adding a bit of noise to make the example more realistic. This allows us to control the characteristics of the data and better understand how the model works.

In [ ]:

import numpy as np

# Generate synthetic data
m = 50 # Number of samples
np.random.seed(0)  # For reproducibility
X = 10 * np.random.rand(m, 1) # m data points between 0 and 10
y = 4 + 3 * X + np.random.randn(m, 1) # Generate data based on the line y = 4 + 3x adding noise, y = 4 + 3x + noise

Step 3: Split the data into training and test sets¶

It is standard practice in machine learning to split the data into two sets:

Training set: to train the model.
Test set: to evaluate its performance on unseen data.

Common splits are 70/30, 75/25, and 80/20. We will use an 80% training and 20% test split.

This helps avoid overfitting and obtain a more realistic estimate of the model's generalization ability. We will use scikit-learn's train_test_split function to easily perform this split.

In [ ]:

# Split the data into training and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=0)

print(f"Training set size: {X_train.shape[0]}")
print(f"Test set size: {X_test.shape[0]}")

Tamaño del conjunto de entrenamiento: 40
Tamaño del conjunto de prueba: 10

Step 4: Model training¶

Here is where we use scikit-learn's LinearRegression class. We will create an instance of this model and "train" it using the training set (X_train and y_train). The fit() method will find the best coefficients for the regression line that best fits these data.

In [ ]:

# Create a Linear Regression model instance
model = LinearRegression()

# Train the model with the training data
model.fit(X_train, y_train)

print("Model trained successfully.")
print(f"Coefficient (weight): {model.coef_[0][0]}")
print(f"Intercept (bias): {model.intercept_[0]}")

Modelo entrenado exitosamente.
Coeficiente (peso): 2.9820321495223094
Intercepto (sesgo): 4.037097174962346

Step 5: Model evaluation¶

Once the model has been trained, we need to evaluate how well it performs. For regression, common metrics such as Mean Squared Error (MSE) and the Coefficient of Determination (R²) give us a quantitative measure of the accuracy of the model's predictions on the test set. A lower MSE and an R² closer to 1 indicate a better fit.

In [ ]:

# Make predictions on the test set
y_pred = model.predict(X_test)

# Compute Mean Squared Error (MSE)
mse = mean_squared_error(y_test, y_pred)

# Compute Coefficient of Determination (R²)
r2 = r2_score(y_test, y_pred)

print(f"Mean Squared Error (MSE): {mse}")
print(f"Coefficient of Determination (R²): {r2}")

Error Cuadrático Medio (ECM): 0.8375952851062938
Coeficiente de Determinación (R²): 0.9756924413920042

Step 6: Visualization of the results¶

Visualization is a powerful tool to understand the model fit. We will plot the original data points (training and test) and the regression line learned by the model. This will allow us to visually see how well the line represents the relationship between X and y, and how it compares to the test data.

In [ ]:

import matplotlib.pyplot as plt

# Visualize the training/test data and the regression line
plt.scatter(X_train, y_train, color='blue', label='Training Data')
plt.scatter(X_test, y_test, color='red', label='Test Data')
plt.plot(X, model.predict(X), color='green', label='Regression Line')
plt.xlabel('X')
plt.ylabel('y')
plt.title('Linear Regression with scikit-learn')
plt.legend()
plt.show()

Step 7: Use the model to make predictions¶

Finally, we use the trained model to make predictions on new values of the independent variable (X) that the model has not seen before. This simulates how the model would be used in a real-world scenario to predict future outcomes.

In [ ]:

# Use the model to predict on new data
X_new = np.array([[2.5],[15]]) # New X values
y_new_pred = model.predict(X_new)

print(f"Prediction for X = {X_new[0][0]} is {y_new_pred[0][0]}")
print(f"Prediction for X = {X_new[1][0]} is {y_new_pred[1][0]}")
print("(the value 15 cannot be within the training data since the maximum range was 10)")

Conclusion¶

In this notebook, we have implemented a linear regression model using the scikit-learn library. Unlike a manual implementation, scikit-learn greatly simplifies the process, providing efficient and optimized tools for:

Data splitting: train_test_split makes it easy to separate the data into training and test sets, which is essential for good training and realistic evaluation.
Model training: LinearRegression abstracts the mathematical and computational details of training, allowing us to fit the model with a simple call to fit().
Model evaluation: Metrics such as mean_squared_error and r2_score give us a quantitative measure of the model's performance on unseen data.

Visualizing the results allows us to verify that the regression line fits the data reasonably well, and both the MSE and R² confirm the good fit.

In summary, scikit-learn is a powerful and easy-to-use tool for building and evaluating machine learning models, which speeds up development and allows us to focus on interpreting results and improving the model.