:

#@title
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.


# Linear regressor¶

Download the tutorial as a Jupyter notebook

In this tutorial, we’ll use a NeoML linear regressor to process the Boston house prices dataset from scikit-learn. We’ll look for the best parameter configuration by trying out every combination over a fixed parameter grid.

The tutorial includes the following steps:

Note: This section doesn’t have any NeoML-specific code. It just downloads the dataset from the internet. If you are not running this notebook, you may skip this section.

:

from sklearn.datasets import load_boston

# Get data
X, y = load_boston(return_X_y=True)

# Split into train/test
test_size = 50
X_train, X_test = X[:-test_size], X[-test_size:]
y_train, y_test = y[:-test_size], y[-test_size:]


## Look for optimal parameters¶

We’ll take a brute-force approach and just check all possible combinations of parameters over a fixed parameter grid.

To evaluate each combination, we’ll use custom cross-validation. First of all we need to write an error function. In this tutorial we’re going to use mean squared error.

:

def mse(a, b):
"""Mean squared error of 2 arrays"""
return ((a - b) ** 2).mean()


Now let’s write cross-validation data iterator and grid search function.

:

import neoml
import itertools

def cv_iterator(X, y, n_folds):
"""Returns X_train, y_train, X_test, y_test for each of the folds"""
data_size = len(y)
test_size = data_size // n_folds
for i in range(n_folds):
train = list(itertools.chain(range(i*test_size),
range((i+1)*test_size, data_size)))
test = range(i*test_size, (i+1)*test_size)
yield X[train], y[train], X[test], y[test]

def grid_search(X, y, param_grid, n_folds=5):
"""Searches for the most optimal parameters in the grid
Returns trained model and optimal parameters
"""
best_params = {}

if param_grid:  # Avoid corner case when param_grid is empty
param_names, param_values_lists = zip(*param_grid.items())
best_mse = 2. ** 32
for param_values in itertools.product(*param_values_lists):
kwargs = dict(zip(param_names, param_values))
linear = neoml.Linear.LinearRegressor(**kwargs)
avg_mse = 0.
# Calculate average MSE for K-folds
for X_train, y_train, X_test, y_test in cv_iterator(X, y, n_folds):
model = linear.train(X_train, y_train)
avg_mse += mse(y_test, model.predict(X_test))
# Update params if MSE has improved
if avg_mse < best_mse:
best_mse = avg_mse
best_params = kwargs

best_linear = neoml.Linear.LinearRegressor(**best_params)
return best_linear.train(X, y), best_params


Now we can search for optimal parameters. The dataset is very small, that’s why the number of folds in cross-validation has to be so large.

Once the optimal parameters are found, we’ll train the regressor with these parameters on the whole training set.

:

%%time
param_grid = {
'error_weight': [1e-5, 1e-4, 1e-3, 1e-2, 1e-1, 1., 1e1, 1e2],
'l1_reg': [0.0, 1e-6, 1e-5, 1e-4, 1e-3, 1e-2, 1e-1],
}

# Search for optimal parameters
model, params = grid_search(X_train, y_train, param_grid, n_folds=20)

print('Best params: ', params)

Best params:  {'error_weight': 0.01, 'l1_reg': 0.0, 'thread_count': 4}
Wall time: 22 s


## Evaluate the best model¶

Let’s take a look at the results of the trained regression model.

:

y_pred = model.predict(X_test)

print(type(y_pred))
print(y_pred.shape)
print(y_pred.dtype)

<class 'numpy.ndarray'>
(50,)
float64


The model returns the prediction for each object as an 1-dimensional numpy array. Here’s the mean squared error for the model’s predictions on the testing set:

:

print(f'Test MSE: {mse(y_test, y_pred):.3f}')

Test MSE: 13.520