Copyright © 2017-2021 ABBYY Production LLC
[1]:
#@title
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
Linear regressor
Download the tutorial as a Jupyter notebook
In this tutorial, we’ll use a NeoML linear regressor to process the Boston house prices dataset from scikit-learn. We’ll look for the best parameter configuration by trying out every combination over a fixed parameter grid.
The tutorial includes the following steps:
Download the dataset
Note: This section doesn’t have any NeoML-specific code. It just downloads the dataset from the internet. If you are not running this notebook, you may skip this section.
[2]:
from sklearn.datasets import load_boston
# Get data
X, y = load_boston(return_X_y=True)
# Split into train/test
test_size = 50
X_train, X_test = X[:-test_size], X[-test_size:]
y_train, y_test = y[:-test_size], y[-test_size:]
Look for optimal parameters
We’ll take a brute-force approach and just check all possible combinations of parameters over a fixed parameter grid.
To evaluate each combination, we’ll use custom cross-validation. First of all we need to write an error function. In this tutorial we’re going to use mean squared error.
[3]:
def mse(a, b):
"""Mean squared error of 2 arrays"""
return ((a - b) ** 2).mean()
Now let’s write cross-validation data iterator and grid search function.
[4]:
import neoml
import itertools
def cv_iterator(X, y, n_folds):
"""Returns X_train, y_train, X_test, y_test for each of the folds"""
data_size = len(y)
test_size = data_size // n_folds
for i in range(n_folds):
train = list(itertools.chain(range(i*test_size),
range((i+1)*test_size, data_size)))
test = range(i*test_size, (i+1)*test_size)
yield X[train], y[train], X[test], y[test]
def grid_search(X, y, param_grid, n_folds=5):
"""Searches for the most optimal parameters in the grid
Returns trained model and optimal parameters
"""
best_params = {}
if param_grid: # Avoid corner case when param_grid is empty
param_names, param_values_lists = zip(*param_grid.items())
best_mse = 2. ** 32
for param_values in itertools.product(*param_values_lists):
kwargs = dict(zip(param_names, param_values))
linear = neoml.Linear.LinearRegressor(**kwargs)
avg_mse = 0.
# Calculate average MSE for K-folds
for X_train, y_train, X_test, y_test in cv_iterator(X, y, n_folds):
model = linear.train(X_train, y_train)
avg_mse += mse(y_test, model.predict(X_test))
# Update params if MSE has improved
if avg_mse < best_mse:
best_mse = avg_mse
best_params = kwargs
best_linear = neoml.Linear.LinearRegressor(**best_params)
return best_linear.train(X, y), best_params
Now we can search for optimal parameters. The dataset is very small, that’s why the number of folds in cross-validation has to be so large.
Once the optimal parameters are found, we’ll train the regressor with these parameters on the whole training set.
[5]:
%%time
param_grid = {
'error_weight': [1e-5, 1e-4, 1e-3, 1e-2, 1e-1, 1., 1e1, 1e2],
'l1_reg': [0.0, 1e-6, 1e-5, 1e-4, 1e-3, 1e-2, 1e-1],
'thread_count': [4],
}
# Search for optimal parameters
model, params = grid_search(X_train, y_train, param_grid, n_folds=20)
print('Best params: ', params)
Best params: {'error_weight': 0.01, 'l1_reg': 0.0, 'thread_count': 4}
Wall time: 22 s
Evaluate the best model
Let’s take a look at the results of the trained regression model.
[6]:
y_pred = model.predict(X_test)
print(type(y_pred))
print(y_pred.shape)
print(y_pred.dtype)
<class 'numpy.ndarray'>
(50,)
float64
The model returns the prediction for each object as an 1-dimensional numpy array. Here’s the mean squared error for the model’s predictions on the testing set:
[7]:
print(f'Test MSE: {mse(y_test, y_pred):.3f}')
Test MSE: 13.520