Copyright © 2017-2021 ABBYY Production LLC

[1]:
#@title
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
#     http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.

Linear regressor

Download the tutorial as a Jupyter notebook

In this tutorial, we’ll use a NeoML linear regressor to process the Boston house prices dataset from scikit-learn. We’ll look for the best parameter configuration by trying out every combination over a fixed parameter grid.

The tutorial includes the following steps:

Download the dataset

Note: This section doesn’t have any NeoML-specific code. It just downloads the dataset from the internet. If you are not running this notebook, you may skip this section.

[2]:
from sklearn.datasets import load_boston


# Get data
X, y = load_boston(return_X_y=True)

# Split into train/test
test_size = 50
X_train, X_test = X[:-test_size], X[-test_size:]
y_train, y_test = y[:-test_size], y[-test_size:]

Look for optimal parameters

We’ll take a brute-force approach and just check all possible combinations of parameters over a fixed parameter grid.

To evaluate each combination, we’ll use custom cross-validation. First of all we need to write an error function. In this tutorial we’re going to use mean squared error.

[3]:
def mse(a, b):
    """Mean squared error of 2 arrays"""
    return ((a - b) ** 2).mean()

Now let’s write cross-validation data iterator and grid search function.

[4]:
import neoml
import itertools


def cv_iterator(X, y, n_folds):
    """Returns X_train, y_train, X_test, y_test for each of the folds"""
    data_size = len(y)
    test_size = data_size // n_folds
    for i in range(n_folds):
        train = list(itertools.chain(range(i*test_size),
                                     range((i+1)*test_size, data_size)))
        test = range(i*test_size, (i+1)*test_size)
        yield X[train], y[train], X[test], y[test]


def grid_search(X, y, param_grid, n_folds=5):
    """Searches for the most optimal parameters in the grid
    Returns trained model and optimal parameters
    """
    best_params = {}

    if param_grid:  # Avoid corner case when param_grid is empty
        param_names, param_values_lists = zip(*param_grid.items())
        best_mse = 2. ** 32
        for param_values in itertools.product(*param_values_lists):
            kwargs = dict(zip(param_names, param_values))
            linear = neoml.Linear.LinearRegressor(**kwargs)
            avg_mse = 0.
            # Calculate average MSE for K-folds
            for X_train, y_train, X_test, y_test in cv_iterator(X, y, n_folds):
                model = linear.train(X_train, y_train)
                avg_mse += mse(y_test, model.predict(X_test))
            # Update params if MSE has improved
            if avg_mse < best_mse:
                best_mse = avg_mse
                best_params = kwargs

    best_linear = neoml.Linear.LinearRegressor(**best_params)
    return best_linear.train(X, y), best_params

Now we can search for optimal parameters. The dataset is very small, that’s why the number of folds in cross-validation has to be so large.

Once the optimal parameters are found, we’ll train the regressor with these parameters on the whole training set.

[5]:
%%time
param_grid = {
    'error_weight': [1e-5, 1e-4, 1e-3, 1e-2, 1e-1, 1., 1e1, 1e2],
    'l1_reg': [0.0, 1e-6, 1e-5, 1e-4, 1e-3, 1e-2, 1e-1],
    'thread_count': [4],
}

# Search for optimal parameters
model, params = grid_search(X_train, y_train, param_grid, n_folds=20)

print('Best params: ', params)
Best params:  {'error_weight': 0.01, 'l1_reg': 0.0, 'thread_count': 4}
Wall time: 22 s

Evaluate the best model

Let’s take a look at the results of the trained regression model.

[6]:
y_pred = model.predict(X_test)

print(type(y_pred))
print(y_pred.shape)
print(y_pred.dtype)
<class 'numpy.ndarray'>
(50,)
float64

The model returns the prediction for each object as an 1-dimensional numpy array. Here’s the mean squared error for the model’s predictions on the testing set:

[7]:
print(f'Test MSE: {mse(y_test, y_pred):.3f}')
Test MSE: 13.520