Copyright © 2017-2021 ABBYY Production LLC

[1]:
#@title
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
#     http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.

Gradient tree boosting classifier

Download the tutorial as a Jupyter notebook

In this tutorial, we’ll use a NeoML gradient boosting classifier to process the 20newsgroups dataset. We’ll compare different modes for building the decision trees, looking at the time it takes to train each one and the accuracy of its performance on the testing set.

The tutorial includes the following steps:

Download the dataset

Note: This section doesn’t have any NeoML-specific code. It just downloads the dataset from the internet. If you are not running this notebook, you may skip this section.

The 20newsgroups vectorized dataset can be downloaded from scikit-learn, ready divided into training and testing subsets.

[2]:
from sklearn.datasets import fetch_20newsgroups_vectorized

train_data = fetch_20newsgroups_vectorized(subset='train')
test_data = fetch_20newsgroups_vectorized(subset='test')

Compare different builder types

To compare different boosting builder types, we need first to:

  • Prepare an error function

  • Set up the other boosting parameters, the same for each builder

[3]:
import numpy as np

def accuracy(model, X, y):
    """Returns the accuracy of model on the given data"""
    correct = sum(1 for label, probs in zip(y, model.classify(X))
                  if label == np.argmax(probs))
    return float(correct)/len(y)


# These arguments will be used for every builder_type
shared_kwargs = {
    'loss' : 'binomial',
    'iteration_count' : 100,
    'learning_rate' : 0.1,
    'subsample' : 1.,
    'subfeature' : 0.25,
    'random_seed' : 1234,
    'max_depth' : 6,
    'max_node_count' : -1,
    'l1_reg' : 0.,
    'l2_reg' : 1.,
    'prune' : 0.,
    'thread_count' : 1,
}

Now we’ll compare training speed and accuracy of different decision tree builders.

NeoML has several builder types for gradient boosting:

  • full - classic algorithm. If the dataset has multiple classes it uses one-versus-all approach.

  • hist uses histograms of feature values when deciding to split nodes.

  • multi_full - classic with one modification: for multiple classes it has multiple values in leaf nodes of one tree ensemble, instead of multiple one-versus-all ensembles.

[4]:
import time
import neoml

# Train and test gradient boosting for every builder type
for builder in ['full', 'hist', 'multi_full']:
    start = time.time()
    boost_kwargs = { **shared_kwargs, 'builder_type' : builder}
    classifier = neoml.GradientBoost.GradientBoostClassifier(**boost_kwargs)
    model = classifier.train(train_data.data, train_data.target)
    run_time = time.time() - start
    acc = accuracy(model, test_data.data, test_data.target)
    print(f'{builder}  Accuracy: {acc:.4f}  Time: {run_time:.2f} sec.')
full  Accuracy: 0.7868  Time: 111.54 sec.
hist  Accuracy: 0.7926  Time: 198.95 sec.
multi_full  Accuracy: 0.6609  Time: 209.08 sec.