[1]:

#@title
#
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
#
# Unless required by applicable law or agreed to in writing, software
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and


In this tutorial, we’ll use a NeoML gradient boosting classifier to process the 20newsgroups dataset. We’ll compare different modes for building the decision trees, looking at the time it takes to train each one and the accuracy of its performance on the testing set.

The tutorial includes the following steps:

Note: This section doesn’t have any NeoML-specific code. It just downloads the dataset from the internet. If you are not running this notebook, you may skip this section.

[2]:

from sklearn.datasets import fetch_20newsgroups_vectorized

train_data = fetch_20newsgroups_vectorized(subset='train')
test_data = fetch_20newsgroups_vectorized(subset='test')


## Compare different builder types¶

To compare different boosting builder types, we need first to:

• Prepare an error function

• Set up the other boosting parameters, the same for each builder

[3]:

import numpy as np

def accuracy(model, X, y):
"""Returns the accuracy of model on the given data"""
correct = sum(1 for label, probs in zip(y, model.classify(X))
if label == np.argmax(probs))
return float(correct)/len(y)

# These arguments will be used for every builder_type
shared_kwargs = {
'loss' : 'binomial',
'iteration_count' : 100,
'learning_rate' : 0.1,
'subsample' : 1.,
'subfeature' : 0.25,
'random_seed' : 1234,
'max_depth' : 6,
'max_node_count' : -1,
'l1_reg' : 0.,
'l2_reg' : 1.,
'prune' : 0.,
}


Now we’ll compare training speed and accuracy of different decision tree builders.

NeoML has several builder types for gradient boosting:

• full - classic algorithm. If the dataset has multiple classes it uses one-versus-all approach.

• hist uses histograms of feature values when deciding to split nodes.

• multi_full - classic with one modification: for multiple classes it has multiple values in leaf nodes of one tree ensemble, instead of multiple one-versus-all ensembles.

[4]:

import time
import neoml

# Train and test gradient boosting for every builder type
for builder in ['full', 'hist', 'multi_full']:
start = time.time()
boost_kwargs = { **shared_kwargs, 'builder_type' : builder}

full  Accuracy: 0.7868  Time: 111.54 sec.