Why RandomForestClassifier On CPU (using SKLearn) And On GPU (using RAPIDs) Get Differents Scores, Very Different?
I am using RandomForestClassifier on CPU with SKLearn and on GPU using RAPIDs. I am doing a benchmark between these two libraries about speed up and scoring using Iris dataset (it
Solution 1:
This is caused by a known issue in our predict code, which was corrected in 0.13 with a warning and fall back to CPU on multi-class classifications. In version 0.12, we didn't have the warning or fallback, so, if you didn't know to use predict_model="CPU'
on a multi-class classification, you'd get a [much] lower prediction score than you should with the model you just fit.
See issue here: https://github.com/rapidsai/cuml/issues/1623
Here's some code to help you and others. It's been modified so it is a bit easier for others in the future. I get ~ 0.9333 on a GV100 and RAPIDS 0.12 stable.
import cudf as cu
from cuml.ensemble import RandomForestClassifier as cusRandomForestClassifier
from cuml.metrics import accuracy_score as cu_accuracy_score
from cuml.preprocessing.model_selection import train_test_split as cu_train_test_split
import numpy as np
# data link: https://gist.githubusercontent.com/curran/a08a1080b88344b0c8a7/raw/639388c2cbc2120a14dcf466e85730eb8be498bb/iris.csv
# Read data
df = cu.read_csv('./iris.csv', header = 0, delimiter = ',') # Get complete CSV
# Prep data
X = df.iloc[:, [0, 1, 2, 3]].astype(np.float32) # Get data columns. Must be float32 for our Classifier
y = df.iloc[:, 4].astype('category').cat.codes # Get labels column. Will convert to int32
cu_s_random_forest = cusRandomForestClassifier(
n_bins = 16,
n_estimators = 40,
max_depth = 16,
max_features = 1.0,
n_streams = 1)
train_data, test_data, train_label, test_label = cu_train_test_split(X, y, train_size=0.8)
# Fit data in RandomForest
cu_s_random_forest.fit(train_data,train_label)
# Predict data
predict = cu_s_random_forest.predict(test_data, predict_model="CPU") # use CPU to do multi-class classifications
print(predict)
# Check score
print('accuracy_score: ', cu_accuracy_score(test_label, predict))
Solution 2:
I tried this from your example above , converted things to numpy and it worked
import numpy as np
train_label_np = host_s_labels_train.as_matrix().astype(np.int32)
train_data_np = host_s_data_train.as_matrix().astype(np.float32)
test_label_np = host_s_labels_test.as_matrix().astype(np.int32)
test_data_np = host_s_data_test.as_matrix().astype(np.float32)
cu_s_random_forest = cusRandomForestClassifier(n_estimators = 40,
max_depth = 16, n_bins =16,
max_features = 1.0,
n_streams = 1)
# Fit data in RandomForest
cu_s_random_forest.fit(train_data_np,train_label_np)
# Predict data (GPU does not predict for multi-class at the moment. Fixed in 0.13)
predict_np = cu_s_random_forest.predict(test_data_np, predict_model='CPU')
# Check score
print('accuracy_score: ', sk_accuracy_score(test_label_np, predict_np))
Post a Comment for "Why RandomForestClassifier On CPU (using SKLearn) And On GPU (using RAPIDs) Get Differents Scores, Very Different?"