XGboost Code Tested

A project log for Ai Honey Bee Swarm Detector / Predictor System

Local C02 sensor and weather data scraped off interwebs predict swarming

capt-flatus-oflahertyCapt. Flatus O'Flaherty ☠ 04/29/2023 at 14:360 Comments

I found some nice code for testing XGboost classification. It was a bit out of date, but with some judicious tweaking I managed to bring it up to date and work nicely.

#!/usr/bin/env python
# cd /home/flatus/Documents/solar_harvester/swarm_predicter && python3

# Data:

import warnings
warnings.filterwarnings( "ignore", module = "matplotlib\..*" )
warnings.filterwarnings( "ignore", module = "xgboost\..*" )

import pandas as pd
import numpy as np
import xgboost as xgb
from sklearn.model_selection import train_test_split
import matplotlib.pyplot as plt
from sklearn.preprocessing import LabelEncoder
from sklearn.metrics import confusion_matrix
from termcolor import colored

print(colored("Start", 'blue'))
# print(colored('The Accuracy on Test Set is: ', 'blue')

df = pd.read_csv('/home/flatus/Documents/solar_harvester/swarm_predicter/HR_comma_sep.csv')

#data process
X = df[['satisfaction_level', 'last_evaluation', 'number_project', 'average_montly_hours', 'time_spend_company', 'Work_accident', 'promotion_last_5years', 'sales', 'salary']]
y = df['left']

# Extract text features
cats = X.select_dtypes(exclude=np.number).columns.tolist()

# Convert to Pandas category
for col in cats:
   X[col] = X[col].astype('category')

# Splitting the dataset into the Training set and Validation set
Xt, Xv, yt, yv = train_test_split(X, y, test_size = 0.25, random_state = 0)
# dt = xgb.DMatrix(Xt.as_matrix(),label=yt.as_matrix())
# dv = xgb.DMatrix(Xv.as_matrix(),label=yv.as_matrix())

dt = xgb.DMatrix(Xt, yt, enable_categorical=True)
dv = xgb.DMatrix(Xv, yv, enable_categorical=True)

#Build the model
params = {
    "eta": 0.2,
    "max_depth": 4,
    "objective": "binary:logistic",
    "silent": 1,
    "base_score": np.mean(yt),
    'n_estimators': 1000,
    "eval_metric": "logloss"
model = xgb.train(params, dt, 3000, [(dt, "train"),(dv, "valid")], verbose_eval=200)

#Prediction on validation set
y_pred = model.predict(dv)

# Making the Confusion Matrix
cm = confusion_matrix(yv, (y_pred>0.5))
print(colored('The Confusion Matrix is: ', 'red'),'\n', cm)

# Calculate the accuracy on test set
predict_accuracy_on_test_set = (cm[0,0] + cm[1,1])/(cm[0,0] + cm[1,1]+cm[1,0] + cm[0,1])
print(colored('The Accuracy on Test Set is: ', 'blue'), colored(predict_accuracy_on_test_set, 'blue'))

# Based on the model we made, we can predict a employee whether he may leave or not after inputing his informaion:
print("Now make a single prediction: ")

satisfaction_level = 0.38
last_evaluation = 0.53
number_project = 2
average_montly_hours = 137
time_spend_company = 3
Work_accident = 0
left = 1   # Dont actually need this one as this is what we're trying to predict.
promotion_last_5years = 0
sales = "sales"
salary = "low"

# Make prediction:

# Use a dictionary type:
dict_ = {'satisfaction_level': satisfaction_level, 'last_evaluation': last_evaluation, 'number_project': number_project, 'average_montly_hours': average_montly_hours, 'time_spend_company': time_spend_company, 'Work_accident': Work_accident, 'left': left, 'promotion_last_5years': promotion_last_5years, 'sales': sales, 'salary': salary}
my_HR_test = pd.DataFrame([dict_])
# Get rid of the 'left' column:
my_HR_test = my_HR_test.drop('left', axis=1)
# Extract text features
cats = my_HR_test.select_dtypes(exclude=np.number).columns.tolist()

# Convert to Pandas category
for col in cats:
   my_HR_test[col] = my_HR_test[col].astype('category')

my_HR_test = xgb.DMatrix(my_HR_test, enable_categorical=True)

# Make prediction
new_prediction = model.predict(my_HR_test)

print("new prediction: ",new_prediction)

if(new_prediction > 0.5):
    print(colored("This employee will leave!", 'blue'))
    print(colored("This employee will not leave!", 'green'))

The key to success was to bring the 'One Shot' prediction data into Pandas as a dictionary. Not seen this done before, but it makes the code nice and clear and easy to see what needs to happen in deployment.

I cant test it much on the bee swarming data yet as I've not actually seen any swarms yet, so there's no boolean 'Trues' for 'swarm seen in apiary'.