library(tidyverse) # datahåndtering, grafikk og glimpse()
library(rsample) # for å dele data i training og testing
library(rpart) # funksjoner for CART
library(rpart.plot) # funksjon for å plotte CART
library(caret) # inneholder funksjon for confusion matrix4 Klassifikasjonstrær
I dette kapittelt skal vi bruke følgende pakker:
Klassifikasjonstrær (CART: Classification and Regression Trees) er en helt annen tilnærming enn regresjon. I stedet for å estimere koeffisienter i en ligning, deler algoritmen dataene i stadig mindre grupper basert på verdier av prediktorene. Resultatet er et “tre” som kan leses som en serie if-else-regler. Fordelen er at trær er enkle å tolke og kan fange opp ikke-lineære sammenhenger og interaksjoner automatisk. Ulempen er at enkelttrær kan være ustabile og har en tendens til overfitting.
Vi skal her bruke Attrition-datasettet fra kapittelet om logistisk regresjon. Utfallsvariabelen er Attrition som angir om en arbeidstaker slutter i jobben (“Yes”) eller ikke (“No”). Vi starter med å lage en logistisk regresjonsmodell som sammenligningsgrunnlag, og deretter bygger vi klassifikasjonstrær.
4.1 Lese inn data
attrition <- readRDS("data/attrition.rds") %>%
select(-EmployeeNumber)
glimpse(attrition)Rows: 1,470
Columns: 31
$ Age <int> 41, 49, 37, 33, 27, 32, 59, 30, 38, 36, 35, 2…
$ Attrition <fct> Yes, No, Yes, No, No, No, No, No, No, No, No,…
$ BusinessTravel <fct> Travel_Rarely, Travel_Frequently, Travel_Rare…
$ DailyRate <int> 1102, 279, 1373, 1392, 591, 1005, 1324, 1358,…
$ Department <fct> Sales, Research & Development, Research & Dev…
$ DistanceFromHome <int> 1, 8, 2, 3, 2, 2, 3, 24, 23, 27, 16, 15, 26, …
$ Education <int> 2, 1, 2, 4, 1, 2, 3, 1, 3, 3, 3, 2, 1, 2, 3, …
$ EducationField <fct> Life Sciences, Life Sciences, Other, Life Sci…
$ EnvironmentSatisfaction <int> 2, 3, 4, 4, 1, 4, 3, 4, 4, 3, 1, 4, 1, 2, 3, …
$ Gender <fct> Female, Male, Male, Female, Male, Male, Femal…
$ HourlyRate <int> 94, 61, 92, 56, 40, 79, 81, 67, 44, 94, 84, 4…
$ JobInvolvement <int> 3, 2, 2, 3, 3, 3, 4, 3, 2, 3, 4, 2, 3, 3, 2, …
$ JobLevel <int> 2, 2, 1, 1, 1, 1, 1, 1, 3, 2, 1, 2, 1, 1, 1, …
$ JobRole <fct> Sales Executive, Research Scientist, Laborato…
$ JobSatisfaction <int> 4, 2, 3, 3, 2, 4, 1, 3, 3, 3, 2, 3, 3, 4, 3, …
$ MaritalStatus <fct> Single, Married, Single, Married, Married, Si…
$ MonthlyIncome <int> 5993, 5130, 2090, 2909, 3468, 3068, 2670, 269…
$ MonthlyRate <int> 19479, 24907, 2396, 23159, 16632, 11864, 9964…
$ NumCompaniesWorked <int> 8, 1, 6, 1, 9, 0, 4, 1, 0, 6, 0, 0, 1, 0, 5, …
$ OverTime <fct> Yes, No, Yes, Yes, No, No, Yes, No, No, No, N…
$ PercentSalaryHike <int> 11, 23, 15, 11, 12, 13, 20, 22, 21, 13, 13, 1…
$ PerformanceRating <int> 3, 4, 3, 3, 3, 3, 4, 4, 4, 3, 3, 3, 3, 3, 3, …
$ RelationshipSatisfaction <int> 1, 4, 2, 3, 4, 3, 1, 2, 2, 2, 3, 4, 4, 3, 2, …
$ StockOptionLevel <int> 0, 1, 0, 0, 1, 0, 3, 1, 0, 2, 1, 0, 1, 1, 0, …
$ TotalWorkingYears <int> 8, 10, 7, 8, 6, 8, 12, 1, 10, 17, 6, 10, 5, 3…
$ TrainingTimesLastYear <int> 0, 3, 3, 3, 3, 2, 3, 2, 2, 3, 5, 3, 1, 2, 4, …
$ WorkLifeBalance <int> 1, 3, 3, 3, 3, 2, 2, 3, 3, 2, 3, 3, 2, 3, 3, …
$ YearsAtCompany <int> 6, 10, 0, 8, 2, 7, 1, 1, 9, 7, 5, 9, 5, 2, 4,…
$ YearsInCurrentRole <int> 4, 7, 0, 7, 2, 7, 0, 0, 7, 7, 4, 5, 2, 2, 2, …
$ YearsSinceLastPromotion <int> 0, 1, 0, 3, 2, 3, 0, 0, 1, 7, 0, 0, 4, 1, 0, …
$ YearsWithCurrManager <int> 5, 7, 0, 0, 2, 6, 0, 0, 8, 7, 3, 8, 3, 2, 3, …
Vi splitter datasettet i training og testing:
set.seed(426)
attrition_split <- initial_split(attrition, prop = .7)
training <- training(attrition_split)
testing <- testing(attrition_split)4.2 Logistisk regresjon som baseline
Før vi lager klassifikasjonstrær er det nyttig å ha et sammenligningsgrunnlag. Vi bruker logistisk regresjon med alle variable:
est.glm <- glm(Attrition ~ ., data = training, family = "binomial")
summary(est.glm)
Call:
glm(formula = Attrition ~ ., family = "binomial", data = training)
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) -1.171e+01 7.504e+02 -0.016 0.987554
Age -3.647e-02 1.688e-02 -2.161 0.030667 *
BusinessTravelTravel_Frequently 1.648e+00 4.983e-01 3.308 0.000940 ***
BusinessTravelTravel_Rarely 9.891e-01 4.541e-01 2.178 0.029387 *
DailyRate -3.576e-04 2.753e-04 -1.299 0.193941
DepartmentResearch & Development 1.298e+01 7.504e+02 0.017 0.986196
DepartmentSales 1.357e+01 7.504e+02 0.018 0.985571
DistanceFromHome 4.746e-02 1.361e-02 3.488 0.000486 ***
Education -1.807e-02 1.084e-01 -0.167 0.867654
EducationFieldLife Sciences -3.399e-01 1.129e+00 -0.301 0.763441
EducationFieldMarketing 3.270e-01 1.179e+00 0.277 0.781525
EducationFieldMedical -5.647e-01 1.133e+00 -0.498 0.618260
EducationFieldOther -5.878e-01 1.207e+00 -0.487 0.626217
EducationFieldTechnical Degree 5.805e-01 1.154e+00 0.503 0.615076
EnvironmentSatisfaction -5.414e-01 1.053e-01 -5.140 2.74e-07 ***
GenderMale 2.498e-01 2.252e-01 1.109 0.267218
HourlyRate -2.457e-03 5.598e-03 -0.439 0.660717
JobInvolvement -5.900e-01 1.538e-01 -3.835 0.000126 ***
JobLevel 1.022e-01 3.942e-01 0.259 0.795375
JobRoleHuman Resources 1.399e+01 7.504e+02 0.019 0.985126
JobRoleLaboratory Technician 1.562e+00 5.660e-01 2.760 0.005777 **
JobRoleManager -1.454e+00 1.269e+00 -1.145 0.252018
JobRoleManufacturing Director -3.797e-01 6.172e-01 -0.615 0.538366
JobRoleResearch Director -2.476e+00 1.256e+00 -1.971 0.048692 *
JobRoleResearch Scientist 2.632e-01 5.858e-01 0.449 0.653273
JobRoleSales Executive -1.863e-01 1.594e+00 -0.117 0.906944
JobRoleSales Representative 1.378e+00 1.648e+00 0.836 0.403020
JobSatisfaction -3.369e-01 1.022e-01 -3.298 0.000975 ***
MaritalStatusMarried 2.514e-01 3.242e-01 0.775 0.438093
MaritalStatusSingle 9.638e-01 4.225e-01 2.281 0.022524 *
MonthlyIncome 8.287e-05 1.030e-04 0.804 0.421245
MonthlyRate 6.922e-06 1.558e-05 0.444 0.656792
NumCompaniesWorked 1.638e-01 4.800e-02 3.413 0.000642 ***
OverTimeYes 2.007e+00 2.437e-01 8.236 < 2e-16 ***
PercentSalaryHike -6.251e-02 4.976e-02 -1.256 0.209020
PerformanceRating 6.650e-01 5.094e-01 1.305 0.191744
RelationshipSatisfaction -3.787e-01 1.031e-01 -3.674 0.000239 ***
StockOptionLevel -2.105e-01 1.875e-01 -1.123 0.261579
TotalWorkingYears -5.930e-02 3.680e-02 -1.611 0.107130
TrainingTimesLastYear -1.856e-01 9.002e-02 -2.061 0.039275 *
WorkLifeBalance -1.886e-01 1.525e-01 -1.236 0.216394
YearsAtCompany 4.602e-02 5.114e-02 0.900 0.368242
YearsInCurrentRole -1.046e-01 5.771e-02 -1.812 0.069940 .
YearsSinceLastPromotion 2.017e-01 5.441e-02 3.707 0.000210 ***
YearsWithCurrManager -1.745e-01 6.343e-02 -2.752 0.005927 **
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
(Dispersion parameter for binomial family taken to be 1)
Null deviance: 875.64 on 1028 degrees of freedom
Residual deviance: 566.77 on 984 degrees of freedom
AIC: 656.77
Number of Fisher Scoring iterations: 15
Vi predikerer på testing-data og gjør en klassifisering med cut-off på 0.5:
testing_glm <- testing %>%
mutate(prob_glm = predict(est.glm, newdata = testing, type = "response")) %>%
mutate(klassifiser = factor(ifelse(prob_glm < .5, "No", "Yes")))
confusionMatrix(reference = testing_glm$Attrition,
testing_glm$klassifiser,
positive = "Yes")Confusion Matrix and Statistics
Reference
Prediction No Yes
No 345 52
Yes 15 29
Accuracy : 0.8481
95% CI : (0.8111, 0.8803)
No Information Rate : 0.8163
P-Value [Acc > NIR] : 0.04596
Kappa : 0.3844
Mcnemar's Test P-Value : 1.092e-05
Sensitivity : 0.35802
Specificity : 0.95833
Pos Pred Value : 0.65909
Neg Pred Value : 0.86902
Prevalence : 0.18367
Detection Rate : 0.06576
Detection Prevalence : 0.09977
Balanced Accuracy : 0.65818
'Positive' Class : Yes
Denne confusion matrix gir oss et referansepunkt for å vurdere klassifikasjonstrærne.
4.3 Klassifikasjonstre
Utfallsvariabel og prediktorer spesifiseres som en formel på samme måte som for regresjon. Siden vi her har en klassifikasjon må vi spesifisere method = "class". Hvis ikke vil rpart() gjette hva slags modell (som kanskje er riktig), så du kan få andre resultater enn du forventet.
klass_tre <- rpart(Attrition ~ .,
data = training, method = "class")Resultatet kan fremstilles grafisk med funksjonen rpart.plot() slik:
rpart.plot(klass_tre)
Treet leses fra toppen og nedover. I hver node vises den predikerte klassen, andelen som tilhører denne klassen, og andelen av totalen som havner i denne noden. Hver split angir et kriterium: observasjoner som oppfyller kriteriet går til venstre, resten til høyre. Bladnodene (nederst) gir den endelige klassifiseringen.
Vi kan også få printet ut resultatet som regler i en tabell med rpart.rules():
rpart.rules(klass_tre, extra = 4) Attrition No Yes
No [.94 .06] when TotalWorkingYears >= 3 & OverTime is Yes & JobRole is Healthcare Representative or Laboratory Technician or Manager or Manufacturing Director or Research Director or Research Scientist or Sales Representative & MonthlyIncome >= 3752
No [.93 .07] when TotalWorkingYears >= 3 & OverTime is No
No [.88 .12] when TotalWorkingYears < 3 & JobRole is Human Resources or Laboratory Technician or Sales Representative & EnvironmentSatisfaction >= 4 & MonthlyRate < 19749
No [.86 .14] when TotalWorkingYears >= 3 & OverTime is Yes & JobRole is Human Resources or Sales Executive & MonthlyIncome >= 3752 & DistanceFromHome < 11
No [.83 .17] when TotalWorkingYears < 3 & JobRole is Research Scientist
No [.77 .23] when TotalWorkingYears >= 3 & OverTime is Yes & MonthlyIncome < 3752 & DistanceFromHome < 17 & EducationField is Life Sciences or Medical & HourlyRate < 84
No [.70 .30] when TotalWorkingYears < 3 & JobRole is Human Resources or Laboratory Technician or Sales Representative & EnvironmentSatisfaction < 4 & Age >= 32
No [.62 .38] when TotalWorkingYears >= 3 & OverTime is Yes & JobRole is Human Resources or Sales Executive & MonthlyIncome >= 3752 & DistanceFromHome >= 11 & EducationField is Life Sciences or Medical or Technical Degree
Yes [.38 .62] when TotalWorkingYears < 3 & JobRole is Human Resources or Laboratory Technician or Sales Representative & EnvironmentSatisfaction >= 4 & MonthlyRate >= 19749
Yes [.33 .67] when TotalWorkingYears >= 3 & OverTime is Yes & MonthlyIncome < 3752 & DistanceFromHome < 17 & EducationField is Life Sciences or Medical & HourlyRate >= 84
Yes [.24 .76] when TotalWorkingYears >= 3 & OverTime is Yes & MonthlyIncome < 3752 & EducationField is Human Resources or Marketing or Other or Technical Degree
Yes [.16 .84] when TotalWorkingYears < 3 & JobRole is Human Resources or Laboratory Technician or Sales Representative & EnvironmentSatisfaction < 4 & Age < 32
Yes [.15 .85] when TotalWorkingYears >= 3 & OverTime is Yes & JobRole is Human Resources or Sales Executive & MonthlyIncome >= 3752 & DistanceFromHome >= 11 & EducationField is Human Resources or Marketing or Other
Yes [.12 .88] when TotalWorkingYears >= 3 & OverTime is Yes & MonthlyIncome < 3752 & DistanceFromHome >= 17 & EducationField is Life Sciences or Medical
Da kan vi sammenligne prediksjoner med observert utfall for testingdataene. I predict() må det angis type = "class" for å spesifisere at det skal være klassifikasjon.
testing_pred <- testing %>%
mutate(Attrition_pred = predict(klass_tre, newdata = testing, type = "class"))
confusionMatrix(reference = testing_pred$Attrition, testing_pred$Attrition_pred, positive = "Yes")Confusion Matrix and Statistics
Reference
Prediction No Yes
No 342 67
Yes 18 14
Accuracy : 0.8073
95% CI : (0.7673, 0.843)
No Information Rate : 0.8163
P-Value [Acc > NIR] : 0.7131
Kappa : 0.1605
Mcnemar's Test P-Value : 1.926e-07
Sensitivity : 0.17284
Specificity : 0.95000
Pos Pred Value : 0.43750
Neg Pred Value : 0.83619
Prevalence : 0.18367
Detection Rate : 0.03175
Detection Prevalence : 0.07256
Balanced Accuracy : 0.56142
'Positive' Class : Yes
Sammenlign dette resultatet med logistisk regresjon-baseline ovenfor. Spesielt interessant er accuracy, sensitivity (andelen av de som faktisk slutter som vi fanger opp) og specificity (andelen av de som blir som vi riktig klassifiserer). Et enkelt tre med forvalgte parametre gjør det ofte ganske greit, men vi kan forsøke å forbedre det med tuning.
4.4 Tuning/pruning
Vi har vært inne på tidligere at vi kan styre hvordan algoritmen fungerer. Det er noen parametres om styrer prosessen, og disse kan vi justere. Her tar vi for oss de viktigste.
Kort fortalt styrer disse parametrene hvor komplekse trærne kan bli. Husk nå fra tidligere kapittel: mer kompleks modell gir bedre tilpassning til trainingdata - men kan gi dårligere tilpassning til testingdata. Målet er altså å finne en slags balansert kompleksitet.
Nå er arbeidsflyten slik at man skrur litt på disse parametrene, sjekker resultatet og justerer på nytt og sjekker… osv. Da er det viktig at bruker trainingdata! Ikke bruke testingdata før du er rimelig fornøyd med resultatet.
4.4.1 Bruk av cp = ...
cp er complexity parameter som setter et krav på hvor mye hver split skal bidra til modellens tilpassning til data. Forvalget er 0.01. Med en lavere verdi tillates mer komplekse trær:
klass_tre2 <- rpart(Attrition ~ .,
data = training, method = "class",
cp = 0.0000001, maxdepth = 20, minbucket = 3)
rpart.plot(klass_tre2)
4.4.2 Bruk av maxdepth = ...
Parameteren maxdepth setter en grense for hvor mange splitter det kan gjøres i hver forgrening:
klass_tre3 <- rpart(Attrition ~ .,
data = training, method = "class",
maxdepth = 4)
rpart.plot(klass_tre3)
4.4.3 Bruk av minbucket = ...
Parameterne minbucket = ... styrer hvor mange observasjoner det minst må være i den siste noden. Forvalget er en 1/3 av minsplit, altså 7 hvis man ikke har endret på minsplit.
klass_tre4 <- rpart(Attrition ~ .,
data = training, method = "class",
minbucket = 15)
rpart.plot(klass_tre4)
4.4.4 Sette delene sammen
Du kan kombinere disse parametrene. Alle har forvalgte verdier, så du bruker dem uansett – forskjellen er bare om du har tatt et eksplisitt valg eller overlater det hele til softwaren.
Her er et eksempel der vi bruker et ganske komplekst tre med justerte parametre:
klass_tre5 <- rpart(Attrition ~ .,
data = training, method = "class",
cp = 0, maxdepth = 25, minbucket = 5)
rpart.plot(klass_tre5)
rpart.rules(klass_tre5) Attrition
0.00 when TotalWorkingYears >= 3 & OverTime is No & EnvironmentSatisfaction < 2 & JobRole is Human Resources or Manager or Manufacturing Director or Research Director or Research Scientist or Sales Representative & YearsAtCompany >= 2 & NumCompaniesWorked >= 4
0.00 when TotalWorkingYears >= 3 & OverTime is No & EnvironmentSatisfaction < 2 & JobRole is Healthcare Representative or Laboratory Technician or Sales Executive & YearsAtCompany >= 2 & NumCompaniesWorked >= 4 & DailyRate >= 1034
0.00 when TotalWorkingYears >= 3 & OverTime is No & EnvironmentSatisfaction < 2 & YearsAtCompany < 2 & Age >= 42
0.00 when TotalWorkingYears >= 3 & OverTime is Yes & MonthlyIncome >= 3752 & JobRole is Healthcare Representative or Laboratory Technician or Manager or Manufacturing Director or Research Director or Research Scientist or Sales Representative & NumCompaniesWorked < 7 & RelationshipSatisfaction >= 2
0.00 when TotalWorkingYears >= 11 & OverTime is Yes & MonthlyIncome >= 3752 & JobRole is Healthcare Representative or Laboratory Technician or Manager or Manufacturing Director or Research Director or Research Scientist or Sales Representative & NumCompaniesWorked < 7 & RelationshipSatisfaction < 2
0.00 when TotalWorkingYears >= 3 & OverTime is Yes & MonthlyIncome < 2434 & DistanceFromHome < 17 & EducationField is Life Sciences or Medical & YearsAtCompany >= 5 & HourlyRate < 84
0.02 when TotalWorkingYears >= 3 & OverTime is No & EnvironmentSatisfaction >= 2 & MonthlyIncome >= 2087 & JobRole is Healthcare Representative or Human Resources or Manager or Research Director or Sales Executive or Sales Representative & DistanceFromHome < 21 & RelationshipSatisfaction < 2 & YearsSinceLastPromotion < 14
0.02 when TotalWorkingYears >= 3 & OverTime is No & EnvironmentSatisfaction >= 2 & DistanceFromHome >= 21 & DailyRate >= 424
0.02 when TotalWorkingYears >= 3 & OverTime is No & EnvironmentSatisfaction >= 2 & MonthlyIncome >= 2087 & DistanceFromHome < 21 & RelationshipSatisfaction >= 2 & YearsSinceLastPromotion < 14
0.05 when TotalWorkingYears >= 3 & OverTime is No & EnvironmentSatisfaction < 2 & YearsAtCompany >= 2 & NumCompaniesWorked < 4
0.07 when TotalWorkingYears >= 3 & OverTime is No & EnvironmentSatisfaction >= 2 & MonthlyIncome >= 2087 & JobRole is Laboratory Technician or Manufacturing Director or Research Scientist & DistanceFromHome < 21 & RelationshipSatisfaction < 2 & YearsSinceLastPromotion < 14 & StockOptionLevel >= 1
0.08 when TotalWorkingYears >= 3 & OverTime is Yes & MonthlyIncome is 2434 to 3752 & DistanceFromHome < 17 & EducationField is Life Sciences or Medical & HourlyRate < 84
0.10 when TotalWorkingYears >= 3 & OverTime is No & EnvironmentSatisfaction >= 2 & MonthlyIncome >= 2087 & JobRole is Laboratory Technician or Manufacturing Director or Research Scientist & DistanceFromHome is 6 to 21 & RelationshipSatisfaction < 2 & YearsSinceLastPromotion < 14 & StockOptionLevel < 1
0.12 when TotalWorkingYears < 3 & EnvironmentSatisfaction >= 4 & JobRole is Human Resources or Laboratory Technician or Sales Representative & MonthlyRate < 19749
0.14 when TotalWorkingYears >= 3 & OverTime is Yes & MonthlyIncome >= 3752 & JobRole is Human Resources or Sales Executive & DistanceFromHome < 11
0.17 when TotalWorkingYears < 3 & JobRole is Research Scientist
0.23 when TotalWorkingYears >= 3 & OverTime is No & EnvironmentSatisfaction >= 2 & DistanceFromHome >= 21 & Age >= 32 & DailyRate < 424
0.25 when TotalWorkingYears >= 3 & OverTime is No & EnvironmentSatisfaction >= 2 & MonthlyIncome >= 2087 & DistanceFromHome < 21 & YearsSinceLastPromotion >= 14
0.25 when TotalWorkingYears >= 3 & OverTime is No & EnvironmentSatisfaction < 2 & JobRole is Healthcare Representative or Laboratory Technician or Sales Executive & EducationField is Medical or Other or Technical Degree & YearsAtCompany >= 2 & NumCompaniesWorked >= 4 & DailyRate < 1034
0.30 when TotalWorkingYears < 3 & EnvironmentSatisfaction < 4 & JobRole is Human Resources or Laboratory Technician or Sales Representative & Age >= 32
0.33 when TotalWorkingYears >= 3 & OverTime is No & EnvironmentSatisfaction >= 2 & MonthlyIncome < 2087 & DistanceFromHome < 21
0.33 when TotalWorkingYears >= 3 & OverTime is Yes & MonthlyIncome >= 3752 & JobRole is Healthcare Representative or Laboratory Technician or Manager or Manufacturing Director or Research Director or Research Scientist or Sales Representative & NumCompaniesWorked >= 7
0.38 when TotalWorkingYears >= 3 & OverTime is Yes & MonthlyIncome >= 3752 & JobRole is Human Resources or Sales Executive & DistanceFromHome >= 11 & EducationField is Life Sciences or Medical or Technical Degree
0.60 when TotalWorkingYears is 3 to 11 & OverTime is Yes & MonthlyIncome >= 3752 & JobRole is Healthcare Representative or Laboratory Technician or Manager or Manufacturing Director or Research Director or Research Scientist or Sales Representative & NumCompaniesWorked < 7 & RelationshipSatisfaction < 2
0.62 when TotalWorkingYears < 3 & EnvironmentSatisfaction >= 4 & JobRole is Human Resources or Laboratory Technician or Sales Representative & MonthlyRate >= 19749
0.67 when TotalWorkingYears >= 3 & OverTime is No & EnvironmentSatisfaction < 2 & YearsAtCompany < 2 & Age < 42
0.67 when TotalWorkingYears >= 3 & OverTime is Yes & MonthlyIncome < 3752 & DistanceFromHome < 17 & EducationField is Life Sciences or Medical & HourlyRate >= 84
0.70 when TotalWorkingYears >= 3 & OverTime is Yes & MonthlyIncome < 2434 & DistanceFromHome < 17 & EducationField is Life Sciences or Medical & YearsAtCompany < 5 & HourlyRate < 84
0.76 when TotalWorkingYears >= 3 & OverTime is Yes & MonthlyIncome < 3752 & EducationField is Human Resources or Marketing or Other or Technical Degree
0.80 when TotalWorkingYears >= 3 & OverTime is No & EnvironmentSatisfaction >= 2 & MonthlyIncome >= 2087 & JobRole is Laboratory Technician or Manufacturing Director or Research Scientist & DistanceFromHome < 6 & RelationshipSatisfaction < 2 & YearsSinceLastPromotion < 14 & StockOptionLevel < 1
0.84 when TotalWorkingYears < 3 & EnvironmentSatisfaction < 4 & JobRole is Human Resources or Laboratory Technician or Sales Representative & Age < 32
0.85 when TotalWorkingYears >= 3 & OverTime is Yes & MonthlyIncome >= 3752 & JobRole is Human Resources or Sales Executive & DistanceFromHome >= 11 & EducationField is Human Resources or Marketing or Other
0.88 when TotalWorkingYears >= 3 & OverTime is No & EnvironmentSatisfaction < 2 & JobRole is Healthcare Representative or Laboratory Technician or Sales Executive & EducationField is Life Sciences or Marketing & YearsAtCompany >= 2 & NumCompaniesWorked >= 4 & DailyRate < 1034
0.88 when TotalWorkingYears >= 3 & OverTime is Yes & MonthlyIncome < 3752 & DistanceFromHome >= 17 & EducationField is Life Sciences or Medical
1.00 when TotalWorkingYears >= 3 & OverTime is No & EnvironmentSatisfaction >= 2 & DistanceFromHome >= 21 & Age < 32 & DailyRate < 424
Vi kan sjekke resultatet med confusion matrix på testing-data:
testing_pred <- testing %>%
mutate(Attrition_pred = predict(klass_tre5, newdata = testing, type = "class"))
confusionMatrix(reference = testing_pred$Attrition,
testing_pred$Attrition_pred,
positive = "Yes")Confusion Matrix and Statistics
Reference
Prediction No Yes
No 328 64
Yes 32 17
Accuracy : 0.7823
95% CI : (0.7408, 0.82)
No Information Rate : 0.8163
P-Value [Acc > NIR] : 0.969618
Kappa : 0.1429
Mcnemar's Test P-Value : 0.001557
Sensitivity : 0.20988
Specificity : 0.91111
Pos Pred Value : 0.34694
Neg Pred Value : 0.83673
Prevalence : 0.18367
Detection Rate : 0.03855
Detection Prevalence : 0.11111
Balanced Accuracy : 0.56049
'Positive' Class : Yes
Legg merke til at et veldig komplekst tre ikke nødvendigvis gir bedre resultater på testing-data. Treet kan ha blitt for tilpasset training-dataene (overfitting). En måte å motvirke dette på er å beskjære treet i etterkant.
4.4.5 Pruning
En relatert teknikk er å beskjære treet basert på cp. Altså, når du har bygget et tre som du tenker er for komplekst, så kan du beskjære grenene slik at de minst viktige grenene kuttes. Funksjonen prune() gjør jobben:
pruned_tre <- prune(klass_tre5, cp = .01)
rpart.plot(pruned_tre)
4.5 Asymetriske kostnader med loss matrix (Ekstramateriale)
Loss matrix gjør det mulig å vekte ulike typer feil forskjellig. Utgangspunktet er følgende matrise:
\[ loss = \begin{bmatrix} TN & FN \\ FP & TP \end{bmatrix} \]
Vi setter alltid vektingen av sanne positive og sanne negative til 0. Feilene vektes. Her er et eksempel der falske negative (at vi ikke fanger opp noen som faktisk slutter) veier 4 ganger tyngre enn falske positive:
lossm <- matrix(c(0, 1, 4, 0), ncol=2)
lossm [,1] [,2]
[1,] 0 4
[2,] 1 0
rpart_loss <- rpart(Attrition ~ . ,
data = training,
parms = list(loss = lossm),
method = "class")
rpart.plot(rpart_loss)
Sjekk mot testingdata:
testing_pred <- testing %>%
mutate(Attrition_pred = predict(rpart_loss, newdata = testing, type = "class"))
tab <- testing_pred %>%
select(Attrition_pred, Attrition) %>%
table()
confusionMatrix(tab, positive = "Yes")Confusion Matrix and Statistics
Attrition
Attrition_pred No Yes
No 356 68
Yes 4 13
Accuracy : 0.8367
95% CI : (0.7989, 0.87)
No Information Rate : 0.8163
P-Value [Acc > NIR] : 0.1476
Kappa : 0.2153
Mcnemar's Test P-Value : 1.131e-13
Sensitivity : 0.16049
Specificity : 0.98889
Pos Pred Value : 0.76471
Neg Pred Value : 0.83962
Prevalence : 0.18367
Detection Rate : 0.02948
Detection Prevalence : 0.03855
Balanced Accuracy : 0.57469
'Positive' Class : Yes
Sammenlign sensitivity og specificity med den opprinnelige modellen uten loss matrix. Ved å vekte falske negative tyngre tvinger vi modellen til å fange opp flere av de som faktisk slutter – men på bekostning av flere falske positive. Denne avveiningen mellom ulike typer feil er sentral i resten av kurset.