Thursday, September 22, 2016

Become 70s on the Dataset SYLVA with a simple tree based algorithm with R

The project involves the SYLVA dataset that is part of a competition for the best search algorithm. The information about the contest is available at http://www.modelselect.inf.ethz.ch/datasets.php. SYLVA The data set consists of 216 variables with data 13086 training and validation data in 1308. The test data are incomplete and can not be used. They are used only to compare in-the algorithms fine them. Only the contest organizers have the results. The purpose of the model is a classification problem. To compare the results, calculate different scores. The origin of this data set is Ecological. There is no further information on the contents of this dataset. His discovery will therefore enable better understand and make decisions about the treatment that will be used to solve the modeling problem.

The programming language used is R. For purposes of analysis, the following libraries are used:
library(ade4)
library(lattice)
library(gridExtra)
library(randomForest)

Preliminary analysis


Loading data

Like any project , we first load the data. The datasets are divided into two : one for exploratory data and the other by the results. This scenario is the same as it is for training or validation data data. Test data results do not contain data; What makes them unusable for our analyse.


#Chargement des données d'entrainement et de validation
train.data <- read.table("data/sylva_train.data")
train.labels <- read.table("data/sylva_train.labels")
valid.data <- read.table("data/sylva_valid.data")
valid.labels <- read.table("data/sylva_valid.labels")

#on change la variable de resultat Y en facteur
train.labels$V1 <- as.factor(train.labels$V1)
colnames(train.labels)[1] <- "Y"
valid.labels$V1 <- as.factor(valid.labels$V1)
colnames(valid.labels)[1] <- "Y"



Kingdom - wide analysis

The number of variables 216 are too numerous to describe one by one. However it is possible to see a quick summary on a single image of all variables as individual boxes mustache.
#Analyse univarie
nb<-ncol(train.data)
#generation de boxplot pour toutes les variables
png("boxplot.png",width = 2400, height = 1400)
par(mfrow=c(12,18),mar=c(1.0,1.8,1.0,1.0))
for (i in 1:nb){
boxplot(train.data[,i],main = paste("Variable",i))
}
dev.off()

On the previous figure , one quickly notes that include different types of data. Quantitative data easily identifiable by complete box plots with identifiable quartiles. Values ​​thereof all vary between 0 and 1000. Qualitative or rather binary data which are characterized by box plots formed by a single line or trunk as a function of the frequency of 0 and 1. Of these, one can see domination with near 0 0 average .

Our variable of interest on it is comprised of a majority of class "-1" .
summary(train.labels)

Y
Effectif
-1
12281
1
805
The imbalance between "-1" and "1" will be in model accuracy .

Separation of quantitative and qualitative variables

To continue the analysis we separate variable types . We remove , qualitative variables that have a single modality .
train.quali<-train.data[,c(1,2,4:6,8,9,11:14,16,17,19:40,44:49,51:54,56:72,74,77:80,82:87,90:99,101:103,106:109,111,112,114:116,118:124,128:138,140:142,144,146,148,149,151:154,156:158,160:165,167:170,172:178,180,181,184:186,189,190,192:200,202:204,206,207,209,210,212,214:216)]
train.quant<-train.data[,c(3,7,10,18,41,42,43,50,55,75,76,81,88,89,100,104,105,110,113,117,125,126,127,139,143,147,150,155,166,171,179,182,183,187,188,191,201,205,208,211)]
valid.quali <- valid.data[,c(1,2,4:6,8,9,11:14,16,17,19:40,44:49,51:54,56:72,74,77:80,82:87,90:99,101:103,106:109,111,112,114:116,118:124,128:138,140:142,144,146,148,149,151:154,156:158,160:165,167:170,172:178,180,181,184:186,189,190,192:200,202:204,206,207,209,210,212,214:216)]
valid.quant <- valid.data[,c(3,7,10,18,41,42,43,50,55,75,76,81,88,89,100,104,105,110,113,117,125,126,127,139,143,147,150,155,166,171,179,182,183,187,188,191,201,205,208,211)]


There are 171 variables and 40 quantitative variables. This imbalance should be taken into account during the modeling .
qualitative variables changing factor for the rest of the analysis.
quali <- as.data.frame(lapply(rbind(train.quali,valid.quali), as.factor))
valid.quali <- quali[(nrow(train.quali)+1):nrow(quali),]
train.quali <- quali[1:nrow(train.quali),]


multivariate analysis

It is difficult with many variables to multivariate analysis in a short page . Separating the multivariate analysis of qualitative and quantitative data. The joint analysis still raises more questions in terms of study and therefore will not be performed due to space saving.

Multivariate analysis of quantitative variables

There are 171 variables to analyze. Given the number of elements, the analysis of multiple correspondancees can not view calmly . However , one can generate a Burt table which is a symmetric matrix by variable level .
#Tableau de burt des variables quantitatives
burtd<-acm.burt(train.quali,train.quali)
n <- names(burtd)
burtd.0=burtd[grep(pattern = ".0$", x = n),grep(pattern = ".0$", x = n)]
burtd.1=burtd[grep(pattern = ".1$", x = n),grep(pattern = ".1$", x = n)]
burtd.0.1=burtd[grep(pattern = ".0$", x = n),grep(pattern = ".1$", x = n)]
In the previous charts , we see the impact that the modality of a variable can have on another modality of another variable . So when we look at the modality "0" against the other terms "0" , we realize that with few exceptions , there is a strong correspondence in counting between the terms "0" for other variables. I do not know if we can speak of strong correlation in this case . Yet this is suspect. One can say the same thing between the terms "1" in smaller numbers because less present in our data. The terms "0" against "1" shows some variables so against strong sense counting and many variables with a low count .

Multivariate analysis of quantitative variables

In the context of this document we can look to fully study the data , so we choose to interest us more than simple correlations of quantitative variables.
corMAt<-cor(train.quant)
png("correlations.png",width = 1024, height = 768)
levelplot(corMAt,aspect="iso",pretty=TRUE,xlim=c(1,ncol(train.quant)), ylim=c(1,ncol(train.quant)),xlab="",ylab="",main="Correlation")
dev.off()

We see from the graph that many variables are not correlated. We see a couple of strong anti- correlation and a dozen strong correlation of the 800 correlations .


Conclusion

Studying 216 different kind of variables is tedious. Nevertheless , one could understand the characteristics of our variables. The variables are many and have such variables that behave identically . Quantitative variables are fewer and weakly correlated. The imbalance between the amounts of both types of variables , allows us to focus our thinking on the modeling methods to implement .

Model

Modeling is particular because the number of variables is greater than the number of quantitative variables. Conventionally, would have transformed our variables into quantitative variables with a Multiple Correspondence Analysis. This approach can not be envisaged without causing an imbalance in the modeling.
Another problem is the number of variables that we will try to reduce to facilitate the modeling effort.
The previous two elements lead us to change our quantitative variables variables. Then we use a Multiple Correspondence Analysis on which retrieves contact information for modeling.
We choose the study of a regression tree. We will compare this choice to an extension of this model: a random forest.
Of course, we will compare the results with the scores of the contest of our data set.

Reduction of dimension

To start cutting , we will transform the quantitative variables variables.
To perform this treatment, we cut each of our quantitative variables quantile .
train.quantQuali<-train.quant
for(i in 1:ncol(train.quantQuali)){
train.quantQuali[,i]=cut(train.quantQuali[,i],breaks= quantile(train.quantQuali[,i]),include.lowest=TRUE)
}
valid.quantQuali<-valid.quant
for(i in 1:ncol(valid.quantQuali)){
valid.quantQuali[,i]=cut(valid.quantQuali[,i], breaks=quantile(train.quantQuali[,i]),include.lowest=TRUE)
}

Then we gather all in one table .
train.FullAcm<-cbind(train.quantQuali,train.quali)
valid.FullAcm<-cbind(valid.quantQuali,valid.quali)
Then starts the Multiple Correspondence Analysis :
train.qualAcm<-dudi.acm(train.FullAcm,nf=ncol(train.FullAcm),scannf = FALSE)
On visualise les valeurs propres :
png("eigenvalue.png")
barplot(train.acm$eig)
dev.off()


The first eigenvalue is strongest . Its impact on the modeling is certainly the most important . There are several bearings. The choice of the number of variables to remember could be the fifth of the number of variables.
We use the Kaiser rule to define precisely this choice.






For a Multiple Correspondence Analysis , the Kaiser rule is :
kaiserLimit<-100/ncol(train.FullAcm)
kaiserLimitCol<-ncol(train.FullAcm)
for(i in 1:nrow(inertie)){
if(inertie[i,1]<kaiserLimit)
{
break
}
kaiserLimitCol<-i
}
Kaiser rule selects the first 39 axes.

Tree Rating

We must prepare the data from the ACM by recovering only the first 39 variables and adding our variable of interest.
#ajout des labels
train.full<-(cbind(train.labels,train.coord[,1:kaiserLimitCol]))
valid.full<-(cbind(valid.labels,valid.coord[,1:kaiserLimitCol]))
The tool used to calculate the regression tree is " rpart ". We use the default setting , which is to build a classification tree as our variable of interest and qualitative kind . This type of algorithm has been studied previously . The main idea of this algorithm is derived from CART is to create a complete tree then to support the tree minimizing the error rate.
We launch the calculation of the regression tree using rpart function.
#Calcul de l'Arbre de Classification
dataTree <- rpart(Y ~ ., data=train.full)
It displays the result tree.
#Affichage de l'arbre
png("arbre.png")
prp(dataTree,extra=1)
dev.off()


The generated tree shows that the first axis is a predominant component in the data classification . If it has a value less than 0.22 , is chosen directly "-1" class. If the value is greater than or equal to 0.36 , the class is "1". Between these two values ​​, we need other variables to make the decision.
Only five variables are necessary for this tree to deliver a ranking. The advantage of the reduction in variable compression allows this .

The results on the training set are provided in the following table :

-1
1
-1
12217
64
1
89
716
The overall error is 1.1 % = (64 + 89) / (12217 + 716) . This error rate is low. We must put this result, because it is the training data .
The results of the validation set are :

-1
1
-1
1223
5
1
6
74
The overall error is 0.84 % = (5 + 6) / (1223 + 74) . This result is better than that obtained on values. Our tree has good generalization ability .
Classes "-1" are effective in larger and has an error rate lower than the class "1".
In this exercise, we have to go one step further by calculating different indicators that allow us to calculate a score . The whole is comparable with the results of the contest in which the SYLVA data set is located .
We need to calculate the BER ( Balanced Error Rate) and Sigma to build the score
#Balanced Error Rate (BER)
ber<-function(tru,pred) {
test<-table(tru,pred)
a<-test[1,1]
b<-test[1,2]
c<-test[2,1]
d<-test[2,2]
retVal<-0.5*(b/(a+b) + c/(c+d))
return(retVal)
}
#Calcul du sigma
sigmaBer<-function(tru,pred) {
test<-table(tru,pred)
size<-test[1,2]+test[1,1]+test[2,1]+test[2,2]
p1<-test[1,2]/size
m1<-test[1,1]+test[1,2]
p2<-test[2,1]/size
m2<-test[2,1]+test[2,2]
sigma = 0.5*sqrt(p1*(1-p1)/m1+p2*(1-p2)/m2)
return(sigma)
}
After these two functions , you just apply and calculate ratios .
arbre.train<-ber(train.full$Y, predict(dataTree, train.coord, type="class"))
arbre.valid<-ber(valid.full$Y, predict(dataTree, valid.coord, type="class"))
arbre.sigma<-sigmaBer(valid.full$Y, predict(dataTree, valid.coord, type="class"))
#BER guess error
arbre.guess<-abs(arbre.valid-arbre.train)
#Test score
arbre.score<-arbre.valid + arbre.guess * (1- exp(-arbre.guess/arbre.sigma))

The table of results is:
Rank
Method
Balanced Error
Test guess
Guess error
Test score
Train
Valid
1
0.0034
0.0033
0.006
0.0001
0.0062
2
0.0046
0.0045
0.0065
0.001
0.0065
 
 
 
 
 
 
70
0.0491
0.0428
0.0491
0.0006
0.0501
 
Arbre de Classement
0.05788
0.03953
0.03953
0.01834
0.05772
71
0.0452
0.0557
0.0454
0.0078
0.0609
If we were to place our algorithm, we apparaîtrions between the 70th and the 71st . When looking at the algorithms used to make the best scores, we realize the complexity that can be implemented. Thus, we find all kinds of algorithms such as k-nearest neighbors , the selection of variables, Bayesian modeling, optimal discretization Baye or optimization of the output trigger on the BER indicator. In any event , the strategies are similar. First, we seek to reduce the dimensions by a transformation or a selection of variables. Then we look for a model that maximizes the reduction of the error.

Random forest

We continue with an algorithm which is an extension of regression trees. This is random forest. A random forest is composed of regression tree which was removed columns and rows by random sampling with replacement.
To initiate such a procedure, we use the library " randomForest ".
By reusing the data prepared for the regression tree , we are launching the calculation.
#Foret Aleatoire
train.rf <- randomForest(Y ~ ., data=train.full)
The number of trees of this forest defaults to 500 trees . It's a pretty big forest must by majority vote to be able to take right decision.
The results of the training data are:

-1
1
-1
12281
0
1
0
805
It's incredible. Our random forest has learned by heart our set of training data. There is no mistake.
The results of the validation data are less good.

-1
1
-1
1223
5
1
7
73
The results of our random forest are very similar to the results of our classification tree. An additional error on the class "1" is counting down .
This result may seem disappointing to the results obtained with the training data . Random forest is not larger generalization ability .