INVESTIGADORES
FERNANDEZ elmer Andres
congresos y reuniones científicas
Título:
SVM Tree with Optimal Multiclass Partition applied to Gene expression signature classification
Autor/es:
PALLAROLO, MARIN; DIEGO ARAB COHEN; FRESNO, CRISTOBAL; PRATO, LAURA; FERNÁNDEZ, ELMER ANDRÉS
Lugar:
Parana
Reunión:
Congreso; 3er Congreso Argentino de Bioinformatica y Biologia Computacional; 2012
Institución organizadora:
FIUNER, A2B2C
Resumen:
Abstract. Gene expression signatures are currently used to lead cancer therapy [1]. In many situations, they are
expected to successfully diagnose several disease types. However this is not usually possible, because of the need of a
great amount of samples or by the overlapping characteristics of the classes in the feature space. One of the main
tools used for multiclass classification problems is Support Vector Machines (SVM) under the well known OVO and
OVA strategies and, more recently, the tree based approach. Most of the tree based SVM classifiers try to split the
multi-class space, mostly, by some clustering like algorithms into several binary partitions. One of the main
drawbacks of this approach is that the natural class structure is not taken into account. Furthermore, the same SVM
parameterization is used for all partitions in the above mentioned strategies. Here, we applied the SVMTOCP (SVM
tree optimal classification partition) [2], a new splitting methodology for K>2 multi-class problems. It builds a two-
class problem for each node in the tree, by looking for the input class combinations that produce the best SVM
performance in a specific tree node. This implies to solve for node i
K i!
L i = η ⋅
r!(K i − r ) !
(1)
binary problems, where η=1(0.5) for K odd (even) and r=[K/2]. Once the best solution, if found, at node i r classes
are passed to the child nodes and the process repeated until reaching a leaf. Despite the training phase being time and
computationally expensive, the proposed approach always produces a balanced tree and the original class structure is
preserved. The last property is very important from a Data Mining point of view, because the reached solution allows
to identify which of the class combinations provides soft or hard margin solutions (tree nodes could have different
kernel parameters) and automatically identifies what are the most difficult input classes to split. These are very
important properties for data analysts who need to extract hidden knowledge from a multivariate data base. The
SVMTOCP and the SVM OVO strategies were compared over three gene expression databases to classify tumor
samples. In all cases the SVMTOCP achieves much more Hard Marging (HM) solutions and lesser amount of
support vectors (SV) with no statistical difference in performance than the usual OVO approach. Reaching solutions
with less number of SVs and HMs suggests, a more robust classification strategy and fewer samples to achieve
efficient solutions. These findings are very nice properties for genomic applications where the number of samples is
scarce.
Table 1: used data sets characteristics and classification performances for both strategies
SVMTOCP
SVM OVO
DB
Instances
#Classes
%PE
%HMs
%SV %PE
%HMs
%SV
**
**
**
NCI60 [4]
61
8 (5,9)
25
43
78
17
0
95
**
**
**
9 Tumors [5]
58
8 (6,9)
0
36
87
15
0
96
**
**
SCBR [6]
63
4 (8,23)
0
33
58
0
0
67
** p