CONICET | Buscador de Institutos y Recursos Humanos

Abstract. Gene expression signatures are currently used to lead cancer therapy [1]. In many situations, they are expected to successfully diagnose several disease types. However this is not usually possible, because of the need of a great amount of samples or by the overlapping characteristics of the classes in the feature space. One of the main tools used for multiclass classification problems is Support Vector Machines (SVM) under the well known OVO and OVA strategies and, more recently, the tree based approach. Most of the tree based SVM classifiers try to split the multi-class space, mostly, by some clustering like algorithms into several binary partitions. One of the main drawbacks of this approach is that the natural class structure is not taken into account. Furthermore, the same SVM parameterization is used for all partitions in the above mentioned strategies. Here, we applied the SVMTOCP (SVM tree optimal classification partition) [2], a new splitting methodology for K>2 multi-class problems. It builds a two- class problem for each node in the tree, by looking for the input class combinations that produce the best SVM performance in a specific tree node. This implies to solve for node i K i! L i = η ⋅ r!(K i − r ) ! (1) binary problems, where η=1(0.5) for K odd (even) and r=[K/2]. Once the best solution, if found, at node i r classes are passed to the child nodes and the process repeated until reaching a leaf. Despite the training phase being time and computationally expensive, the proposed approach always produces a balanced tree and the original class structure is preserved. The last property is very important from a Data Mining point of view, because the reached solution allows to identify which of the class combinations provides soft or hard margin solutions (tree nodes could have different kernel parameters) and automatically identifies what are the most difficult input classes to split. These are very important properties for data analysts who need to extract hidden knowledge from a multivariate data base. The SVMTOCP and the SVM OVO strategies were compared over three gene expression databases to classify tumor samples. In all cases the SVMTOCP achieves much more Hard Marging (HM) solutions and lesser amount of support vectors (SV) with no statistical difference in performance than the usual OVO approach. Reaching solutions with less number of SVs and HMs suggests, a more robust classification strategy and fewer samples to achieve efficient solutions. These findings are very nice properties for genomic applications where the number of samples is scarce. Table 1: used data sets characteristics and classification performances for both strategies SVMTOCP SVM OVO DB Instances #Classes %PE %HMs %SV %PE %HMs %SV ** ** ** NCI60 [4] 61 8 (5,9) 25 43 78 17 0 95 ** ** ** 9 Tumors [5] 58 8 (6,9) 0 36 87 15 0 96 ** ** SCBR [6] 63 4 (8,23) 0 33 58 0 0 67 ** p

enviar mensaje