Published on December 22, 2009
Slide 1: GENETIC PROGRAMMING FOR GENERATING PROTOTYPES IN CLASSIFICATION PROBLEMS Presented by: Tarundeep Dhot Dept of ECE Concordia University Slide 2: This presentation is based on a research paper written by the following authors: L.P. Cordella, C. De Stefano, F. Fontanella, A Marcelli This paper was published at: The 2005 IEEE Congress on Evolutionary Computation This presentation is solely meant for educational purposes. Acknowledgements Salient Features : Salient Features GP based approach for generating prototypes in a classification problem is proposed. The set of prototypes is encoded by a multitree. Multitrees are set of trees which represent the chromosome. Chromosomes are of variable length allowing classification for problems consisting of classes as well as further sub-classes. WHAT IS CLASSIFICATION? : WHAT IS CLASSIFICATION? Categorization: the act of distributing things into classes or categories of the same type. Classification problems undergo a supervised training phase. Set of labeled data samples indicating the class it belongs to is provided to the system. Based on such data, unknown data is classified using rules, decision trees or mathematical functions. SYSTEM / CLASSIFIER Data Sets (Labeled) Type X X1 X2 Type Y Y1 Fig: Training Phase PROPOSED APPROACH : Provides a new GP based method for determining the prototypes in a c-class problem where c 2. Prototypes describing samples belonging to ‘c’ different classes consists of logical expressions. Each prototype represents a cluster of samples in the training set. Each prototype consists of a set of assertions (logical predicates) connected by Boolean operators. The value of a particular feature in a sample depends on such assertions. A class can be represented by a variable number of logical expressions i.e. the length of a single expression (as well as its predicates) is variable, thus, the devised approach is able to automatically find subclasses present in the data set. PROPOSED APPROACH PROTOTYPE 1 PROTOTYPE 2 PROTOTYPE 3 X1 X2 X3 Y1 Y2 Y3 Z1 Z2 PROPOSED APPROACH (cont:) : PROPOSED APPROACH (cont:) The set of prototypes describing the classes makes up a single individual of the evolving population. Each prototype is encoded as a derivation tree, thus, an individual is a multi-tree (list of trees). Classification consists of attributing a sample to one of the classes i.e. associating the sample to one of the prototypes. Fitness function or value is the recognition rate obtained on the training set of an individual. Selection is done based on the fitness values of individuals. At the end of the process, the best individual obtained, constitutes the set of prototypes to be used for the considered application. DESCRIPTION OF THE APPROACH : DESCRIPTION OF THE APPROACH Set of prototypes Each prototype characterizes a class or subclass Each class or subclass consists of set of logical expressions Each expression may contain variable number of predicates. Each predicate establishes a condition on the value of a particular feature it is representing. If all predicates of an expression are satisfied by values in the feature vector describing a sample, we say the expression matches the sample. DESCRIPTION OF THE APPROACH (cont:) : DESCRIPTION OF THE APPROACH (cont:) CLASSIFICATION: Given a data set and a set of labeled expressions, the classification task is performed in the following way: Each sample of the data set is matched against the set of expressions and is either assigned to one of them or is rejected. Different cases may occur: Sample matches ONE expression: it is assigned to it. Sample matches more than one expression with DIFFERENT number of predicates: it is assigned to the expression with the SMALLEST number of predicates. Sample matches more than one expression with SAME number of predicates and different labels: sample is REJECTED. Sample matches NO expression: sample is REJECTED. Sample matches more than one expression with equal label: sample is assigned to the class the expression belongs to. IMPLEMENTATION OF EVOLUTIONARY APPROACH : IMPLEMENTATION OF EVOLUTIONARY APPROACH GP starts by randomly generating a population of p individuals. Phenotype String containing logical expressions, each one encoded as a derivation tree. Chromosome of an individual Multitree Number of trees in an individual’s chromosome is referred as length of the individual. Length of individual in the initial population [2, Lmax]. For generating a new population, best e individuals are selected and copied to new generation ELITISM. Tournament selection for remaining (p – e)/2 couples. This controls loss of diversity and selection intensity. Crossover is applied on selected couples with probability factor pc. Mutation is applied on individuals with probability factor pm. Finally, these individuals are added to the new population. Process is repeated for NG generations. LEARNING CLASSIFICATION RULES : LEARNING CLASSIFICATION RULES Before implementing the proposed evolutionary paradigm, the following steps must be executed: Structure Definition: definition of the structure to be evolved Training Phase and Fitness Function: choice of fitness function Genetic Operators: definition of genetic operators LEARNING CLASSIFICATION RULES (cont:) : LEARNING CLASSIFICATION RULES (cont:) STRUCTURE DEFINITION: The implementation requires a program generator providing syntactically correct programs and an interpreter to execute them. PROGRAM GENERATOR: Based on grammar written for S-expressions. Grammar G is defined as a quadruple = (T, N, S, P) where T and N are disjoint finite alphabets. S is the starting symbol P is the set of production rules used to define strings. INTERPRETER: is implemented by an automation that computes Boolean functions. Such an automation accepts an expression as an input and returns TRUE or FALSE as an output depending on whether the expression matches the sample or not . LEARNING CLASSIFICATION RULES (cont:) : LEARNING CLASSIFICATION RULES (cont:) TRAINING PHASE AND FITNESS FUNCTION: The system is trained with a set containing Ntr patterns. The training set is used for evaluating the fitness of the individuals in the population. Training set samples are assigned to the expressions belonging to each individual. Each valid expression of an individual is labeled with the most widely represented in the corresponding cluster. Recognition rate for each individual is evaluated using a classifier. This rate is assigned as the fitness value to the individual. In order to favor those individuals able to obtain good performance with lesser number of expressions, the fitness of each individual is increased by 0.1/Nc where Nc is the number of expressions in an individual. LEARNING CLASSIFICATION RULES (cont:) : LEARNING CLASSIFICATION RULES (cont:) GENETIC OPERATORS: Crossover and mutation are used as the genetic operator. Crossover: Applied to two chromosomes C1 and C2 and yields two new chromosomes by swapping the lists of the initial chromosomes. Fig: (a) and (b) Chromosome C1 and C2 of length 4 and 3. (c) and (d) Chromosomes obtained after crossover at t1 = 2 and t2 = 1 LEARNING CLASSIFICATION RULES (cont:) : LEARNING CLASSIFICATION RULES (cont:) GENETIC OPERATORS: Mutation: It is independently applied to every tree of the chromosome C with probability pm. More specifically, mutation operator is applied by randomly choosing a single non-terminal node Ti in a given tree Ts. If there are n non-terminal nodes in a chromosome C, probability of mutating each single node of C = pm / n. EXPERIMENTAL RESULTS : EXPERIMENTAL RESULTS The proposed approach was tested on three standard databases (IRIS, BUPA and Vehicle available on the UCI website and also compared another GP based approach. The method showed better results with higher recognition rates. Table: Average Recognition rates (%). R1 proposed classifer R2 comparison classifier. CONCLUSIONS : CONCLUSIONS GP based approach to prototype generation and classification is proposed. Population is evolved where each individual consists of a set of possible prototypes of the classes present in the data to be analyzed. A prototype consists of a set of logical expressions. Recognition rate obtained using each set of prototypes is used as fitness function for controlling evolution. Method is able to automatically cluster data. The method offers greater flexibility in comparison to other methods due to dynamic labeling mechanism of logical expressions. The comparison with another method show better results with the proposed GP. Slide 17: THANK YOU !!