Thursday, 16 February 2012

LIBSVM tutorial


What is SVM?

A Support Vector Machine is a learning algorithm typically used for classification
problems (Weather prediction, text categorization, handwritten character recognition, image
classification,Facial expression classification etc.). ie, if we have some sets of things classified
But we  know nothing about how we classified it, or we don't know the rules used for classification),
when a new data comes, SVM can PREDICT which set the data should belong to.

Suppose we have to differentiate a square from rectangle. how can we do it?
we know the logic is

if length=breadth then it is a square
else
it is a rectangle.

this is the logic we want to cluster using the SVM, we would have to give the have data in following format:

1 1:2 2:2
1 1:4 2:4
1 1:9 2:9
1 1:10 2:10
..........
..........
-1 1:5 2:6
-1 1:3 2:4
-1 1:6 2:9
-1 1:4 2:1
..........
..........

The above example is a TWO-class classification with labels +1 and -1 . +1 represents square and -1 represents rectangle.

Consider the 4 th line in the above format ie 1 1:10 2:10

1 represents this is a square.
For the second column (eg: 1:10) 1: represents the index value and 10 represents the length.
For the third column (eg: 2:10) 2: represents the index value and 10 represents the bredth.

Consider the 8 th line in the above format ie -1 1:4 2:1

-1 represents this is a rectangle.
For the second column (eg: 1:4) 1: represents the index value and 4 represents the length.
For the third column (eg: 2:1) 2: represents the index value and 1 represents the bredth.

This is the input file format of SVM.

[label] [index1]:[value1] [index2]:[value2] ...
[label] [index1]:[value1] [index2]:[value2] ...

label
Sometimes referred to as 'class', the class (or set) of your classification. Usually we put integers here.

index
Ordered indexes. usually continuous integers.

value
The data for training. Usually lots of real (floating point) numbers.

Why value, value2, ...?

The reason is usually the input data to the problem you were trying to solve involves lots of 'features'
or say 'attributes', so the input will be a set (or say vector/array).

For the above example we have 2 features lenght and breadth.

Installing Libsvm

We can download the libsvm from http://www.csie.ntu.edu.tw/~cjlin/libsvm/
libsvm.zip or libsvm.tar.gz

Contents in the .zip and .tar.gz are the same.
For windows users unzip the file. libsvm folder will be created.
UNIX users mostly prefer .tar.gz.

What is Libsvm?
Libsvm is a library for support vector machines.

How to Use Libsvm

We must follow the below procedure :
 1. Data Preparation for SVM
 2. Convert data into SVM format
 3. Conduct simple scaling on the data
 4. Consider the RBF kernel K(x; y) = e
 5. Use cross-validation to find the best parameter C and gamma
 6. Use the best parameter C and gamma to train the whole training set
 7. Test

We discuss this procedure in detail in the following sections

1. Data Preparation for SVM
We need to collect maximum data as possible. data set should contains both positive and negative data.
if it contains only one type of data(ie either positive or negative), it will shows wrong accuracy.
The selection of a negative data set is essential to the reliability of the prediction model.
we need to split the data set into two, one for training and other for testing.
After collecting the data , we need to convert both the training set and testing set into SVM format.

2.  Convert the data into SVM format
The SVM algorithm operates on numeric attributes.
So we need to convert the data into libsvm format which contains only numerical values.

For example , If we have imaginary data records like this:

man voice:low figure:big income:good
woman voice:high figure:slim income:fare

1. Convert the feature values to its numeric representation.
   Let's say, that best salary would be 5 and worst salary 1 (or no salary = 0), the same with other enumarated variables.
2. We have 2 classes, man and women . convert the classes to numeric values: man = 1, woman = -1
3. Save it in libsvm data format:

[class/target] 1:[firstFeatureValue] 2:[secondFeatureValue] etc.
ex.:
a women with great salary, low voice and small figure would be encoded like:
-1 1:5 2:1.5 3:1.8

In general the input file format of SVM is

[label] [index1]:[value1] [index2]:[value2] ...
[label] [index1]:[value1] [index2]:[value2] ...

label
Sometimes referred to as 'class', the class (or set) of your classification. Usually we put integers here.

index
Ordered indexes. usually continuous integers.

value
The data for training. Usually lots of real (floating point) numbers.


Is there a program to check if my data are in the correct format?
yes.
we can use the python script libsvm-3.11/tools /checkdata.py

Before doing that you need to install Python.
put checkdata.py in the Python folder. open a command prompt and type python checkdata.py filename.






3. Conduct simple scaling on the data

The original data maybe too huge or small in range, thus we can
rescale them to the proper range so that training and predicting will be faster.
The main advantage of scaling is to avoid attributes in greater numeric
ranges dominating those in smaller numeric ranges. Another advantage is to avoid
numerical difficulties during the calculation.

We recommend linearly scaling each attribute to the range [1; +1] or [0; 1].
we have to use the same method to scale both training and testing
data.

svm-scale is a tool for scaling input data file.
It is available at libsvm-3.11\windows

The syntax of svm-scale is
svm-scale [options] data_filename.

For options please refer README file in libsvm.






Running the above command generates 2 files. training.scale and range file.
In the above example we scale the file training.txt and scale each attribute/feature to the range [-1 +1].

The output of scaling is training.scale file, which is used for creating the model.
(model will be discussed later).
here we use option -s for saving the scaling parameters to range file.
Scaling factors are stored in the file range and that file is used for scaling the
test data.


we should scale the testing and scaling data in same range.
so for scaling the testing file we can use the range file .





Above example uses the range file for scaling the testing data. It generates the output testing.scale.

note: we can scale training and testing data in a similar way.
> svm-scale -s scaling_parameters train_data > scaled_train_data
> svm-scale -r scaling_parameters test_data > scaled_test_data

But using the same scaling factors for training and testing sets, we obtain much better
accuracy


4. Model Selection

After scaling the data set , we have to choose a kernel function for creating the model.
4 basic kernels are

  1. linear
  2. polynomial
  3. radial basis function
  4. sigmoid

In general, the RBF kernel is a reasonable first choice.
A recent result shows that if RBF is used with model selection, then there is no need to consider the linear kernel. The kernel matrix using sigmoid may not be positive definite and in general it's accuracy is not better than RBF.  Polynomial kernels are ok but if a high degree is used, numerical difficulties tend to happen.

In some cases RBF kernel is not used.
case 1: Number of instances < number of features/attributes

(instance means each entry in the txt file)

Consider an example1 .
we have 2 data sets one for training and other for testing
training set contains 38 instance and testing set contains 34 instance. number of features:7,129.

If you use RBF kernel the accuracy is 71.05
and for linear kernel accuracy is 94.73 (please see below)

In such cases linear kernel have more accuracy than RBF kernel.

There are two parameters for an RBF kernel: C and gamma
linear kernel has a penalty parameter C.
It is not known beforehand which C and gamma are best for a given problem; consequently some kind of model selection(parameter search) must be done. The goal is to identify good (C , gamma) so that the
classifier can accurately predict unknown data (i.e. testing data).

For selecting the best parameter value , we can use grid.py in the libsvm-3.11\tools directory.

grid.py is a parameter selection tool for C-SVM classification using
the RBF (radial basis function) kernel. It uses cross validation (CV) (will discuss later)
technique to estimate the accuracy of each parameter combination in
the specified range and helps you to decide the best parameters for
your problem.
grid.py directly executes libsvm binaries (so no python binding is needed)
for cross validation and then draw contour of CV accuracy using gnuplot.

Before using the grid.py , we should edit grid.py for setting the path of svmtrain_exe and gnuplot_exe

grid.py

if not is_win32:
       svmtrain_exe = "../svm-train"
       gnuplot_exe = "/usr/bin/gnuplot"
else:
       # example for windows
       svmtrain_exe = r"F:\lekshmi\libsvm-3.11\windows\svm-train.exe"
       # svmtrain_exe = r"c:\Program Files\libsvm\windows\svm-train.exe"
       gnuplot_exe = r"F:\lekshmi\gp444win32\gnuplot\binary\gnuplot.exe"

you should install python and gnuplot before running grid.py.
After installing python set the environment variable "PYTHONPATH" as C:\Python27(in my system python is installed in C directory)
put the grid.py in the python folder. open the command promt and change directory to the folder contains grid.py.

Consider the above example1 (Number of instances < number of features/attributes)

for linear kernel







for RBF kernel






now we can discuss the grid.py in detail
The syntax for grid.py is


grid.py [-log2c begin,end,step] [-log2g begin,end,step] [-v fold]
       [-svmtrain pathname] [-gnuplot pathname] [-out pathname] [-png pathname]
       [additional parameters for svm-train] dataset









10 comments:

  1. dear leksmi, can you tell me what ar x and y in svm scale oprion?
    --parthasarathi

    ReplyDelete
  2. Dear Lekshmi

    I am doing project on face recognition using MATLAB.I want to know how to use libsvm for multi-class classification.How it is used for images.Please help me out.

    Regards
    Mohan

    ReplyDelete
  3. Dear Lekshmi
    Please tell how to use SVM to separate noise from text in an image using MATLAB 7 and win Xp platform.

    ReplyDelete
  4. Dear Lekshmi
    please tell me what should we do for Gaussian kernel?

    Thanks

    ReplyDelete
  5. Dear Lekshmi,

    Do you have any java sample for face recognition?

    My email ID is sowyuen80@yahoo.com

    Thanks.

    ReplyDelete
  6. hi
    i need a help. if you impossible, how do i create a train and test set with 2 input and 1 output?
    i want to a matlab code same as : ex.m MatLab code.
    with 2 input and 1 out put with epsilon regression


    Thank you in advanced for your consideration and I am looking forward to
    hearing from you soon. Yours sincerely, H. Orouji
    my mail: hosseinoroji@gmail.com

    code: ex code
    N = 30;
    u = linspace(0,1,N)';
    y = 1 ./ (0.1 + u) + 0.3*randn(N,1);
    svm_type = 3;
    kernel_type = 2;
    gamma = 10;
    cost = 10;
    epsilon = 0.2;

    options = ['-s ', num2str(svm_type),...
    fprintf('Starting LIBSVM\n');
    tic;
    model = svmtrain(y, u, options);
    fprintf('Optimization finished in %3.2f sec\n',toc);
    yp = svmpredict(y, u, model);
    figure
    plot(u,y,'*')
    hold
    plot(u,yp,'r','linewidth',2)
    xlabel('u')
    ylabel('y')
    legend('process','model')

    ReplyDelete
  7. Dear Lekshmi,

    Is there any way to connect those LibSVM output using C#, so that we can use it for software with better GUI.

    If you have any experience with it we can share each other.

    This is my Google account : riezu.vgoth@gmail.com

    ReplyDelete
  8. hi,

    i have a set of data which is not in numerical but it is in words. So, how can use it in SVM? i didn't understand

    ReplyDelete
  9. Lekshmi,

    How can I use LibSVM using eclipse in java language? I didn't understand how to install it. Hope you can explain a little bit about this.

    You can reply to my email. iffah247@gmail.com

    ReplyDelete
  10. Dear Lekshmi

    i want to give a grade to each vector in the training mode
    and after that somehow measure the distance between the separator of the classes and each vector position and by that connect between the grads of each vector and their distance form the separator .... could you help me in that subject.
    also please know that i'm using c#
    thanks
    my email : talwaserman@gmail.com

    ReplyDelete