MATLABArsenal

A MATLAB Wrapper for Classification

Developed at:
Informedia
School of Computer Science
Carnegie Mellon University

Version: 0.99 Debug Version
Date: 05.03.2004

 

Overview

MATLABArsenal is a open-source wrapper for the problem of classification written in MATLAB. The main features of the program are the following:

Description

MATLABArsenal is a open-source wrapper written in MATLAB for the problem of supervised learning / classification.

Source Code

The program is free for scientific use. Please contact me, if you are planning to use the software for commercial purposes. The software must not be modified and distributed without prior permission of the author. If you use MATLABArsenal in your scientific work, please cite as

http://finalfantasyxi.inf.cs.cmu.edu/tmp/MATLABArsenal.zip

This wrapper is powered by several other machine learning packages. Their binaries have already been included in this release. If necessary, you can download their lastest version from the following websites,

Installation for MATLAB source code

  1. You can skip this step if you already have a MATLAB environment, otherwises you have to install MATLAB in your machine.

  2. Download the package from http://finalfantasyxi.inf.cs.cmu.edu/tmp/MATLABArsenal.zip

  3. Unzip the .zip files into a arbitrary directory, say $MATLABArsenalRoot

  4. Add the path $MATLABArsenalRoot and its subfolders in MATLAB. Use addpath command or menu File->Set Path.

Then it is ready to go.

 

Installation for binary code

  1. You can skip this step if you already have a Java Runtime environment, otherwises it is better to install Java in your machine (for WEKA).

  2. Download the package from http://finalfantasyxi.inf.cs.cmu.edu/tmp/MATLABArsenalExec.zip

  3. Unzip the .zip files into a arbitrary directory, say $MATLABArsenalRoot

  4. Add the path $MATLABArsenalRoot/bin/win32 to system path.

  5. Run demo1.bat in the $MATLABArsenalRoot

Then it is ready to go.

 

How to use

This section explains how to use the MATLABArsenal software.

The main module of MATLABArsenal is called "test_classify".

It can be called with the following approaches:

(1) In MATLAB Command line, type

 test_classify('classify -t input_file [options] [--@Evaluation [options]] ...
				-- Classifier [param] [-- Classifiers]); 
For example, the following commands use SVM_LIGHT with RBF kernel('-Kernel 2') to classify, where param is 0.01, cost factor is 3. The first 100 data in TREC03_com.CNN.hstat1 is used as training data, the rest are the testing data. The data features('-n') will be normalized to [0,1] before classificaiton.
   test_classify(strcat(
		'classify -t TREC03_com.CNN.hstat1 -n 1', ...
   		' -- train_test_validate -t 100 ', ...
		' -- train_test_multiple_class ', ... 

		' -- SVM_LIGHT -Kernel 2 -KernelParam 0.01 -CostFactor 3'));

You might need to run

clear global preprocess; 

before classification to clean the global variable "preprocess"

(2) Write a .m files in the current directory as follows, and run

	global preprocess; 
	%Normalize the data
   	preprocess.Normalization = 1;
	%Evaluation Method
   	%0: Train-Test Split
   	%1: Cross Validation
   	preprocess.Evaluation = 0;
	preprocess.TrainTestSplitBoundary = 100;


	% Multi-class classification
   	% 0: Classification
   	% 1: Multi-class Classification Wrapper
   	% 2: Multi-label Classification Wrapper
   	% 3: Multi-class Active Learning Wrapper
   	preprocess.MultiClassType = 1;
	preprocess.root = '.'; // Change to your files 
   	preprocess.output_file = sprintf('%s/_Result', preprocess.root);
   	preprocess.input_file = sprintf('%s/TREC03_com.CNN.hstat1', preprocess.root);
	run = test_classify('SVM_LIGHT -Kernel 2 -KernelParam 0.01 -CostFactor 3');

This example give the same results as above.

(3) The binary code is running on the DOS command line. Suppose the current directory is $MATLABArsenalRoot, type

 ./test_classify.exe "classify -t input_file [options] [--@Evaluation [options]] ...
				-- Classifier [param] [-- Classifiers]"
with the same parameters as before.

A more detailed documentation is avaiable in http://finalfantasyxi.inf.cs.cmu.edu/tmp/MATLABArsenalDoc/

Input & Output Formats

Available options are: (To be continued)

Output options: 
         -o OUTPUT_FILE   - The file name of the result output file
         -of [{'a'};'w']  - Overwrite the output file or append
Input options:
		 -if 0/1		  -	Use the first type or the second type of input formats

The input file example_file contains the training examples. Currently two typrs of input formats are accepted. The first type is,

<line> .=. <value> <value> ... <value> <target>
<target> .=. <integer>

<value> .=. <float>

Sample Input Format: (Each line represents one training example, Last number is the label, all others are the features)

   0, 0, 0.40, 0
   0, 1, 0.10, 0
   0, 0, 0.05, 1
   0, 0, 0.10, 1
   0, 0, 0.15, 0
   0, 0, 0.05, 0

For the second type of input format, each of the following lines represents one training example and is of the following format:

<line> .=. <target> <feature>:<value> <feature>:<value> ... <feature>:<value>
<target> .=. <integer>
 
<feature> .=. <integer> | "qid" 
<value> .=. <float>

Sample Input Format: (Each line represents one training example, First number is the label, all others are the features, zeros can be omitted)

0 1:0 2:0 3:0.40
0 1:0 2:1 3:0.10
1 1:0 2:0 3:0.05
1 1:0 2:0 3:0.10
0 1:0 2:0 3:0.15
0 1:0 2:0 3:0.05
OR
0 3:0.40
0 2:1 3:0.10
1 3:0.05
1 3:0.10
0 3:0.15
0 3:0.05

There are two major output files. By default, their names are set to $(input_file).pred and $(input_file).result. The file $(input_file).pred contains the prediction results for each test instances. The sample output format is,

Index
Prob
Pred
Truth
1
0.98
0
0
2
0.76
1
0
3
0.60
1
1
4
0.79
0
1
5
0.52
1
0
6
0.67
0
0

 

The file $(input_file).result contains the overall prediction statistics for the test set. The sample output format is,

Processing Filename: demo/DataExample1.txt
   Classifier:kNN_classify -k 3
   Message: Cross Validation, Folder: 3, Classification, 
   Error = 0.234679, Precision = 0.375293, Recall = 0.337795, F1 = 0.354157, MAP    = 0.289376, MBAP = 0.209378, 

Options and Classifiers

The basic grammar for MATLABArsenal's command is as follows. Note that "--" is used to separate different parts of the input commands.
Do not forget to add spaces before and after the "--" otherwise the wrappers cannot parse the command correctly.

 test_classify('classify -t input_file [general_option] [-- EvaluationMethod [evaluation_options]] ...
				[-- ClassifierWrapper [param] ] -- BaseClassifier [param] ); 

More details for the available options, classifiers and their default values.

1. The general options

	-v   (def 1): preprocess.Vebosity, vebosity of messages
-sf (def 0): preprocess.Shuffled, shuffle the data or not. 0 for no shuffling
-n (def 1): preprocess.Normalization, normalize the data or not. 1 for normalizing
-sh (def -1): preprocess.ShotAvailable, shot information available or not. -1 for automatical detection
-vs (def 1): preprocess.ValidateByShot,
-ds (def 0): preprocess.DataSampling, do data sampling or not. 0 for none
-dsr (def 0): preprocess.DataSamplingRate, data sampling rate
-svd (def 0): preprocess.SVD, SVD dimension reduction. The parameter is number of reduced dimension
-fld (def 0): preprocess.FLD, FLD dimension reduction. The parameter is number of reduced dimension
-map (def 0): preprocess.ComputeMAP, report mean average precision
-if (def 0): preprocess.InputFormat, the input formats, either 0 or 1
-of (def 0): preprocess.OutputFormat, the output formats, either 0 or 1
-pf (def 0): preprocess.PredFormat, the prediction file formats, either 0 or 1
-chi (def 0): preprocess.ChiSquare, feature selection using chi-squared measure.
-t (def ''): preprocess.input_file, the input file name
-o (def ''): preprocess.output_file, the output file name
-p (def ''): preprocess.pred_file, the prediction file name
-oflag(def 'a'):preprocess.OutputFlag, output flag. 'a' for appending, 'w' for 'overwriting'
-dir (def ''): preprocess.WorkingDir, the working directory, which is $MATLABArsenalRoot
-drf (def ''): preprocess.DimReductionFile, the intermediate file for dimension reduction

2. The evaluation methods. The default method is to split the input file equally into training and testing sets.

 	train_test_validate(default method): split the input data into training set and testing set
options: -t (def -2): The training-testing splitting boundary for the data set
cross_validate: cross validation
options: -t (def 3): The folder for cross-validation
test_file_validate: use the input file as the training set, the additional file as testing set
options: -t (def ''): The additional testing file
train_only: use the input file for training only
options: -m (def ''): The output model file
test_only: use the input file for testing only
options: -m (def ''): The input model file

3. The multiclass classification wrappers. By default no wrappers are applied.

 	train_test_simple(default method): no wrappers are applied
train_test_multiple_class: multi-class classification wrappers
options: -CodeType (def 0): Coding schemes. 0 is one-against-all, 1 is pairwise coupling, 2 is ECOC-16 -LossFuncType (def 2): The type of loss functions. 0 is logist loss, 1 is exp loss, 2 is hinger loss train_test_multiple_label: multi-label classification wrappers
train_test_multiple_class_AL: multi-class classification wrappers with active learning
options: -CodeType (def 0): Coding schemes. 0 is one-against-all, 1 is pairwise coupling, 2 is ECOC-16 -LossFuncType (def 2): The type of loss functions. 0 is logist loss, 1 is exp loss, 2 is hinger loss -ALIter (def 4): Iterations for active learning -ALIncrSize (def 10): Incremental size per iteration

4. The classifier wrappers. By default no wrappers are applied.

 	WekaClassify (para) -- (Additional classifier and options): WEKA classification
		options: -MultiClassWrapper (def -1): Multi-class wrapper for WEKA. 1 for activation and 0 for deactivation. -1 for automatcially select
 	
	MCActiveLearning: Active learning module
options: -Iter (def 10): Iterations for active learning -IncrSize (def 10): Incremental size per iteration MCAdaBoostM1: AdaBoost.M1
options: -Iter (def 10): Iterations for AdaBoost -SampleRatio (def 1): The ratio of data to be resampled per iteration. 1 means 100% of the data is resampled MCBagging: Bagging
options: -Iter (def 10): Iterations for Bagging -SampleRatio (def 1): The ratio of data to be resampled per iteration. 1 means 100% of the data is resampled MCDownSampling: Down sampling
options: -PosNegRatio (def 0.5): The ratio of positive and negative data after sampling MCUpSampling: Up sampling
options: -PosNegRatio (def 0.5): The ratio of positive and negative data after sampling MCHierarchyClassify (para) -- (Meta Classifer, para) [-- BaseClassifer]: Hierarchial classification, using the meta classifier on top options: -PosNegRatio (def 0.5): The ratio of positive and negative data after sampling -SampleDevSet (def 0): Whether use a sampled development set to learn meta classifier or not. 0 is not. MCWithMultiFSet: Hierarchial classification on multiple groups of features. See example 5
options: -Voting (def 0): Use sum rule or majority voting to combine. 0 is sum rule.
-Separator (def 0): Separators for multiple feature groups.

5. The base classifiers.
 	SVM_LIGHT: SVM_light classification
		options: -Kernel (def 0): Kernel Type. 0 for linear, 1 for polynomial, 2 for RBF, 3 for sigmoid
				 -KernelParam (def 0.05): Kernel Parameter. d for polynomial, g for RBF, a/b for sigmoid
				 -CostFactor (def 1): Cost Factor, roughly the ratio of positive and negative data 
				 -Threshold (def 0): Classification threshold. Classified as positive if larger than the threshold  

 	SVM_LIGHT_TRANSDUCTIVE: SVM_light transductive classification
		options: -Kernel (def 0): Kernel Type. 0 for linear, 1 for polynomial, 2 for RBF, 3 for sigmoid
				 -KernelParam (def 0.05): Kernel Parameter. d for polynomial, g for RBF, a/b for sigmoid
				 -CostFactor (def 1): Cost Factor, roughly the ratio of positive and negative data 
				 -Threshold (def 0): Classification threshold. Classified as positive if larger than the threshold  
				 -TransPosFrac (def 1): Transductive postive fraction
 	
 	libSVM: libSVM classification 
		options: -Kernel (def 0): Kernel Type. 0 for linear, 1 for polynomial, 2 for RBF, 3 for sigmoid
				 -KernelParam (def 0.05): Kernel Parameter. d for polynomial, g for RBF, a/b for sigmoid
				 -CostFactor (def 1): Cost Factor, roughly the ratio of positive and negative data 
				 -Threshold (def 0): Classification threshold. Classified as positive if larger than the threshold  

 	mySVM: mySVM classification
		options: -Config (def N/A): the configuration file

	kNN_classify: kNN classification
options: -k (def 1): number of neighbors -d (def 2): distnace type. 0 for Euclidean, 1 for chi-squared, 2 for cosine-similarity GMM_classify: Gaussian Mixture Model classification options: -NumMix (def 1): number of mixture for each class LDA_classify: Linear Discriminant Analysis classification
options: -RegFactor (def 0.1): Regularization factors -QDA (def 0): 0 for LDA, 1 for QDA IIS_classify: Maximum entopy model, IIS implementation
options: -Iter (def 50): number of iterations -MinDiff (def 1e-7): Minimum difference of loglikelihood -Sigma (def 0): Regularization factors NeuralNet: Multi-layer perceptron (N/A for binary mode)
options: -NHidden (def 10): Hidden units -NOut (def 1): Output units -Alpha (def 0.2): Weight decay -NCycles (def 10): Number of training cycles LogitReg: Logistic regression options: -RegFactor (def 0): Regularization factors -CostFactor (def 1): Cost factors LogitRegKernel: Kernel logistic regression
options: -RegFactor (def 0): Regularization factors -Kernel (def 0): Kernel Type. 0 for linear, 1 for polynomial, 2 for RBF, 3 for sigmoid -KernelParam (def 0.05): Kernel Parameter. d for polynomial, g for RBF, a/b for sigmoid ZeroR: Do nothing, predict everthing as zero Wekaclassify -- trees.J48: C4.5 decision trees Wekaclassify -- bayes.NaiveBayes: Naive Bayes More weka classifiers, please refer to its manual

Getting started: some examples

Example 1

Classify DataExample1.txt
Shuffle the data before classfication ('-sf 1')
50%-50% train-test split (default)
Linear Kernel Support Vector Machine

test_classify('classify -t DataExample1.txt -sf 1 -- LibSVM -Kernel 0 -CostFactor 3');

Example 2

Classify DataExample1.txt
Shuffle the data before classfication ('-sf 1')
Reduce the number of dimension to 15
3 folder Cross Validation
3 Nearest Negihbor

test_classify('classify -t DataExample1.txt -sf 1 -svd 15 -- cross_validate -t 3 -- kNN_classify -k 3');

Example 3

Classify DataExample2.txt
Do not shuffle the data
Use first 100 data as training, the rest as testing
Apply a multi-class classification wrapper
RBF Kernel SVM_LIGHT Support Vector Machine

test_classify('classify -t DataExample2.txt -sf 0 -- train_test_validate -t 100 -- train_test_multiple_class -- SVM_LIGHT -Kernel 2 -KernelParam 0.01 -CostFactor 3');

Example 4

Train with DataExample2.train.txt, Test with DataExample2.test.txt
Do not shuffle the data
Use Weka provided C4.5 Decision Trees
AdaBoostM1 Wrapper
No Multi-class Wrapper for Weka

test_classify(strcat('classify -t DataExample2.train.txt -sf 0 ', ...
   ' -- test_file_validate -t DataExample2.test.txt -- MCAdaBoostM1 -- WekaClassify    -NoWrapper -- trees.J48'));

Example 5

Classify DataExample2.txt
Do not shuffle the data
Rewrite the output file
Use first 100 data as training, the rest as testing
Apply a stacking classification wrapper, first learn three classifiers based on features (1..120), (121..150) and (154..225), majority voting on top
Improved Iterative Scaling with 50 iterations

test_classify(strcat('classify -t DataExample2.txt -sf 0 -of w', ...
' -- train_test_validate -t 100 -- MCWithMultiFSet -Voting -Separator 1,120,121,150,154,225 -- IIS_classify -Iter 50'));

Example 6

Classify DataExample1.txt
Training the model using DataExample1.train.txt
Linear Kernel Support Vector Machine

test_classify(strcat('classify -t DataExample1.train.txt -- train_only -m DataExample1.libSVM.model -- LibSVM -Kernel 0 -CostFactor 3'));

Classify DataExample1.txt
Testing the new data for DataExample1.test.txt using DataExample1.libSVM.model
Linear Kernel Support Vector Machine

test_classify(strcat('classify -t DataExample1.test.txt -- test_only -m DataExample1.libSVM.model -- LibSVM -Kernel 0 -CostFactor 3'));

Example 7

Dimension Reduction of DataExample2.txt
Do not shuffle the data
Use Weka provided C4.5 Decision Trees
AdaBoostM1 Wrapper
No Multi-class Wrapper for Weka

test_classify('classify -t DataExample2.txt -sf 0 -svd 15 -drf DataExample2_SVD15.txt -- train_test_validate -t 1 -- ZeroR');

Example 8 (for binary code)

The same as Example 1, assuming the current directory is $MATLABArsenalRoot
Classify DataExample1.txt
Shuffle the data before classfication ('-sf 1')
3 folder Cross Validation
Linear Kernel Support Vector Machine

./test_classify.exe 'classify -t demo/DataExample1.txt -sf 1 -- cross_validate -t 3 -- LibSVM -Kernel 0 -CostFactor 3'

Example 9 (for binary code)

The same as Example 1, assuming the current directory is $MATLABArsenalRoot/demo
Classify DataExample1.txt
Shuffle the data before classfication ('-sf 1')
3 folder Cross Validation
Linear Kernel Support Vector Machine

../test_classify.exe 'classify -dir .. -t DataExample1.txt -sf 1 -- cross_validate -t 3 -- LibSVM -Kernel 0 -CostFactor 3'
 

Extensions and Additions

Questions and Bug Reports

If you find bugs or you have problems with the code you cannot solve by yourself, please contact me via email yanrong@cs.cmu.edu.

Disclaimer

This software is free only for non-commercial use. It must not be modified and distributed without prior permission of the author. The author is not responsible for implications from the use of this software.

History

References

Last modified May 3rd, 2004 by Rong Yan