BCAR - an associative classification library

    Contents
  1. Introduction
  2. Composition
  3. Usage
  4. Source Code Download
  5. Citation
  6. Contact Information

  1. Introduction

  BCAR is a library for the associative classification, which denotes "Boosting
Class Association Rules". BCAR provides a general tool for classification tasks
with various types of input data: discrete, sequence, and rea-valued data. This
library consists of two parts: front-end application programs and back-end core
library methods. Three front-end applications are executed sequentially to
perform classification.

	A. `wrules', mine and write association rules to a file.
	B. `brules', build a classifier using a boosing technique.
	C. `predict', predict classes for test documents.

These three binaries calls the routines inside the BCAR library. The generation
part of class association rules were extended from the implementation of the
FP-tree freqeunt pattern mining algorithms by Bart Goethals
(http://adrem.ua.ac.be/~goethals/software/). The classifier-building and the
prediction parts were newly written for this library. The library is applied
with GNU Lesser General Public License, so everyone who wants to study the
associative classification can freely use it.


  2. Composition

  An archive of the library can be downloaded from the website as a single `tar'
file. Most sources are written in C++, and utilities are written in Perl and
Bash scripts. If you unpack the tar file, you can get the following files.

	Table 1. Files in the `bcar.tar' archive.
	--------------------------------------------------------------------------
	Type                Source Files
	--------------------------------------------------------------------------
	library             fpgrowth_c.cpp, fptree_c.spp, data_cf.cpp, item.cpp,
	                    boost.cpp, classify.cpp, fpgrowth.h, fptree.h,
	                    data.h,	item.h, rule.h

	front-end           Write_r.cpp, Boost_r.cpp, Predict.cpp

	utility             batchrun.sh, writerl.sh, buildcl.sh, predcicttd.sh,
	                    evalm.pl, evals.pl

	etc                 Makefile

	documentation       README, COPYING, COPYING.LESSER
	--------------------------------------------------------------------------

The program sources uses the Standard Template Library in C++. The three target
binaries were compiled using GNU C++ library with accompanying `Makefile', so I
think it is easy to migrate this sources to other platforms such as Microsoft
Windows. In the LINUX envrionment, extract files into a directory from the `tar'
file, then run the following command:

	$ make -f Makefile

It produces the library, `bcar.a', then binaries `wrules', `brules', and
`predict'. For simplicity, the output files are now generated in the current
directory. You may install the necessary files into the system directory through
modifying the Makefile. Besides source files, sample data files can also be
downloaded in a compressed format. We provide the pre-processed Reuters-21578
text collection.


  3. Usage

	Each front-end binary has a set of pre-designed arguments and input/output
filename conventions. For the simple use of our library, we provide the utility
scripts in Table 1. The `batchrun.sh' script contains all the necessary commands
and arguments in itself. Before we explain about the commands and the arguments,
three subdirectories are assumed to be made:

	`doc_mtx',      training and testing files reside,
	`out',          rule files are to be stored,
	`cout',         prediction score files are to be stored.

`batchrun.sh' and subordinate scritps `writerl.sh', `buildcl.sh' and
`predicttd.sh' are designed to input files from and output files to those
directories. These scripts provide users with easy-to-use interfaces to the
front-end applications, `wrules', `brules' and `predict'. Next, we explain about
the format of input/output files and the usage of the three executables in
detail.

	A. training data file

	  In the association rule mining, a training data file is a set of
	transactional log records. Each record contains a set of items whose
	order is not important in the mining. The items may have discrete or
	continuous quantity, but continous values should be converted into
	discrete ones to be input to our library. The library can only input
	integer-coded values as items. In the text categorizaiton, for example,
	words in a document must be converted into integers using a document-
	indexing tool. A line in a training data file corresponds to one
	transaction record. Items in a line are separated with a blank. The
	first item in the line is not an actual item, but a class label which
	consists of alphabet characters and `_', `-' , `.' characters. The
	following lines is an example of a training data.

		three 3 10 19 27 36 44 51 59 67 75 82 94 101 112 118 125 135 ...
		two 3 15 21 29 37 45 54 62 69 76 84 91 100 108 116 122 132 141 ...
		one 4 11 21 30 37 45 54 61 68 76 82 91 101 107 115 125 132 140 ...
		two 5 12 22 28 36 45 52 63 71 78 85 94 101 108 115 124 132 139 ...
		...

	  If the training data is multi-labeled, a record may have more than one
	class labels. In that case, multiple records should be copied with the
	same items but different class labels.

	B. test data file

	  Test data files have a similar format to the training data files except
	in one point. In multi-labeled data sets, test instances have multiple
	answer class labels in one line. They would look like:

		trade  3 12 16 20 26 31 40 41 48 56 79 94 96 104 107 117 137 ...
		trade grain corn  28 31 50 79 83 88 96 107 113 117 149 261 262 ...
		money-fx interest  27 29 31 40 79 86 117 162 168 190 194 195 202 ...
		...

	In the above format, multiple class labels come firat, then items; the
	class labels and items are separated with "two" blank characters.

	C. wrules

	  `wrules' is called in `writerl.sh' script inside `batchrun.sh' script.
	This program mines class association rules from training examples. It
	accepts parameters related to frequent pattern mining such as the minimum
	support and the minimum confidence thresholds. These parameters can be
	easily set using `batchrun.sh' shell script. In addition, input and output
	filenames can be designated through the script. For example, if the
	name of training data file is `./doc_mtx/reu.in', `df' variable in
	`batchrun.sh' should be set to `reu'. Much attention should be taken to
	set the value of `gdep', the growth depth of sub FP-trees. It determines
	the length of the rules, thus vaules greater than 3 would make the mining
	take an excessive amount of time, so those must be avoided.

	  If `wrules' is successfully executed, it generates class association
	rules to file "./out/filename.rule", where `filename' would include the
	given parameter values. Although there is no need to edit this file
	directly, we would like to show the content of the file:

		1 > 6552 0 1472 3 2709
		2 > 5764 0 1295 3 2438
		1 > 5705 0 1278 3 2426
		3 > 4253 0 1401 3 2639
		1 > 4216 0 1384 3 2622
		2 > 3749 0 1219 3 2366
		1 > 3720 0 1204 3 2354
		4 > 3384 0 754 3 1707
		1 > 3266 0 725 3 1655
		2 > 2915 0 631 3 1485
		...

	This rule file has an itermediate rule format, which is not an intuitive
	form and hard to understand. The left-hand side of `>' is a condition part,
	and the right-hand side the consequent part. In the first line, `1' denotes
	an item expressed as a rank-id, `6552' the support of the item, `0' the
	class label code, `1472' the support of the rule with class label `0', and
	`3' another class label code. Generally, these first-step rule files
	contain a huge number of class association rules, so it is inevitable to
	store rules in a very compact format for these files.

	  `wrules' also produces another output file, "word code file". This file
	contains mapping codes between original item codes and internally converted
	codes. The internal code is a ranking among items in the transactions that
	are ordered by their support values. The word code file is used in `brules'
	and `predict' binaries as input files.

	D. brules

	  `brules' is called in `buildcl.sh' script inside `batchrun.sh' script.
	`brules' boosts the rules generated by `wrules', then build a final
	classifier with a small number of the selected rules. Actually, `brules'
	only "selects" the rules, and the classifier is implemented by executing
	`predict'. The selected rules are stored to "out/filename2.fnr" file,
	where `filename2' is augmented with hit score threshold information. The
	final classification rules have a more intuitive format:

		9 13 -> earn (1389, 1.0000)
		1 9 13 -> earn (1385, 1.0000)
		3 9 13 -> earn (1359, 1.0000)
		...
		12 587 -> acq (42, 0.8750)
		6 130 561 -> trade (42, 0.8750)
		...
	
	where the condition and the consequent parts are divided by `->'. The first
	rule has an itemset "9 13", class label `earn', rule support 1389, and
	confidence value 1.0.

	  `brules' accepts another parameter `hsth' in `batchrun.sh', which denotes
	hit score threshold. The parameter acts a significant role in improving the
	classification accuray of the classifier. An optimal value for hit score
	threshold must be determined empirically yet. Grid search for an optimal
	`hsth' can be conducted using `batchrun.sh' script.

	E. predict

	  `predict' is called in `predicttd.sh' script inside `batchrun.sh' script.
	You can optionally apply a weighting scheme on the prediction scores when
	you handle class-skewed data with `wgt' variable set to 1. When you use
	multi-labeled data, you should set variable `ml' (multi-labelling) to 1.
	It is necessary to designate a test file as "tf=doc_mtx/testfile.in" in
	`batchrun.sh'. The output of the program is stored as "./cout/filename3".
	`filename3' is augmented with `_c' to the last of `filename2'. For example,
	a multi-label prediction results would look like this:

		1: trade  ; trade  1052.8018 acq   190.1348 earn   162.8906 money-fx   
45.9139 interest    10.3302 crude     7.0687 grain     0.5263
		2: grain  ; grain    13.0862 earn     2.9189 acq     1.6973
		3: crude  ; crude    76.1770 trade    44.4820 money-fx    19.7688 earn    
5.8108
		4: trade grain corn  ; trade    97.5271 earn    39.4855 crude    15.0773
grain     2.1764
		...

	  The answer labels and the prediction scores are divided with ` ; '. The
	prediction scores are output sorted by highest scores. In the single-labeled
	classification, the first label appeared in the score part is the predicted
	one, while in the multi-labeled classification the first m class labeles
	could be predicted ones. The value for m is determined by a pre-defined
	principle. In our scheme, class labels with the score which exceeds a given
	ratio for the total score of all the class labels is taken as predicted
	ones.

	  The performance evaulation for a classifier can be shown with overall
	performance measures such as accuracy, reall, precision and F1 measure. We
	provide utilities to measure the performance of a classifier; `evalm.pl' for
	multi-label classification and `evals.pl' for single-label classification.
	You can easily use `evalm.pl' through `batchrun.sh' script. It is also
	possible to determine an optimal score ratio for multi-label classification
	using the script. The following is an example output of `evalm.pl' when
	applying the Reuters-21578 collection to text categorization:

	Classfication Evaluation Result - cout/reu_5p20_d3_h50_c

	  Classname    TP   FP   FN    TN   sum      P      R     F1   mBEP   Acc
	acq           710   25    9  1801  2545   96.6   98.7   97.7   97.7   98.7
	corn           17    2   39  2487  2545   89.5   30.4   45.3   59.9   98.4
	crude         155    9   34  2347  2545   94.5   82.0   87.8   88.3   98.3
	earn         1080   31    7  1427  2545   97.2   99.4   98.3   98.3   98.5
	grain         129    4   20  2392  2545   97.0   86.6   91.5   91.8   99.1
	interest      102   19   29  2395  2545   84.3   77.9   81.0   81.1   98.1
	money-fx      150   30   29  2336  2545   83.3   83.8   83.6   83.6   97.7
	ship           67    2   22  2454  2545   97.1   75.3   84.8   86.2   99.1
	trade         106   11   11  2417  2545   90.6   90.6   90.6   90.6   99.1
	wheat          27    2   44  2472  2545   93.1   38.0   54.0   65.6   98.2

    micro-Avged  2543  135  244              94.96  91.25  93.065  93.10   98.5

   4. Source Code Download

You can download the source code of the BCAR library, and can build your own binaries which can be executable in your computing platform. We also provide a pre-processed test data, Reuters-21578 document collection, which can be directly input to our library.
  5. Citation

  The library implementation is based on the paper published in ICSC-2008 and
others which have been submitted to related scientific jounals. The next is a
reference to 
ICSC-2008 paper:

	Y. Yoon and G. G. Lee, Text categorization based on boosting association
rules, in Proceedings of ICSC 2008, Second IEEE International Conference on
Semantic Computing, 2008, pp. 136-143.


  6. Contact Information

  If you have any questions about the BCAR library, do not hesitate to contact
the author:

	Yongwook Yoon,
	
	Dept. of Computer Science and Engineering, Pohang University of Science
	and Technology (POSTECH), San 31 Hyoja-dong, Pohang, Korea, 790-784.
	Tel) +82-54-279-5581, Fax) +82-54-279-2299, Mobile) +82-10-2878-2357,
	Email) ywyoon@postech.ac.kr, 
     Homeage) http://isoft.postech.ac.kr/~ywyoon