ARFF reader/writer for MATLAB

After the early posts (1,2) on ARFF tools for MATLAB I would show usage examples of them but first I’m going to give some insights on their design which could be helpful to understand the approach used by the ARFF library.

While writing the code I started from the ARFF’s Weka Documentation, in particular looking at the stable version of ARFF specification, then I chose to leave out (at least for the first implementation) the Sparse ARFF support.

ARFF MATLAB

Being the MATLAB code a different world from the standard Java API used by Weka, I chose to implement the ARFF’s payload, the instances, as a single struct array representing the ARFF’s dataset. As you may know the dataset comes with a brief description of each instance’s attribute (aka the real data). This extra piece of information, which is common to all instances, is located in the header section of the ARFF file. It brings a description of each attribute’s name and type and gives Weka hints on how to read/update/write the entire file content.

For instance, a common header could include several attributes and just one nominal specification:

1
2
3
4
5
6
7
@RELATION example_dataset

@ATTRIBUTE idx NUMERIC
@ATTRIBUTE low NUMERIC
@ATTRIBUTE med NUMERIC
@ATTRIBUTE high NUMERIC
@ATTRIBUTE type_class { front, middle, rear }

and after that comes the payload (aka the instances):

1
2
3
4
5
6
7
8
@DATA
1,6,53,95,rear
2,27,57,96,rear
3,6,66,70,middle
4,7,42,78,front
5,17,65,80,middle
6,20,57,80,rear
...

Being a simple text format is an advantage for in-place editing and implementing the parsing directly using MATLAB code is not tough. However there is a small problem: dealing with nominal specification attributes (i.e. attributes which allow a limited set of (string) values) isn’t so straightforward inside the MATLAB enviroment. For keeping things simple I used a small work-around: using an extra argument (nomspec) for describing the nominal attributes value mappings while doing the ARFF parsing. Probably isn’t the cleanest solution but it does its jobs.

Using a simple struct array to hold all the dataset payload doesn’t help when come to nominal spec. attributes because one can assign any sort of datatype to a struct’s field. However from a parser point of view this aspect can be overcome by introducing a simple convention: just add “_class” string to each struct’s field name which needs to be mapped to a nominal spec. attribute.

In terms of code this approach needs just few lines when saving an ARFF dataset:

1
2
3
4
5
6
7
8
9
10
% add attributes to an instance
data(1).amplitude = 123;
data(1).type_class = 'a';
% ...

% define nominal specification
nomspec.type_class = { 'a', 'b', 'c' };

% save the dataset
arff_write(arff_file, data, relname, nomspec);

Or when loading it:

1
2
3
4
5
6
7
% import data
[data, relname, nomspec] = arff_read(arff_file);

% check nominal spec attribute
nomspec.type_class

>> { 'a' 'b' 'c' }

Using this hacky solution you can unleash the power of the ARFF file format while doing you MATLAB/Weka simulations without needing any dataset conversion. Handy, isn’t?

For more extensive usage examples or for more info about these tools please look at the ARFF reader/writer page.

Also read...