關於我自己

我的相片
New York, New York, United States
我叫江奕賢啦

2008年6月16日

How to use weka to predict new data

有人問我怎麼用Weka預測東西
因為網路上的資料大部分都很制式
都只帶著你讀入一堆有label的資料
然後告訴你run某個classifier來"預測"的話 可以得到多少正確性
用來學習了解data mining玩玩看是不錯啦
不過總是沒辦法拿來真的預測新的data

所以我隨便亂寫個note 到時候好按表操課

下面的例子test.csv為已知data with labels.
pred.csv假裝為未知data
因為實際應用通常data會很大 爲了可以丟到server上去跑
下面的例子我用command line操作weka
我隨便寫寫 你們隨便看看 不附customer support
======================

I made a test.csv file as follow

id value cat
1 1 a
2 2 a
3 3 b
4 4 b
5 5 c
6 6 c

and load into weka and save as test.arff, I got

@relation test

@attribute id numeric
@attribute value numeric
@attribute cat {a,b,c}

@data

1,1,a
2,2,a
3,3,b
4,4,b
5,5,c
6,6,c


then, I use following command to get the result:

java -cp weka.jar weka.classifiers.trees.J48 -t test.arff -d j48.model -x 3

next step, we'll do prediction
try to do the same thing as test.arff,
but this time since you don't have the label, you put what ever label (but have to be valid, for this example, {a,b,c})
for me, I make a pred.csv as follow

id value cat
1 1 a
2 2 a
3 3 a
4 4 a
5 5 a
6 6 a

and make the arff file using weka.
I got following lines in pred.arff:
@relation pred

@attribute id numeric
@attribute value numeric
@attribute cat {a,b,c}

@data

1,1,a
2,2,a
3,3,a
4,4,a
5,5,a
6,6,a


this file will have wrong prediction, since weka will think "cat" can only be "a"
(as you can see in the arff file: "@attribute cat {a}" )
change it to "@attribute cat {a,b,c}"

then you can run
java -cp weka.jar weka.classifiers.trees.J48 -T pred.arff" -l j48.model -p 0

and you'll get output like:
0 a 1.0 a
1 a 1.0 a
2 b 1.0 a
3 b 1.0 a
4 c 1.0 a
5 c 1.0 a

it means, it predict first instance as a with confidence 1.0 while you stats it's "a". But remember, the "a" you stat here is just a dummy one.

But as you can see, you got the prediction as second column, confidence in third column.