Saturday, January 31, 2009

Data Mining more deeper

In this exercise, I have found the limits of expression in the realm of data, within the first part and I can see some utility of the Weka (pronounced like Mecca) and watching the data dance about on the screen, proves to be more than a little interesting.
However using the tool and fiddling around with the data (not real sure which data set I should be using) I discovered some tendencies and patterns, that could prove useful in the long run. I would have to use this with some familiar data, just so I could better understand what is coming from this. (Something like, needing to see the answer before asking the question.)

Using the Heart data-- and J48
I got something like this


=== Run information ===
http://www.blogger.com/posts.g?blogID=3679037668869916100
Scheme: weka.classifiers.trees.J48 -C 0.25 -M 2
Relation: cleveland-14-heart-disease
Instances: 303
Attributes: 14
age
sex
cp
trestbps
chol
fbs
restecg
thalach
exang
oldpeak
slope
ca
thal
num
Test mode: 10-fold cross-validation

=== Classifier model (full training set) ===

J48 pruned tree
------------------

thal = fixed_defect
| ca <= 0
| | exang = no: <50 (5.06)
| | exang = yes: >50_1 (3.06/1.0)
| ca > 0: >50_1 (10.0)
thal = normal
| ca <= 0: <50 (117.21/12.55)
| ca > 0
| | cp = typ_angina
| | | trestbps <= 138: >50_1 (4.0/1.0)
| | | trestbps > 138: <50 (3.0)
| | cp = asympt: >50_1 (20.0/3.0)
| | cp = non_anginal: <50 (13.9/1.0)
| | cp = atyp_angina
| | | restecg = left_vent_hyper
| | | | exang = no: >50_1 (3.0)
| | | | exang = yes: <50 (2.0)
| | | restecg = normal: <50 (4.0)
| | | restecg = st_t_wave_abnormality: <50 (0.0)
thal = reversable_defect
| cp = typ_angina
| | chol <= 229: <50 (3.0)
| | chol > 229
| | | age <= 48: >50_1 (2.0)
| | | age > 48: <50 (3.0/1.0)
| cp = asympt
| | oldpeak <= 0.6
| | | restecg = left_vent_hyper: >50_1 (8.0/1.0)
| | | restecg = normal
| | | | trestbps <= 136
| | | | | ca <= 0: <50 (4.0)
| | | | | ca > 0
| | | | | | thalach <= 151: <50 (2.0)
| | | | | | thalach > 151: >50_1 (3.0)
| | | | trestbps > 136: >50_1 (4.0)
| | | restecg = st_t_wave_abnormality: >50_1 (0.0)
| | oldpeak > 0.6: >50_1 (57.39)
| cp = non_anginal
| | slope = up: <50 (7.39/1.0)
| | slope = flat
| | | ca <= 0
| | | | trestbps <= 122: <50 (3.0)
| | | | trestbps > 122: >50_1 (3.0)
| | | ca > 0: >50_1 (8.0/1.0)
| | slope = down: <50 (1.0)
| cp = atyp_angina
| | ca <= 0
| | | oldpeak <= 0.1: <50 (4.0)
| | | oldpeak > 0.1: >50_1 (2.75/0.75)
| | ca > 0: >50_1 (2.25/0.25)

Number of Leaves : 30

Size of the tree : 51


Time taken to build model: 0.14 seconds

=== Stratified cross-validation ===
=== Summary ===

Correctly Classified Instances 235 77.5578 %
Incorrectly Classified Instances 68 22.4422 %
Kappa statistic 0.5443
Mean absolute error 0.1044
Root mean squared error 0.2725
Relative absolute error 52.0476 %
Root relative squared error 86.5075 %
Total Number of Instances 303

=== Detailed Accuracy By Class ===

TP Rate FP Rate Precision Recall F-Measure ROC Area Class
0.83 0.29 0.774 0.83 0.801 0.809 <50
0.71 0.17 0.778 0.71 0.742 0.809 >50_1
0 0 0 0 0 ? >50_2
0 0 0 0 0 ? >50_3
0 0 0 0 0 ? >50_4
Weighted Avg. 0.776 0.235 0.776 0.776 0.774 0.809

=== Confusion Matrix ===

a b c d e <-- classified as
137 28 0 0 0 | a = <50
40 98 0 0 0 | b = >50_1
0 0 0 0 0 | c = >50_2
0 0 0 0 0 | d = >50_3
0 0 0 0 0 | e = >50_4

The other output looks like this

=== Run information ===

Scheme: weka.classifiers.trees.J48 -C 0.25 -M 2
Relation: cleveland-14-heart-disease
Instances: 303
Attributes: 14
age
sex
cp
trestbps
chol
fbs
restecg
thalach
exang
oldpeak
slope
ca
thal
num
Test mode: evaluate on training data

=== Classifier model (full training set) ===

J48 pruned tree
------------------

thal = fixed_defect
| ca <= 0
| | exang = no: <50 (5.06)
| | exang = yes: >50_1 (3.06/1.0)
| ca > 0: >50_1 (10.0)
thal = normal
| ca <= 0: <50 (117.21/12.55)
| ca > 0
| | cp = typ_angina
| | | trestbps <= 138: >50_1 (4.0/1.0)
| | | trestbps > 138: <50 (3.0)
| | cp = asympt: >50_1 (20.0/3.0)
| | cp = non_anginal: <50 (13.9/1.0)
| | cp = atyp_angina
| | | restecg = left_vent_hyper
| | | | exang = no: >50_1 (3.0)
| | | | exang = yes: <50 (2.0)
| | | restecg = normal: <50 (4.0)
| | | restecg = st_t_wave_abnormality: <50 (0.0)
thal = reversable_defect
| cp = typ_angina
| | chol <= 229: <50 (3.0)
| | chol > 229
| | | age <= 48: >50_1 (2.0)
| | | age > 48: <50 (3.0/1.0)
| cp = asympt
| | oldpeak <= 0.6
| | | restecg = left_vent_hyper: >50_1 (8.0/1.0)
| | | restecg = normal
| | | | trestbps <= 136
| | | | | ca <= 0: <50 (4.0)
| | | | | ca > 0
| | | | | | thalach <= 151: <50 (2.0)
| | | | | | thalach > 151: >50_1 (3.0)
| | | | trestbps > 136: >50_1 (4.0)
| | | restecg = st_t_wave_abnormality: >50_1 (0.0)
| | oldpeak > 0.6: >50_1 (57.39)
| cp = non_anginal
| | slope = up: <50 (7.39/1.0)
| | slope = flat
| | | ca <= 0
| | | | trestbps <= 122: <50 (3.0)
| | | | trestbps > 122: >50_1 (3.0)
| | | ca > 0: >50_1 (8.0/1.0)
| | slope = down: <50 (1.0)
| cp = atyp_angina
| | ca <= 0
| | | oldpeak <= 0.1: <50 (4.0)
| | | oldpeak > 0.1: >50_1 (2.75/0.75)
| | ca > 0: >50_1 (2.25/0.25)

Number of Leaves : 30

Size of the tree : 51


Time taken to build model: 0.02 seconds

=== Evaluation on training set ===
=== Summary ===

Correctly Classified Instances 279 92.0792 %
Incorrectly Classified Instances 24 7.9208 %
Kappa statistic 0.8396
K&B Relative Info Score 21518.6331 %
K&B Information Score 232.2202 bits 0.7664 bits/instance
Class complexity | order 0 305.5409 bits 1.0084 bits/instance
Class complexity | scheme 99.2382 bits 0.3275 bits/instance
Complexity improvement (Sf) 206.3027 bits 0.6809 bits/instance
Mean absolute error 0.0532
Root mean squared error 0.1624
Relative absolute error 26.5595 %
Root relative squared error 51.542 %
Total Number of Instances 303

=== Detailed Accuracy By Class ===

TP Rate FP Rate Precision Recall F-Measure ROC Area Class
0.952 0.116 0.908 0.952 0.929 0.952 <50
0.884 0.048 0.938 0.884 0.91 0.952 >50_1
0 0 0 0 0 ? >50_2
0 0 0 0 0 ? >50_3
0 0 0 0 0 ? >50_4
Weighted Avg. 0.921 0.085 0.922 0.921 0.921 0.952

=== Confusion Matrix ===

a b c d e <-- classified as
157 8 0 0 0 | a = <50
16 122 0 0 0 | b = >50_1
0 0 0 0 0 | c = >50_2
0 0 0 0 0 | d = >50_3
0 0 0 0 0 | e = >50_4


What I am looking for with the final set. The bookmark recommendations are sketchy at best

No comments: