K-NN_Demo
By calculating the distance between each sample in training set and the samples in testing set (samples to be classified or predicted so to speak), take the `K` training samples that is (are) closest to the sample that is to be classified, and to see which of the K samples occupies the class majority, the majority is then assumed to be the classification result. Basicly, it's a voting process.
Pros and Cons of K-nearest-neighbour algorithm
Pros
- Can work with large sample size classification problem.
- Good efficiency. (Relatively speaking)
- Algorithm's concept is easy to comprehend and then implement.
- It's a 'Incremental learning' model like Naive Bayesian. Does not retrain the model while input data grow.
Cons
- Every sample in training set requies global search, will effect efficiency when data size is huge (due to the amount of calculation).
- Misclassification happen when sample distribution is uneven. (Value majority more then minority)
- Needs more physical memory when data size increase.
UCI letter recognition data set
The objective is to identify each of a large number of black-and-white rectangular pixel displays as one of the 26 capital letters in the English alphabet. The character images were based on 20 different fonts and each letter within these 20 fonts was randomly distorted to produce a file of 20,000 unique stimuli. Each stimulus was converted into 16 primitive numerical attributes (statistical moments and edge counts) which were then scaled to fit into a range of integer values from 0 through 15. We typically train on the first 16000 items and then use the resulting model to predict the letter category for the remaining 4000. See the article cited above for more details.
Attribute Information:
- lettr capital letter (26 values from A to Z)
- x-box horizontal position of box (integer)
- y-box vertical position of box (integer)
- width width of box (integer)
- high height of box (integer)
- onpix total # on pixels (integer)
- x-bar mean x of on pixels in box (integer)
- y-bar mean y of on pixels in box (integer)
- x2bar mean x variance (integer)
- y2bar mean y variance (integer)
- xybar mean x y correlation (integer)
- x2ybr mean of x * x * y (integer)
- xy2br mean of x * y * y (integer)
- x-ege mean edge count left to right (integer)
- xegvy correlation of x-ege with y (integer)
- y-ege mean edge count bottom to top (integer)
- yegvx correlation of y-ege with x (integer)
My Data Input
Training set contains 19900 samples and testing set contains 100 samples, this is not an optimal sampling solution but since I'm only demoing it, so what the heck.├── test100.txt
└── train19900.txt
- test100 contains 100 samples, acting as testing set.
- train19900 contains 19900 samples, acting as training set.
Compile and Run
$> gcc K-NN_Demo.c -o knn
$> ./knn
it will then prompt to input K
in K-NN algorithm: here I'm using 1
************************************************************************************************
* K-NN Demo using Data Set: UCI Letter Recognition *
* - [Yang (Simon) Guo] - *
************************************************************************************************
rows of training data set: 19900
rows of test data set: 100
input K in K-nearest-neighbour algorithm: 1
Results
-------------------TEST RESULT SUMMARY-------------------
right prediction: 97
wrong prediction: 3
accuracy: 97.000% [0.970000]
Check out this project on Github