载入中。。。 'S bLog
 
载入中。。。
 
载入中。。。
载入中。。。
载入中。。。
载入中。。。
载入中。。。
 
填写您的邮件地址,订阅我们的精彩内容:


 
K Nearest Neighbors (knn) Tutorial
[ 2009/10/30 16:16:00 | By: 梦翔儿 ]
 

This tutorial is an introduction to an instance based learning called K-Nearest Neighbor or KNN algorithm. KNN is part of supervised learning that has been used in many applications in the field of data mining, statistical pattern recognition, image processing and many others. Some successful applications are including recognition of handwriting, satellite image and EKG pattern. Instead of using sophisticated software or any programming language, I will use only spreadsheet functions of Microsoft Excel, without any macro. You can free download the spreadsheet companion of this tutorial.

First, you will learn KNN for classification, then we will extend the same method for smoothing and prediction in solving time series data.

Topics of this tutorial (click any of them to enter):

What is K-Nearest Neighbor (KNN) Algorithm?

How K-Nearest Neighbor (KNN) Algorithm works?

Numerical Example (hand computation)

KNN for Smoothing and Prediction

How do we use the spreadsheet for KNN?

Strength and Weakness of K-Nearest Neighbor Algorithm

Resources for K Nearest Neighbors Algorithm

          Give your feedback and rate this tutorial

From: http://people.revoledu.com/kardi/tutorial/KNN/index.html

================

What is K Nearest Neighbors Algorithm?

Let us start with K-nearest neighbor algorithm for classification. K-nearest neighbor is a supervised learning algorithm where the result of new instance query is classified based on majority of K-nearest neighbor category. The purpose of this algorithm is to classify a new object based on attributes and training samples. The classifiers do not use any model to fit and only based on memory. Given a query point, we find K number of objects or (training points) closest to the query point. The classification is using majority vote among the classification of the K objects. Any ties can be broken at random. K Nearest neighbor algorithm used neighborhood classification as the prediction value of the new query instance.

For example

We have data from the questionnaires survey (to ask people opinion) and objective testing with two attributes (acid durability and strength) to classify whether a special paper tissue is good or not. Here is four training samples

X1 = Acid Durability (seconds)

X2 = Strength

(kg/square meter)

Classification

7

7

Bad

7

4

Bad

3

4

Good

1

4

Good

Now the factory produces a new paper tissue that pass laboratory test with X1 = 3 and X2 = 7. Without another expensive survey, can we guess what the classification of this new tissue is? Fortunately, k nearest neighbor (KNN) algorithm can help you to predict this type of proble

------------------------------------------------

How K-Nearest Neighbor (KNN) Algorithm works?

K nearest neighbor algorithm is very simple. It works based on minimum distance from the query instance to the training samples to determine the K-nearest neighbors. After we gather K nearest neighbors, we take simple majority of these K-nearest neighbors to be the prediction of the query instance.

The data for KNN algorithm consist of several multivariate attributes name that will be used to classify . The data of KNN can be any measurement scale from ordinal, nominal, to quantitative scale but for the moment let us deal with only quantitative and binary (nominal) . Later in this section, I will explain how to deal with other types of measurement scale.

Suppose we have this data:

The last row is the query instance that we want to predict.

The graph of this problem is shown in the figure below

Suppose we determine K = 8 (we will use 8 nearest neighbors) as parameter of this algorithm. Then we calculate the distance between the query-instance and all the training samples. Because we use only quantitative , we can use Euclidean distance. Suppose the query instance have coordinates ( , ) and the coordinate of training sample is ( , ) then square Euclidean distance is . See Euclidean distance tutorial when you have more than two variables. If the contains some categorical data or nominal or ordinal measurement scale, see similarity tutorial on how to compute weighted distance from the multivariate variables.

The next step is to find the K-nearest neighbors. We include a training sample as nearest neighbors if the distance of this training sample to the query instance is less than or equal to the K-th smallest distance. In other words, we sort the distance of all training samples to the query instance and determine the K-th minimum distance.

If the distance of the training sample is below the K-th minimum, then we gather the category of this nearest neighbors' training samples. In MS excel, we can use MS Excel function =SMALL(array, K) to determine the K-th minimum value among the array. Some special case happens in our example that the 3 rd until the 8 th minimum distance happen to be the same. In this case we directly use the highest K=8 because choosing arbitrary among the 3 rd until the 8 th nearest neighbors is unstable.

The KNN prediction of the query instance is based on simple majority of the category of nearest neighbors. In our example, the category is only binary, thus the majority can be taken as simple as counting the number of ‘+' and ‘-‘ signs. If the number of plus is greater than minus, we predict the query instance as plus and vice versa. If the number of plus is equal to minus, we can choose arbitrary or determine as one of the plus or minus.

If your training samples contain as categorical data, take simple majority among this data. If the is quantitative, take average or any central tendency or mean value such as median or geometric mean.

--------------------

KNN Numerical Example (hand computation)

Here is step by step on how to compute K-nearest neighbors KNN algorithm:

  1. Determine parameter K = number of nearest neighbors
  2. Calculate the distance between the query-instance and all the training samples
  3. Sort the distance and determine nearest neighbors based on the K-th minimum distance
  4. Gather the category of the nearest neighbors
  5. Use simple majority of the category of nearest neighbors as the prediction value of the query instance

We will use again the previous example to calculate KNN by hand computation. If you want to download the MS excel companion of this tutorial, click here

Example

We have data from the questionnaires survey (to ask people opinion) and objective testing with two attributes (acid durability and strength) to classify whether a special paper tissue is good or not. Here is four training samples

X1 = Acid Durability (seconds)

X2 = Strength

(kg/square meter)

Y = Classification

7

7

Bad

7

4

Bad

3

4

Good

1

4

Good

Now the factory produces a new paper tissue that pass laboratory test with X1 = 3 and X2 = 7. Without another expensive survey, can we guess what the classification of this new tissue is?

1. Determine parameter K = number of nearest neighbors

Suppose use K = 3

2. Calculate the distance between the query-instance and all the training samples

Coordinate of query instance is (3, 7), instead of calculating the distance we compute square distance which is faster to calculate (without square root)

X1 = Acid Durability (seconds)

X2 = Strength

(kg/square meter)

Square Distance to query instance (3, 7)

7

7

7

4

3

4

1

4

 

3. Sort the distance and determine nearest neighbors based on the K-th minimum distance

X1 = Acid Durability (seconds)

X2 = Strength

(kg/square meter)

Square Distance to query instance (3, 7)

Rank minimum distance

Is it included in 3-Nearest neighbors?

7

7

3

Yes

7

4

4

No

3

4

1

Yes

1

4

2

Yes

4. Gather the category of the nearest neighbors. Notice in the second row last column that the category of nearest neighbor (Y) is not included because the rank of this data is more than 3 (=K).

X1 = Acid Durability (seconds)

X2 = Strength

(kg/square meter)

Square Distance to query instance (3, 7)

Rank minimum distance

Is it included in 3-Nearest neighbors?

Y = Category of nearest Neighbor

7

7

3

Yes

Bad

7

4

4

No

-

3

4

1

Yes

Good

1

4

2

Yes

Good

 

5. Use simple majority of the category of nearest neighbors as the prediction value of the query instance

We have 2 good and 1 bad, since 2>1 then we conclude that a new paper tissue that pass laboratory test with X1 = 3 and X2 = 7 is included in Good category.

---------------------------------

KNN for Smoothing and Prediction

Using the same principle, we can extend the K-Nearest Neighbor (KNN) algorithm for smoothing (interpolation) and prediction (extrapolation) of quantitative data (e.g. time series). In classification, the dependent variable Y is categorical data. In this section, the dependent variable has quantitative values.

Here is step by step on how to compute K-nearest neighbors KNN algorithm for quantitative data:

  1. Determine parameter K = number of nearest neighbors
  2. Calculate the distance between the query-instance and all the training samples
  3. Sort the distance and determine nearest neighbors based on the K-th minimum distance
  4. Gather the values of of the nearest neighbors
  5. Use average of nearest neighbors as the prediction value of the query instance

Example (KNN for Extrapolation)

We have 5 data pair (X,Y) as shown below. The data are quantitative in nature. Suppose the data is sorted as in time series. Then the problem is to estimate the value of Y based on K-Nearest Neighbor (KNN) algorithm at X=6.5

1. Determine parameter K = number of nearest neighbors

Suppose use K = 2


 

2. Calculate the distance between the query-instance and all the training samples

Coordinate of query instance is 6.5. As we are dealing with one-dimensional distance, we simply take absolute value from the query instance to value of X.

For instance for X=5.1, the distance is | 6.5 – 5.1 | = 1.4, for X = 1.2 the distance is | 6.5 – 1.2 | = 5.3 and so on.


 

3. Sort the distance and determine nearest neighbors based on the K-th minimum distance

As the data is already sorted, the nearest neighbors are the last K data.


 

4. Gather the values of of the nearest neighbors

We simply copy the Y values of the last K=2 data. The result is tabulated below.


 

5. Use average of nearest neighbors as the prediction value of the query instance

In this case, we have prediction value of

You can play around with different data and value K if you download the spreadsheet of this example here.

Example (KNN for Interpolation)

Using the same training data and the same technique, we can also do KNN for smoothing (interpolation between values). Thus, our data is shown as

Suppose we know the X data is between 0 and 6 and we would like to compute the value of Y between them.

  1. We define dx=0.1 and set the value of x = 0 to 6 with increment dx
  2. Compute distance between x (as if it is the query instance) and each of the data X

For instance the distance between query instance x = 0.1 and X2 = 1.2 is denoted as d(x, X2) = | 0.1 – 1.2 | = 1.1. Similarly, distance between query instance x = 0.5 and X5 = 5.1 is computed as d(x, X5) = | 0.5 – 5.1 | = 4.6. Table below shows distance for x = 0 to 0.5 for all X data

  1. We obtain the nearest neighbors based on the K-th minimum distance and copy the value of Y of the nearest neighbors.
  2. The smoothing estimate is the arithmetic average of the values of the nearest neighbors

Table below shows example of computation of KNN for smoothing for x = 2.5 until 3.5.

Playing around with the value of K give the graph results is give below (download the spreadsheet of this exaample here). In general, the plot of KNN smoothing has many discontinuities. For K=1, the KNN smoothing line goes passing all the data points, therefore the sum of square error is zero. The plot is the most rough. When K = 5 (all the data point), we get only one horizontal line as the average of all data. Between the two extremes, we can find adjust the value of K as parameter to adjust the smoothing plot. Among K=2, K=3 and K=4 we obtain K=4 have the smallest sum of square error (SSE) .

I have demonstrated through several examples how we can use simple K-NN algorithm for classification, interpolation and extrapolation.

-----------------

How do we use the spreadsheet for KNN?

You can download the spreadsheet companion of this tutorial here.

The spreadsheet does not contain any macro. KNN algorithm use only simple MS excel functions

SMALL – return the k-th smallest value of the array input

COUNTIF – count number of cells that pass some simple criteria

RANDBETWEEN – to generate random integer between two values

You can change any value of K (number of neighbors) or the data to see the effect of prediction. Alternatively, you can just press F9 and it will automatically generate new data.

Yellow cells are for input, other color are results or computation.

--------------

Strength and Weakness of K-Nearest Neighbor Algorithm

Advantage

  • Robust to noisy training data (especially if we use inverse square of weighted distance as the “distance”)
  • Effective if the training data is large

Disadvantage

  • Need to determine value of parameter K (number of nearest neighbors)
  • Distance based learning is not clear which type of distance to use and which attribute to use to produce the best results. Shall we use all attributes or certain attributes only?
  • Computation cost is quite high because we need to compute distance of each query instance to all training samples. Some indexing (e.g. K-D tree) may reduce this computational cost

-------------

 Strength and Weakness of K-Nearest Neighbor Algorithm

Advantage

  • Robust to noisy training data (especially if we use inverse square of weighted distance as the “distance”)
  • Effective if the training data is large

Disadvantage

  • Need to determine value of parameter K (number of nearest neighbors)
  • Distance based learning is not clear which type of distance to use and which attribute to use to produce the best results. Shall we use all attributes or certain attributes only?
  • Computation cost is quite high because we need to compute distance of each query instance to all training samples. Some indexing (e.g. K-D tree) may reduce this computational cost

----------------

Resources for K Nearest Neighbors Algorithm

Aside from my tutorial on K nearest Neighbor, there are many good Internet resources discuss about K-Nearest Neighbor (KNN) algorithm. Below I selected a few good list that you may consider to probe further. I welcome any feedback and input about other good resources that you want to include in this list; please tell me about it.

-----------------

 
 
  • 标签:KNN Nearest Neighbors Tutorial 
  •  
    Re:K燦earest燦eighbors?knn)燭utorial
    [ 2010/2/14 11:48:35 | By: Mahesh(游客) ]
     
    Mahesh(游客)Not able to read in english
    以下为梦翔儿的回复:
    Maybe you could try to read it,maybe you could youdao/ciba... to help you.
     
    个人主页 | 引用 | 返回 | 删除 | 回复
     
    发表评论:
    载入中。。。

     
     
     

    梦翔儿网站 梦飞翔的地方 http://www.dreamflier.net
    中华人民共和国信息产业部TCP/IP系统 备案序号:辽ICP备09000550号

    Powered by Oblog.