Python/Study

[Python] K-NN

Yoppie 2023. 12. 19. 11:02
반응형

K-NN 분류 알고리즘(K-Nearest Neighbor classification Algorithm)은 새로운 데이터가 입력되었을 때, 가장 가까운 데이터 k개를 이용해 해당 데이터를 유추하는 모델이다.

 

유방암 자료

In [1]:
from sklearn.neighbors import KNeighborsClassifier
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn import datasets

import pandas as pd
import matplotlib.pyplot as plt
import numpy as np
import seaborn as sns
In [2]:
cancer = pd.read_csv('D:\Backup\바탕 화면\wisc_bc_data.csv')

del cancer['id']
In [3]:
result = []
accuracy = []

k_range = range(1, 11)
for k in k_range:
    result.append(k)
    knn = KNeighborsClassifier(n_neighbors=k)
    knn.fit(cancer.loc[:, cancer.columns != 'diagnosis'], cancer['diagnosis'])
    print('k is %d, Accuracy is %f' %(k, knn.score(cancer.loc[:, cancer.columns != 'diagnosis'],cancer['diagnosis'])))
    accuracy.append(knn.score(cancer.loc[:, cancer.columns != 'diagnosis'], cancer['diagnosis']))
    
def draw(x, y, title='K value for kNN'):
    plt.plot(x, y, label='k value')
    plt.title(title)
    plt.xlabel('k')
    plt.ylabel('Accuracy')
    plt.grid(True)
    plt.legend(loc='best', framealpha=0.5, prop={'size':'small'})
    plt.tight_layout(pad=1)
    plt.gcf().set_size_inches(8,4)
    plt.show()

draw(result, accuracy, 'kNN')
k is 1, Accuracy is 1.000000
k is 2, Accuracy is 0.947276
k is 3, Accuracy is 0.956063
k is 4, Accuracy is 0.949033
k is 5, Accuracy is 0.947276
k is 6, Accuracy is 0.947276
k is 7, Accuracy is 0.943761
k is 8, Accuracy is 0.940246
k is 9, Accuracy is 0.942004
k is 10, Accuracy is 0.938489
  • k가 최적의 값인 3일때 k-NN 초매개변수의 최적조건이 된다.
 
In [4]:
from sklearn.metrics import accuracy_score

X = cancer.loc[:, cancer.columns != 'diagnosis']
y =  cancer['diagnosis']

Accuracy=[]

for i in range(100):
    X_train,X_test,y_train,y_test = train_test_split(cancer.loc[:, cancer.columns !='diagnosis'], cancer['diagnosis'],test_size=0.2)
    knn = KNeighborsClassifier(n_neighbors=3)
    knn.fit(X_train, y_train)
    y_pred = knn.predict(X_test)
    Accuracy.append(accuracy_score(y_test, y_pred))
In [5]:
plt.hist(Accuracy)
Out[5]:
(array([ 3.,  2.,  7., 26., 14., 12., 27.,  6.,  1.,  2.]),
 array([0.86842105, 0.87982456, 0.89122807, 0.90263158, 0.91403509,
        0.9254386 , 0.93684211, 0.94824561, 0.95964912, 0.97105263,
        0.98245614]),
 <a list of 10 Patch objects>)
In [6]:
ax = sns.violinplot(Accuracy)
ax.set(xlim=(0, 1.05))
Out[6]:
[(0, 1.05)]
In [7]:
ax = sns.boxplot(Accuracy)
ax.set(xlim=(0, 1.05))
Out[7]:
[(0, 1.05)]
  • 분포를 보았을때 0.92 부근에 많이 모여있는 쌍봉형 분포이고 이상값은 없음을 알 수 있다.
In [8]:
from sklearn.neighbors import KNeighborsClassifier
from sklearn.model_selection import train_test_split

cancer = pd.read_csv('D:\Backup\바탕 화면\wisc_bc_data.csv')

X = cancer.loc[:, cancer.columns != 'diagnosis']
y =  cancer['diagnosis']


Accuracy=[]

X_train, X_test, y_train, y_test = train_test_split(cancer.loc[:, cancer.columns != 'diagnosis'],
                                                    cancer['diagnosis'],test_size=0.2)
 

logit = LogisticRegression()
logit.fit(X_train, y_train)

print(logit.coef_)
print(logit.intercept_)
[[-6.84573971e-10 -8.74980516e-16 -2.28407647e-15 -4.93652136e-15
   3.61357716e-14 -1.35260963e-17  5.80930211e-19  1.71260542e-17
   9.24055907e-18 -2.58844919e-17 -1.08308542e-17  2.12237622e-17
  -2.19251182e-16  1.51224708e-16  7.66234897e-15 -1.34269082e-18
  -1.14677428e-18 -1.07636739e-18 -5.34266433e-19 -3.62715204e-18
  -4.94589517e-19 -5.68383126e-16 -2.81195853e-15 -2.84528248e-15
   9.44573724e-14 -1.72061844e-17  1.14462145e-17  3.26742762e-17
   1.06599824e-17 -3.50722302e-17 -1.08889991e-17]]
[-1.69965859e-16]
In [9]:
y_pred = logit.predict(X_test)
print(y_test)
print(y_pred)
419    M
507    B
193    M
110    B
217    B
      ..
174    M
169    B
360    M
196    M
374    B
Name: diagnosis, Length: 114, dtype: object
['B' 'B' 'B' 'B' 'B' 'B' 'B' 'B' 'B' 'B' 'B' 'B' 'B' 'B' 'B' 'B' 'B' 'B'
 'B' 'B' 'B' 'B' 'B' 'B' 'B' 'B' 'B' 'B' 'B' 'B' 'B' 'B' 'B' 'B' 'B' 'B'
 'B' 'B' 'B' 'B' 'B' 'B' 'B' 'B' 'B' 'B' 'B' 'B' 'B' 'B' 'B' 'B' 'B' 'B'
 'B' 'B' 'B' 'B' 'B' 'B' 'B' 'B' 'B' 'B' 'B' 'B' 'B' 'B' 'B' 'B' 'B' 'B'
 'B' 'B' 'B' 'B' 'B' 'B' 'B' 'B' 'B' 'B' 'B' 'B' 'B' 'B' 'B' 'B' 'B' 'B'
 'B' 'B' 'B' 'B' 'B' 'B' 'B' 'B' 'B' 'B' 'B' 'B' 'B' 'B' 'B' 'B' 'B' 'B'
 'B' 'B' 'B' 'B' 'B' 'B']
In [10]:
print(accuracy_score(y_test, y_pred))
0.543859649122807
In [11]:
for i in range(100):
    X_train,X_test,y_train,y_test = train_test_split(cancer.loc[:, cancer.columns !='diagnosis'], cancer['diagnosis'],test_size=0.2)
    knn = KNeighborsClassifier(n_neighbors=3)
    knn.fit(X_train, y_train)
    y_pred = knn.predict(X_test)
    Accuracy.append(accuracy_score(y_test, y_pred))
In [12]:
plt.hist(Accuracy)
Out[12]:
(array([ 2.,  4.,  3., 13., 16., 21., 17., 12., 10.,  2.]),
 array([0.6754386 , 0.69298246, 0.71052632, 0.72807018, 0.74561404,
        0.76315789, 0.78070175, 0.79824561, 0.81578947, 0.83333333,
        0.85087719]),
 <a list of 10 Patch objects>)
In [13]:
ax = sns.violinplot(Accuracy)
ax.set(xlim=(0, 1.05))
Out[13]:
[(0, 1.05)]
In [14]:
ax = sns.boxplot(Accuracy)
ax.set(xlim=(0, 1.05))
Out[14]:
[(0, 1.05)]
  • 분포를 보았을 때 이전의 그래프에 비해 분포의 모양은 약간 더 넓은 단봉형 분포이고 0.77부근에 값이 많이 모여있는 것을 알 수 있다.

 


와인 자료

 
In [2]:
wine = pd.read_csv('D:\Backup\바탕 화면\wine.csv')
In [3]:
result = []
accuracy = []

k_range = range(1, 11)
for k in k_range:
    result.append(k)
    knn = KNeighborsClassifier(n_neighbors=k)
    knn.fit(wine.loc[:, wine.columns != 'Cultivar'], wine['Cultivar'])
    print('k is %d, Accuracy is %f' %(k, knn.score(wine.loc[:, wine.columns != 'Cultivar'],wine['Cultivar'])))
    accuracy.append(knn.score(wine.loc[:, wine.columns != 'Cultivar'], wine['Cultivar']))

def draw(x, y, title='K value for kNN'):
    plt.plot(x, y, label='k value')
    plt.title(title)
    plt.xlabel('k')
    plt.ylabel('Accuracy')
    plt.grid(True)
    plt.legend(loc='best', framealpha=0.5, prop={'size':'small'})
    plt.tight_layout(pad=1)
    plt.gcf().set_size_inches(8,4)
    plt.show()

draw(result, accuracy, 'kNN')
k is 1, Accuracy is 1.000000
k is 2, Accuracy is 0.876404
k is 3, Accuracy is 0.870787
k is 4, Accuracy is 0.825843
k is 5, Accuracy is 0.786517
k is 6, Accuracy is 0.775281
k is 7, Accuracy is 0.747191
k is 8, Accuracy is 0.775281
k is 9, Accuracy is 0.775281
k is 10, Accuracy is 0.792135
  • k가 최적의 값인 2가 될 때 k-NN 초매개변수의 최적조건이 된다.
 
In [4]:
from sklearn.metrics import accuracy_score

X = wine.loc[:, wine.columns != 'Cultivar']
y =  wine['Cultivar']

Accuracy=[]

for i in range(100):
    X_train,X_test,y_train,y_test = train_test_split(wine.loc[:, wine.columns !='Cultivar'], wine['Cultivar'],test_size=0.2)
    knn = KNeighborsClassifier(n_neighbors=2)
    knn.fit(X_train, y_train)
    y_pred = knn.predict(X_test)
    Accuracy.append(accuracy_score(y_test, y_pred))
In [5]:
plt.hist(Accuracy)
Out[5]:
(array([ 3.,  3.,  5., 13., 28., 14., 17.,  7.,  8.,  2.]),
 array([0.5       , 0.53333333, 0.56666667, 0.6       , 0.63333333,
        0.66666667, 0.7       , 0.73333333, 0.76666667, 0.8       ,
        0.83333333]),
 <a list of 10 Patch objects>)
In [6]:
ax = sns.violinplot(Accuracy)
ax.set(xlim=(0, 1.05))
Out[6]:
[(0, 1.05)]
In [8]:
ax = sns.boxplot(Accuracy)
ax.set(xlim=(0, 1.05))
Out[8]:
[(0, 1.05)]
  • 분포를 보았을 때 0.65부근에 값이 많이 모여있음을 알 수 있다.
 
In [9]:
from sklearn.neighbors import KNeighborsClassifier
from sklearn.model_selection import train_test_split

wine = pd.read_csv('D:\Backup\바탕 화면\wine.csv')

X = wine.loc[:, wine.columns != 'Cultivar']
y =  wine['Cultivar']

Accuracy=[]

X_train, X_test, y_train, y_test = train_test_split(wine.loc[:, wine.columns != 'Cultivar'], wine['Cultivar'],test_size=0.2)

logit = LogisticRegression()
logit.fit(X_train, y_train)

print(logit.coef_)
print(logit.intercept_)
[[-0.13836717  0.18750986  0.1183952  -0.21702569 -0.03848878  0.19306668
   0.50283914 -0.04057603  0.15091832 -0.07847443 -0.00338881  0.34536572
   0.01038515]
 [ 0.5264141  -0.71594001 -0.09558026  0.1919263   0.01110074  0.33311383
   0.47136046  0.05460182  0.28285192 -1.05768273  0.2530053   0.43859586
  -0.01167248]
 [-0.38804693  0.52843015 -0.02281494  0.02509939  0.02738805 -0.52618052
  -0.9741996  -0.01402579 -0.43377024  1.13615716 -0.24961649 -0.78396159
   0.00128733]]
[-0.03385263  0.08774793 -0.05389529]
C:\Users\User\anaconda3\lib\site-packages\sklearn\linear_model\_logistic.py:940: ConvergenceWarning: lbfgs failed to converge (status=1):
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  extra_warning_msg=_LOGISTIC_SOLVER_CONVERGENCE_MSG)
In [10]:
y_pred = logit.predict(X_test)
print(y_test)
print(y_pred)
153    3
152    3
81     2
103    2
17     1
26     1
53     1
170    3
165    3
158    3
116    2
137    3
70     2
45     1
162    3
161    3
84     2
115    2
6      1
102    2
135    3
7      1
100    2
117    2
93     2
15     1
131    3
78     2
42     1
69     2
24     1
154    3
177    3
151    3
74     2
13     1
Name: Cultivar, dtype: int64
[3 3 1 2 1 1 1 3 3 3 2 3 1 1 3 3 2 2 1 2 3 1 2 2 2 1 3 2 1 2 1 3 3 3 1 1]
In [11]:
print(accuracy_score(y_test, y_pred))
0.9166666666666666
In [12]:
for i in range(100):
    X_train,X_test,y_train,y_test = train_test_split(wine.loc[:, wine.columns !='Cultivar'], wine['Cultivar'],test_size=0.2)
    knn = KNeighborsClassifier(n_neighbors=2)
    knn.fit(X_train, y_train)
    y_pred = knn.predict(X_test)
    Accuracy.append(accuracy_score(y_test, y_pred))
In [13]:
plt.hist(Accuracy)
Out[13]:
(array([ 2.,  3.,  8., 33., 15., 17., 18.,  3.,  0.,  1.]),
 array([0.5       , 0.53611111, 0.57222222, 0.60833333, 0.64444444,
        0.68055556, 0.71666667, 0.75277778, 0.78888889, 0.825     ,
        0.86111111]),
 <a list of 10 Patch objects>)
In [14]:
ax = sns.violinplot(Accuracy)
ax.set(xlim=(0, 1.05))
Out[14]:
[(0, 1.05)]
In [15]:
ax = sns.boxplot(Accuracy)
ax.set(xlim=(0, 1.05))
Out[15]:
[(0, 1.05)]
  • 이전의 히스토그램과 비교했을 때 0.63부근에 값이 많이 분포하고 이상값이 하나 존재한다.

 

반응형