[기계학습] IRIS 데이터를 분류하는 분류기를 만들어보자(코드 실습)

J.S.Y 2022. 2. 8. 14:06

728x90

IRIS Classification

오늘은 흔히 사용되는 IRIS 데이터셋을 가지고 이전까지의 포스팅을 복습하는 글을 쓰려고한다.

데이터 설명

붓꽃 데이터

총 150개의 데이터로 이루어져있고
Featrue는 4개, Label 1개로 (150, 5)의 Shape을 가지는 데이터이다.

Sepal Length 꽃 받침의 길이 정보(cm)

Sepal Width 꽃 받침의 너비 정보(cm)

Petal Length 꽃잎의 길이 정보(cm)

Petal Width 꽃잎의 너비 정보(cm)

Species 꽃의 종류 정보(Setosa / Versicolor/Virgincia) 3종류

Sepal Length	꽃 받침의 길이 정보(cm)
Sepal Width	꽃 받침의 너비 정보(cm)
Petal Length	꽃잎의 길이 정보(cm)
Petal Width	꽃잎의 너비 정보(cm)
Species	꽃의 종류 정보(Setosa / Versicolor/Virgincia) 3종류

CSV 형식으로도 다운 받을 수 있지만, Scikit-Learn에서 제공하는 "sklearn"패키지에서 iris 데이터를 불러올 수 있다.

실습의 형식은 전체 150개의 데이터 중에서 85%비율은 학습으로 사용하고, 나머지 15%의 비율은 검증(Test)에 사용하여 내가 설계한 모델의 정확도를 측정한다.

모델은 총 두개를 사용할 것이다.

일반 3계층의 선형변환 모델
Self-Attetion 적용 모델

두개의 모델을 설계하고 학습하여 정확도를 척도로 성능을 비교할 것
(물론 Self-Attention이 월등히 높을 것... ㅎㅎ)

필요 패키지 import

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

import torch
import torch.nn as nn
import torch.nn.functional as F
import torch.optim as optim
from torch.utils.data import Dataset, DataLoader

from tqdm import tqdm, notebook

DEVICE = torch.device("cuda" if torch.cuda.is_available() else "cpu")

print(torch.__version__)
print(torch.cuda.is_available())
print(DEVICE)

1.10.1
True
cuda

데이터 살펴보기

from sklearn.datasets import load_iris

data = load_iris()
data.keys()

dict_keys(['data', 'target', 'frame', 'target_names', 'DESCR', 'feature_names', 'filename', 'data_module'])

scikit-learn에서 제공하는 형태는 딕셔너리 형태로 여기서 학습에 사용할 'Data'와 'target', 'target_names'를 사용한다.

# Feature와 Label로 구분
X = data['data']
Y = data['target']
label_name = data['target_names']
n_classes = max(Y)+1

print(X[:5])
print(Y[:5])
print(label_name)
print(n_classes)

[[5.1 3.5 1.4 0.2]
 [4.9 3.  1.4 0.2]
 [4.7 3.2 1.3 0.2]
 [4.6 3.1 1.5 0.2]
 [5.  3.6 1.4 0.2]]
[0 0 0 0 0]
['setosa' 'versicolor' 'virginica']
3

Feature와 Label을 받아와 확인해보았다.

# Y값 one_Hot Encoding
n = np.unique(Y, axis=0).shape[0]
Y = np.eye(n)[Y]

print(Y.shape)

(150, 3)

Label을 모델이 구분할 수 있도록 one-hot 인코딩을 해준다.
numpy의 unique()와 eye() 메소드를 사용하였다.

# Train, Valid Dataset 분리
from sklearn.model_selection import train_test_split
train_x, valid_x, train_y, valid_y = train_test_split(X, Y, stratify = Y, random_state = 17, test_size = 0.15)
print("Train Feature Shape : {}".format(train_x.shape))
print("Train Label Shape : {}".format(train_y.shape))
print("Valid Feature Shape : {}".format(valid_x.shape))
print("Valid Label Shape : {}".format(valid_y.shape))

Train Feature Shape : (127, 4)
Train Label Shape : (127, 3)
Valid Feature Shape : (23, 4)
Valid Label Shape : (23, 3)

총 150개의 데이터를 sklearn 패키지의 train_test_split() 메소드를 통해 85:15 비율로 Split하였다.
Shape은 출력과 같다.

# array to Tensor
train_x = torch.tensor(train_x)
train_y = torch.tensor(train_y)
valid_x = torch.tensor(valid_x)
valid_y = torch.tensor(valid_y)

print("Train Feature Shape : {}".format(train_x.shape))
print("Train Label Shape : {}".format(train_y.shape))
print("Valid Feature Shape : {}".format(valid_x.shape))
print("Valid Label Shape : {}".format(valid_y.shape))

Train Feature Shape : torch.Size([127, 4])
Train Label Shape : torch.Size([127, 3])
Valid Feature Shape : torch.Size([23, 4])
Valid Label Shape : torch.Size([23, 3])

실습에서는 Dataset과 DataLoader를 사용할 예정이지만, 위 처럼 바로 Tensor로 바꿔서 바로 모델에 입력으로 넣을 수 있다.
(데이터의 크기가 크지 않기 때문에 가능하다.)

Dataset 정의

class MyDataset(Dataset):
    def __init__(self, x_data, y_data):
        self.x = x_data
        self.y = y_data

    def __len__(self):
        return len(self.x)

    def __getitem__(self, idx):
        xx = self.x[idx].float()
        yy = self.y[idx].float()
        return xx, yy

train = MyDataset(train_x, train_y)
valid = MyDataset(valid_x, valid_y)

train_loader = DataLoader(train, batch_size = 4, shuffle=True)
valid_loader = DataLoader(valid, batch_size = 2)

print(train_loader)
print(valid_loader)

<torch.utils.data.dataloader.DataLoader object at 0x0000020186E45AC8>
<torch.utils.data.dataloader.DataLoader object at 0x0000020186E45A48>

Dataset을 정의하고 생성한 다음 DataLoader에 대입해주었다.

훈련 & 검증 함수 정의

loss_fn = nn.CrossEntropyLoss()

def calc_acc(X, Y):
    x_val, x_idx = torch.max(X, dim=1)
    y_val, y_idx = torch.max(Y, dim=1)
    return (x_idx == y_idx).sum().item()

def train(EPOCHS, model, train_loader, valid_loader, opt):
    train_loss_history = []
    valid_loss_history = []
    train_acc_history = []
    valid_acc_history = []
    for epoch in range(1, EPOCHS+1):
        model.train()
        train_acc = 0
        print("<<< EPOCH {} >>>".format(epoch))
        for batch_idx, (x,y) in enumerate(notebook.tqdm(train_loader)):
            x, y = x.to(DEVICE), y.to(DEVICE)

            output = model(x)                 # 순전파
            loss = loss_fn(output, y)         # 오차 계산

            opt.zero_grad()                   # opt내부 값 초기화
            loss.backward()                   # 오차 역전파
            opt.step()                        # 가중치 갱신

            train_acc += calc_acc(output, y)
            if batch_idx % 10 == 0 and batch_idx != 0:
                print("Training : [{}/{} ({:.0f}%)]\tLoss: {:.6f}\t Acc : {:.3f}".format(
                    batch_idx * len(x), 
                    len(train_loader.dataset), 
                    100. * batch_idx / len(train_loader), 
                    loss.item(),
                    train_acc / len(train_loader.dataset)))
        print("\n{} Training : Loss: {:.6f}\t Acc : {:.3f}".format(
                    epoch,  
                    loss.item(),
                    train_acc / len(train_loader.dataset)))

        t_loss, t_acc = evaluate(model, valid_loader)
        print("{} Validation : Loss : {:.4f}\t Acc: {:.2f}%\n\n\n".format(epoch, t_loss, t_acc*100.))

        train_loss_history.append(loss.item())
        train_acc_history.append(train_acc / len(train_loader.dataset))

        valid_loss_history.append(t_loss.item())
        valid_acc_history.append(t_acc)

    return train_loss_history, train_acc_history, valid_loss_history, valid_acc_history

def evaluate(model, valid_loader):
    model.eval()
    t_loss = 0
    correct = 0

    with torch.no_grad():
        for x, y in notebook.tqdm(valid_loader):
            x, y = x.to(DEVICE), y.to(DEVICE)

            output = model(x)
            t_loss += loss_fn(output, y)

            correct += calc_acc(output, y)

    t_loss /= len(valid_loader)
    t_acc = correct / len(valid_loader.dataset)
    return t_loss, t_acc

학습 함수와, 검증 함수를 정의하였다.(+ 정확도를 계산하는 calc_acc)
학습 중간중간에 loss와 acc를 확인하도록 출력문을 넣어주었다.

모델 정의

모델은 총 2개를 정의하였다.

단순 선형변환 모델(3 계층)
Self-Attention 적용 모델

모델 1

class MyModel1(nn.Module):
    def __init__(self):
        super(MyModel1, self).__init__()

        self.fc1 = nn.Linear(4, 16)
        self.fc2 = nn.Linear(16, 8)
        self.fc3 = nn.Linear(8, n_classes)

        self.act_fn = nn.ReLU()

    def forward(self, x):
        x = self.fc1(x)
        x = self.act_fn(x)

        x = self.fc2(x)
        x = self.act_fn(x)

        x = self.fc3(x)
        return x

모델 1은 3개의 선형 Layer를 가진다.
정말 단순 선형변환 모델이다.

(batch, 4) -> (batch, 16) -> (batch, 8) -> (batch, 3)
위 순서대로 선형회귀를 3번 거친다.
물론 활성화 함수로는 ReLU를 사용하였다.

모델 2

class MyModel2(nn.Module):
    def __init__(self):
        super(MyModel2, self).__init__()

        self.fc1 = nn.Linear(4, 16*4)

        self.Q = nn.Linear(16, 8)
        self.K = nn.Linear(16, 8)
        self.V = nn.Linear(16, 8)

        self.fc2 = nn.Linear(32, 8)
        self.fc3 = nn.Linear(8, n_classes)

        self.act_fn = nn.ReLU()
        self.softmax = nn.Softmax(dim=-1)

    def forward(self, x):
        x = self.fc1(x)
        x = self.act_fn(x)

        x = x.view(-1, 4, 16)

        q = self.Q(x) # (batch, 4, 8)
        k = self.K(x) # (batch, 4, 8)
        v = self.V(x) # (batch, 4, 8)

        score = torch.matmul(q, torch.transpose(k, 1, 2)) # (batch, 4, 16)
        score = self.softmax(score) / np.sqrt(8)         # (batch, 4, 16)

        z = torch.matmul(score, v) # (batch, 4, 8)
        z = z.view(-1, 4*8)

        x = self.fc2(z)
        x = self.act_fn(x)

        x = self.fc3(x)
        return x

모델 2는 Self-Attention 기법을 사용하였다.
Self-Attention은 트랜스포머(Transformer) 아키텍쳐에서 주로 사용되는 기법인데, 이는 추후에 포스팅하기로 하겠다.
모델 1에 비해 복잡하다.

모델 학습 및 성능 검증

이제 학습을 시키고 성능을 비교해볼 차례이다. 먼저 단순 선형변환 모델을 학습시키고 성능을 확인해 보자

단순 선형회귀

model = MyModel1().to(DEVICE)
opt = optim.Adam(model.parameters())

print("Model :",model)
print("model's number of Parameters: ", sum([p.numel() for p in model.parameters()]))

Model : MyModel1(
  (fc1): Linear(in_features=4, out_features=16, bias=True)
  (fc2): Linear(in_features=16, out_features=8, bias=True)
  (fc3): Linear(in_features=8, out_features=3, bias=True)
  (act_fn): ReLU()
)
model's number of Parameters:  243

사용되는 가중치의 개수는 243개이다.
학습 회수는 10번이다.

t_loss_his, t_acc_his, v_loss_his, v_acc_his = train(EPOCHS = 10, model = model, train_loader = train_loader, valid_loader = valid_loader, opt = opt)

<<< EPOCH 1 >>>
Training : [40/127 (31%)]    Loss: 1.046869     Acc : 0.189
Training : [80/127 (62%)]    Loss: 1.130223     Acc : 0.276
Training : [120/127 (94%)]    Loss: 1.117985     Acc : 0.370

1 Training : Loss: 1.069064     Acc : 0.378
1 Validation : Loss : 1.0696     Acc: 30.43%

<<< EPOCH 2 >>>
Training : [40/127 (31%)]    Loss: 1.101946     Acc : 0.094
Training : [80/127 (62%)]    Loss: 1.070963     Acc : 0.205
Training : [120/127 (94%)]    Loss: 0.989257     Acc : 0.323

2 Training : Loss: 0.992205     Acc : 0.339
2 Validation : Loss : 1.0526     Acc: 30.43%

. . . . .
. . . . .
중간  생략
. . . . .
. . . . .

<<< EPOCH 9 >>>
Training : [40/127 (31%)]    Loss: 0.802042     Acc : 0.205
Training : [80/127 (62%)]    Loss: 0.774793     Acc : 0.457
Training : [120/127 (94%)]    Loss: 0.491868     Acc : 0.661

9 Training : Loss: 0.822680     Acc : 0.669
9 Validation : Loss : 0.6623     Acc: 65.22%

<<< EPOCH 10 >>>
Training : [40/127 (31%)]    Loss: 0.847679     Acc : 0.205
Training : [80/127 (62%)]    Loss: 0.639355     Acc : 0.417
Training : [120/127 (94%)]    Loss: 0.595597     Acc : 0.654

10 Training : Loss: 0.383359     Acc : 0.677
10 Validation : Loss : 0.5935     Acc: 65.22%

plt.plot(t_loss_his, label="train")
plt.plot(v_loss_his, label="valid")
plt.legend()
plt.title("Loss")
plt.show()

output_20_0

plt.plot(t_acc_his, label="train")
plt.plot(v_acc_his, label="valid")
plt.legend()
plt.title("Accuracy")
plt.show()

output_21_0

최종 학습 종료 후

Training : Loss: 0.383359 Acc : 0.677
Validation : Loss : 0.5935 Acc: 65.22%

정확도가 대략 65~67%에 달한다. 3개중에 2개는 맞춘다는 소리인데....

셀프 어텐션 적용 모델

model = MyModel2().to(DEVICE)
opt = optim.Adam(model.parameters())

print("Model :",model)
print("model's number of Parameters: ", sum([p.numel() for p in model.parameters()]))

Model : MyModel2(
  (fc1): Linear(in_features=4, out_features=64, bias=True)
  (Q): Linear(in_features=16, out_features=8, bias=True)
  (K): Linear(in_features=16, out_features=8, bias=True)
  (V): Linear(in_features=16, out_features=8, bias=True)
  (fc2): Linear(in_features=32, out_features=8, bias=True)
  (fc3): Linear(in_features=8, out_features=3, bias=True)
  (act_fn): ReLU()
  (softmax): Softmax(dim=-1)
)
model's number of Parameters:  1019

사용되는 가중치는 1019개로 모델 1에 비해 대략 5배 가량많다.

t_loss_his, t_acc_his, v_loss_his, v_acc_his = train(EPOCHS = 10, model = model, train_loader = train_loader, valid_loader = valid_loader, opt = opt)

<<< EPOCH 1 >>>
Training : [40/127 (31%)]    Loss: 1.113500     Acc : 0.102
Training : [80/127 (62%)]    Loss: 1.031586     Acc : 0.189
Training : [120/127 (94%)]    Loss: 1.065747     Acc : 0.323

1 Training : Loss: 0.859659     Acc : 0.339
1 Validation : Loss : 1.0244     Acc: 30.43%

<<< EPOCH 2 >>>
Training : [40/127 (31%)]    Loss: 1.127708     Acc : 0.142
Training : [80/127 (62%)]    Loss: 0.841290     Acc : 0.276
Training : [120/127 (94%)]    Loss: 1.086602     Acc : 0.472

2 Training : Loss: 0.887653     Acc : 0.496
2 Validation : Loss : 0.8946     Acc: 65.22%

. . . . .
. . . . .
중간  생략
. . . . .
. . . . .

<<< EPOCH 9 >>>
Training : [40/127 (31%)]    Loss: 0.060780     Acc : 0.315
Training : [80/127 (62%)]    Loss: 0.543211     Acc : 0.614
Training : [120/127 (94%)]    Loss: 0.054991     Acc : 0.921

9 Training : Loss: 0.023854     Acc : 0.945
9 Validation : Loss : 0.0623     Acc: 100.00%

<<< EPOCH 10 >>>
Training : [40/127 (31%)]    Loss: 0.153605     Acc : 0.315
Training : [80/127 (62%)]    Loss: 0.285837     Acc : 0.598
Training : [120/127 (94%)]    Loss: 0.031138     Acc : 0.898

10 Training : Loss: 0.043692     Acc : 0.921
10 Validation : Loss : 0.0581     Acc: 100.00%

plt.plot(t_loss_his, label="train")
plt.plot(v_loss_his, label="valid")
plt.legend()
plt.show()

output_25_0

plt.plot(t_acc_his, label="train")
plt.plot(v_acc_his, label="valid")
plt.legend()
plt.show()

output_26_0

성능이 모델 1에 비해 월등히 높다.
Epoch 2부터 모델 1과 비슷한 성능을 자랑한다.

이유를 간단히 설명하자면, Self-Attention은 Feature간의 중요도 비율을 스스로 학습하여 적용한다.
즉, 주어진 데이터 중 4개의 Feature중 어느 Feature에 Attention할것인지를 데이터 속에서 스스로 결정한다.
다행이 150개의 적은 데이터로도 잘 먹혔고 성능으로 결과가 나온것 같다.

오늘 포스팅은 여기까지하고, 다음 포스팅은 이제 CNN을 설명해보고자 한다.
CNN은 총 2 part로 포스팅할 예정이다.(이론, 실습)

추가적으로 전체 코드는 아래 링크에서 확인할 수 있습니다.
https://github.com/JoSangYeon/Machine_Learning_Project/blob/master/IRIS_Classification/IRIS_Classification.ipynb

728x90