머신러닝 기법과 R 프로그래밍 2: ⅩⅥ. 딥러닝과 텍스트 마이닝

2020-12-24

2021-02-01

Postech, 딥러닝, 텍스트마이닝

POSTECH에서 제공하는 MOOC 중, 머신러닝기법과 R프로그래밍 Ⅱ 과정입니다.

ⅩⅥ. 딥러닝과 텍스트 마이닝

1. Neural Networks

Concepts

인공신경망은 기계학습(Machine Learning)의 통계적 학습 알고리즘 중 하나
컴퓨터 비전, 자연어 처리, 음석 인식 등 영역에서 사용
AI ⊂ Machine Learning ⊂ Neural Network

신경망 모델(Neural Network)
- Perceptron을 한 단위로 하는 네트워크를 구축하여, 인간의 신경세포(Neuron)와 유사한 기능을 하도록 제안

Perceptron - Single Layer

하나의 Perceptron은 단순히 다수의 입력과 가중치의 선형 결합을 계산하는 역할을 수행
Activation 함수에 따라 선형결합으로 생성되는 출력값이 결정
- Binary Threshold
- Sigmoid
- ReLU(Rectified Linear Unit)
- tanh

Multi-layer perceptron

Perceptron으로 구성된 Single Layer들이 Multi-layer를 만듦
- Input layer와 Output layer 사이에는 Hidden layer가 존재하여 Non-linear transformation 수행
Output layer에서 Softmax 함수를 통해 가장 큰 값을 손쉽게 알 수 있음
- Exponential 함수로 항상 양수 결과치가 도출되고 이를 통해 확률값을 도출

Neural Network 수행

mxnet 패키지는 R4.0 버전 이후부터 실행 불가

R 3.6.2버전 설치해서 실행 필요

# require mxnet
install.packages("https://github.com/jeremiedb/mxnet_winbin/raw/master/mxnet.zip",repos = NULL)
library(mxnet)

# If you have Error message "no package called XML or DiagrmmeR", then install
install.packages("XML")
install.packages("DiagrammeR")
#library(XML)
#library(DiagrammeR)

#devtools::install_github("rich-iannone/DiagrammeR")

#install.packages('devtools')
#library(devtools)

데이터 불러오기

# read csv file
iris<-read.csv("data/week16_1/iris.csv")
attach(iris)

# change to numeric from character variable : Species
iris[,5] = as.numeric(iris[,5])-1
table(Species)
# check the data
head(iris, n=10)

학습데이터와 검증데이터

# split train & test dataset
# training (n=100)/ test data(n=50) 
set.seed(1000,sample.kind="Rounding")
N<-nrow(iris)
tr.idx<-sample(1:N, size=N*2/3, replace=FALSE)

# split train data and test data
train<-data.matrix(iris[tr.idx,])
test<-data.matrix(iris[-tr.idx,])

# feature & Labels
# 객체별 Feature와 Label로 분리
# Label은 5번째 열에 위치
train_feature<-train[,-5]
trainLabels<-train[,5]
test_feature<-test[,-5]
testLabels <-test[,5]

Hidden Layer 구성

# Build nn model
# first layers
require(mxnet)
my_input = mx.symbol.Variable('data')
fc1 = mx.symbol.FullyConnected(data=my_input, num.hidden = 200, name='fc1') # 200개의 뉴런 형성
relu1 = mx.symbol.Activation(data=fc1, act.type='relu', name='relu1')

1
2
3

# second layers
fc2 = mx.symbol.FullyConnected(data=relu1, num.hidden = 100, name='fc2') # 100개의 뉴런 형성
relu2 = mx.symbol.Activation(data=fc2, act.type='relu', name='relu2')

Output Layer 구성

# third layers
fc3 = mx.symbol.FullyConnected(data=relu2, num.hidden = 3, name='fc3') # 3개로 분류(0,1,2)해야 하므로 3개의 Output 뉴런 생성

# softmax
softmax = mx.symbol.SoftmaxOutput(data=fc3, name='sm')
# sofrmax 결과를 통해 가장 큰 값 선택

모델 학습

# training
mx.set.seed(123, sample.kind="Rounding")
device <- mx.cpu()
model <- mx.model.FeedForward.create(softmax,
                                     optimizer = "sgd", # Stochastic Gradient Descent
                                     array.batch.size=10, # 총 10개 그룹
                                     num.round = 500, learning.rate=0.1,
                                     X=train_feature, y=trainLabels, ctx=device,
                                     eval.metric = mx.metric.accuracy,
                                     array.layout = "rowmajor",
                                     epoch.end.callback=mx.callback.log.train.metric(100))
graph.viz(model$symbol)

모델 테스트

# testing
predict_probs <- predict(model, test_feature, array.layout = "rowmajor")
predicted_labels <- max.col(t(predict_probs)) - 1
table(testLabels, predicted_labels)
sum(diag(table(testLabels, predicted_labels)))/length(predicted_labels)

2. Convolutional Neural Networks

Features

신경망 모델(Neural Net)
- 입력값으로 객체의 특성(feature)을 받고, 출력된 값과 실제 값을 비교하는 과정을 거침(지도학습; Supervised Learning)
- 하나의 이미지는 수많은 픽셀들이 모여 형성하며, 특정 색에 해당하는 특정 값을 가짐
  - 이미지의 모든 픽셀값을 입력값으로 갖는 신경망 모델을 만들 수 있음

Intuitions

고해상도 이미지의 경우 특성 feature 수가 너무 많아짐
- 모든 뉴런이 모든 픽셀과 연결(fully connected)되어 있을 경우, 모델 학습에 큰 어려움이 있음
- 각 뉴런들이 이미지의 일부 특성 feature에만 연결될 수 있는 구조가 더 더 적합함
- Convolution operation을 통해 구현 가능
Convolutional Neural Network
- Feed forward Network: xi^n^을 구함
  - Convilution
  - Max Pooling
  - Activation function
- Backpropagation: Error 최소화

Concolution Operation

임의의 값으로 설정된 filter가 전체 이미지 중 일부의 선형 결합을 계산
- 각각 결괏값은 하나의 Neuron이 되며, filter는 해당 Neuron의 가중치과 됨
- 결괏값의 사이즈를 정하기 위해서는 Stride, Padding, Depth 고려 필요
  - Stride: filter를 몇 칸 이동할지 결정
  - Padding: input 주변에 0으로 padding을 삽입
  - Depth number of filter: 3차원상의 neuron의 깊이를 결정

Pooling

Convolutional Layer 사이에 Pooling Layer를 넣는 방법이 많이 사용됨
- 추출한 이미지에서 지역적인 부분 특징만 뽑아 다음 layer로 넘겨줌
- 이를 통해 ① 가중치의 수를 줄일 수 있으며, ② 과적합(overfitting)을 방지할 수 있음
- 대표적으로 가장 큰 값(Local Maxima)만을 뽑아내는 Max Pooling이 많이 사용됨

MNIST

Modified National Institute of Standards and Technology
손으로 쓴 숫자를 인식하기 위해 사용되는 데이터
- 28x28 pixel(784)의 흑백 이미지(0-255)들이 있음
- 0부터 9까지 총 70,000개의 손글씨 이미지가 있음

CNN in R

신경망 모델 생성을 위한 패키지: mxnet

# Load MNIST mn1
# 28*28, 1 channel images
mn1 <- read.csv("mini_mnist.csv")
set.seed(123,sample.kind="Rounding")
N<-nrow(mn1)
tr.idx<-sample(1:N, size=N*2/3, replace=FALSE)

# split train data and test data
train_data<-data.matrix(mn1[tr.idx,])
test_data<-data.matrix(mn1[-tr.idx,])

# 0과 1 사이에 분포하도록 Nomalized(0: 검정색 / 255: 흰색)
test<-t(test_data[,-1]/255)
features<-t(train_data[,-1]/255)
labels<-train_data[,1]

# data preprocession
features_array <- features
# 입력 데이터의 차원을 설정(픽셀*객체 개수)
# ncol(features): 학습 데이터 수(866)
dim(features_array) <- c(28,28,1,ncol(features))
test_array <- test
dim(test_array) <- c(28,28,1,ncol(test))

ncol(features)
table(labels)

Convolutional Layer 구성

# Build cnn model
# first conv layers
my_input = mx.symbol.Variable('data')
conv1 = mx.symbol.Convolution(data=my_input, kernel=c(4,4), stride=c(2,2), pad=c(1,1), num.filter = 20, name='conv1')
relu1 = mx.symbol.Activation(data=conv1, act.type='relu', name='relu1')
mp1 = mx.symbol.Pooling(data=relu1, kernel=c(2,2), stride=c(2,2), pool.type='max', name='pool1')

# second conv layers
conv2 = mx.symbol.Convolution(data=mp1, kernel=c(3,3), stride=c(2,2), pad=c(1,1), num.filter = 40, name='conv2')
relu2 = mx.symbol.Activation(data=conv2, act.type='relu', name='relu2')
mp2 = mx.symbol.Pooling(data=relu2, kernel=c(2,2), stride=c(2,2), pool.type='max', name='pool2')

# fully connected
fc1 = mx.symbol.FullyConnected(data=mp2, num.hidden = 1000, name='fc1')
relu3 = mx.symbol.Activation(data=fc1, act.type='relu', name='relu3')
fc2 = mx.symbol.FullyConnected(data=relu3, num.hidden = 3, name='fc2')

# softmax
sm = mx.symbol.SoftmaxOutput(data=fc2, name='sm')

# training
mx.set.seed(100,sample.kind="Rounding")
device <- mx.cpu()
model <- mx.model.FeedForward.create(symbol=sm, 
                                     optimizer = "sgd",
                                     array.batch.size=30,
                                     num.round = 70, learning.rate=0.1,
                                     X=features_array, y=labels, ctx=device,
                                     eval.metric = mx.metric.accuracy,
                                     epoch.end.callback=mx.callback.log.train.metric(100))
graph.viz(model$symbol)

# test
predict_probs <- predict(model, test_array)
predicted_labels <- max.col(t(predict_probs)) - 1
table(test_data[, 1], predicted_labels)
sum(diag(table(test_data[, 1], predicted_labels)))/length(predicted_labels)

네트워크 시각화 함수: graph.viz(model$symbol)
1
graph.viz(model$symbol)

3. 텍스트 마이닝

텍스트 마이닝이란?

텍스트마이닝(Text mining)
- 웹페이지, 이메일, 소셜네트워크 기록 등 전자문서 파일로부터 특정 연관성(동시적으로 빈도 높은 단어 추출)을 분석하는 방법
- 다양한 방식의 알고리즘을 이용해 대용량의 텍스트로부터 트렌드와 관심어를 찾아내는 기법으로 사용

텍스트 마이닝 필요 패키지

Natural language processing
- install.packages(‘NLP’)
text mining package
- install.packages(‘tm’)
visualizing
- install.packages(‘wordcloud’)
color displaying
- install.packages(‘RColorBrewer’)
Korean processing
- install.packages(‘KoNLP’)
import twitter data
- install.packages(‘twitteR’)

# set library (set in order)
library(NLP)
library(tm)
library(RColorBrewer)
library(wordcloud)

사용 데이터: tm에 포함된 crude

# 20 new articles from Reuter- 21578 data set
data(crude)
# To know abour crude data
help(crude)

텍스트 마이닝- 함수

# information about the first file in crude data
str(crude[[1]])
content(crude[[1]])
meta(crude[[1]])
lapply(crude, content)

1
2
3

# inspect function
inspect(crude[1:3]) 
inspect(crude[[1]])

텍스트 마이닝- 전처리 함수

함수|설명
|—|—|
tm_map(x, removePunctuation)|문장부호 제거(. , “” ‘’)
tm_map(x, stripWhitespace)|공백문자 제거
tm_map(x, removeNumbers)|숫자 제거

텍스트 마이닝 실습

문장부호 없애기: tm_map(x, removePunctuation)

1
2
3

# 1. remove punctuation in documnet
crude<-tm_map(crude, removePunctuation)
content(crude[[1]])

숫자 제거: tm_map(x, removeNumbers)

1
2
3

# 2. remove numbers
crude<-tm_map(crude, removeNumbers)
content(crude[[1]])

스톱워즈 제거: 언어별로 다르며 별도 지정 가능

1
2
3

# 3. remove stopwords
crude<-tm_map(crude, function(x) removeWords(x,stopwords()))
content(crude[[1]])

참고: 스톱워즈 리스트

1	stopwords() #stopwords("en") 174개, stopwords("SMART")" 517개

문서 행렬 구성: TermDocumentMatrix(문서이름)

1
2
3

# 4. contruct term-doucument matrix 
tdm<-TermDocumentMatrix(crude)
inspect(tdm)

결과 해석
- DOCS 144: crude 문서 번호 144에 나오는 단어들의 빈도

문서 행렬을 행렬로 변환

1
2
3

# 5. read tdm as a matrix
m<-as.matrix(tdm)
head(m)

단어의 빈도 순서로 정렬

1
2
3

# 6. sorting in high frequency to low frequency 
v<-sort(rowSums(m), decreasing=TRUE)
v[1:10] # 가장 높은 것부터 [1:10]번까지

텍스트마이닝 수행
- 단어 이름과 빈도를 결합한 행렬을 데이터 프레임으로 저장
- crude 관련 기사 파일로부터 962개 단어를 분류해 빈도 계산
  1
  2
  3
  4
  # 7. match with freq and word names
  d<-data.frame(word=names(v), freq=v)
  head(d) #가장 빈도 높은 단어 6개
  d[957:962, ] #가장 빈도 낮은 단어 6개

텍스트마이닝 결과 그리기

1
2
3

# 7. plot a word cloud
wordcloud(d$word, d$freq)
# help(wordcloud)

# 7-1. Now lets try it with frequent words plotted first
# par(mfrow = c(1, 1),mar=c(1,2,3,1))
wordcloud(d$word,d$freq,c(8,.3),2,,FALSE,.1)

# 7-2. color plot with frequent words plotted first
pal <- brewer.pal(9,"BuGn")
pal <- pal[-(1:4)]
wordcloud(d$word,d$freq,c(8,.3),2,,FALSE,,.15,pal)

StudyPostechML