빅데이터 분석과 R 프로그래밍 1: Ⅳ. R 그래픽 Ⅰ

2020-12-06

2021-02-01

Postech, ggplot2

POSTECH에서 제공하는 MOOC 중, 빅데이터분석과 R프로그래밍 Ⅰ 과정입니다.

Ⅳ. R 그래픽 Ⅰ

1. R 그래픽 Ⅲ: 히스토그램

히스토그램 (1차원)

데이터 불러오기

# read brain data
brain <- read.csv(file="week4_1/brain2210.csv")

# attach 적용하기
attach(brain)

히스토그램: hist(변수이름)

hist(brain$wt)
hist(wt)

# 색상 선택
hist(wt, col = "lightblue")

히스토그램(색과 제목): hist(변수이름, col=”colname”, main=” “)

1 2	# histogram with color and title, legend hist(wt, breaks = 10, col = "lightblue", main="Histogram of Brain weight")

색(657가지 색)

colors() → 모든 색의 이름을 볼 수 있음

# see rgb values for 657 colors, choose what you like
# colors()

# select colors including "blue"
grep("blue", colors(), value=TRUE)

밀도함수 그려보기

# fit function(find density function)
par(mfrow=c(1,1))
d <- density(brain$wt)
plot(d)

그룹별 히스토그램(동일한 x축, y축 범위): xlim, ylim

# histogram with same scale
# 2 multiple plot
par(mfrow=c(2,1)) # 그래프 화면 분할을 2행 1열로 하라는 의미
brainf<-subset(brain,brain$sex=='f') 
hist(brainf$wt, breaks = 12,col = "green", xlim=c(900,1700),ylim=c(0,20),cex=0.7, main="Histogram with Normal Curve (Female)", xlab="brain weight")

brainm<-subset(brain,brain$sex=='m') 
hist(brainm$wt, breaks = 12,col = "orange",xlim=c(900,1700),ylim=c(0,20), main="Histogram with Normal Curve (Male)", xlab="brain weight")

grep(“violet”, colors(), value=TRUE)

2. R 그래픽 Ⅱ: 상자그림, 파이차트

상자그림(Boxplot, 1차원)

상자그림: boxplot(변수이름, col=”green”)

# boxplot
par(mfrow=c(1,2))
# boxplot for all data
boxplot(brain$wt, col=c("coral"))

그룹별 상자그림: boxplot(변수이름~그룹이름, col=c(“col1”, “col2”)

# boxplot by gender (female, male)
boxplot(brain$wt~brain$sex, col = c("green", "orange"))
#- 수평 상자그림: boxplot(변수이름, col="colname", horizontal=TRUE)
```{r}
par(mfrow=c(1,1))
boxplot(brain$wt~brain$sex, boxwex=0.5, horizontal=TRUE, col = c("grey", "red"))

박스플롯 폭 조정 옵션: boxwex=

1
2
3

par(mfrow=c(1,2))
boxplot(brain$wt, boxwex = 0.25, col = c("coral"), main = "Boxplot for all data")
boxplot(brain$wt, boxwex = 0.5, col=c("coral"), main="Boxplot for all data")

상자그림에 기술통계치 넣기: 관측치수(n) 넣기

# add text(n) over a boxplot
par(mfrow=c(1,2))
boxplot(brain$wt~brain$sex, col=c("green", "orange"))
text(c(1:nlevels(brain$sex)), a$stats[nrow(a$stats),]+30, paste("n = ",table(brain$sex),sep=""))

막대그림

차의 연비 데이터(autompg)

1 2	car <- read.csv("week4_2/autompg.csv") attach(car)

barplot(변수빈도, col=c(“col1”,”col2”,…))

# bar plot with cyliner count
# par(mfrow=c(1,1))
table(car$cyl)
freq_cyl <- table(cyl)
names(freq_cyl) <- c("3cyl", "4cyl", "5cyl", "6cyl",
                      "8cyl")
barplot(freq_cyl, col = c("lightblue", "mistyrose", "lightcyan",
                          "lavender", "cornsilk"))

파이차트

pie(변수빈도, labels=c(“ “, …))

파이차트를 그리기 위해서는 table(변수이름)을 이용하여 빈도를 계산해야 함

# you can alse custom the labels
freq_cyl <- table(cyl)
names(freq_cyl) <- c("3cyl", "4cyl", "5cyl", "6cyl", "8cyl")

pie(freq_cyl)

시계방향으로 파이차트 그리기

1
2
3

# pie chart clockwise
pie(freq_cyl, labels = c("3cyl", "4cyl", "5cyl", "6cyl","8cyl"),
    clockwise = TRUE)

몇 개의 변수만 뽑아서 그래프 그리기

# 4-3 pie chart of subset
# subset with cylinder (4,6,8) - refresh creating subset data lec3_2.R
car1<-subset(car, cyl==4 | cyl==6 | cyl==8)
table(car1$cyl)

freq_cyl1<-table(car1$cyl)
pie(freq_cyl1, labels = c("4cyl","6cyl","8cyl"),
    clockwise = TRUE)

3. R 그래픽 Ⅲ: 산점도

산점도: plot(x,y)

par(mforw=c(1,1))
x2 <- c(1, 4, 9)
y2 = 2+x2
plot(x2, y2)

par(mfrwo=c(2,1))
x <- seq(0, 2*pi, by=0.001)
y1 <- sin(x)
plot(x, y1, main = "sin curve (0:2*pi)")

y2 <- cos(x)
plot(x,y2,main="cosine curve (0:2*pi)")

wt(차의 무게)과 mpg(연비) 간의 산점도: plot(wt, mpg)
hp(마력)과 mpg(연비) 간의 산점도: plot(hp, mpg)
1
2
3
par(mfrow=c(2,1))
plot(wt, mpg)
plot(hp, mpg)
산점도 해석
- 차 무게가 무거울수록 연비는 낮다.
- 마력과 연비 간 산점도에서는 두 개의 클러스터가 보임(클러스터 내에서는 마력이 높을수록 연비 낮음)

plot(x, y, col=as.integer(그룹변수)) 색으로 표시

1
2
3

par(mfrow=c(2,1), mar=c(4,4,2,2))
plot(disp, mpg, col=as.integer(car$cyl))
plot(wt, mpg,  col=as.integer(car$cyl))

Conditioning plot: coplot(y~x|z)는 factor(그룹)

그룹에 따른 (x와 y간) 산점도

그룹변수(factor변수)간 평균 차이를 제공

1
2
3

car1<-subset(car, cyl==4 | cyl==6 | cyl==8)
coplot(car1$mpg ~ car1$disp | as.factor(car1$cyl), data = car1,
       panel = panel.smooth, rows = 1)

위 그래프로 확인 가능한 그룹별 산점도
- cylinder에 따른 차이를 보여줌
- 4cyl, 6cyl, 8cyl별로 (배기량과 연비) 관계를 구체적으로 해석할 수 있음

pairwise scatterplot: pairs(변수리스트)

1 2	# cross-tab plot to see how explanatory variables are related each other pairs(car1[,1:6], col=as.integer(car1$cyl), main = "autompg")

최적 적합 함수 추정(선형회귀모형, 비선형회귀모형)

lm(y변수~x변수): lm(linear model, 선형모형)

abline: add line (선 추가 함수)

# scatterplot with best fir lines
plot(wt, mpg, col=as.integer(car$cyl), pch=19)
# best fit linear line
abline(lm(mpg~wt), col="red", lwd=2, lty=1)
# lwd: width, 선의 굵기, lty: type, 라인의 타입

최적 적합 함수 추정: 비선형회귀모형, lowess 이용

lowess: locally-weighted polynomial regression (see the references)

# scatterplot with best fit lines
par(mfrow=c(1,1))
plot(wt, mpg,  col=as.integer(car$cyl), pch=19)
# best fit linear line
abline(lm(mpg~wt), col="red", lwd=2, lty=1)

# lowess : smoothed line, nonparmetric fit line (locally weighted polynomial regression)
lines(lowess(wt, mpg), col="red", lwd=3, lty=2)
help(lowess)

4. 그래픽과 레이아웃

그래프의 기본 함수

그래프 종류: plot(), barplot(), boxplot(), hist(), pie(), persp()
그래프 조정 사항: 점/선 종류, 글자 크기, 여백 조정
점 그리기: points()
선 그리기: lines(), abline(), arrows()
문자 출력: text()
도형: rect(), ploygon()
좌표축: axis()
격자표현: grid()

그래픽 옵션

par(): 그래프 출력 조정- 화면 분할, 마진, 글자 크기, 색상
pty=”s”: x축과 y축을 동일 비율로 설정, square
pty=”m”: 최대 크기로 설정, maximal
legend = c(“name1”, “name2”)
bty=”o”: box type 그래프 상자 모양 설정- o, l, 7, c, u
pch=1(default): point character (1=동그라미, 2=세모, … , 19=채운 동그라미)
Lty=(solid가 default): line type, 1=직선, 2=점선
cex=1(default): character expansion, 문자나 점의 크기, 숫자 클수록 글자 크기 커짐
mar: 아래, 왼쪽, 위쪽, 오른쪽 여백

선 그리기

abline(h=위치, v=위치, col=colname)

par(mfrow=c(1,1))
plot(wt, mpg, main = "abline on the scatterplot")
# horizontal
abline(h = 20)
abline(h = 30) # 수평선을 y축 위치 20과 30에
# vertical
abline(v = 3000, col="blue") # 수직선을 x축 위치 3000에, 색은 파란색

abline(절편값, 기울기값, lty=1, lwd=1, col=colname)

lty=1(직선), lty=2(점선), lwd=1 (line width, 숫자 클수록 선 굵어짐)

# y = a + bx
abline(a = 40, b = -0.0076, col="red")
# linear model coefficients, lty (line type), lwd (line with)
# linear model (mpg=f(wp))
z <- lm(mpg ~ wt, data = car)
z
abline(z, lty = 2, lwd = 2, col="green")

layout 함수

par(mfrow=c(2,2))

# 2*2 mulitple plot
par(mfrow=c(2,2))
plot(wt, mpg)
plot(disp, mpg)
plot(hp, mpg)
plot(accler, mpg)

margin 조정: mar(아래, 왼쪽, 위쪽, 오른쪽)

# 2*2 mulitple plot adjusting margin
par(mfrow=c(2,2), mar=c(4,4,2,2))
plot(wt, mpg)
plot(disp, mpg)
plot(hp, mpg)
plot(accler, mpg)

layout 조정

layout 행렬 m

# top 1 plot, bottom 2 plot
(m <- matrix(c(1, 1, 2, 3), ncol = 2, byrow = T))
layout(mat = m)
plot(car$wt, car$mpg, main = "scatter plot of autompg", pch = 19, col = 4)
hist(car$wt)
hist(car$mpg)

legend 달기

legend(x축 위치, y축 위치, legend=범례라벨, pch=1, col=c(번호 혹은 색으로 지정), lty=1)

# scatterplot coloring group variable
par(mfrow=c(1,1), mar=c(4,4,4,4))
plot(wt, mpg,  col=as.integer(car$cyl))
labels = c("3cyl", "4cyl", "5cyl", "6cyl","8cyl")
legend(4000, 45, legend = labels, pch = 1, col =c(3,4,5,6,8), lty =1)

R 그래픽

히스토그램과 밀도함수(histogram and density)
상자그림(boxplot)
파이차트(pie chart), 막대그림(bar plot)
산점도(scatterplot)
ggplot2를 이용한 그래픽
덴드로그램, 애니메이션
지도분석(map)
3D, 히트맵

StudyPostechBigdata