데이터 분석 라이브러리

푸닥거리 2020. 7. 30. 07:33

탐색적 자료 분석 Exploratory Data Analysis -> Insight

-> Numpy, Pandas: 데이터를 다루기 위한 패키지

-> Matplotlib, Seaborn: 데이터를 시각화 하기 위한 패키지

 

기계학습 Machine Learning -> Optimization

-> Scikit-learn: 기계학습 라이브러리

-> Statsmodels: 통계 라이브러리

 

 

Numpy 패키지

- 파이썬을 사용한 과학 컴퓨팅의 기본 패키지

- 넘파이의 주요 객체는 동종의 다차원 배열

- N 차원 배열 객체 생성 및 관리

- 선형 대수학, 푸리에 변환 기능, 난수 생성 기능

- 넘파이의 차원들은 축(axis)으로 불림

https://numpy.org/devdocs/

 

 

넘파이 주요 함수

- 배열 만들기: arange, array, copy, empty, empty_like, eye, fromfile, fromfunction, identity, linspace, logspace, mgrid, ogrid, ones, ones_like, r, zeros, zeros_like

- 모양 바꾸기: ndarray.astype, atleast_1d, atleast_2d, atleast_3d, mat

- 배열 조작하기: array_split, column_stack, concatenate, diagonal, dsplit, dstack, hsplit, hstack, ndarray.item, newaxis, ravel, repeat, reshape, resize, squeeze, sqapaxes, take, transpose, vsplit, vstack

- 찾기: all, any, nonzero, where

- 정렬하기: argmax, argmin, argsort, max, min, ptp, searchsorted, sort

- 배열 운영하기: choose, compress, cumprod, cumsum, inner, ndarray.fill, imag, prod, put, putmask, real, sum

- 기초 통계: cov, mean, std, var

- 선형 대수: cross, dot, outer, linalg.svd, vdot

 

 

import numpy as np

A = np.arange(15).reshape(3,5)

 

A

 

array([[ 0, 1, 2, 3, 4], [ 5, 6, 7, 8, 9], [10, 11, 12, 13, 14]])

 

 

 

 

ndarray 속성

- ndarray.ndim: 배열의 축(Axis) 수, 차원

 

A.ndim

 

 

- ndarray.shape: 각 차원의 배열 크기를 나타내는 정수 타입의 튜플, shape는 (n,m) 형태, 행렬은 n개의 행과 m개의 열, shape 튜플의 길이는 축의 수 ( ndim ) 

 

A.shape

 

 

- darray.size: 배열의 요소의 총수, shape의 각 요소의 곱과 동일

 

A.size

 

- ndarray.dtype: 배열 내의 요소의 타입, 파이썬의 자료형 또는 넘파이의 자료형(numpy.int32, numpy.int16, numpy.float64 등)을 이용해 지정함. 형변환의 개념이 아님, 지정한 타입의 크기만큼 잘라서 해당 타입으로 인식 함. 형변환은 astype(t) 함수를 이용

 

A.dtype -> dtype('int32')

 

 

- ndarray.itemsize: 배열의 각 요소의 바이트 단위의 사이즈,  float64 유형의 요소 배열에는 itemsize 8(=64/8), complex32 유형이는 itemsize 4(=32/8)가 있음. 이것은 ndarray.dtype.itemsize과 같은

 

A.itemsize 

 

 

 

 

 

 

https://numpy.org/doc/stable/index.html

 

 

import numpy as np

A = np.array([2,3,4])

 

A

 

A.dtype

C = np.array([[1,2], [3,4]], dtype=complex)

 

C

 

D = np.array(C, copy=False)

 

id(C), id(D)

 

E = np.array(A, copy=True)

 

id(E), id(A)

 

 

np.zeros((3,4))

 

 

np.ones((2,3,4), dtype=np.int16)

 

 

np.empty((2,3))  # 메모리 상태에 따라 값이 달라짐

 

 

np.arange(10)

 

 

np.arange(11,20)

 

 

np.arange(10305)

 

 

np.arange(0,20.3)

 

 

np.linspace(0,2# 끝 값 포함 됨

 

 

np.linspace(0,2,9)

 

 

np.linspace(0,10,20)

 

 

 

x = np.linspace(02*np.pi, 100)

y = np.sin(x)

 

A = np.arange(12).reshape(3,4)

A

 

A.shape

 

A.ravel()

 

A.reshape(6,2)

 

 

A.T

 

A.resize(2,6) # 현재 객체를 바꿈

 

A.shape

 

A.shape = (3,4)

 

 

A.reshape(2,-1) # 변경 된 배열이 반환 됨

 

 

A = np.arange(6)

print(A)

 

B = np.arange(12).reshape(4,3)

print(B)

 

C = np.arange(24).reshape(2,3,4)

 

C

 

 

print(np.arange(10000).reshape(100,100))

 

np.set_printoptions(threshold=10000)

print(np.arange(100).reshape(10,10))

 

A = np.array([20,30,40,50])

B = np.arange(4)

 

print(A)

print(B)

 

A - B

 

[1,2,3]*2

 

A*2

 

 

B ** 2

 

 

10 * np.sin(A)

 

A < 35

 

A = np.array([[1,1],[0,1]])

B = np.array([[2,0],[3,4]])

 

A * B

 

A @ B # 행렬의 곱

 

A.dot(B) # 행렬의 곱

 

A = np.ones((2,3), dtype=int)

B = np.random.random((2,3))

 

A

 

 

 

B

 

A *= 3

A

 

B += 3

B

 

 

A += B

 

A = np.ones(3, dtype=np.int32)

B = np.linspace(0,np.pi,3)

B.dtype

 

C = A+B

C.dtype

 

A = np.random.random((2,3))

A

A.sum(), A.min(), A.max()

 

np.sum(A)

 

A = np.arange(12).reshape(3,4)

A

A.sum(axis=0# 0은 행, 열은 고정하고 행 인덱스가 바뀌는 것들의 집계

 

np.sum(A, axis=0)

 

A.sum(axis=1# 1은 행은 고정하고 열 인덱스가 바뀌는 것들의 집계

 

B = np.arange(24).reshape(2,3,4)

B

 

B[1,2,1# 면(깊이), 행, 열

 

B.sum()

 

B.sum(axis=0# 0은 면(깊이), 행과 열이 같은 것 집계

 

 

B.sum(axis=1# 1은 행 인덱스가 바뀌는 것, 깊이와 열은 같은 것 집계

 

B.sum(axis=2# 2는 깊이와 행은 고정, 열이 바뀐것들 집계

 

A = np.arange(12).reshape(3,4)

A

 

A.sum(axis=0# 열을 고정시키고 행 인덱스가 바뀌는 것

 

 

reshape: 변경 된 배열이 반홤 됨, -1 값을 사용할 수 있음

resize: 현재 객체를 바꿈, -1값(음수)을 사용할 수 없음

 

2차원의 경우 

axis = 0 일 경우 열인덱스는 고정이고, 행인덱스가 바뀌는 데이터를 집계

 

3차원의 경우 

axis = 0 일 경우 행과 열인덱스를 고정하고 깊이(면)인덱스가 바뀌는 데이터 집계

axis = 1 일 경우 깊이와 열인덱스 고정, 행인덱스가 바뀌는 데이터 집계

 

A = np.arange(3)

B = np.arange(4,7)

 

print(A)

print(B)

 

A + B

 

np.add(A,B)

 

id(A), id(B) # 주소 값

 

np.add(A, B, A) # A 와 B를 더합 값을 A로

 

A = np.arange(3)

B = np.arange(4,7)

 

A + B

 

print(id(A)), print(id(B)) # 주소 값

 

np.add(A, B, A) # A 와 B를 더합 값을 A로

 

id(A)

A = np.arange(3)

B = np.arange(4,7)

C = np.array([20,30,40])

print(A)

print(B)

print(C)

 

A = np.arange(3)

B = np.arange(4,7)

C = np.array([20,30,40])

print(A)

print(B)

print(C)

 

G = A*B+C

 

T1 = A * B

G = T1 + C

del T1

 

G = A * B

np.add(G,C,G)

 

G = A * B

G += C

 

 

 

numpy 범용함수

https://numpy.org/doc/stable/reference/ufuncs.html

 

 

import numpy as np

A = np.array([1,2,3])

B = np.array([2,2,2])

 

A * B

 

 

np.multiply(A, B)

 

 

B = 2

 

A * B

 

A = np.array([[0,0,0],

             [10,10,10],

             [20,20,20],

             [30,30,30]])

 

B = np.array([1,2,3])

 

A + B

 

 

A = np.array([0,10,20,30])

B = np.array([1,2,3])

 

A[:, np.newaxis]

 

 

A = np.array([0,10,20,30])

B = np.array([1,2,3])

 

A[:, np.newaxis] # 축을 하나 추가 함

B

 

A[:,np.newaxis] + B # 4행 3열로 만들어짐

 

import numpy as np

A = np.arange(10)**3

A

 

A[2:5]

 

A[0:7:2]

 

A[::2]

 

A[::-1]

 

A[:6:2] = -1000

 

A

 

B = np.arange(20).reshape(5,4)

B

 

B[2,3]

 

B[-3,-1]

 

C = np.arange(24).reshape(2,3,4)

C

 

C[0,1,2]

 

C[1,2,3]

 

B[0:5,1]

 

B[:,1]

 

B[1]

 

C[:,0:2,1:3]

 

C[:,0:2,]

 

 

C[1]

 

C[0:2]

 

 

배열 쌓기

 

hstack(), vstack(), dstack()

 

 

A = np.arange(12).reshape(3,4)

B = np.arange(12,24).reshape(3,4)

 

print(A)

print(B)

 

np.vstack((A,B)) # 배열을 아래에 추가하는 방식으로 쌓음

 

 

np.hstack((A,B)) # 배열을 옆에 추가하는 방식으로 쌓음

 

 

np.dstack((A,B)) # 3번째 축(depth)을 따라 쌓음

 

 

A = np.array((1,2,3,4))

B = np.array((5,6,7,8))

C = np.array((9,10,11,12))

 

print(A)

print(B)

print(C)

 

 

np.column_stack((A,B,C))

 

 

np.hstack((A,B))

 

 

A[:,np.newaxis]

 

 

np.hstack((A[:,np.newaxis], B[:,np.newaxis]))

 

 

np.row_stack((A,B))

 

np.vstack((A,B))

 

A = np.arange(12).reshape(3,4)

B = np.arange(12,24).reshape(3,4)

 

print(A)

print(B)

 

 

np.stack((A,B), axis=0)

 

 

np.stack((A,B), axis=1) # axis 1은 행, 3차원

 

np.stack((A,B), axis=2 # axis 2은 열, 3차원

 

 

A = np.array((1,2,3,4))

B = np.array((5,6,7,8))

C = np.array((9,10,11,12))

 

print(A)

print(B)

print(C)

 

 

np.r_[A,B,C]

 

 

np.c_[A,B,C]

 

 

나누기

vsplit(), hsplit(), dsplit()

 

 

A = np.arange(12).reshape(3,4)

A

 

 

 

np.vsplit(A,3)

 

 

np.vsplit(A, 3)[0]

 

 

np.hsplit(A, 2)

 

np.hsplit(A, 4)

 

 

B = np.arange(24).reshape(2,3,4)

B

 

 

np.dsplit(B,2)

 

 

np.split(A,3, axis=0)

 

 

np.split(A, 2, axis=1)

 

 

np.hsplit(A,2)

 

 

 

# 2차원

- axis 0 은 행인덱스 변경, 열 고정

- axis 1은 열인덱스 변경, 행 고정

 

 

A = np.arange(24).reshape(3,8)

A

 

 

np.hsplit(A, 2)

 

np.hsplit(A, (2,5,6)) # 첫 인덱스부터 튜플로 지정한 각각의 인덱스까지 각각 부분 집합을 생성

 

 

A = np.arange(12).reshape(3,4)

A

 

 

 

np.split(A, 2, axis=1)

 

 

 

np.array_split(A, 2, axis=0)

 

 

 

A = np.arange(10)

A

 

 

 

np.split(A, 4) # X 에러, 균등분할

-> np.array_split(A, 4) # 균등분할 아니어도 가능

 

 

슬라이싱,

stack: c_ , nreaxis 속성, 세로로 2차원 구조로 만들어 줌

split: 균등분할
appay_split 은 균등분할되지 않아도 됨

 

 

A = np.arange(12)

B = A

print(B)

print(A)

 

 

B.shape = (3,4)

 

B[::2,] = 0

 

print(B)

print(A)

 

 

A = np.arange(12).reshape(3,4)

C = A.view()

is A

C.flags.owndata

C.shape = (2,6)

A[0,] = [10,20,30,40]

 

A = np.arange(12).reshape(3,4)

S = A[:,1:3]

S[:,1] = 100

 

 

A = np.arange(12).reshape(3,4)

D = A.copy()

is A

D.shape = (2,6)

D[0,:] = [0,0,0,0,0,0]

 

import numpy as np

A = np.arange(12)**2

i = np.array([1,1,3,8,5])

A[i]

j = np.array([[3,4],[9,7]])

A[j]

 

A = np.arange(12).reshape(3,4)

 

ind_i = np.array([[0,1],[1,2]])

A[ind_i]

A[ind_i, :]

 

ind_j = np.array([[2,1],[3,3]])

A[ind_i, ind_j]

 

 

A = np.arange(5)

A[[1,3,4]] = 0

 

 

A = np.arange(5)

A[[0,0,2]] = [10,20,30]

 

A = np.arange(5)

A[[0,0,2]] += 1

A

 

A = np.array([2,3,4,5])

B = np.array([8,5,4])

 

np.ix_(A, B)

 

 

A = np.array([2,3,4,5])

B = np.array([8,5,4])

C = np.array([5,4,6,8,3])

 

np.ix_(A,B,C)

 

AX, BX = np.ix_(A,B)

 

AX + BX

 

#A + B * C

 

AX, BX, CX = np.ix_(A,B,C)

AX + BX * CX

 

A = np.array([2,3,4,5])

B = np.array([8,5,4])

 

def reduce_func(*arrsfunc=np.add):

  aix = np.ix_(*arrs)

  result = aix[0]

  for item in aix[1:]:

    result = func(result, item)

  return result

 

  reduce_func(A,B)

 

  reduce_func(A,B, func=np.divide)

 

A = np.arange(20).reshape(4,5)

A

 

A % 2 == 0

 

A[A%2==0]

 

A[A%2==0] = A[A%2==0]**2

A

 

A[A%2==0] = 0

A

 

 

import numpy as np

A = np.array([[1,2],[3,4]])

A

 

A.T

 

B = np.array([[0,-1], [1,0]])

B

 

A @ B # 행렬의 곱

 

A.dot(B) # 행렬의 곱

 

np.dot(A, B) # 행렬의 곱

 

np.linalg.inv(A) # 역 행렬

 

np.eye(2# 단위행렬

 

np.diag(A) # 대각행렬

 

np.trace(A) # 대각합

 

np.linalg.det(A) # 행렬식을 구함

 

# solve 선형방정식을 구함

x=[23]

y=[6.87.3]

 

A = np.c_[x, np.ones(2)]

A

 

B = np.array(y)

B

 

np.linalg.solve(A, B)

 

# y = wx + b, y = ax + b

 

import matplotlib.pyplot as plt

%matplotlib inline

 

x=[23]

y=[6.87.3]

 

plt.scatter(x, y)

plt.plot(x, np.multiply(x, 0.5)+5.8)

plt.show()

 

 

# eig 고유값과 고유백터를 구함

np.linalg.eig(A)

 

#svd() 함수는 특이값 분해, 결과는 U.S.V.T

np.linalg.svd(A)

 

 

# transpose 같은 차원 내에서 배열의 모양을 변경할 때 사용

imgs = np.ones(shape=[4,3,28,28])

imgs.shape

 

# 배열의 차원은 그대로 유지하면서, 채널정보를 가장 마지막으로 변경

trans_imgs = imgs.transpose([0,2,3,1])

trans_imgs.shape

 

 

# 점들이 많은 경우, 모델 추정 문제를 행렬식 형태로 표현한 후 에 선형대수학을 적용, 선형 연립 방정식

 

X = [32,64,96,118,126,144,152,158]

Y = [17,24,62,49,52,105,130,125]

 

A = np.c_[X, np.ones(len(X))]

A

 

B = np.array(Y)

B

 

inv_A = np.linalg.inv(A.T @ A) @ A.T # A의 의사역행렬

w, b = inv_A @ B

 

w, b # 기울기, 편향

 

 

 

import matplotlib.pyplot as plt

%matplotlib inline

 

plt.scatter(X,Y)

plt.plot(X, np.multiply(X, w)+b)

plt.show()

 

import pandas as pd

 

d = [{'col1':1'col2':3}, {'col1':2'col2':4}]

pd.DataFrame(data=d)

 

 

d = [{'col1':1'col2':3}, {'col1':2'col2':4}, {'col1':2}]

pd.DataFrame(d)

 

 

L1 = [1,2,3,4,5]

L2 = [6,7,8,9,10]

 

pd.DataFrame({'col1':L1, 'col2': L2})

 

 

 

import numpy as np

pd.DataFrame(np.c_[L1, L2], columns=['col1','col2'])

 

 

from sklearn import datasets

iris_dic = datasets.load_iris()

type(iris_dic)

 

 

print(iris_dic.target_names)

 

X = pd.DataFrame(iris_dic.data, columns=iris_dic.feature_names)

X

 

iris_dic.target_names[[0,0,1,1,2,2,2,2]]

 

y = pd.DataFrame(iris_dic.target_names[iris_dic.target], columns=['species'])

y

 

 

iris = pd.concat([X, y], axis=1)

iris.head()

 

 

import statsmodels.api as sm

iris_data = sm.datasets.get_rdataset("iris", package="datasets", cache=True)

type(iris_data)

 

iris_data.data

 

import seaborn as sns

iris = sns.load_dataset("iris")

type(iris)

 

iris.head()

 

 

iris['sepal_length']

 

iris.to_csv("iris.csv.gz", sep=',', mode='w', encoding='utf-8', index=False, compression='infer')

 

del iris

iris = pd.read_csv("iris.csv")

iris.head()

 

iris = pd.read_csv("iris.csv", skiprows=[0,2])

iris.head()

 

 

import seaborn as sns

iris = sns.load_dataset("iris")

iris.columns # 열의 이름

iris.columns = ["sl",'sw'"pl""pw""sp"]

iris.head()

 

iris.index

 

iris.index = range(150300)

iris.head()

 

iris.columns = \

 [["sepal""sepal""petal""petal""species"], 

  ["length""width""length""width""species"]]

iris.head()

 

iris.columns.names = ["sps""lw"]

iris.head()

 

 

iris.index = [["setosa" for i in range(50)] + 

              ["versicolor" for i in range(50)] + 

              ["virginica" for i in range(50)],

              list(range(150))]

iris.head(10)

 

iris.index.names = ["sp""rownum"]

iris.head()

 

import seaborn as sns

iris = sns.load_dataset("iris")

 

iris.sepal_length

 

iris["sepal_length"]

# loc[] : 행 또는 열의 이름/조건으로 조회

# iloc[] : 행 또는 열의 번호로 조회

 

iris.loc[0]

 

iris.loc[:, "sepal_length"]

iris.loc[0:5]

iris.loc[:, "sepal_width":"petal_width"]

 

iris.iloc[0:5]

iris.iloc[0:50:3]

 

 

 

iris.iloc[0:10:2, ::2]

 

 

iris.loc[iris.species=='versicolor']

 

 

 

iris.loc[iris.species=='versicolor', ["sepal_length""species"]].head()

iris.loc[(iris.species=='versicolor') & (iris.sepal_length > 6.5)].head()

import seaborn as sns

iris = sns.load_dataset("iris")

iris.drop(0)

 

iris = sns.load_dataset("iris")

iris.drop(0, inplace=True# 현재 객체를 바꿈

iris.head()

 

iris.drop("species", axis=1).head()

iris = sns.load_dataset("iris")

iris.species

 

iris["species"]

 

 

iris.year = 2020

iris.head()

 

iris.year

 

 

iris["year"] = 2020

iris.head()

 

 

iris["no2"] = [10,20,30] + [None]*147

iris.head()

 

 

iris["no3"] = None

iris.head()

 

 

iris.loc[0:2"no3"] = [10,20,30]

iris.head()

 

import pandas as pd

new_data = pd.Series([10,20,30], index=[0,1,2])

new_data

iris["no4"] = new_data

iris.head()

iris = sns.load_dataset("iris")

row1 = {"sepal_length":10"sepal_width":5"petal_length":20"petal_width":10"species":"setosa"}

row1

iris.append(row1, ignore_index=True).tail()

 

new_row = pd.Series([1,2,3,4,"versicolor"], index=iris.columns)

new_row

iris.append(new_row, ignore_index=True)

 

import pandas as pd

df1 = pd.DataFrame({'key': ['a','b','c','f'], 'c1':[1,2,3,5]})

df2 = pd.DataFrame({'key': ['a','b','d','f'], 'c2':[5,6,7,8]})

df1.merge(df2)

 

df1.merge(df2, how='left')

 

df1.merge(df2, how='right')

 

df1.merge(df2, how="outer"# 둘중에 한곳만 있어도

df1

 

 

df2

 

df3 = pd.DataFrame({'key3': ['a','b','c','f'], 'c1':[1,2,3,5]})

df4 = pd.DataFrame({'key4': ['a','b','d','f'], 'c2':[5,6,7,8]})

 

df3

 

df4

df3.merge(df4, left_on='key3', right_on='key4')

 

df3.merge(df4, left_index=True, right_index=True)

 

df1 = pd.DataFrame({'c1': [1,2,3,4], 'c2': [5,6,7,8]})

df2 = pd.DataFrame({'c3': ['a','b','c','d'], 'c4': [1.23.45.57.6]})

 

pd.concat([df1, df2], axis=1# axis 1 왼쪽에서 오른쪽으로, 

 

pd.concat([df1, df2], axis=0)

 

 

df1 = pd.DataFrame({'c1': [1,2,3,4], 'c2': [5,6,7,8]}, index=[0,2,4,6])

df2 = pd.DataFrame({'c3': ['a','b','c','d'], 'c4': [1.23.45.57.6]}, index=[0,1,2,3])

 

df1

 

 

df2

pd.concat([df1, df2], axis=1# axis 1, 왼쪽에서 오른쪽으로

 

df1.reset_index()

 

df1.reset_index(drop=True)

 

df1.reset_index(drop=True, inplace=True)

 

df1

 

df2

 

pd.concat([df1, df2], axis=1)

 

import seaborn as sns

import pandas as pd

 

iris = sns.load_dataset("iris")

iris.head()

 

 

iris.sort_index(ascending=False).head()

iris.sort_index(axis=1).head()

 

 

iris.sort_index(axis=1, inplace=True)

iris.head()

 

 

iris = sns.load_dataset("iris")

iris.sort_values(by=["sepal_length"], inplace=True)

iris.head()

 

 

 

iris.sort_values(by=["sepal_length""sepal_width"], inplace=True)

iris.head()

iris = sns.load_dataset("iris")

iris.columns = [["sepal","sepal","petal","petal","species"],iris.columns]

iris.columns.names = ["info""details"]

iris.head()

iris.sort_index(level=["info"], axis=1).head()

 

iris.sort_index(level=0, axis=1).head()

import seaborn as sns

iris = sns.load_dataset("iris")

iris.head()

 

iris.min() # 열 별로 최소 값

 

iris.max()

 

iris.median()

 

iris.mean() # 평균

 

iris.var() # 분산, 평균 값 데이터를 제곱하여 평균

 

iris.std() # 표준 편차

 

iris.std(ddof=1)

 

 

iris.cov() #공분산

 

iris.corr() # 상관계수 0.3 작은 상관 관계 0.6 크면 강한 상관 관계

 

iris.describe()

 

 

iris.species.describe()

 

 

iris.describe(include="all")

 

import pandas as pd

df = pd.DataFrame({'a':[1,2]*3'b':[TrueFalse]*3'c':[2.3,4.0]*3})

df

 

df.describe(include=['int64'])

 

 

df.describe(exclude=["float64"])

 

 

import seaborn as sns

iris = sns.load_dataset("iris")

iris_grouped = iris.groupby(by=iris.species)

iris_grouped

 

 

 

iris_grouped.mean()

 

import numpy as np

iris["num"] = np.ravel([[i]*25 for i in range(6)])

iris.head()

iris.tail()

 

iris_grouped2 = iris.groupby(by=[iris.species, iris.num])

iris_grouped2.mean()

 

 

for type, group in iris_grouped: 

  print(group.head())

 

iris_grouped.take([1,2,3])

 

iris_grouped.take([0,1,2])

 

 

#데이터 구조 변경

 

import statsmodels.api as sm

 

airquality_data = sm.datasets.get_rdataset("airquality")

airquality = airquality_data.data

airquality.head()

 

import pandas as pd

 

pd.melt(airquality, id_vars=["Month""Day"], value_vars=["Ozone"])

 

 

import pandas as pd

 

pd.melt(airquality, id_vars=["Month""Day"], value_vars=["Ozone""Solar.R"])

 

 

import pandas as pd

 

pd.melt(airquality, id_vars=["Month""Day"], value_vars=["Ozone""Solar.R""Wind""Temp"])

 

import pandas as pd

 

pd.melt(airquality, id_vars=["Month""Day"])

 

 

airquality.melt(id_vars=["Month""Day"])

 

# 롱포멧->와이드포멧

airquality_melted = airquality.melt(id_vars=["Month""Day"])

airquality_melted.pivot_table(index=["Month""Day"], columns=["variable"], values=["value"])

 

 

airquality2 = airquality_melted.pivot_table(index=["Month""Day"], columns=["variable"], values=["value"])

airquality2.head()

 

 

airquality2.reset_index(level=["Month""Day"], col_level=1)

 

 

airquality2 = airquality2.reset_index(level=["Month""Day"], col_level=1)

airquality2.head()

 

 

 

airquality2.columns.droplevel(level=0)

 

 

airquality2.columns = airquality2.columns.droplevel(level=0)

airquality2.head()

 

 

# 데이터프레임에 함수 적용

 

import seaborn as sns

iris_df = sns.load_dataset("iris")

import numpy as np

iris_df.iloc[:, :-1].apply(np.round).head()

 

 

iris_df.head()

 

 

 

iris_df.iloc[:, :-1].apply(np.sum)

 

 

iris_df.iloc[:, :-1].apply(np.mean)

 

 

iris_df.iloc[:, :-1].apply(np.sum, axis=1)

 

iris_avg = iris_df.iloc[:, :-1].apply(np.average)

iris_avg

 

 

iris_df.iloc[:, :-1].apply(lambda x :x-iris_avg, axis=1).head()

# 모든 데이터들의 각 열의 평균과 차이

 

 

 

iris_df2 = iris_df.iloc[:, :-1]

iris_df2.apply(lambda x: list(x-iris_avg), axis=1).head()

 

 

 

iris_df2.apply(lambda x: list(x-iris_avg), axis=1, result_type="broadcast").head()

 

 

 

iris_df2.applymap(lambda x : x**2).head()

 

 

iris_df.sepal_length.map(np.round)

 

 

 

iris_df.sepal_length.map(np.sum)

 

 

# 결측지 처리 및 변경하기, 값 일괄 변경하기

 

import seaborn as sns

 

iris = sns.load_dataset("iris")

iris_x = iris.iloc[:,:-1]

 

import random

random.seed(1)

 

for col in range(4):

  iris_x.iloc[[random.sample(range(len(iris)), 20)], col] = float('nan')

 

iris_x.head(15)

 

 

 

iris_x.dropna().head()

 

 

 

iris_x.dropna(thresh=2).head(10)

 

 

 

iris_x.dropna(subset=["sepal_length""sepal_width"]).head()

 

 

 

iris_x.dropna(inplace=True)

iris_x.head()

 

 

 

 

iris = sns.load_dataset("iris")

iris_x = iris.iloc[:,:-1]

 

import random

random.seed(1)

 

for col in range(4):

  iris_x.iloc[[random.sample(range(len(iris)), 20)], col] = float('nan')

 

iris_x.fillna(value=0).head()

 

 

 

 

iris_x.fillna(method="ffill").head(10)

 

 

iris_x.head(10)

 

 

 

 

import numpy as np

 

np.round(iris_x.mean(), 1)

 

iris_x.fillna(value=np.round(iris_x.mean(), 1)).head(5)

 

 

iris_x.fillna(value=np.round(iris_x.mean(), 1), limit=2).head(10)

 

iris_x.head(10)

 

 

iris = sns.load_dataset("iris")

iris_x = iris.iloc[:,:-1]

 

import random

random.seed(1)

 

for col in range(4):

  iris_x.iloc[[random.sample(range(len(iris)), 20)], col] = float('nan')

 

iris_x.replace(float('nan'), 10).head()

 

 

 

 

iris_x.sepal_length.replace([54.6], method="bfill").head(10)

 

 

iris.replace(r"^se[a-z]*""set", regex=True).head()

 

 

iris.head()

 

 

 

iris.replace(regex=r"^se[a-z]*", value="set").head()

 

 

import pandas as pd

 

df = pd.DataFrame({'A':[0,1,2,3,4], 'B':[5,6,7,8,9], 'C':['a','b','c','d','e']})

df

 

 

 

df.replace([0,1,2,3,4], 4)

 

 

df.replace([0,1,2,3], [4,3,2,1])

 

 

df.replace({0:101:100})

 

 

 

df.replace({'A':0'B':5}, 100)

 

 

df.replace({'A': {0:1004:400}})

 

 

 

df = pd.DataFrame({'A':['bat''foo''bait'], 'B':['abc''bar''xyz']})

df

 

 

 

df.replace(r"^ba.$"'new', regex=True)

 

 

df.replace({'A':r"^ba.$"}, {'A':'new'}, regex=True)

 

 

 

df.replace(regex=r"^ba.$", value="new")

 

 

 

 

df.replace(regex={r"^ba.$":'new''foo':'xyz'})

 

 

 

df.replace(regex=[r"^ba.$""foo"], value="new")

 

 

 

 

s = pd.Series([10'a','a','b','a'])

s.replace({'a':None})

 

 

s.replace('a',None)

 

 

s.replace(to_replace='a', value=None, method='pad')

 

 

 

iris = sns.load_dataset("iris")

iris_x = iris.iloc[:,:-1]

 

iris_x.where(iris_x > 5).head(10)

 

 

 

 

iris_x.where(iris_x > 5, other=0).head(10)

 

 

 

 

df = pd.DataFrame(np.arange(10).reshape(-1,2), columns=["A","B"])

df

 

 

 

df.where(df%3==0, other = -df)

 

 

np.where(df%3==0, df, -df)

 

 

df.mask(~df%3==0, ~df)

 

 

 

iris_x.astype(int).head()

 

 

 

iris_x.astype({"sepal_length":int"sepal_width":int}).head()

 

 

# Series: 1차원 자료구조

 

from pandas import Series, DataFrame

fruits = Series([2500380012006000], index=['apple','banana','pear','cherry']) # 컬럼이 없고 인덱스만

fruits

 

 

fruits.values

 

 

fruits.index

 

 

fruit_dic = {'apple':2500'banana':3800'pear':1200'cherry':6000}

type(fruit_dic)

 

 

fruits = Series(fruit_dic)

type(fruits)

 

fruits

 

fruits.values

 

 

fruits.index

 

 

fruits.drop('banana')

 

fruits.drop('banana', inplace=True)

fruits

 

fruits[:]

 

 

fruits[0:2]

 

 

fruits['apple':'pear']

 

 

 

fruits1 = Series([5,9,10,3], index=['apple''banana''cherry''pear'])

fruit2 = Series([3,2,9,5,10], index=['apple''orange''banana''cherry''mango'])

 

fruits1

 

fruit2

 

 

fruits1 + fruit2

 

 

fruits = Series([2500380012006000], index=['apple''banana''pear','cherry'])

fruits.sort_values(ascending=False)

 

 

 

fruits.to_frame()

 

 

fruits.to_frame().T

 

 

# matplotlib.org, seaborn.pydata.org

 

https://matplotlib.org/tutorials/introductory/pyplot.html#sphx-glr-tutorials-introductory-pyplot-py

 

Pyplot tutorial — Matplotlib 3.3.0 documentation

text can be used to add text in an arbitrary location, and xlabel, ylabel and title are used to add text in the indicated locations (see Text in Matplotlib Plots for a more detailed example) All of the text functions return a matplotlib.text.Text instance.

matplotlib.org

 

https://plotnine.readthedocs.io/en/stable/

http://python-visualization.github.io/folium/

https://plotly.com/python/

http://pyecharts.org/#/

 

 

# Matplotlib 

 

https://matplotlib.org/gallery

 

Thumbnail gallery — Matplotlib 2.0.2 documentation

 

matplotlib.org

 

 

import matplotlib.pyplot as plt

%matplotlib inline 

%config InlineBackend.figure_format='retina'

 

import matplotlib

matplotlib.__version__

 

 

plt.plot([1,2,3,4])

plt.ylabel('some numbers')

plt.show()

 

 

plt.figure(figsize=(10,8)) #가로 10인치, 세로 8인치

plt.plot([1,2,3,4])

plt.ylabel('some numbers')

plt.show()

 

 

fig.set_size_inches(10,8)

plt.rcParam["figure.figsize"]=(10,8)

 

 

 

https://matplotlib.org/gallery/showcase/anatomy.html?highlight=anatomy

 

Anatomy of a figure — Matplotlib 3.3.0 documentation

Note Click here to download the full example code Anatomy of a figure This figure shows the name of several matplotlib elements composing a figure import numpy as np import matplotlib.pyplot as plt from matplotlib.ticker import AutoMinorLocator, MultipleLo

matplotlib.org

 

 

 

import numpy as np

x = np.arange(0100.01)

plt.subplot(2,1,1)

plt.plot(x, np.sin(x))

plt.show()

 

 

import numpy as np

x = np.arange(0100.01)

plt.subplot(2,1,1)

plt.plot(x, np.sin(x))

plt.subplot(223)

plt.plot(x, np.cos(x))

plt.subplot(224)

plt.plot(x, np.sin(x)*np.cos(x))

plt.show()

 

 

plt.plot(x, np.sin(x)*np.cos(x))

plt.show()

 

 

plt.plot(x, np.sin(x)*np.cos(x))

 

 

 

fig, axes = plt.subplots(nrows=2, ncols=2) # 도화지, 축 객체

 

fig # 도화지 객체

 

 

 

axes # 축 객체

 

 

axes[0,0].plot(x, np.sin(x))

 

 

 

 

fig, axes = plt.subplots(nrows=2, ncols=2)

axes[0,0].plot(x, np.sin(x))

plt.show()

 

 

fig, axes = plt.subplots(nrows=2, ncols=2)

axes[0,0].plot(x, np.sin(x))

axes[0,1].plot(x, np.cos(x))

axes[1,0].plot(x, np.tanh(x))

axes[1,1].plot(x, np.sin(x)*np.cos(x))

plt.show()

 

 

 

 

fig, axes = plt.subplots(nrows=4)

 

 

axes

 

 

fig, axes = plt.subplots(nrows=4)

for i, ax in enumerate(axes):

  ax.plot(x, np.sin(x))

plt.show()

 

 

 

fig, axes = plt.subplots(nrows=4)

for i, ax in enumerate(axes.flat):

  ax.plot(x, np.sin(x))

plt.show()

 

 

 

 

def sin_cos(x):

  return np.sin(x)*np.cos(x)

 

func_list = [np.sin, np.cos, np.tanh, sin_cos]

 

fig, axes = plt.subplots(nrows=4)

for i, ax in enumerate(axes.flat):

  ax.plot(x, func_list[i](x))

plt.show()

 

 

 

def sin_cos(x):

  return np.sin(x)*np.cos(x)

 

func_list = [np.sin, np.cos, np.tanh, sin_cos]

 

fig, axes = plt.subplots(ncols=4)

for i, ax in enumerate(axes.flat):

  ax.plot(x, func_list[i](x))

plt.show()

 

 

def sin_cos(x):

  return np.sin(x)*np.cos(x)

 

func_list = [np.sin, np.cos, np.tanh, sin_cos]

 

fig, axes = plt.subplots(nrows=2, ncols=2)

for i, ax in enumerate(axes.flat):

  ax.plot(x, func_list[i](x))

plt.show()

 

 

https://matplotlib.org/api/_as_gen/matplotlib.pyplot.html

 

matplotlib.pyplot — Matplotlib 3.3.0 documentation

matplotlib.pyplot matplotlib.pyplot is a state-based interface to matplotlib. It provides a MATLAB-like way of plotting. pyplot is mainly intended for interactive plots and simple cases of programmatic plot generation: import numpy as np import matplotlib.

matplotlib.org

 

 

 

fig, axes = plt.subplots(2,2, figsize=(8,5))

fig.suptitle("Figure Sample Plots")

 

 

 

fig, axes = plt.subplots(2,2, figsize=(8,5))

fig.suptitle("Figure Sample Plots")

axes[0,0].plot([1,2,3,4], 'ro-')

plt.show()

 

 

fig, axes = plt.subplots(2,2, figsize=(8,5))

fig.suptitle("Figure Sample Plots")

axes[0,0].plot([1,2,3,4], 'ro--')

plt.show()

 

 

import numpy as np

import matplotlib.pyplot as plt

%matplotlib inline 

%config InlineBackend.figure_format='retina'

 

fig, axes = plt.subplots(2,2, figsize=(8,5))

fig.suptitle("Figure Sample Plots")

axes[0,0].plot([1,2,3,4], 'ro-')

axes[0,1].plot(np.random.randn(4,10), np.random.randn(4,10), 'cs-.')

plt.show()

 

 

 

import numpy as np

import matplotlib.pyplot as plt

%matplotlib inline 

%config InlineBackend.figure_format='retina'

 

fig, axes = plt.subplots(2,2, figsize=(8,5))

fig.suptitle("Figure Sample Plots")

axes[0,0].plot([1,2,3,4], 'ro-')

axes[0,1].plot(np.random.randn(4,10), np.random.randn(4,10), 'cs-.')

 

axes[1,0].plot(np.linspace(0,5), np.cos(np.linspace(0,5)))

plt.show()

 

 

 

import numpy as np

import matplotlib.pyplot as plt

%matplotlib inline 

%config InlineBackend.figure_format='retina'

 

fig, axes = plt.subplots(2,2, figsize=(8,5))

fig.suptitle("Figure Sample Plots")

axes[0,0].plot([1,2,3,4], 'ro-')

axes[0,1].plot(np.random.randn(4,10), np.random.randn(4,10), 'cs-.')

 

axes[1,0].plot(np.linspace(0,5), np.cos(np.linspace(0,5)))

 

axes[1,1].plot([3,6], [3,5], 'b^:')

 

plt.show()

 

 

import numpy as np

import matplotlib.pyplot as plt

%matplotlib inline 

%config InlineBackend.figure_format='retina'

 

fig, axes = plt.subplots(2,2, figsize=(8,5))

fig.suptitle("Figure Sample Plots")

axes[0,0].plot([1,2,3,4], 'ro-')

axes[0,1].plot(np.random.randn(4,10), np.random.randn(4,10), 'cs-.')

 

axes[1,0].plot(np.linspace(0,5), np.cos(np.linspace(0,5)))

 

axes[1,1].plot([3,6], [3,5], 'b^:')

axes[1,1].plot([4,5], [5,4], 'kx--')

plt.show()

 

 

import seaborn as sns

iris = sns.load_dataset("iris")

plt.scatter(x=iris.petal_length, y=iris.petal_width, 

            s=iris.sepal_length*10, c=iris.sepal_width, 

            alpha=0.5)

plt.show()

 

 

np.random.seed(7902)

N = 50

x = np.random.rand(50)

y = np.random.rand(N)

colors = np.random.rand(N)

area = (30*np.random.rand(N))**2

plt.scatter(x, y, s=area, c=colors, alpha=0.5)

plt.show()

 

 

 

 

plt.bar([1,2,3], [3,4,5])

plt.show()

 

 

 

plt.barh([1,3,5], [1,2,3])

plt.show()

 

 

 

 

plt.axvline(0.6)

plt.show()

 

 

 

plt.axhline(0.4)

plt.show()

 

 

plt.hist(iris.sepal_length, bins=10, color='r')

plt.show()

 

 

 

fig, axes = plt.subplots(2,2)

axes[0,0].hist(iris.sepal_length, bins=10, color='r')

axes[0,1].hist(iris.sepal_length, bins=15, cumulative=True)

axes[1,0].hist(iris.sepal_length, bins=5, orientation='horizontal')

axes[1,1].hist(iris.sepal_length, bins=10, histtype='step')

plt.show()

 

 

 

 

plt.boxplot(iris.sepal_length)

plt.show()

 

 

 

plt.violinplot(iris.sepal_length)

plt.show()

 

x = np.linspace(01050)

y = np.cos(x)

plt.fill(x, y, c='blue')

plt.show()

 

 

 

plt.fill_between(x, y, color='red')

plt.show()

 

 

plt.fill_betweenx(x, y, color='red')

plt.show()

 

 

# Matplotlib을 이용한 시각화 2/2 – 그래프 커스터마이징

 

 

import numpy as np

import matplotlib.pyplot as plt

x = np.linspace(010100)

y = np.cos(x)

 

plt.plot(x, y)

plt.show()

 

 

 

plt.plot(x, y, linewidth=2, color="#FF00FF")

plt.show()

 

 

 

 

 

plt.plot(x, y, linestyle="dotted", linewidth=7, color="purple")

 

 

 

plt.plot(x, y, ls="--", c="r", lw=3)

plt.show()

 

 

 

plt.plot(x, y, 'b^:')

plt.show()

 

 

 

 

plt.plot(x, y, 'b.')

plt.show()

 

 

 

fig, axes = plt.subplots(1,2)

axes[0].scatter(x, y, marker=".")

axes[0].text(2,0"Example Grapg", style="italic")

plt.show()

 

 

fig, axes = plt.subplots(1,2)

axes[0].scatter(x, y, marker=".")

axes[0].text(2,0"Example Grapg", style="italic")

axes[1].scatter(x, y, marker="*")

axes[1].annotate("Sine", xy=(5,0.5), xytext=(2,0.75), 

                 arrowprops=dict(arrowstyle="->", connectionstyle="angle3"))

plt.show()

 

 

https://matplotlib.org/tutorials/text/annotations.html#sphx-glr-tutorials-text-annotations-py

 

Annotations — Matplotlib 3.3.0 documentation

Let's start with a simple example. text takes a bbox keyword argument, which draws a box around the text: t = ax.text( 0, 0, "Direction", ha="center", va="center", rotation=45, size=15, bbox=dict(boxstyle="rarrow,pad=0.3", fc="cyan", ec="b", lw=2)) The pat

matplotlib.org

 

plt.scatter(x, y, marker=".")

plt.text(30.5, r"$\sum_{i=0}^\infty X_i$", fontsize=20)

plt.show()

 

 

 

https://matplotlib.org/3.1.0/api/axis_api.html

 

matplotlib.axis — Matplotlib 3.1.0 documentation

matplotlib.axis Classes for the ticks and x and y axis. class matplotlib.axis.Axis(axes, pickradius=15)[source] Base class for XAxis and YAxis. class matplotlib.axis.XAxis(axes, pickradius=15)[source] class matplotlib.axis.YAxis(axes, pickradius=15)[source

matplotlib.org

 

 

 

fig, axes = plt.subplots(2,2, figsize=(8,5))

plt.subplots_adjust(hspace=0.4, wspace=0.3)

axes[0,0].scatter(x, y, marker=".")

axes[0,0].set(title="An Example Axes", ylabel="Y-Axis",

              xlabel="X-Axis")

plt.show()

 

# 23:03 Matplotlib을 이용한 시각화 2/2 – 그래프 커스터마이징

 

 

import numpy as np

import matplotlib.pyplot as plt

 

fig, axes = plt.subplots(2,2, figsize=(8,5))

plt.subplots_adjust(hspace=0.4, wspace=0.3)

axes[0,0].scatter(x, y, marker=".")

axes[0,0].set(title="An Example Axes", ylabel="Y-Axis",

              xlabel="X-Axis")

 

axes[0,1].scatter(x, y, marker='^', c='r')

axes[0,1].set_xlim(05)

axes[0,1].set_ylim(-22)

axes[0,1].set_xlabel("X 0-5")

axes[0,1].set_ylabel("Y -2~2")

 

plt.show()

 

 

import numpy as np

import matplotlib.pyplot as plt

 

fig, axes = plt.subplots(2,2, figsize=(8,5))

plt.subplots_adjust(hspace=0.4, wspace=0.3)

axes[0,0].scatter(x, y, marker=".")

axes[0,0].set(title="An Example Axes", ylabel="Y-Axis",

              xlabel="X-Axis")

 

axes[0,1].scatter(x, y, marker='^', c='r')

axes[0,1].set_xlim(05)

axes[0,1].set_ylim(-22)

axes[0,1].set_xlabel("X 0-5")

axes[0,1].set_ylabel("Y -2~2")

 

axes[1,0].scatter(x, y, marker='v', c='b')

axes[1,0].set_xticks(range(1,8,2))

axes[1,0].set_xticklabels([3,100,-12,'foo'])

axes[1,0].set_yticks([-2,0,1,2])

axes[1,0].set_yticklabels([-200,0,100,"Max"])

 

plt.show()

 

 

 

 

 

import numpy as np

import matplotlib.pyplot as plt

 

fig, axes = plt.subplots(2,2, figsize=(8,5))

plt.subplots_adjust(hspace=0.4, wspace=0.3)

axes[0,0].scatter(x, y, marker=".")

axes[0,0].set(title="An Example Axes", ylabel="Y-Axis",

              xlabel="X-Axis")

 

axes[0,1].scatter(x, y, marker='^', c='r')

axes[0,1].set_xlim(05)

axes[0,1].set_ylim(-22)

axes[0,1].set_xlabel("X 0-5")

axes[0,1].set_ylabel("Y -2~2")

 

axes[1,0].scatter(x, y, marker='v', c='b')

axes[1,0].set_xticks(range(1,8,2))

axes[1,0].set_xticklabels([3,100,-12,'foo'])

axes[1,0].set_yticks([-2,0,1,2])

axes[1,0].set_yticklabels([-200,0,100,"Max"])

 

axes[1,1].scatter(x, y, marker=',', c='c')

axes[1,1].set(xticks=range(1,8,2),   # axes[1,0].set_xticks(range(1,8,2)) 와 동일

              xticklabels=[3,100,-12,'foo'],

              yticks=[-2,0,1,2],

              yticklabels=[-200,0,100,"Max"])

axes[1,1].spines["top"].set_visible(False)

axes[1,1].spines["bottom"].set_position(("outward"10))

axes[1,1].grid(True)

 

plt.show()

 

 

 

axes[1,1].scatter(x, y, marker=',', c='c')

axes[1,1].set(xticks=range(1,8,2),   # axes[1,0].set_xticks(range(1,8,2)) 와 동일

              xticklabels=[3,100,-12,'foo'],

              # yticks=[-2,0,1,2],

              yticklabels=[-200,0,100,"Max"])

axes[1,1].spines["top"].set_visible(False)

axes[1,1].spines["bottom"].set_position(("outward"10))

axes[1,1].grid(True)

 

plt.show()

 

 

 

 

 

 

x = np.arange(0,10)

y1 = 0.5 * x**2

y2 = -1*y1

fig, ax1 = plt.subplots()

ax1.plot(x, y1, "g^:"# 초록색, 삼각형, 점선

ax1.set_xlabel("X data")

ax1.set_ylabel("Y1 data", color="g")

plt.show()

 

 

 

 

x = np.arange(0,10)

y1 = 0.5 * x**2

y2 = -1*y1

fig, ax1 = plt.subplots()

ax1.plot(x, y1, "g^:"# 초록색, 삼각형, 점선

ax1.set_xlabel("X data")

ax1.set_ylabel("Y1 data", color="g")

 

ax2 = ax1.twinx() # 축 공유

ax2.plot(x, y2, 'bv--')

ax2.set_ylabel("Y2 data", color="b")

 

plt.show()

 

 

x = np.arange(0,10)

y1 = 0.5 * x**2

y2 = -1*y1

fig, ax1 = plt.subplots()

ax1.plot(x, y1, "g^:"# 초록색, 삼각형, 점선

ax1.set_xlabel("X data")

ax1.set_ylabel("Y1 data", color="g")

 

ax2 = ax1.twinx() # 축 공유

ax2.plot(x, y2, 'bv--')

ax2.set_ylabel("Y2 data", color="b")

 

ax3 = ax1.twiny()

ax3.plot(-x, y1, 'ro-.')

ax3.set_xlabel('-x data', color="r")

 

plt.show()

 

 

 

 

 

fig, axes = plt.subplots(1,2, figsize=(8,3))

plt.subplots_adjust(hspace=0.4, wspace=0.3)

plt.suptitle("Main Title")

plt.show()

 

 

 

fig, axes = plt.subplots(1,2, figsize=(8,3))

plt.subplots_adjust(hspace=0.4, wspace=0.3)

plt.suptitle("Main Title")

 

axes[0].set_title("Title 1")

axes[0].set_xlabel("W")

 

plt.show()

 

 

fig, axes = plt.subplots(1,2, figsize=(8,3))

plt.subplots_adjust(hspace=0.4, wspace=0.3)

plt.suptitle("Main Title")

 

axes[0].set_title("Title 1")

axes[0].set_xlabel("W")

 

axes[1].set(title="Title 2")

 

plt.show()

 

 

 

 

fig, axes = plt.subplots(1,2, figsize=(8,3))

plt.subplots_adjust(hspace=0.4, wspace=0.3)

plt.suptitle("Main Title")

 

axes[0].set_title("Title 1")

axes[0].set_xlabel("W")

 

axes[1].set_title("Title 2", loc="right")

axes[1].set_xlabel("X")

 

plt.show()

 

 

 

 

plt.style.use("default")

fig, ax1 = plt.subplots()

ax1.plot(x, y1, "g^:", label="GREEN"# 초록색, 삼각형, 점선

ax1.plot(x, y2, "bv-", label="BLUE")

ax1.set_xlabel("X data")

ax1.set_ylabel("Y1 data", color="g")

ax1.legend() 

plt.show()

 

 

 

plt.style.use("default")

fig, ax1 = plt.subplots()

ax1.plot(x, y1, "g^:")

ax1.plot(x, y2, "bv-")

ax1.set_xlabel("X data")

ax1.set_ylabel("Y1 data", color="g")

ax1.legend(labels=["GERRN""BLUE"]) 

plt.show()

 

 

import numpy as np

x = np.arange(010)

y = 0.5 * x**2

 

plt.style.use('default')

fig, ax1 = plt.subplots(figsize=(8,4))

ax1.plot(x, y, 'g^:')

ax1.set_xlabel('X data')

ax1.set_ylabel('Y data', color='g')

 

ax2 = ax1.twinx()

ax2.plot(x, -y, 'bv-.')

ax2.set_ylabel('-Y data', color='b')

 

 

import numpy as np

x = np.arange(010)

y = 0.5 * x**2

 

plt.style.use('default')

fig, ax1 = plt.subplots(figsize=(8,4))

ax1.plot(x, y, 'g^:')

ax1.set_xlabel('X data')

ax1.set_ylabel('Y data', color='g')

 

ax2 = ax1.twinx()

ax2.plot(x, -y, 'bv-.')

ax2.set_ylabel('-Y data', color='b')

 

import matplotlib.patches as mpatches

red_patch = mpatches.Patch(color='red', label='RED')

green_patch = mpatches.Patch(color='green', label='GREEN')

blue_patch = mpatches.Patch(color='blue', label='BLUE')

ax1.legend(handles=[red_patch, green_patch, blue_patch])

 

 

 

import numpy as np

x = np.arange(010)

y = 0.5 * x**2

 

plt.style.use('default')

fig, ax1 = plt.subplots(figsize=(8,4))

ax1.plot(x, y, 'g^:')

ax1.set_xlabel('X data')

ax1.set_ylabel('Y data', color='g')

 

ax2 = ax1.twinx()

ax2.plot(x, -y, 'bv-.')

ax2.set_ylabel('-Y data', color='b')

 

import matplotlib.patches as mpatches

red_patch = mpatches.Patch(color='red', label='RED')

green_patch = mpatches.Patch(color='green', label='GREEN')

blue_patch = mpatches.Patch(color='blue', label='BLUE')

ax1.legend(handles=[red_patch, green_patch, blue_patch])

 

import matplotlib.lines as mlines

dot_line = mlines.Line2D([], [], color='green'

                         marker='^', markersize=5

                         linestyle=":", linewidth=2,

                         label='Dot Line')

dash_line = mlines.Line2D([], [], color='red',

                          marker='o', markersize=5,

                          linestyle="--", linewidth=2,

                          label='Dash Line')

dash_dot_line = mlines.Line2D([], [], color='blue',

                              marker='v', markersize=5,

                              linestyle="-.", linewidth=2,

                              label='Dash Dor Line')

ax2.legend(handles=[dot_line, dash_line, dash_dot_line],

           loc='lower right', ncol=3, borderaxespad=3.,

           mode="expand")

 

 

 

 

plt.style.use('ggplot')

 

fig, ax1 = plt.subplots()

ax1.plot(x, y1, "g^:")

ax1.plot(x, y2, "bv-")

ax1.set_xlabel("X data")

ax1.set_ylabel("Y1 data", color="g")

ax1.legend(labels=["GERRN""BLUE"]) 

plt.show()

 

 

 

plt.style.available

 

 

 

 

plt.style.use('default')

 

fig, ax1 = plt.subplots()

ax1.plot(x, y1, "g^:")

ax1.plot(x, y2, "bv-")

ax1.set_xlabel("X data")

ax1.set_ylabel("Y1 data", color="g")

ax1.legend(labels=["GERRN""BLUE"]) 

plt.show()

 

 

 

plt.rc("lines", ls="-.", lw="20")

plt.plot([1,2,3,4,5])

plt.show()

 

 

 

 

import matplotlib as mpl

plt.rc("lines", ls="-.", lw="20")

plt.rc("axes", prop_cycle=mpl.cycler(color=['g']))

plt.plot([1,2,3,4,5])

plt.show()

 

 

plt.rcParams["lines.linestyle"] = "-."

plt.rcParams["lines.linewidth"] = 20

plt.rcParams["axes.prop_cycle"] = mpl.cycler(color=["g"])

plt.plot([1,2,3,4,5])

plt.show()

 

 

 

 

# 주기표 cycler

from cycler import cycler

my_cycler = (cycler('color'

                    ['r','g','b','c','m','y','k'])+

             cycler(linestyle=['-','--',':','-.','-','--',':'])+

             cycler(lw=np.linspace(5,20,7)))

plt.rcParams["axes.prop_cycle"] = my_cycler

plt.plot([1,2], [1,1])

plt.plot([1,2], [2,2])

plt.plot([1,2], [3,3])

plt.plot([1,2], [4,4])

plt.plot([1,2], [5,5])

plt.plot([1,2], [6,6])

plt.plot([1,2], [7,7])

plt.plot([1,2], [8,8])

plt.show()

 

 

 

 

 

# 그래프 저장 

from cycler import cycler

my_cycler = (cycler('color'

                    ['r','g','b','c','m','y','k'])+

             cycler(linestyle=['-','--',':','-.','-','--',':'])+

             cycler(lw=np.linspace(5,20,7)))

plt.rcParams["axes.prop_cycle"] = my_cycler

plt.plot([1,2], [1,1])

plt.plot([1,2], [2,2])

plt.plot([1,2], [3,3])

plt.plot([1,2], [4,4])

plt.plot([1,2], [5,5])

plt.plot([1,2], [6,6])

plt.plot([1,2], [7,7])

plt.plot([1,2], [8,8])

plt.savefig("poo.png", transparent=True)

 

 

 

 

 

# Seaborn을 이용한 시각화 1/2

 

http://seaborn.pydata.org/

 

seaborn: statistical data visualization — seaborn 0.10.1 documentation

Seaborn is a Python data visualization library based on matplotlib. It provides a high-level interface for drawing attractive and informative statistical graphics. For a brief introduction to the ideas behind the library, you can read the introductory note

seaborn.pydata.org

 

http://seaborn.pydata.org/api.html

 

 

import matplotlib.pyplot as plt

import seaborn as sns

%matplotlib inline

iris = sns.load_dataset("iris")

iris.head()

 

titanic = sns.load_dataset("titanic")

titanic.head()

 

 

iris.describe()

 

 

titanic.describe()

 

 

 

titanic.describe(include='all')

 

 

 

 

 

 

plt.style.use('ggplot')

fig, ax = plt.subplots(figsize=(5,6))

plt.show()

 

 

plt.style.available

 

 

 

 

 

sns.set(style="darkgrid")

sns.set_context("notebook", font_scale=1.5,

                rc={"lines.linewidth":2.5})

sns.scatterplot(x="petal_length", y="petal_width"

                data=iris)

plt.show()

 

 

 

 

sns.set_style("whitegrid")

sns.set_context("notebook", font_scale=1.5,

                rc={"lines.linewidth":2.5})

sns.scatterplot(x="petal_length", y="petal_width"

                data=iris)

plt.show()

 

 

 

# 9:38 Seaborn을 이용한 시각화 1/2

 

import matplotlib.pyplot as plt

import seaborn as sns

%matplotlib inline

 

iris = sns.load_dataset("iris")

 

sns.set(style="white")

sns.set_context("notebook", font_scale=1.5

                rc={"lines.linewidth":3.5})

sns.scatterplot(x="petal_length", y="petal_width",

                data=iris)

sns.lineplot(x="petal_length", y="petal_width", data=iris)

 

 

 

 

sns.set_palette("dark"3)

sns.scatterplot(x="petal_length", y="petal_width",

                data=iris, hue="species")

plt.show()

 

 

sns.set()

sns.scatterplot(x="petal_length", y="petal_width",

                data=iris)

 

 

 

sns.set()

_ = sns.scatterplot(x="petal_length", y="petal_width",

                data=iris)

 

 

sns.set()

sns.scatterplot(x="petal_length", y="petal_width",

                data=iris)

plt.show()

 

 

 

 

 

sns.scatterplot(x="petal_length", y="petal_width",

               hue="species", style="species"

               data=iris)

plt.show()

 

 

sns.lineplot(x="petal_length", y="petal_width",

               data=iris)

plt.show()

 

 

 

 

 

sns.lineplot(x="petal_length", y="petal_width",

             hue="species", style="species",

             data=iris)

plt.show()

 

 

 

sns.lineplot(x="petal_length", y="petal_width",

             hue="species", style="species",

             markers=True, dashes=False,

             data=iris)

plt.show()

 

 

 

 

fig, axes = plt.subplots(ncols=2, figsize=(8,5))

plt.subplots_adjust(wspace=0.3)

 

sns.scatterplot(x="petal_length", y="petal_width",

                data=iris, ax=axes[0])

 

sns.lineplot(x="petal_length", y="petal_width",

                data=iris, ax=axes[1])

plt.show()

 

 

 

sns.scatterplot(x="petal_length", y="petal_width",

                data=iris)

 

sns.lineplot(x="petal_length", y="petal_width",

                data=iris)

 

plt.show()

 

 

 

 

sns.relplot(x="petal_length", y="petal_width",

            col="species", data=iris)

plt.show()

 

 

 

 

 

 

sns.stripplot(x="species", y="petal_length", data=iris)

plt.show()

 

 

titanic = sns.load_dataset("titanic")

sns.barplot(x="sex", y="survived", hue="class",

            data=titanic)

plt.show()

 

 

 

sns.countplot(x="deck", data=titanic)

plt.show()

 

 

 

 

sns.pointplot(x="class", y="survived", hue="sex",

              data=titanic,

              palette={"male":"g""female":"m"},

              markers=["^","o"], linestyles=["-","--"])

plt.show()

 

 

sns.boxplot(x="alive", y="age", hue="adult_male",

            data=titanic)

plt.show()

 

 

 

 

sns.violinplot(x="age", y="sex", hue="survived",

               data=titanic)

plt.show()

 

 

 

 

 

sns.jointplot("petal_width""petal_length", data=iris,

              kind="kde", color="g")

plt.show()

 

 

 

sns.jointplot("petal_width""petal_length", data=iris,

              kind="scatter", color="g")

plt.show()

 

 

 

 

sns.jointplot("petal_width""petal_length", data=iris,

              kind="scatter", color="g")

 

 

 

g = sns.jointplot("petal_width""petal_length", data=iris,

              kind="scatter", color="g")

g.plot_joint(sns.kdeplot, color="c")

plt.show()

 

 

 

sns.pairplot(iris, hue="species", palette="husl",

             markers=["o""s""D"])

 

 

 

 

 

 

import numpy as np

np.random.seed(0)

x = np.random.randn(100)

from scipy.stats import norm

sns.distplot(x, fit=norm, kde=False)

plt.show()

 

 

 

from scipy.stats import norm

sns.distplot(x, kde=False)

plt.show()

 

 

from scipy.stats import norm

sns.distplot(x, kde=True)

plt.show()

 

 

from scipy.stats import norm

sns.distplot(x, fit=norm, kde=True)

plt.show()

 

 

 

# Seaborn을 이용한 시각화 2/2

 

 

import matplotlib.pyplot as plt

import seaborn as sns

%matplotlib inline

 

iris = sns.load_dataset("iris")

 

sns.lmplot(x="sepal_width", y="sepal_length", data=iris,

           hue="species")

plt.show()

 

 

 

 

sns.lmplot(x="sepal_width", y="sepal_length", data=iris)

 

 

sns.regplot(x="sepal_width", y="sepal_length", data=iris)

 

 

sns.regplot(x="petal_length", y="petal_width", data=iris)

 

 

sns.clustermap(iris.iloc[:, :-1])

plt.show()

 

 

 

 

 

species = iris.pop("species")

iris.head()

 

sns.clustermap(iris)

plt.show()

 

 

 

 

 

iris.corr() # iris 의 상관계수

 

 

sns.heatmap(iris.corr(), vmin=-1, vmax=1)

plt.show()

 

 

 

 

sns.heatmap(iris.corr(), vmin=-1, vmax=1, annot=True)

plt.show()

 

 

 

 

 

sns.heatmap(iris.corr(), vmin=-1, vmax=1, annot=True,

            cmap="cool_r")

plt.show()

 

 

 

import numpy as np

mask = np.zeros_like(iris.corr())

mask

 

mask[np.triu_indices_from(mask)]

 

mask[np.triu_indices_from(mask)] = True

 

mask[np.triu_indices_from(mask)]

 

mask

 

 

with sns.axes_style("white"):

  sns.heatmap(iris.corr(), mask=mask, square=True)

  plt.show()

 

 

 

import matplotlib.pyplot as plt

import seaborn as sns

%matplotlib inline

%config InlineBackend.figure_format="retina"

iris=sns.load_dataset("iris")

 

g = sns.FacetGrid(iris, col="species", hue="species")

g.map(plt.hist, "sepal_length")

plt.show()

 

 

g = sns.FacetGrid(iris, col="species", hue="species")

g.map(plt.hist, "sepal_length", bins=5)

plt.show()

 

 

 

 

g = sns.FacetGrid(iris, col="species", hue="species")

g.map(plt.hist, "sepal_length", bins=5)

g.set_axis_labels("sepal_length""Count")

plt.show()

 

 

 

g = sns.FacetGrid(iris, col="species", hue="species")

g.map(sns.scatterplot, "petal_width""petal_length",

      size=iris.sepal_length)

plt.show()

 

 

 

 

titanic = sns.load_dataset("titanic")

g = sns.FacetGrid(titanic, col="survived", row="sex")

g.map(plt.hist, "age")

 

 

g = sns.PairGrid(iris)

g.map(sns.scatterplot)

plt.show()

 

 

 

 

g = sns.PairGrid(iris)

g.map_diag(sns.kdeplot)

g.map_lower(sns.scatterplot)

g.map_upper(sns.regplot)

plt.show()

 

 

 

g = sns.JointGrid(x="petal_length", y="petal_width",

                  data=iris)

g.plot(sns.regplot, sns.distplot)

 

 

 

 

g = sns.JointGrid(x="petal_length", y="petal_width",

                  data=iris)

g.plot_joint(sns.scatterplot)

g.plot_marginals(sns.distplot)

plt.show()

 

 

# 뷰티풀솝과 파서

 

https://www.w3schools.com/css/default.asp

 

!pip install requests_file

 

 

 

from requests_file import FileAdapter

import requests

 

s = requests.Session()

s.mount("file://", FileAdapter()) 

res = s.get("file:///sample.html")

res

 

 

 

!pip install beautifulsoup4

 

 

from bs4 import BeautifulSoup

soup = BeautifulSoup(res.content, "html.parser")

soup

 

 

el = soup.select_one("h1")

el

 

 

el.text

 

 

 

div_el = soup.select("div")

div_el

 

 

soup.select_one("div")

 

soup.select("h1, span")

 

soup.select("div b")

 

soup

 

soup.select("div > b")

 

soup.select(".contents")

 

soup.select("div.contents")

 

soup.select("#subject")

 

 

soup.select("#subject")[0]

 

soup.select("[id=subject]")

 

 

https://www.w3schools.com/css/css_selectors.asp

 

url = "https://finance.naver.com/marketindex/"

market_index = requests.get(url)

market_index

 

soup = BeautifulSoup(market_index.content, "html.parser")

price = soup.select_one("div.head_info > span.value")

price

 

 

price.text

 

 

 

!pip install requests

 

import requests

requests.get("https://api.github.com")

 

 

response = requests.get("https://api.github.com")

response.content

 

 

response.status_code

 

response

 

 

if response.status_code == 200:

  print("Success")

elif response.status_code == 404:

  print("Not Found")

 

 

if response:

  print("Success")

else:

  print("Error")

 

 

print(response.content)

 

 

response.text

 

 

res = requests.get("http://javaspecialist.co.kr")

res.content

 

 

res.text

 

 

response.json()

 

 

 

response.headers

 

 

 

response.headers["Content-Type"]

 

 

 

requests.get("https://api.github.com/search/repositories"

                        params={'q':'request+language:python'})

 

 

json_response = response.json()

json_response

 

 

response = requests.get(

                        "https://api.github.com/search/repositories",

                        params={'q':'request+language:python'},

                        headers={"Accept":"application/vnd.github.v3.text-match+json"})

response.json()

 

 

 

 

requests.post("https://httpbin.org/post"

              data={"key":"value"})

 

 

 

requests.put("https://httpbin.org/put"

              data={"key":"value"})

 

 

 

 

requests.delete("https://httpbin.org/delete"

              data={"key":"value"})

 

 

 

requests.head("https://httpbin.org/get")

 

 

requests.post("https://httpbin.org/post",

              data={"key":"value"})

 

 

 

requests.post("https://httpbin.org/post",

              data=[("key","value"),("key1","value1")])

 

 

 

response = requests.post("https://httpbin.org/post",

                         json={"key":"value"})

json_response = response.json()

json_response

 

 

response.request.headers["content-Type"]

 

 

 

from getpass import getpass

requests.get("https://api.github.com/user",

             auth=("id", getpass()))

 

 

 

 

 

requests.get("https://api.github.com", verify=False)

 

 

 

requests.get("https://api.github.com", timeout=1)

 

 

from requests.exceptions import Timeout

 

try :

  response = requests.get("http://api.github.com",

                        timeout=1)

except Timeout:

  print("요청 시간 초과")

else :

    print("정상 처리")

 

 

 

with requests.Session() as session:

  session.auth = ('id', getpass())

  response = session.get("https://api.github.com/user")

 

  print(response.headers)

  print(response.json())

 

 

 

from requests.adapters import HTTPAdapter

from requests.exceptions import ConnectionError

github_adapter = HTTPAdapter(max_retries=3)

session = requests.Session()

session.mount("https://api.github.com", github_adapter)

 

try:

  session.get("https://api.github.com")

except ConnectionError as ce:

    print(ce)

 

 

 

# 텍스트 마이닝 개요

# 텍스트 전처리, 개수 기반 단어 표현, 문서 유사도, 토픽 모델링, 연관 분석, 딥러닝을 이용한 자연어 처리, 워드 임베딩, 텍스트 분류, 태깅, 번역

 

# NLTK 자연어처리 패키지

# corpus, tokenizing, morphological analysis, POS tagging

 

import nltk

nltk.download()

 

 

import nltk

nltk.download("treebank")

 

 

 

from nltk.corpus import treebank

print(treebank.fileids())

 

 

treebank.sents("wsj_0001.mrg")

 

 

 

wsj_0001 = treebank.sents("wsj_0001.mrg")

for line in wsj_0001:

  print(' '.join(line))

 

 

treebank.tagged_words("wsj_0001.mrg")

 

 

print(treebank.parsed_sents("wsj_0001.mrg")[0])

 

 

nltk.download("book", quiet=True# 로그출력 안함

 

from nltk.book import *

 

 

 

type(text1)

 

 

text1

 

 

nltk.corpus.gutenberg.fileids()

 

 

emma = nltk.corpus.gutenberg.raw("austen-emma.txt")

print(emma[:200])

 

 

from nltk.tokenize import sent_tokenize

print(sent_tokenize(emma[:1000])[3])

 

 

 

 

 

from nltk.tokenize import word_tokenize

word_tokenize(emma[50:100])

 

 

 

 

from nltk.tokenize import RegexpTokenizer

ret = RegexpTokenizer("[\w]+"# 1회 이상

ret.tokenize(emma[50:100])

 

 

# 어간추출

words = ["sending""cooking""files""lives""crying""dying"]

 

from nltk.stem import PorterStemmer

pst = PorterStemmer()

 

pst.stem(words[0])

 

 

[pst.stem(w) for w in words]

 

 

from nltk.stem import LancasterStemmer

lst = LancasterStemmer()

[lst.stem(w) for w in words]

 

 

from nltk.stem.regexp import RegexpStemmer

rest = RegexpStemmer('ing')

[rest.stem(w) for w in words]

 

 

words2 = ['enviar','cocina','moscas''vidas','llorar','morir']

from nltk.stem.snowball import SnowballStemmer

sbst = SnowballStemmer('spanish')

[sbst.stem(w) for w in words2]

 

 

 

 

# 원형복원

words3 = ["cooking""believes"]

from nltk.stem.wordnet import WordNetLemmatizer

wl = WordNetLemmatizer()

[wl.lemmatize(w) for w in words]

 

 

 

[wl.lemmatize(w, pos='v'for w in words3]

 

 

# 품사 태깅

nltk.help.upenn_tagset('NNP')

 

 

nltk.help.upenn_tagset()

 

 

 

 

sentense = emma[50:289]

print(sentense)

 

 

 

from nltk.tag import pos_tag

tagged_list = pos_tag(word_tokenize(sentense))

print(tagged_list)

 

 

 

nouns_list = [ t[0for t in tagged_list if t[1]=="NN"# 명사들만

nouns_list

 

 

import re

pattern = re.compile('NN?'# NN 명사로 시작하는

nouns_list = [t[0for t in tagged_list if pattern.match(t[1])]

nouns_list

 

tagged_list

 

 

from nltk.tag import untag

untag(tagged_list)

 

 

["/".join(p) for p in tagged_list]

 

 

ret = RegexpTokenizer("[\w]{2,}")

from nltk import Text

emma_text = Text(ret.tokenize(emma))

emma_text.plot(20)

 

 

 

emma_text.concordance('Emma', lines=5)

 

 

emma_text.similar("general")

 

emma_text.similar("general"10)

 

 

 

emma_text.common_contexts(["general","strong"])

 

 

emma_text.dispersion_plot(["Emma","Knightley","Frank","Jane","Harriet","Robert"])

 

len(emma_text) # 단어 개수

 

len(set(emma_text)) # 중복을 제거한 단어 개수

 

 

 

 

len(set(emma_text)) / len(emma_text) # 어휘 풍부성

 

 

 

 

emma_fd = emma_text.vocab()

type(emma_fd)

 

# emma 말뭉치에서 사람 이름을 가져와 품사 태그에서 단어 빈도 수

from nltk import FreqDist

stopwords = ["Mr.","Mrs","Miss","Mr","Mrs","Dear"#불용어

emma_tokens = pos_tag(emma_text)

names_list = [ t[0for t in emma_tokens 

              if t[1]=="NNP" and t[0not in stopwords ]

emma_fd_names = FreqDist(names_list)

emma_fd_names

 

 

emma_fd_names.most_common(5

 

# 한글 형태소 분석, 의미를 가진 최소 단위

# KoNLPy: Python용 자연어 처리기, http://konlpy.readthedocs.org, http://konlpy.org, https://github.com/konlpy/konlpy

# KOMORAN: 자바로 만든 형태소 분석기, https://shineware.tistory.com/tag/KOMORAN/

# HanNanum: 자바로 만든 형태소 분석기, http://semanticweb.kaist.ac.kr/home/index.php/HanNanum

# KoNLP: R용 자연어 처리기, https://github.com/haven-jeon/KoNLP

 

 

!pip install konlpy

 

 

 

http://jdk.java.net/

 

# 품사 태그

https://konlpy-ko.readthedocs.io/ko/v0.4.3/morph/#comparison-between-pos-tagging-classes

 

https://konlpy.org/en/latest/morph/#pos-tagging-with-konlpy

https://docs.google.com/spreadsheets/d/1OGAjUvalBuX-oZvZ_-9tEfYD2gQe7hTGsgUpiiBSXI8/edit#gid=0

 

 

Korean POS tags comparison chart.xlsx
0.05MB

 

from konlpy.tag import Hannanum

 

text = """아름답지만 다소 복잡하기도한 한국어는 

전세계에서 13번쨰로 많이 사용되는 언어입니다."""

 

han = Hannanum()

han.analyze(text)

 

 

han.morphs(text)

 

 

han.nouns(text) 

 

 

 

han.pos(text)

 

 

han.pos(text, ntags=22# 기본 9

 

 

from konlpy.tag import Kkma

Kkma = Kkma()

print(Kkma.morphs(text))

 

 

print(Kkma.nouns(text)) # 명사를 추출

 

 

print(Kkma.pos(text))

 

 

 

from konlpy.tag import Komoran

kom = Komoran()

kom.morphs(text)

 

 

 

print(kom.nouns(text)) # 명사만 추출

 

 

 

print(kom.pos(text))

 

 

 

from konlpy.corpus import kolaw

c = kolaw.open("constitution.txt").read()

print(c[:100])

 

 

 

 

 

from konlpy.corpus import kobill

d = kobill.open('1809890.txt').read()

print(d[150:300])

 

 

# 워드 클라우드

 

!pip install wordcloud

 

 

from konlpy.corpus import kolaw

data = kolaw.open("constitution.txt").read()

from konlpy.tag import Komoran

komoran = Komoran()

print(komoran.nouns("%r"%data[0:1000]))

 

 

 

word_list = komoran.nouns("%r"%data[0:1000])

text = ' '.join(word_list)

text

 

 

import matplotlib.pyplot as plt

%matplotlib inline

from wordcloud import WordCloud

wordc = WordCloud()

wordc.generate(text) # 문장

 

 

plt.imshow(wordc)

plt.show()

 

 

 

wordc = WordCloud(background_color="white", max_words=20

                  relative_scaling=0.2,

                  font_path='/H2PORL.TTF')

wordc.generate(text)

plt.figure()

plt.imshow(wordc)

plt.axis("off")

 

 

 

 

word_list = komoran.nouns("%r"%data)

text = ' '.join(word_list)

wordcloud = WordCloud(background_color="white",

                      max_words=2000,

                      relative_scaling=0.2,

                      font_path='/H2PORL.TTF')

wordcloud.generate(text)

plt.figure(figsize=(15,10))

plt.imshow(wordcloud)

plt.axis("off")

plt.show()

 

 

 

# 불용어

stow_w = set(["국가","대통령"])

wordcloud = WordCloud(background_color="white",

                      max_words=2000,

                      stopwords=stow_w,

                      relative_scaling=0.2,

                      font_path="/H2PORL.TTF")

wordcloud.generate(text)

plt.figure(figsize=(15,10)) # 인치

plt.imshow(wordcloud)

plt.axis("off")

plt.show()

 

 

 

 

 

http://coderby.com/img

 

 

import numpy as np

from PIL import Image

img = Image.open("/south_korea.png").convert("RGBA")

mask_ar = np.array(img)

wordcloud = WordCloud(background_color="white",

                      max_words=2000,

                      font_path="/H2PORL.TTF",

                      mask=mask_ar, random_state=42)

wordcloud.generate(text)

wordcloud.to_file("/result1.png")

 

 

 

 

import random

def grey_color(*args, **kwargs):

  return 'hsl(40, 100%%, %d%%)'% random.randint(50,100)

 

wordcloud.generate(text)

wordcloud.recolor(color_func=grey_color)

wordcloud.to_file("/result2.png")

 

 

 

import nltk

import matplotlib.font_manager as fm

plt.figure(figsize=(12,6))

font_path = "/H2PORL.TTF"

font_name = fm.FontProperties(fname=font_path).get_name()

plt.rc("font", family=font_name)

nltk.Text(word_list).plot(50)

 

 

 

Trackbacks 0 : Comments 0

Write a comment


딥러닝을 활용한 자연어 처리

푸닥거리 2020. 7. 4. 12:45

딥러닝

 비선형

 인공신경망

 tensorflow는 학습, keras는 구현 활용

 - tensor ( 다차원 배열, 변수 ) 의 flow

 영상처리, 신호처리

 RNN: 어휘, 구문 분석

 - 시계열 처리

 - 자연어 처리

 

  머신러닝: 자동완성기능, 예측, 문자 간 거리를 계산, 귀여운 강아지 vs 귀여운 바퀴벌레, wegiht

  - 회귀분석, 분류분석

  - 군집분석

  - 선형

 

 

https://www.anaconda.com/products/individual

 

 

 

1. NLTK 자연어 처리 패키지

 

텍스트 마이닝: 자연어에서 의미 있는 정보를 찾는 것, 비정형 문서 데이터로부터 문서별 단어의 행렬, 통찰, 의사결정을 지원, 말뭉치

 

문서: 비정형 데이터 -> Corpus: 저장된 문서 -> TermDocument Matrix: 구조화된 문서 -> 분석: 분류, 군집 분석, 연관, 감성 분석

 

자연어 처리 학습 주제, 선형 결합, 희소 행렬

- 텍스트 전처리

- 개수 기반 단어 표현

- 문서 유사도

- 토픽 모델링

- 연관 분석

- 딥러닝을 이용한 자연어 처리: RNN, LSTM

- 워드 임베딩: Word2vec 패키지

- 텍스트 분류: 스팸 메일 분류

- 태깅

- 번역

 

NLTK ( 영어권 자연어 처리 ),  KNLPy ( 한국어 자연어 처리 ) 패키지가 제공하는 주요 기능

- 말뭉치(corpus)

- 토큰 생성(tokenizing)

- 형태소 분석(morphological analysis): 어근 분석, 명사

- 품사 태깅(POS tagging)

 

말뭉치 다운로드

 

 

nltk.download("book")

 

 

 

형태소 분석

 

 

from nltk.tokenize import word_tokenize

 

->

 

from nltk.tokenize import RegexpTokenizer

 

ret = RegexpTokenizer("[\w]+") // 영문자 숫자에 해당하는 것만 토큰으로 만듬

 

ret.tokenize(emma[50:100])

 

 

 

형태소 분석의 예

 

- 어간 추출: Stemming

- 원형 복원: Lemmatizing

- 품사 태깅

 

 

 

 

 

 

어간 추출 vs 원형 복원

 

 

품사 태깅

 

 

 

Text 클래스, 한글 미지원

 

- 단어 개수 

- 소설 책 내 단어 개수, 6%

 

. 탭키

 

 

 

 

 

 

 

 

 

 

2. 한글 형태소 분석과 워드 클라우드

 

 

자연어 처리 용어

- 형태소

- 용언

- 어근

- 어미

- 자모

- 품사

- 어절 분류

- 불용어

- n-gram

 

 

형태소

 

KoNLPy: Korean NLP in Python

 

https://konlpy.org/

 

https://konlpy.org/en/latest/#api

 

https://konlpy.org/en/latest/morph/#comparison-between-pos-tagging-classes

 

 

 

https://docs.google.com/spreadsheets/d/1OGAjUvalBuX-oZvZ_-9tEfYD2gQe7hTGsgUpiiBSXI8/edit#gid=0

 

 

 

 

 

 

어절 -> 전처리 -> 분석 후보 생성 -> 결합 제약 검사 -> 분석 후보 선택 -> 형태소

 

형태소 분석 엔진

- KoNLPy

- KOMORAN

- HanNanum

- KoNLP: KoNLPy는 JPype1 패키지에 의존

 

https://www.oracle.com/java/technologies/

-> Java SE 8u251

 

 

 

 

 

 

형태소 분석기

 

 

Hannanum

- morphs

- nouns

- pos

 

Komoran

- morphs

- nouns

- pos

 

KKma

- morphs

- nouns

- pos

 

 

 

 

 

 

 

 

법률 말뭉치

 

 

 

 

 

 

 

 

 

 

 

http://coderby.com/forums/topic/jupyter-notebook-extension-%ec%a3%bc%ed%94%bc%ed%84%b0-%eb%85%b8%ed%8a%b8%eb%b6%81-%ed%99%95%ec%9e%a5%ed%8c%a9

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

워드 임베딩: 단어를 벡터로 표현

 

- 희소 표현
- 밀집 표현: 단어와 단어 거리 

- 워드 임베딩: 위 과정

 

Word2Vec

- CBOW: 주변 단어로 중간 단어 예측

- Skip-Gram: 중간 단어로 주변 단어 예측

 

pip install gensim

 

import nltk

nltk.download("book")

 

from nltk.book import *

nltk.corpus.gutenberg.fileids()

emma = nltk.corpus.gutenberg.raw("austen-emma.txt")

print(emma[:200])

 

from nltk.tokenize import sent_tokenize

sent_tokens = sent_tokenize(emma)

type(emma)

sent_tokens[10]

len(sent_tokens)

 

from nltk.tokenize import word_tokenize

print(word_tokenize(sent_tokens[10]))

 

from nltk.tokenize import RegexpTokenizer

ret = RegexpTokenizer("[\w]+")

print(ret.tokenize(sent_tokens[10]))

 

words = ["sending""cooking""files""lives""crying""dying"]

from nltk.stem import PorterStemmer

 

pst = PorterStemmer()

[pst.stem(w) for w in words]

 

from nltk.stem import LancasterStemmer

lst = LancasterStemmer()

[lst.stem(w) for w in words]

 

from nltk.stem.regexp import RegexpStemmer

rest = RegexpStemmer('ing')

[rest.stem(w) for w in words]

 

words = ["sending""cooking""files""lives""crying""dying"]

from nltk.stem.snowball import SnowballStemmer

sbst = SnowballStemmer("english")

[rest.stem(w) for w in words]

 

words3 = ["cooking""belives"]

 

lst = LancasterStemmer()

[lst.stem(w) for w in words3]

 

from nltk.stem.wordnet import WordNetLemmatizer

wl = WordNetLemmatizer()

[wl.lemmatize(w) for w in words3]

 

from nltk.tag import pos_tag

sent_tokens[10]

tagged_list = pos_tag(word_tokenize(sent_tokens[10]))

print(tagged_list)

 

nouns_list = [ t[0for t in tagged_list if t[1]=='NN' ]

nouns_list

 

ret = RegexpTokenizer("[a-zA-Z]{3,}")

emma_tokens = ret.tokenize(emma)

nouns_list = [ t[0for t in tagged_list if t[1]=='NN' ]

len(set(emma_tokens))

len(emma_tokens)

len(set(emma_tokens))/len(emma_tokens)

 

from nltk import Text

emma_text = Text(emma_tokens)

 

type(emma_text)

 

emma_text.plot(20)

emma_text.concordance("Emma", lines=5)

emma_text.similar("general")

 

emma_text.dispersion_plot(["Emma""Jane""Robert"])

 

ret = RegexpTokenizer("[a-zA-Z]{3,}")

emma_tokens = ret.tokenize(emma)

nouns_list = [ t[0for t in tagged_list if t[1]=='NN' ]

emma_text = Text(emma_tokens)

emma_text.plot(20)

 

from nltk import FreqDist

 

emma_tokens = pos_tag(emma_text)

stopwords = ["CHAPTER""End""Nothing"]

 

names_list = [t[0for t in emma_tokens if t[1]=="NNP" and t[0not in stopwords ]

emma_df_names = FreqDist(names_list)

 

emma_df_names

 

!pip install konlpy

 

from konlpy.tag import Hannanum

han = Hannanum()

text = "아름답지만 다소 복잡하기만 한 한국어는 전세계에서 13번째로 많이 사용되는 언어입니다."

han.analyze(text)

han.nouns(text)

pos_t = han.pos(text, ntags=22)

[ t[0for t in pos_t if t[1]=='PV' ]

 

!pip install wordcloud

 

print(r"Hello\nworld")

word_list = komoran.nouns("%r"%data[0:1000])

import matplotlib.pyplot as plt

%matplotlib inline

 

from wordcloud import WordCloud

wordc = WordCloud()

wordc.generate(text)

plt.figure()

plt.imshow(wordc, interpolation="bilinear")

 

!pip install jupyter_contrib_nbextensions && jupyter contrib nbextension install

 

wordc = WordCloud(background_color='white', max_words=20, font_path='c:/Windows/Fonts/malgun.ttf', relative_scaling=0.2)

wordc.generate(text)

 

plt.figure()

plt.imshow(wordc, interpolation="bilinear")

plt.axis('off')

 

from konlpy.corpus import kolaw

data = kolaw.open('constitution.txt').read()

from konlpy.tag import Komoran

komoran = Komoran()

word_list = komoran.nouns("%r"%data)

text = ' '.join(word_list)

wordcloud = WordCloud(background_color='white', max_words=20, font_path='c:/Windows/Fonts/malgun.ttf', relative_scaling=0.2)

wordcloud.generate(text)

plt.figure(figsize=(15,10))

plt.imshow(wordcloud, interpolation="bilinear")

plt.axis('off')

 

from PIL import Image

import numpy as np

img = Image.open("south_korea.png").convert("RGBA")

mask = Image.new("RGB", img.size, (255,255,255))

mask.paste(img,img)

mask = np.array(mask)

wordcloud = WordCloud(background_color='white', max_words=2000, font_path='c:/Windows/Fonts/malgun.ttf', mask=mask, random_state=42)

wordcloud.generate(text)

wordcloud.to_file("result1.png")

 

import requests

rss_url = "http://fs.jtbc.joins.com/RSS/economy.xml"

jtbc_economy = requests.get(rss_url)

from bs4 import BeautifulSoup

economy_news_list = BeautifulSoup(jtbc_economy.content, "xml")

link_list = economy_news_list.select("item > link")

 

from konlpy.tag import Kkma

Kkma = Kkma()

 

news = []

for link in link_list:

    news_url = link.text

    news_response = requests.get(news_url)

    news_soup = BeautifulSoup(news_response.content, "html.parser")

    news_content = news_soup.select_one("#articlebody > .article_content")

    news.append(list(filter(lambda word: len(word)>1, Kkma.nouns(news_content.text))))

 

!pip install -U gensim

 

from gensim.models import Word2Vec

model = Word2Vec(news, size=100, window=5, min_count=2, workers=-1)

model.wv.most_similar("부동산")

 

 

 

 

3.인공신경망

 

답이 있어야 함

 

분류, 회귀, 군집 (X, y가 없음)

 

인공지능 암흑기

1. 과적합

2. 지역최적값

3. 오차가 점점 줄어들어야 하나 줄어들지 않는 현상, wegiht 가 그대로, 학습이 안됨

 

의사결정나무, 선형 데이타

 

CNN, ImageNet Challenge 2012

 

얇은 인공지능: 딥러닝, 머신러닝 

-> 깊은 인공지능

 

 

인간의 뉴런 구조

 

 

인공 뉴런의 구조

 

 

weight: 학습을 통해서 얻어야 할 값

f: 활성화 함수, 함수 선택이 적절해야 함

 

대표적인 활성화 함수 종류: 뉴런이 다음 출력으로 내보내기 위해 사용

- Softmax: y를 합치면 1이 됨, 분류의 출력 체계에서 활용

- Sigmoid

- tanh(x)

- Binary step

- Gaussian

- ReLU

 

 

 

인공신경망: 입력층, 은닉층, 출력층

 

 

 

 

 

 

입력층: 입력하는 변수의 수

출력증: 

은닉층: 

 

다층 신경망(MLP, DNN): 은닉층이 여러개

 

 

 

 

DNN Layer

 

Hidden1: 뉴런의 수 50개, 활성화 함수->relu

Hidden2: 뉴런의 수 30개, 활성화 함수->relu

Output: 뉴런의 수 10개, 활성화 함수->softmax

 

손실 함수: 크로스엔트로피

옵티마이저: 경사하강법, 차이=미분, 기울기, 미분값이 작을수록 오차가 적음, 학습률: 0.001

배치 크기: 100, 

학습 횟수: 200회

 

- MAE

- MSE

- RMSE: 표준편차

 

 

MLPClassifier

 

import seaborn as sns

iris = sns.load_dataset("iris")

from sklearn.preprocessing import LabelEncoder

le = LabelEncoder()

le.fit(iris.species)

 

iris.species = le.transform(iris.species)

from sklearn.model_selection import train_test_split

iris_X = iris.iloc[:,:-1]

iris_y = iris.species # iris.iloc[:,-1]

train_X, test_X, train_y, test_y = train_test_split(iris_X, iris_y, test_size=0.3, random_state=1)

train_X.shape, test_X.shape

 

from sklearn.neural_network import MLPClassifier 

mlp = MLPClassifier(hidden_layer_sizes=(50,50,30), max_iter=500)

mlp.fit(train_X, train_y)

pred = mlp.predict(test_X)

pred

 

mlp.score(test_X, test_y)

 

 

 

https://archive.ics.uci.edu/ml/index.php

 

https://archive.ics.uci.edu/ml/datasets/wine+quality

 

https://archive.ics.uci.edu/ml/machine-learning-databases/wine-quality/

 

 

MLClassifier를 이용한 winequality 데이터 등급 분류

 

#from IPython.core.display import display, HTML

#display(HTML("""<style>div."""))

 

import pandas as pd

redwine = pd.read_csv("winequality-red.csv", sep=";")

redwine.head()

redwine_X = redwine.iloc[:, :-1]

redwine_y = redwine.quality # redwine.iloc[:, -1]

 

from sklearn.model_selection import train_test_split

train_X, test_X, train_y, test_y = train_test_split(redwine_X, redwine_y, test_size=0.3, random_state=1)

 

from sklearn.neural_network import MLPClassifier

model = MLPClassifier(hidden_layer_sizes=(50,50,30), max_iter=500#, verbose=true)

 

model.fit(train_X, train_y)

 

model.score(test_X, test_y)

 

pred = model.predict(test_X)

 

pd.crosstab(test_y, pred)

 

 

 

 

Hadoop, Spark 

 

인공신경망 모형, 딥러닝 프레임워크, 정의 시 고려 사항

- 레이어의 수

- 뉴런의 수

- 활성화 함수

- 손실함수

- 옵티마이저

- 학습률

- 학습 횟수

- 배치크기

 

Scikit-learn MLPClassifier, 머신러닝 영역에 초점 VS Tensorflow DNNClassifier, 딥러닝에 초점

 

 

 

 

4. 케라스를 이용한 인공신경망 구현

 

 

 

 

Keras, https://keras.io/

- 유저가 손쉽게 딥 러닝을 구현할 수 있도록 도와주는 상위 레벨의 인터페이스

- 히든 레이어의 행렬 자동화

 

conda vs pip 로 텐서플로우 설치 시 서로 버전이 다름

 

 

Sequential model vs Functional API

 

dropout, 오버핏(overfit)을 줄임

 

 

 

 

conda install tensorflow

 

 

keras를 이용한 iris 데이터 종 분류

 

#from IPython.core.display import display, HTML

#display(HTML(

#"""<style>

#div.container { width:100% !import; } 

#div.CodeMirror {font-family: Consolas; font-size: 16pt;} 

#div.output { font-size:16pt; font_weight: bold;} 

#div.input { font-family; Consolas; font-size: 16pt; }

#div.prompt { min-width: 100px; }

#</style>

#"""))

 

import tensorflow # conda install tensorflow

#tensorflow.__version__

 

from tensorflow.keras.models import Sequential

from tensorflow.keras.layers import Dense

 

import seaborn as sns

 

iris = sns.load_dataset("iris")

 

iris_X = iris.iloc[:, :-1]

iris_y = iris.iloc[:, -1]

 

import pandas as pd

 

iris_onehot = pd.get_dummies(iris_y)

#iris_onehot.to_numpy()

 

model = Sequential()

 

model.add(Dense(4, activation="relu"))

model.add(Dense(50, activation="relu"))

model.add(Dense(50, activation="relu"))

model.add(Dense(30, activation="relu"))

model.add(Dense(3, activation="softmax"))

 

model.compile(optimizer="adam", loss="categorical_crossentropy", metrics=["accuracy"]) #metrics=["acc"])

 

from sklearn.model_selection import train_test_split

train_X, test_X, train_y, test_y = train_test_split(iris_X, iris_onehot, test_size=0.3)

 

train_X.to_numpy().shape, train_y.to_numpy().shape

 

model.fit( train_X.to_numpy(), train_y.to_numpy(), batch_size=50, epochs=200, verbose=1 )

 

model.predict(test_X)

 

model.evaluate(test_X, test_y)

 

 

 

 

 

 

import numpy as np

pred = np.argmax(model.predict(test_X), axis=1# 각 클래스별 확률을 출력하므로 argmax를 이용해서 가장 큰 값의 열 인덱스

 

pred # 예측한 값

 

np.argmax(test_y.to_numpy(), axis=1# 테스트 데이터의 정답

 

pd.crosstab(np.argmax(test_y.to_numpy(), axis=1), pred) # 교차 분류표

 

model.evaluate(test_X, test_y) 

 

 

 

Optimizer

- SGD

- RMSgrop

- Adagrad

- Adadelta

- Adam: 최적 값을 지나 좀 더 학습을 진행

- Adamax

- Nadam

 

Activation functions

- softmax

- elu

- selu

- softsign

- relu

- tanh

- sigmoid

- hard_sigmoid

 

Advanced Activation functions

- LeakyReLU

- PReLU

- ELU

- ThresholdedReLU

- Softmax: 출력층에서 사용

- ReLU: 영상처리

 

 

 

 

배치 정규화

 

불안정화가 일어나는 이유 - Internal Covariance Shift

 

분산이 0인 열은 학습에서 제외 시켜야 함, weight = 0

 

Dense, 은닉층 -> Dropout -> BatchNormalization -> Dense, 은닉층

 

 

 

 

손실함수

- mean_squared_error

- mean_absolute_error

 

 

 

keras를 이용한 winequality 데이터 등급 분류

 

import pandas as pd

import numpy as np

 

redwine = pd.read_csv("winequality-red.csv", sep=";")

 

redwine_X = redwine.iloc[:, :-1].to_numpy()

 

redwine_y = redwine.iloc[:, -1]

 

redwine_onehot = pd.get_dummies(redwine_y).to_numpy()

 

from sklearn.model_selection import train_test_split

train_X, test_X, train_y, test_y = train_test_split(redwine_X, redwine_onehot, test_size=0.3)

 

from tensorflow.keras.models import Sequential

model = Sequential()

 

from tensorflow.keras.layers import Input, Dense

 

model.add(Input(11))

model.add(Dense(50, activation="relu"))

model.add(Dense(50, activation="relu"))

model.add(Dense(30, activation="relu"))

model.add(Dense(6, activation="softmax"))

 

model.compile(optimizer="adam", loss="categorical_crossentropy", metrics=["accuracy"])

model.fit(train_X, train_y, batch_size=200, epochs=200, verbose=1)

 

import numpy as np

pred = np.argmax(model.predict(test_X), axis=1)

 

pred+3 # 등급이 3등급부터이므로 예측한 값을 보정해 줘야 함

 

pd.crosstab(np.argmax(test_y, axis=1)+3, pred+3# 교차 분류표

 

model.evaluate(test_X, test_y)

 

 

 

 

 

 

Callback: 학습 시 특정 조건이 되면 실행되는 객체

- ModelCheckPoint

- EarlyStopping

- LearningRateScheduler

- TensroBoard

- CSVLogger

 

 

import pandas as pd

import numpy as np

 

redwine = pd.read_csv("winequality-red.csv", sep=";")

 

redwine_X = redwine.iloc[:, :-1].to_numpy()

 

redwine_y = redwine.iloc[:, -1]

 

redwine_onehot = pd.get_dummies(redwine_y).to_numpy()

 

from sklearn.model_selection import train_test_split

train_X, test_X, train_y, test_y = train_test_split(redwine_X, redwine_onehot, test_size=0.3)

 

from tensorflow.keras.models import Sequential

model = Sequential()

 

from tensorflow.keras.layers import Input, Dense

 

model.add(Input(11))

model.add(Dense(50, activation="relu"))

model.add(Dense(50, activation="relu"))

model.add(Dense(30, activation="relu"))

model.add(Dense(6, activation="softmax"))

 

from tensorflow.keras.callbacks import ModelCheckpoint #, EarlyStopping

 

checkpoint = ModelCheckpoint(filepath="redwine-{epoch:03d}-{val_acc:.4f}.hdf5"# or H5 확장자

                             monitor="val_acc"# 모니터링 할 val_ 테스트 데이터 지정 필요 , validation_split=0.2 or validation_data

                             save_best_only=True# mode = 'auto', save_weight_only=False

                             verbose=1 # 로그를 자세히 , save_best_only=True

                            )

 

model.compile(optimizer="adam", loss="categorical_crossentropy", metrics=["accuracy"])

 

model.fit(train_X, train_y, 

          validation_data=(test_X, test_y), # 정답 데이타

          callbacks=[checkpoint],

          batch_size=200, epochs=200, verbose=1)

 

 

 

 

 

 

 

 

 

 

tensorflow install
anaconda prompt install -> 1.x
pip -> 2.0

in anaconda prompt

텐서플로우 설치 확인
conda list tensorflow

텐서플로우 삭제
conda remove tensorflow
conda remove tensorflow-base
pip uninstall tensorflow-estimator

pip로 텐서플로우 2.2.0 설치
pip install tensorflow==2.2.0
anaconda prompt install -> 1.x
pip -> 2.0

in anaconda prompt

텐서플로우 설치 확인
conda list tensorflow

텐서플로우 삭제
conda remove tensorflow
conda remove tensorflow-base
pip uninstall tensorflow-estimator

pip로 텐서플로우 2.2.0 설치
pip install tensorflow==2.2.0
pip install tensorflow-cpu

conda install tensorflow

 

윈도우 cmd 

pscp.exe model.py userid@ip-address:/home/userid/data/model.py

 

리눅스 터미널

python model.py &

 

ps -ef | grep python

 

윈도우 cmd 창에서

pscp userid@ip-address:/home/userid/data/...h5 model...h5

 

 

 

 

 

 

 

Early Stopping Callback

 

import pandas as pd

import numpy as np

 

redwine = pd.read_csv("winequality-red.csv", sep=";")

 

redwine_X = redwine.iloc[:, :-1].to_numpy()

 

redwine_y = redwine.iloc[:, -1]

 

redwine_onehot = pd.get_dummies(redwine_y).to_numpy()

 

from sklearn.model_selection import train_test_split

train_X, test_X, train_y, test_y = train_test_split(redwine_X, redwine_onehot, test_size=0.3)

 

from tensorflow.keras.models import Sequential

model = Sequential()

 

from tensorflow.keras.layers import Input, Dense

 

model.add(Input(11))

model.add(Dense(50, activation="relu"))

model.add(Dense(50, activation="relu"))

model.add(Dense(30, activation="relu"))

model.add(Dense(6, activation="softmax"))

 

from tensorflow.keras.callbacks import ModelCheckpoint #, EarlyStopping

 

checkpoint = ModelCheckpoint(filepath="redwine-{epoch:03d}-{val_acc:.4f}.hdf5"# or H5 확장자

                             monitor="val_acc"# 모니터링 할 val_ 테스트 데이터 지정 필요 , validation_split=0.2 or validation_data

                             save_best_only=True# mode = 'auto', save_weight_only=False

                             verbose=1 # 로그를 자세히 , save_best_only=True

                            )



from tensorflow.keras.callbacks import EarlyStopping

early_stopping = EarlyStopping(monitor="val_acc", patience=10)

 

model.compile(optimizer="adam", loss="categorical_crossentropy", metrics=["accuracy"])

 

model.fit(train_X, train_y, 

          validation_data=(test_X, test_y), # 정답 데이타

          callbacks=[checkpoint, early_stopping],

          batch_size=200, epochs=2000, verbose=1)

 

 

 

 

 

 

 

 

 

import tensorflow

 

import pandas as pd

import numpy as np

 

redwine = pd.read_csv("winequality-red.csv", sep=";")

 

redwine_X = redwine.iloc[:, :-1].to_numpy()

redwine_y = redwine.iloc[:, -1]

 

redwine_onehot = pd.get_dummies(redwine_y).to_numpy()

 

from sklearn.model_selection import train_test_split

train_X, test_X, train_y, test_y = train_test_split(redwine_X, redwine_onehot, test_size=0.3)

 

from tensorflow.keras.models import Sequential

model = Sequential()

 

from tensorflow.keras.layers import Input, Dense

 

model.add(Input(11))

model.add(Dense(50, activation="relu"))

model.add(Dense(50, activation="relu"))

model.add(Dense(30, activation="relu"))

model.add(Dense(6, activation="softmax"))

 

model.compile(optimizer="adam", loss="categorical_crossentropy", metrics=["accuracy"])

model.load_weights("redwine-062-0.5521.hdf5")

model.evaluate(test_X, test_y)

 

 

 

 

 

 

 

5. RNN

 

 

 

저장한 모델 불러와 예측하기

 

CNN, 영상 필터 학습, 합성곱

 

 

 

 

 

 

 

RNN, 순환 신경망

 

 

 

 

 

이전에 학습했던 y 값

이전 y 값

 

- 이전 학습한 내용을 다음 학습 할 내용에 전달

 

 

양방향 순환 신경망

 

 

 

 

 

 

 

 

 

문맥을 예측해서 다음 단어 예측해보기

 

vocab_size 희소행렬, 한개의 문장을 가지고 여러개의 행을 만듬, index 1부터 시작 됨

 

Embedding

SimpleRNN

 

nltk, keras_preprocessing.text

 

형태소 분류

 

경마장에 있는 말이 뛰고 있다

-> 경마장에 있는 말이, 있는 말이, 말이 뛰고, ---

 

pad_sequences, padding='pre' 데이터의 앞을 0으로 채움

 

 

 

text = """경마장에 있는 말이 뛰고 있다\n

그의 말이 법이다\n

가는 말이 고와야 오는 말이 곱다\n"""

 

from keras_preprocessing.text import Tokenizer

t = Tokenizer()

t.fit_on_texts([text])

encoded = t.texts_to_sequences([text])[0]

 

vocab_size = len(t.word_index) + 1

print('단어 집합의 크기: %d' % vocab_size)

 

print(t.word_index)

 

sequences = list()

for line in text.split('\n'):

    encoded = t.texts_to_sequences([line])[0]

    for i in range(1len(encoded)):

        sequence = encoded[:i+1]

        sequences.append(sequence)

        

print('훈련 데이터의 개수: %d' % len(sequences))

 

print(sequences)

 

print(max(len(I) for I in sequences))

 

from keras.preprocessing.sequence import pad_sequences

sequences = pad_sequences(sequences, maxlen=6, padding='pre')

 

import numpy as np

sequences = np.array(sequences)

 

X = sequences[:,:-1]

y = sequences[:,-1]

 

print(X)

 

print(y)

 

from keras.utils import to_categorical

y = to_categorical(y, num_classes=vocab_size)

 

print(y)

 

from keras.layers import Embedding, Dense, SimpleRNN

from keras.models import Sequential

 

model = Sequential()

 

model.add(Embedding(vocab_size, 10, input_length=5))

model.add(SimpleRNN(32))

model.add(Dense(vocab_size, activation='softmax'))

model.compile(loss='categorical_crossentropy', optimizer='adam', metrics=['accuracy'])

model.fit(X, y, epochs=200, verbose=2)

 

def sentence_generation(modeltcurrent_wordn):

    init_word = current_word

    sentence = ''

    

    for _ in range(n):

        encoded = t.texts_to_sequences([current_word])[0]

        encoded = pad_sequences([encoded], maxlen=5, padding='pre')

        result = mpdel.predict_classes(encoded, verbose=0)

        

        for word, index in t.word_index.items():

            if index == result:

                break

        

        current_word = current_word + ' ' + word

        sentence = sentence + ' ' + word

        

    sentence = init_word + sentence

    return sentence

 

print(sentence_gwneration(model, t, '경마장에'4))

 

print(sentence_generation(model, t, '그의'2))

 

model_json = model.to_json()

with open("redwin.json""r"as json_file:

    loaded_model_json = json_file.read()

    model = model_from_json(loaded_model_json)

 

 

model.evaluate(test_X, test_y) # compile 해야 사용할 수 있음

 

 

 

6. LSTM

 

 

RNN, 전거만 기억하고 있음

LSTM, 이전것을 기억하고 있음

 

 

 

 

 

np.uint8

 

PIL

opencv-python

 

N차원 배열 다루기

데이터프레임

데이터 시각화

웹데이터 수집

 

 

 

 

 

 

 

 

 

 

Trackbacks 0 : Comments 1
  1. 테리엇 2020.07.06 08:06 신고 Modify/Delete Reply

    블로그 잘 보고 갑니다. 오늘도 즐거운 하루 되세요.

Write a comment


가용성 다단계 웹 테스트 등록

푸닥거리 2020. 7. 3. 09:01

https://github.com/uglide/azure-content/blob/master/articles/application-insights/app-insights-monitor-web-app-availability.md#multi-step-web-tests

 

uglide/azure-content

Repository containing the Articles on azure.microsoft.com Documentation Center - uglide/azure-content

github.com

Multi-step web tests

You can monitor a scenario that involves a sequence of URLs. For example, if you are monitoring a sales website, you can test that adding items to the shopping cart works correctly.

To create a multi-step test, you record the scenario by using Visual Studio, and then upload the recording to Application Insights. Application Insights will replay the scenario at intervals and verify the responses.

Note that you can't use coded functions in your tests: the scenario steps must be contained as a script in the .webtest file.

1. Record a scenario

Use Visual Studio Enterprise or Ultimate to record a web session.

  1. Create a web performance test project.

  2. Open the .webtest file and start recording.

  3. Do the user actions you want to simulate in your test: open your website, add a product to the cart, and so on. Then stop your test.

    Don't make a long scenario. There's a limit of 100 steps and 2 minutes.

  4. Edit the test to:

  • Add validations to check the received text and response codes.

  • Remove any superfluous interactions. You could also remove dependent requests for pictures or to ad or tracking sites.

    Remember that you can only edit the test script - you can't add custom code or call other web tests. Don't insert loops in the test. You can use standard web test plug-ins.

  1. Run the test in Visual Studio to make sure it works.

    The web test runner opens a web browser and repeats the actions you recorded. Make sure it works as you expect.

2. Upload the web test to Application Insights

  1. In the Application Insights portal, create a new web test.

  2. Select multi-step test, and upload the .webtest file.

     

    Set the test locations, frequency, and alert parameters in the same way as for ping tests.

View your test results and any failures in the same way as for single-url tests.

A common reason for failure is that the test runs too long. It mustn't run longer than two minutes.

Don't forget that all the resources of a page must load correctly for the test to succeed, including scripts, style sheets, images and so forth.

Note that the web test must be entirely contained in the .webtest file: you can't use coded functions in the test.

Plugging time and random numbers into your multi-step test

Suppose you're testing a tool that gets time-dependent data such as stocks from an external feed. When you record your web test, you have to use specific times, but you set them as parameters of the test, StartTime and EndTime.

When you run the test, you'd like EndTime always to be the present time, and StartTime should be 15 minutes ago.

Web Test Plug-ins provide the way to do this.

  1. Add a web test plug-in for each variable parameter value you want. In the web test toolbar, choose Add Web Test Plugin.

    In this example, we'll use two instances of the Date Time Plug-in. One instance is for "15 minutes ago" and another for "now".

  2. Open the properties of each plug-in. Give it a name and set it to use the current time. For one of them, set Add Minutes = -15.

  3. In the web test parameters, use {{plug-in name}} to reference a plug-in name.

Now, upload your test to the portal. It will use the dynamic values on every run of the test.

Dealing with sign-in

If your users sign in to your app, you have a number of options for simulating sign-in so that you can test pages behind the sign-in. The approach you use depends on the type of security provided by the app.

In all cases, you should create an account just for the purpose of testing. If possible, restrict its permissions so that it's read-only.

  • Simple username and password: Just record a web test in the usual way. Delete cookies first.
  • SAML authentication. For this, you can use the SAML plugin that is available for web tests.
  • Client secret: If your app has a sign-in route that involves a client secret, use that. Azure Active Directory provides this.
  • Open Authentication - for example, signing in with your Microsoft or Google account. Many apps that use OAuth provide the client secret alternative, so the first tactic is to investigate that. If your test has to sign in using OAuth, the general approach is:
  • Use a tool such as Fiddler to examine the traffic between your web browser, the authentication site, and your app.
  • Perform two or more sign-ins using different machines or browsers, or at long intervals (to allow tokens to expire).
  • By comparing different sessions, identify the token passed back from the authenticating site, that is then passed to your app server after sign-in.
  • Record a web test using Visual Studio.
  • Parameterize the tokens, setting the parameter when the token is returned from the authenticator, and using it in the query to the site. (Visual Studio will attempt to parameterize the test, but will not correctly parameterize the tokens.)

Edit or disable a test

Open an individual test to edit or disable it.

You might want to disable web tests while you are performing maintenance on your service.

Performance tests

You can run a load test on your website. Like the availability test, you can send either simple requests or multi-step requests from our points around the world. Unlike an availability test, many requests are sent, simulating multiple simultaneous users.

From the Overview blade, open Settings, Performance Tests. When you create a test, you are invited to connect to or create a Visual Studio Team Services account.

When the test is complete, you'll be shown response times and success rates.

Automation

Trackbacks 0 : Comments 0

Write a comment