처음 배우는 데이터 과학에 나오는 예제입니다.

오랜만입니다.
저는 길당 홍길한이라 합니다.
아래의 내용은 책 처음 배우는 데이터 과학에 나오는 예제입니다.

5장

In [22]:
import matplotlib
matplotlib.rc('font', family="NanumBarunGothic")  

%matplotlib inline

5.2 아이리스 데이터셋

In [23]:
import pandas as pd
from matplotlib import pyplot as plt
import sklearn.datasets

def get_iris_df():
    ds = sklearn.datasets.load_iris()
    df = pd.DataFrame(ds['data'], columns=ds['feature_names'])
    code_species_map = dict(zip(
        range(3), ds['target_names']))
    df['species'] = [code_species_map[c] for c in ds['target']]
    return df

df = get_iris_df()
df_iris = df

5.3 원형 차트

In [24]:
sums_by_species = df.groupby('species').sum()
var = 'sepal width (cm)'
sums_by_species[var].plot(kind='pie', fontsize=20)
plt.ylabel(var, horizontalalignment='left')
plt.title('꽃받침 너비로 분류한 붓꽃', fontsize=25)
# plt.savefig('iris_pie_for_one_variable.png')
# plt.close()
Out[24]:
Text(0.5,1,'꽃받침 너비로 분류한 붓꽃')
In [25]:
sums_by_species = df.groupby('species').sum()
sums_by_species.plot(kind='pie', subplots=True,
layout=(2,2), legend=False)
plt.title('종에 따른 전체 측정값Total Measurements, by Species')
# plt.savefig('iris_pie_for_each_variable.png')
# plt.close()
Out[25]:
Text(0.5,1,'종에 따른 전체 측정값Total Measurements, by Species')

5.4 막대그래프

In [26]:
sums_by_species = df.groupby('species').sum()
var = 'sepal width (cm)'
sums_by_species[var].plot(kind='bar', fontsize=15, rot=30)

plt.title('꽃받침 너비(cm)로 분류한 붓꽃', fontsize=20)
# plt.savefig('iris_bar_for_one_variable.png')
# plt.close()
sums_by_species = df.groupby('species').sum()
sums_by_species.plot(
   kind='bar', subplots=True, fontsize=12)
plt.suptitle('종에 따른 전체 측정값')
# plt.savefig('iris_bar_for_each_variable.png')
# plt.close()
Out[26]:
Text(0.5,0.98,'종에 따른 전체 측정값')

5.5 히스토그램

In [12]:
df.plot(kind='hist', subplots=True, layout=(2,2))
plt.suptitle('붓꽃 히스토그램', fontsize=20)
# plt.show()
Out[12]:
Text(0.5,0.98,'붓꽃 히스토그램')
In [13]:
for spec in df['species'].unique():
    forspec = df[df['species']==spec]
    forspec['petal length (cm)'].plot(kind='hist', alpha=0.4, label=spec)

plt.legend(loc='upper right')
plt.suptitle('종에 따른 꽃잎 길이')
# plt.savefig('iris_hist_by_spec.png')
Out[13]:
Text(0.5,0.98,'종에 따른 꽃잎 길이')

5.6 평균, 표준편차, 중간값, 백분위

In [16]:
col = df['petal length (cm)']
average = col.mean()
std = col.std()
median = col.quantile(0.5)
percentile25 = col.quantile(0.25)
percentile75 = col.quantile(0.75)
print(average, std, median, percentile25, percentile75)
3.75866666667 1.76442041995 4.35 1.6 5.1

아웃라이어 걸러내기

In [19]:
col = df['petal length (cm)']
perc25 = col.quantile(0.25)
perc75 = col.quantile(0.75)
clean_avg = col[(col>perc25)&(col<perc75)].mean()
print(clean_avg)
4.0984375

5.7 상자그림

In [20]:
col = 'sepal length (cm)'
df['ind'] = pd.Series(df.index).apply(lambda i: i% 50)
df.pivot('ind','species')[col].plot(kind='box')
# plt.show()
Out[20]:
<matplotlib.axes._subplots.AxesSubplot at 0x10b273a20>

5.8 산포도

In [23]:
df.plot(kind="scatter",
    x="sepal length (cm)", y="sepal width (cm)")
plt.title("Length vs Width")
# plt.show()
Out[23]:
Text(0.5,1,'Length vs Width')
In [25]:
colors = ["r", "g", "b"]
markers= [".", "*", "^"]
fig, ax = plt.subplots(1, 1)
for i, spec in enumerate(df['species'].unique() ):
    ddf = df[df['species']==spec]
    ddf.plot(kind="scatter",
        x="sepal width (cm)", y="sepal length (cm)",
        alpha=0.5, s=10*(i+1), ax=ax,
        color=colors[i], marker=markers[i], label=spec)
    
plt.legend()
plt.show()
In [27]:
import pandas as pd
import sklearn.datasets as ds
import matplotlib.pyplot as plt
# 팬다스 데이터프레임 생성
bs = ds.load_boston()
df = pd.DataFrame(bs.data, columns=bs.feature_names)
df['MEDV'] = bs.target
# 일반적인 산포도
df.plot(x='CRIM',y='MEDV',kind='scatter')
plt.title('일반축에 나타낸 범죄 발생률')
# plt.show()
Out[27]:
Text(0.5,1,'일반축에 나타낸 범죄 발생률')

로그를 적용

In [29]:
df.plot(x='CRIM',y='MEDV',kind='scatter',logx=True)
plt.title('Crime rate on logarithmic axis')
plt.show()
/Library/Frameworks/Python.framework/Versions/3.4/lib/python3.4/site-packages/matplotlib/mathtext.py:854: MathTextWarning: Font 'default' does not have a glyph for '-' [U+2212]
  MathTextWarning)
/Library/Frameworks/Python.framework/Versions/3.4/lib/python3.4/site-packages/matplotlib/mathtext.py:855: MathTextWarning: Substituting with a dummy symbol.
  warn("Substituting with a dummy symbol.", MathTextWarning)

5.10 산포 행렬

In [33]:
from pandas.tools.plotting import scatter_matrix
scatter_matrix(df_iris)
plt.show()
/Library/Frameworks/Python.framework/Versions/3.4/lib/python3.4/site-packages/ipykernel_launcher.py:2: FutureWarning: 'pandas.tools.plotting.scatter_matrix' is deprecated, import 'pandas.plotting.scatter_matrix' instead.
  

5.11 히트맵

In [35]:
df_iris.plot(kind="hexbin", x="sepal width (cm)", y="sepal length (cm)")
plt.show()

5.12 상관관계

In [36]:
df["sepal width (cm)"].corr(df["sepal length (cm)"])  # Pearson corr
Out[36]:
-0.10936924995064937
In [37]:
df["sepal width (cm)"].corr(df["sepal length (cm)"], method="pearson")
Out[37]:
-0.10936924995064937
In [38]:
df["sepal width (cm)"].corr(df["sepal length (cm)"], method="spearman")
Out[38]:
-0.15945651848582867
In [39]:
df["sepal width (cm)"].corr(df["sepal length (cm)"], method="spearman")
Out[39]:
-0.15945651848582867

5.12 시계열 데이터

In [41]:
# $ pip install statsmodels
import statsmodels.api as sm
dta = sm.datasets.co2.load_pandas().data
dta.plot()
plt.title("이산화탄소 농도")
plt.ylabel("PPM")
plt.show()
/Library/Frameworks/Python.framework/Versions/3.4/lib/python3.4/importlib/_bootstrap.py:321: FutureWarning: The pandas.core.datetools module is deprecated and will be removed in a future version. Please use the pandas.tseries module instead.
  return f(*args, **kwds)

구글 주가 불러오는 코드는 야후 API가 작동하지 않아서 생략합니다.

 

 

댓글 남기기

이메일은 공개되지 않습니다. 필수 입력창은 * 로 표시되어 있습니다

This site uses Akismet to reduce spam. Learn how your comment data is processed.