淘先锋技术网

首页 1 2 3 4 5 6 7

话不多说就开始吧!

import pandas as pd
# 读入 csv 文字档
gapminder = pd.read_csv('gapminder.csv')
# 读取excel档 gapminder = pd.read_excel(xlsx_file)
print(type(gapminder))
gapminder.head()
<class 'pandas.core.frame.DataFrame'>
countrycontinentyearlifeExppopgdpPercap
0AfghanistanAsia195228.8018425333779.445314
1AfghanistanAsia195730.3329240934820.853030
2AfghanistanAsia196231.99710267083853.100710
3AfghanistanAsia196734.02011537966836.197138
4AfghanistanAsia197236.08813079460739.981106
(1704, 6)
Index(['country', 'continent', 'year', 'lifeExp', 'pop', 'gdpPercap'], dtype='object')
RangeIndex(start=0, stop=1704, step=1)
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1704 entries, 0 to 1703
Data columns (total 6 columns):
country      1704 non-null object
continent    1704 non-null object
year         1704 non-null int64
lifeExp      1704 non-null float64
pop          1704 non-null int64
gdpPercap    1704 non-null float64
dtypes: float64(2), int64(2), object(2)
memory usage: 80.0+ KB
yearlifeExppopgdpPercap
count1704.000001704.0000001.704000e+031704.000000
mean1979.5000059.4744392.960121e+077215.327081
std17.2653312.9171071.061579e+089857.454543
min1952.0000023.5990006.001100e+04241.165877
25%1965.7500048.1980002.793664e+061202.060309
50%1979.5000060.7125007.023596e+063531.846989
75%1993.2500070.8455001.958522e+079325.462346
max2007.0000082.6030001.318683e+09113523.132900

数据整理

dplyr的基本功能是六个能与SQL查询语法相互呼应的函数:

filter()函数:SQL查询中的where描述

select()函数:SQL查询中的select描述

mutate()函数:SQL查询中的衍生字段描述

arrange()函数:SQL查询中的order by描述

summarise()函数:SQL查询中的聚合函数描述

group_by()函数:SQL查询中的group by描述

# 撰写布尔判断条件将符合条件的观测值从数据框中筛选出,
# 实践filter()函数的功能,例如选出China:
gapminder[gapminder['country'] == 'China']
countrycontinentyearlifeExppopgdpPercap
288ChinaAsia195244.00000556263527400.448611
289ChinaAsia195750.54896637408000575.987001
290ChinaAsia196244.50136665770000487.674018
291ChinaAsia196758.38112754550000612.705693
292ChinaAsia197263.11888862030000676.900092
293ChinaAsia197763.96736943455000741.237470
294ChinaAsia198265.525001000281000962.421380
295ChinaAsia198767.2740010840350001378.904018
296ChinaAsia199268.6900011649700001655.784158
297ChinaAsia199770.4260012300750002289.234136
298ChinaAsia200272.0280012804000003119.280896
299ChinaAsia200772.9610013186830964959.114854
# 如果有多个条件,可以使用|或&amp;符号连接,例如选出2007年的亚洲国家(用.iloc选择显示前几行):
gapminder[(gapminder['year'] == 2007) & (gapminder['continent'] == 'Asia')].iloc[0:10,]
countrycontinentyearlifeExppopgdpPercap
11AfghanistanAsia200743.82831889923974.580338
95BahrainAsia200775.63570857329796.048340
107BangladeshAsia200764.0621504483391391.253792
227CambodiaAsia200759.723141318581713.778686
299ChinaAsia200772.96113186830964959.114854
671Hong Kong, ChinaAsia200782.208698041239724.978670
707IndiaAsia200764.69811103963312452.210407
719IndonesiaAsia200770.6502235470003540.651564
731IranAsia200770.9646945357011605.714490
743IraqAsia200759.545274996384471.061906
# 用list标注变数名称,可以将所需变数的一列提取出来
gapminder[['country', 'continent']].iloc[0:10,]
countrycontinent
0AfghanistanAsia
1AfghanistanAsia
2AfghanistanAsia
3AfghanistanAsia
4AfghanistanAsia
5AfghanistanAsia
6AfghanistanAsia
7AfghanistanAsia
8AfghanistanAsia
9AfghanistanAsia
# 直接撰写衍生公式并为变数命名即可实践mutate()函数的功能,搭配apply()与lambda函数将公式应用到每一个观测值,
# 例如新增一个country_abb变数撷取原本country变数的前三个英文字母:
gapminder['country_abb'] = gapminder['country'].apply(lambda x: x[:3])
gapminder.iloc[1:10,]
countrycontinentyearlifeExppopgdpPercapcountry_abb
1AfghanistanAsia195730.3329240934820.853030Afg
2AfghanistanAsia196231.99710267083853.100710Afg
3AfghanistanAsia196734.02011537966836.197138Afg
4AfghanistanAsia197236.08813079460739.981106Afg
5AfghanistanAsia197738.43814880372786.113360Afg
6AfghanistanAsia198239.85412881816978.011439Afg
7AfghanistanAsia198740.82213867957852.395945Afg
8AfghanistanAsia199241.67416317921649.341395Afg
9AfghanistanAsia199741.76322227415635.341351Afg
# 呼叫DataFrame不同的聚合函数针对字段计算,实践summarise()函数的功能,例如计算2007年全球人口总数:
gapminder[gapminder['year'] == 2007][['pop']].sum()
pop    6251013179
dtype: int64
# 或者计算 2007 年全球的平均寿命、平均财富:
gapminder[gapminder['year'] == 2007][['lifeExp', 'gdpPercap']].mean()
lifeExp         67.007423
gdpPercap    11680.071820
dtype: float64
# 最后用 DataFrame的 groupby 方法实践 group_by()函数的功能,例如计算2007年各洲人口总数:
gapminder[gapminder['year'] == 2007].groupby(by = 'continent')['pop'].sum()
continent
Africa       929539692
Americas     898871184
Asia        3811953827
Europe       586098529
Oceania       24549947
Name: pop, dtype: int64
# 或者计算2007年各洲平均寿命、平均财富:
gapminder[gapminder['year'] == 2007].groupby(by = 'continent')['lifeExp', 'gdpPercap'].mean()
lifeExpgdpPercap
continent
Africa54.8060383089.032605
Americas73.60812011003.031625
Asia70.72848512473.026870
Europe77.64860025054.481636
Oceania80.71950029810.188275

Python可视化的基石是Matplotlib套件的pyplot,她的绘图哲学是将图形的元素,例如坐标轴、线、点或者文字用不同的方法一一拼凑起来,优点是绘图的弹性非常高,缺点则是对于初学者的门坎略高。为了解决这个问题,pandas套件将matplotlib.pyplot的基础图形包装起来成为一个方法,让使用者只要呼叫df.plot()就能够便利地绘图,可以选择的图形种类相当丰富,只要指定kind =参数即可:
line’:线图(预设)
‘bar’:垂直直方图
‘barh’:水平直方图
‘hist’:直方图
‘box’:盒须图
‘scatter’:散布图
‘hexbin’:hexbin plot
…etc.
在作图之前我们加载matplotlib.pyplot与seaborn,前者是绘图的基础套件,后者是让图形的样式美观:

import matplotlib.pyplot as plt
import seaborn as sns

# 可视化时间与数值:线图
# 将中国数据筛选出来并绘制从1952年至2007年的人口变化:

gapminder_cn = gapminder[gapminder['country'] == 'China']
gapminder_cn[['year', 'pop']].plot(kind = 'line', x = 'year', y = 'pop', title = 'Pop vs. Year in China')
plt.show()

在这里插入图片描述

# 或者将中国、日本、韩国数据筛选出来并绘制从1952年至2007年的平均寿命变化

gapminder_northasia = gapminder.loc[gapminder['country'].isin(['China', 'Japan', 'Korea, Rep.'])]
gapminder_northasia_pivot = gapminder_northasia.pivot_table(values = 'lifeExp', columns = 'country', index = 'year')
gapminder_northasia_pivot.plot(title = 'Life Expectancies in North Asia')
plt.show()

在这里插入图片描述

# 可视化数值的分布:直方图、盒须图
# 将2007年数据筛选出来并以三个子图(subplots)绘制人口数、平均寿命与人均所得的直方图:

gapminder_2007 = gapminder[gapminder['year'] == 2007]
gapminder_2007[['pop', 'gdpPercap', 'lifeExp']].hist(bins = 15)
plt.show()

在这里插入图片描述

# 或者绘制人均所得的直方图:
gapminder_2007[['gdpPercap']].plot(kind = 'hist', title = 'GDP Per Capita in 2007', legend = False, bins = 15)
plt.show()

在这里插入图片描述

# 或者将人均所得直方图依照不同洲别以不同颜色绘制:
gapminder_continent_pivot = gapminder_2007.pivot_table(values = 'gdpPercap', columns = 'continent', index = 'country')
gapminder_continent_pivot.plot(kind = 'hist', alpha=0.5, bins = 20, title = 'GDP Per Capita by Continent')
plt.show()

在这里插入图片描述

# 或者依照不同洲别,将人均所得以盒须图绘制
gapminder_continent_pivot.plot(kind = 'box', title = 'GDP Per Capita by Continent')
plt.show()

在这里插入图片描述

# 可视化相关性:散点图、hexbin plot
gapminder_2007.plot(kind = 'scatter', x = 'gdpPercap', y = 'lifeExp', title = 'Wealth vs. Health in 2007')
plt.show()
# 改为hexbin plot
gapminder_2007.plot(kind = 'hexbin', x = 'gdpPercap', y = 'lifeExp', title = 'Wealth vs. Health in 2007', gridsize = 20)
plt.show()

在这里插入图片描述

在这里插入图片描述

# 可视化排名:直方图
# 绘制2007年各洲的人口总数:
summarized_df = gapminder[gapminder['year'] == 2007].groupby(by = 'continent')['pop'].sum()
summarized_df.plot(kind = 'bar', rot = 0)
plt.show()

在这里插入图片描述

# 或者绘制2007年各洲平均寿命、平均财富:
mean_df = gapminder[gapminder['year'] == 2007].groupby('continent')['lifeExp','gdpPercap'].mean()
mean_df = mean_df.reset_index()

mean_df.head()
fig = plt.figure()
ax = fig.add_subplot(111)
ax.plot(mean_df['continent'], mean_df['lifeExp'], '-', label = 'lifeExp')
ax2 = ax.twinx()
ax2.plot(mean_df['continent'], mean_df['gdpPercap'], '-r', label = 'gdpPercap')
ax.set_ylim(40,100)
ax2.set_ylim(0, 50000)
ax.legend(loc='upper left')
ax2.legend(loc='upper right')
plt.show()

在这里插入图片描述