一:框架
- 使用requests+beautifulsoup进行豆瓣电影前250名的信息爬取
二:流程
2.1 使用requests进行的网页信息获取
- 如上图所示,所有关于电影信息都在都在标签lo的下面,我们的任务就是将获取的页码beautifusoup定位到指定的位置。
- 如下代码所示,使用requests获取到所需要的数据
def get_html(url, code="utf-8"):
kv = {'User-Agent': 'Mozilla/5.0'}
try:
r = requests.get(url, headers=kv)
r.encoding = code
r.raise_for_status()
return r.text
except:
print("请求失败")
- 通过网页的url(https://movie.douban.com/top250?start=25&filter=)特点知道,每次点击下一个页面都是通过start加上25,一共是个页面
- 在每次传递的时候,参数的时候进行下面操作
for i in range(10):
get_url = url.format(i*25)
- 获取到的html文本传递到beautifulsoup中进行解析,其中通过find方法寻找到指定标签的数据并且将结果存储在列表中
def parse_html(html, i):
soup = BeautifulSoup(html, "html.parser")
info_page = soup.find("ol", attrs={"class": "grid_view"})
films = info_page.find_all('li')
for film in films:
rank = film.find("em").text
name = film.find("span", attrs={"class": "title"}).text
description = film.find('span', attrs={'class': 'inq'}).text
socre = film.find("span", attrs={"class": "rating_num"}).text
list_info.append([int(rank) + i*25, name, description, socre])
- 将所有结果存储起来,主要使用的是excel存储,整个过程的完整代码如下
from bs4 import BeautifulSoup
import requests
import xlwt
list_info = []
url = "https://movie.douban.com/top250?start={}"
def get_html(url, code="utf-8"):
kv = {'User-Agent': 'Mozilla/5.0'}
try:
r = requests.get(url, headers=kv)
r.encoding = code
r.raise_for_status()
return r.text
except:
print("请求失败")
def parse_html(html, i):
soup = BeautifulSoup(html, "html.parser")
info_page = soup.find("ol", attrs={"class": "grid_view"})
films = info_page.find_all('li')
for film in films:
rank = film.find("em").text
name = film.find("span", attrs={"class": "title"}).text
description = film.find('span', attrs={'class': 'inq'}).text
socre = film.find("span", attrs={"class": "rating_num"}).text
list_info.append([int(rank) + i*25, name, description, socre])
def save_Films(list_infos, filePath):
for i in list:
with open(filePath, 'a', encoding="utf-8") as f:
f.write(str(i) + '\n')
def save_excel(list_infos, filename, list_names):
workbook = xlwt.Workbook()
sheet_01 = workbook.add_sheet("sheet_01")
row = 0
cow = 0
for list_name in list_names:
sheet_01.write(row, cow, list_name)
cow += 1
row += 1
for list_info in list_infos:
cow = 0
for l in list_info:
sheet_01.write(row, cow, l)
cow += 1
row += 1
workbook.save(filename)
if __name__ == "__main__":
file_name = "./douban_text.xls"
list_names = ["排名", "名字", "短评", "评分"]
for i in range(10):
get_url = url.format(i*25)
html = get_html(url=url)
parse_html(html, i)
save_excel(list_info, file_name,list_names)
- save主要是存储在txt文本中,如需使用,修改file_name的后缀名即可