淘先锋技术网

首页 1 2 3 4 5 6 7

一:框架

  • 使用requests+beautifulsoup进行豆瓣电影前250名的信息爬取

二:流程

2.1 使用requests进行的网页信息获取

网页代码

  • 如上图所示,所有关于电影信息都在都在标签lo的下面,我们的任务就是将获取的页码beautifusoup定位到指定的位置。
  • 如下代码所示,使用requests获取到所需要的数据
def get_html(url, code="utf-8"):
    kv = {'User-Agent': 'Mozilla/5.0'}
    try:
        r = requests.get(url, headers=kv)
        r.encoding = code
        r.raise_for_status()
        return r.text
    except:
        print("请求失败")
  • 通过网页的url(https://movie.douban.com/top250?start=25&filter=)特点知道,每次点击下一个页面都是通过start加上25,一共是个页面
  • 在每次传递的时候,参数的时候进行下面操作
    for i in range(10):
        get_url = url.format(i*25)
  • 获取到的html文本传递到beautifulsoup中进行解析,其中通过find方法寻找到指定标签的数据并且将结果存储在列表中
def parse_html(html, i):
    soup = BeautifulSoup(html, "html.parser")
    info_page = soup.find("ol", attrs={"class": "grid_view"})
    films = info_page.find_all('li')
    for film in films:
        rank = film.find("em").text  # 排名
        name = film.find("span", attrs={"class": "title"}).text  # 名字
        description = film.find('span', attrs={'class': 'inq'}).text  # 短评
        socre = film.find("span", attrs={"class": "rating_num"}).text  # 得分
        list_info.append([int(rank) + i*25, name, description, socre])
  • 将所有结果存储起来,主要使用的是excel存储,整个过程的完整代码如下
# beautifulsoup+requests爬取豆瓣电影排行榜信息
from bs4 import BeautifulSoup
import requests
import xlwt

list_info = []
url = "https://movie.douban.com/top250?start={}"


def get_html(url, code="utf-8"):
    kv = {'User-Agent': 'Mozilla/5.0'}
    try:
        r = requests.get(url, headers=kv)
        r.encoding = code
        r.raise_for_status()
        return r.text
    except:
        print("请求失败")


def parse_html(html, i):
    soup = BeautifulSoup(html, "html.parser")
    info_page = soup.find("ol", attrs={"class": "grid_view"})
    films = info_page.find_all('li')
    for film in films:
        rank = film.find("em").text  # 排名
        name = film.find("span", attrs={"class": "title"}).text  # 名字
        description = film.find('span', attrs={'class': 'inq'}).text  # 短评
        socre = film.find("span", attrs={"class": "rating_num"}).text  # 得分
        list_info.append([int(rank) + i*25, name, description, socre])


def save_Films(list_infos, filePath):
    for i in list:
        with open(filePath, 'a', encoding="utf-8") as f:
            f.write(str(i) + '\n')


def save_excel(list_infos, filename, list_names):
    workbook = xlwt.Workbook()
    sheet_01 = workbook.add_sheet("sheet_01")
    row = 0
    cow = 0
    for list_name in list_names:
        sheet_01.write(row, cow, list_name)
        cow += 1
    row += 1
    for list_info in list_infos:
        cow = 0
        for l in list_info:
            sheet_01.write(row, cow, l)
            cow += 1
        row += 1
    workbook.save(filename)


if __name__ == "__main__":
    file_name = "./douban_text.xls"
    list_names = ["排名", "名字", "短评", "评分"]
    for i in range(10):
        get_url = url.format(i*25)
        html = get_html(url=url)
        parse_html(html, i)
    save_excel(list_info, file_name,list_names)
  • save主要是存储在txt文本中,如需使用,修改file_name的后缀名即可
  • 在这里插入图片描述