所有的章节标题和内容
虽然我感觉数据解析用xpath
和正则表达式
就够用了,但这个BeautifulSoup
也是个绕不过去的坎。
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 |
import requests from bs4 import BeautifulSoup import time if __name__ == '__main__': headers = { 'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/88.0.4324.96 Safari/537.36 Edg/88.0.705.50' } url = 'https://www.shicimingju.com/book/sanguoyanyi.html' # 以下代码是为了避免乱码,如果中文没有乱码可以直接写成 page_text = requests.get(url=url,headers=headers).text r = requests.get(url=url,headers=headers).content # 以二进制的形式返回数据 page_text = str(r,'utf-8') # 把二进制数据转换为文本 # 在首页中解析出章节的标题和详情页 # 1.实例化BeautifulSoup对象,需要将页面源码数据加载到该对象中 soup = BeautifulSoup(page_text,'lxml') # # 解析章节标题和详情页 li_list = soup.select('.book-mulu > ul > li') fp = open('./sanguo.txt','w',encoding='utf-8') # 创建文件 for li in li_list: # 得到章节标题 title = li.a.string detail_url = 'https://www.shicimingju.com'+li.a['href'] # 对详情页发起请求,获取详情页 detail_r = requests.get(url=detail_url,headers=headers).content detail_page_text = str(detail_r,'utf-8') # 解析详情页,获取章节内容 detail_page_soup = BeautifulSoup(detail_page_text,'lxml') div_tag = detail_page_soup.find('div',class_='chapter_content') # 得到章节内容 content = div_tag.text fp.write(title+':'+content+'\n') # 追加内容 print(title,'爬取完成.......') time.sleep(0.1) |
近期评论