BeautifulSoup爬虫代码

还是上次的那个网站，不过这次我们用request+beautifulsoup来进行爬取了。

思路和上次的那个基本上是一样的，不过就是把定位信息的方法从原来的使用python内置的str模块中的函数方法改成了使用beautifulsoup这个第三方的模块，这个模块的手册在网上能找到，翻译得不错，基本上是一看就懂的那种。注释的话我这次没写了，因为和上次一样的，想看注释的可以去看上一篇文章。

这次的存储方法与上次使用urllib的有所不同，上次的存储是直接保存HTML文件的要使用一些处理结构性文档的工具才能查看文章的内容，而且文件命名也是使用的网站上的URL来进行的，这样的命名毫无意义也就无法知道文件中的内容是什么，所以这次我们把爬取的文章标题作为文件名，保存为txt记事本文件。

#!/usr/bin/python
#coding=utf-8
import requests
import time
from bs4 import BeautifulSoup
url=['']*20
header = {'User-Agent':'"Mozilla/5.0 (X11; Linux x86_64; rv:45.0) Gecko/20100101 Firefox/45.0"'}
for page in range(1,39):
    htmls=requests.get('http://tuilixue.com/zhentantuilizhishi/list_4_'+str(page)+'.html',headers=header)
    htmls.encoding = 'gb2312' 
    pageContent=htmls.text
    txt=pageContent
# print type(txt)
# open(r'/home/wukong/testTuiLiXue/download/123.txt','w+').write(txt.encode('utf-8'))
# print content
    bsContent=BeautifulSoup(pageContent,'html.parser')
    urlContent=bsContent.find(class_="liszw")
    lis=urlContent.find_all('a')
    lis=str(lis)
    hrefHeader=lis.find(r'href=')
    hrefTrail=lis.find(r'target="_blank">',hrefHeader)
    url[0]=lis[hrefHeader+6:hrefTrail-2]
    if hrefHeader!=-1 and hrefTrail!=-1:
        for times in range(1,20):
            hrefHeader=lis.find(r'href=',hrefTrail)
            hrefTrail=lis.find(r'target="_blank">',hrefHeader)
            url[times]=lis[hrefHeader+6:hrefTrail-2]
            # print url[i]
        for i in range(0,20,2):
            articleHtml=requests.get(url[i],headers=header)
            articleHtml.encoding='gb2312'
            articleContent=articleHtml.text
            bsContent=BeautifulSoup(articleContent,'html.parser')
            title=bsContent.find('h2').string
            content=bsContent.find(class_="arwzks")
            article=content.get_text()
            txt=article.encode('utf-8')
            print title+' start'
            # print txt
            open(r'/home/wukong/testTuiLiXue/download/'+title.encode("utf-8")+'.txt','w+').write(txt)
            print title+' end'
            time.sleep(1)
print 'finish'