py+bs+requests多线程爬虫

还是那个站,不过我们这次不再一个一个来爬取了,我们将采用多线程进行爬取。

python的多线程问题一直是一个备受争议的话题,因为在多核心CPU的硬件条件下,python依然还是只能利用单核心,所以python的多线程就有一个“伪多线程”的命题,但是python多线程虽然是只能利用单核心,但是依旧比单线程要快。

思路还是一样,不过在最后处理的时候我们引入了一个多线程。

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
#-*-coding:utf8-*-
#!usr/bin/python
import requests
import time
from multiprocessing.dummy import Pool as ThreadPool
from bs4 import BeautifulSoup
header = {'User-Agent':'"Mozilla/5.0 (X11; Linux x86_64; rv:45.0) Gecko/20100101 Firefox/45.0"'}
# 读取网页源码
def getHtml(url):
htmls=requests.get(url,headers=header)
# 发现网页是用GBK编码的,在此处进行转码
htmls.encoding = 'gb2312'
# 调用text将对象进行字符化
pageContent=htmls.text
return pageContent
# 进行文章url的获取
def getContentUrl(html):
urls=[]
bsContent=BeautifulSoup(html,'html.parser')
urlContent=bsContent.find(class_="liszw")
for link in urlContent.find_all('a'):
url_lib=link.get('href')
urls.append(url_lib)
return urls
#文章内容的获取
def readContent(urls):
articleHtml=getHtml(urls)
# print articleHtml
bsContent=BeautifulSoup(articleHtml,'html.parser')
title=bsContent.find('h2').string
content=bsContent.find(class_="arwzks")
article=content.get_text()
txt=article.encode('utf-8')
print title+' start'
open(r'/home/wukong/testTuiLiXue/download/'+title.encode("utf-8")+'.txt','w+').write(txt)
print time.strftime('%Y-%m-%d %X',time.localtime(time.time()))+title+' end'
# print txt
return txt
if __name__ == '__main__':
#后面的参数为CPU的核心数,虽然说只能利用单核心
pool = ThreadPool(4)
links=[]
for i in range(1,40):
links.append('http://tuilixue.com/zhentantuilizhishi/list_4_'+str(i)+'.html')
for link in links:
html=getHtml(link)
urls=getContentUrl(html)
url=[]
for i in range(0,20,2):
url.append(urls[i])
# for i in url:
# result = readContent(i)
result=pool.map(readContent,url)
pool.close()
pool.join()

本文地址:http://damiantuan.xyz/2017/11/19/py-bs-requests多线程爬虫/
转载请注明出处,谢谢!

坚持原创技术分享,您的支持将鼓励我继续创作!
-------------本文结束感谢您的阅读-------------