python利用grequest 实现高并发爬虫

之前都是使用多线程来实现效率的提升,但是由于python的GIL锁的原因,导致所谓的多线程并不是多并发,而是当爬虫发出去的请求还未等到返回的间隙,继续执行别的线程,相当于交替轮流执行,是伪多并发。

grequest库封装了gevent和requests,实现了真正的多并发请求,不过也不是说来多少并发多少,也是有边界的,所以通过size参数设置一下同时并发数。


import grequests

urls = ["https://www.baidu.com", "https://www.hao123.com", "https://www.taobao.com"]
req = (grequests.get(u) for u in urls)
resp = grequests.map(req, size=10)  # 10个并发
print(resp)
for i in resp: 
    print(i.text)


使用grequests和requests的对比

# coding:utf-8
import grequests
import time
import json
import requests
 
 
adata = json.dumps({"key": "value"})
header = {"Content-type": "appliaction/json", "Accept":"application/json"}
 
 
def use_grequests(num):
    task = []
    urls = ["http://hao.jobbole.com/python-docx/" for i in range(num)]
    while urls:
        url = urls.pop(0)
        rs = grequests.request("POST", url, data=adata, headers=header)
        task.append(rs)
    resp = grequests.map(task, size=5)
    return resp
 
 
def use_requests(num):
    urls = ["http://hao.jobbole.com/python-docx/" for i in range(num)]
    index = 0
    while urls:
        url = urls.pop(0)
        resp = requests.post(url=url, headers=header, data=adata)
        index += 1
        if index % 10 == 0:
            print u'目前是第{}个请求'.format(index)
 
 
def main(num):
    time1 = time.time()
    finall_res = use_requests(num)
    print finall_res
    time2 = time.time()
    T = time2 - time1
    print u'use_requests发起{}个请求花费了{}秒'.format(num, T)
 
    print u'正在使用grequests模块发起请求...'
    time3 = time.time()
    finall_res2 = use_grequests(num)
    print finall_res2
    time4 = time.time()
    T2 = time4 - time3
    print u'use_grequests发起{}个请求花费了{}秒'.format(num, T2)
 
 
if __name__ == '__main__':
    main(100)


关键词: 爬虫 , grequest

上一篇: JetBrains全家桶永久激活码jetbrains-agent.jar 激活方法 和中文翻译包
下一篇: 安装腾讯内部TCPA单边拥塞算法,加速服务器,附一键脚本安装

目前还没有人评论,您发表点看法?
发表评论

评论内容 (必填):