Scrapy是Python最强大的爬虫框架,高效、稳定、支持分布式!
| 特点 |
说明 |
| 高性能 |
异步IO,比requests快10倍以上 |
| 功能完整 |
请求、解析、存储一条龙 |
| 中间件 |
支持自定义扩展 |
| 分布式 |
Scrapy-Redis支持分布式爬取 |
| 调试方便 |
Shell交互模式 |
二、安装
1 2
| pip install scrapy scrapy version
|
三、快速开始
1. 创建项目
1 2
| scrapy startproject tutorial cd tutorial
|
2. 项目结构
1 2 3 4 5 6 7 8 9 10
| tutorial/ ├── scrapy.cfg └── tutorial/ ├── __init__.py ├── items.py ├── middlewares.py ├── pipelines.py ├── settings.py └── spiders/ └── __init__.py
|
3. 定义Item
1 2 3 4 5 6 7
| import scrapy
class QuoteItem(scrapy.Item): text = scrapy.Field() author = scrapy.Field() tags = scrapy.Field()
|
4. 编写Spider
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19
| import scrapy
class QuotesSpider(scrapy.Spider): name = 'quotes' start_urls = ['http://quotes.toscrape.com/'] def parse(self, response): for quote in response.css('div.quote'): yield { 'text': quote.css('span.text::text').get(), 'author': quote.css('small.author::text').get(), 'tags': quote.css('div.tags a.tag::text').getall(), } next_page = response.css('li.next a::attr(href)').get() if next_page: yield response.follow(next_page, self.parse)
|
四、运行爬虫
1 2 3
| scrapy crawl quotes -o quotes.json scrapy crawl quotes -o quotes.csv scrapy crawl quotes -o quotes.jl
|
五、选择器
CSS选择器
1 2 3 4
| response.css('title::text').get() response.css('div.quote span.text::text').getall() response.css('a::attr(href)').get() response.css('li.next a').xpath('@href').get()
|
XPath选择器
1 2
| response.xpath('//div[@class="quote"]//span[@class="text"]/text()').get() response.xpath('//div[@class="quote"]').xpath('.//a[@class="tag"]/text()').getall()
|
六、翻页与跟进
1 2 3 4 5 6 7 8 9
| yield scrapy.Request(url, callback=self.parse)
yield response.follow(next_page, callback=self.parse)
for a in response.css('li.next a'): yield response.follow(a, self.parse)
|
七、模拟登录
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18
| import scrapy from scrapy.http import FormRequest
class LoginSpider(scrapy.Spider): name = 'login' start_urls = ['http://quotes.toscrape.com/login'] def parse(self, response): return FormRequest.from_response( response, formdata={'username': 'user', 'password': 'pass'}, callback=self.after_login ) def after_login(self, response): if '退出' in response.text: self.logger.info('✅ 登录成功!') yield scrapy.Request('http://quotes.toscrape.com/', self.parse)
|
八、设置
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19
| BOT_NAME = 'tutorial'
CONCURRENT_REQUESTS = 16
DOWNLOAD_DELAY = 1
AUTOTHROTTLE_ENABLED = True
DEFAULT_REQUEST_HEADERS = { 'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36' }
LOG_LEVEL = 'INFO'
|
九、数据存储
Pipeline处理
1 2 3 4 5 6 7 8 9 10 11 12 13 14
| import json
class JsonWriterPipeline: def open_spider(self, spider): self.file = open('items.jl', 'w') def close_spider(self, spider): self.file.close() def process_item(self, item, spider): line = json.dumps(dict(item)) + '\n' self.file.write(line) return item
|
启用Pipeline
1 2 3 4
| ITEM_PIPELINES = { 'tutorial.pipelines.JsonWriterPipeline': 300, }
|
十、代理与User-Agent
随机User-Agent
1 2 3 4 5 6 7 8 9 10 11 12
| import random
class RandomUserAgentMiddleware: def __init__(self): self.user_agents = [ 'Mozilla/5.0...', 'Mozilla/5.0...', ] def process_request(self, request, spider): request.headers['User-Agent'] = random.choice(self.user_agents)
|
代理
1 2
| def process_request(self, request, spider): request.meta['proxy'] = 'http://ip:port'
|
十一、实战:爬取豆瓣Top250
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22
| import scrapy
class DoubanSpider(scrapy.Spider): name = 'douban' allowed_domains = ['movie.douban.com'] def start_requests(self): for i in range(0, 250, 25): url = f'https://movie.douban.com/top250?start={i}&filter=' yield scrapy.Request(url, self.parse) def parse(self, response): for movie in response.css('div.item'): rank = movie.css('em::text').get() title = movie.css('span.title::text').get() rating = movie.css('span.rating_num::text').get() yield { 'rank': rank, 'title': title, 'rating': rating, }
|
十二、Scrapy Shell
1
| scrapy shell 'http://quotes.toscrape.com/'
|
交互式调试:
1 2 3 4
| response.css('title::text').get() response.css('div.quote span.text::text').getall() view(response) fetch('http://quotes.toscrape.com/page/2/')
|
十三、常见命令
| 命令 |
说明 |
scrapy startproject <name> |
创建项目 |
scrapy genspider <name> <domain> |
生成爬虫 |
scrapy crawl <name> |
运行爬虫 |
scrapy list |
列出所有爬虫 |
scrapy shell <url> |
交互式Shell |
scrapy view <url> |
在浏览器查看 |
scrapy check |
检查爬虫 |
总结
1 2 3 4 5 6 7 8
| ✅ Scrapy项目结构 ✅ Item定义 ✅ Spider编写 ✅ CSS/XPath选择器 ✅ 翻页与跟进 ✅ 模拟登录 ✅ Pipeline存储 ✅ 中间件配置
|
下篇预告:Scrapy进阶、分布式爬虫、Redis集成。
#Python爬虫 #Scrapy #数据采集