13-Python爬虫:Scrapy框架入门

Scrapy是Python最强大的爬虫框架,高效、稳定、支持分布式!

特点 说明
高性能 异步IO,比requests快10倍以上
功能完整 请求、解析、存储一条龙
中间件 支持自定义扩展
分布式 Scrapy-Redis支持分布式爬取
调试方便 Shell交互模式

二、安装

1
2
pip install scrapy
scrapy version # 验证安装

三、快速开始

1. 创建项目

1
2
scrapy startproject tutorial
cd tutorial

2. 项目结构

1
2
3
4
5
6
7
8
9
10
tutorial/
├── scrapy.cfg # 配置文件
└── tutorial/
├── __init__.py
├── items.py # 数据模型
├── middlewares.py # 中间件
├── pipelines.py # 管道
├── settings.py # 设置
└── spiders/ # 爬虫
└── __init__.py

3. 定义Item

1
2
3
4
5
6
7
# items.py
import scrapy

class QuoteItem(scrapy.Item):
text = scrapy.Field()
author = scrapy.Field()
tags = scrapy.Field()

4. 编写Spider

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
# spiders/quotes.py
import scrapy

class QuotesSpider(scrapy.Spider):
name = 'quotes'
start_urls = ['http://quotes.toscrape.com/']

def parse(self, response):
for quote in response.css('div.quote'):
yield {
'text': quote.css('span.text::text').get(),
'author': quote.css('small.author::text').get(),
'tags': quote.css('div.tags a.tag::text').getall(),
}

# 翻页
next_page = response.css('li.next a::attr(href)').get()
if next_page:
yield response.follow(next_page, self.parse)

四、运行爬虫

1
2
3
scrapy crawl quotes -o quotes.json    # 保存为JSON
scrapy crawl quotes -o quotes.csv # 保存为CSV
scrapy crawl quotes -o quotes.jl # 保存为JSON Lines

五、选择器

CSS选择器

1
2
3
4
response.css('title::text').get()              # 获取文本
response.css('div.quote span.text::text').getall() # 获取所有
response.css('a::attr(href)').get() # 获取属性
response.css('li.next a').xpath('@href').get() # 混用

XPath选择器

1
2
response.xpath('//div[@class="quote"]//span[@class="text"]/text()').get()
response.xpath('//div[@class="quote"]').xpath('.//a[@class="tag"]/text()').getall()

六、翻页与跟进

1
2
3
4
5
6
7
8
9
# 方法1:直接URL
yield scrapy.Request(url, callback=self.parse)

# 方法2:相对路径(推荐)
yield response.follow(next_page, callback=self.parse)

# 方法3:Selector对象
for a in response.css('li.next a'):
yield response.follow(a, self.parse)

七、模拟登录

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
import scrapy
from scrapy.http import FormRequest

class LoginSpider(scrapy.Spider):
name = 'login'
start_urls = ['http://quotes.toscrape.com/login']

def parse(self, response):
return FormRequest.from_response(
response,
formdata={'username': 'user', 'password': 'pass'},
callback=self.after_login
)

def after_login(self, response):
if '退出' in response.text:
self.logger.info('✅ 登录成功!')
yield scrapy.Request('http://quotes.toscrape.com/', self.parse)

八、设置

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
# settings.py
BOT_NAME = 'tutorial'

# 并发数
CONCURRENT_REQUESTS = 16

# 下载延迟
DOWNLOAD_DELAY = 1

# 自动限速
AUTOTHROTTLE_ENABLED = True

# 请求头
DEFAULT_REQUEST_HEADERS = {
'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36'
}

# 日志级别
LOG_LEVEL = 'INFO'

九、数据存储

Pipeline处理

1
2
3
4
5
6
7
8
9
10
11
12
13
14
# pipelines.py
import json

class JsonWriterPipeline:
def open_spider(self, spider):
self.file = open('items.jl', 'w')

def close_spider(self, spider):
self.file.close()

def process_item(self, item, spider):
line = json.dumps(dict(item)) + '\n'
self.file.write(line)
return item

启用Pipeline

1
2
3
4
# settings.py
ITEM_PIPELINES = {
'tutorial.pipelines.JsonWriterPipeline': 300,
}

十、代理与User-Agent

随机User-Agent

1
2
3
4
5
6
7
8
9
10
11
12
# middlewares.py
import random

class RandomUserAgentMiddleware:
def __init__(self):
self.user_agents = [
'Mozilla/5.0...',
'Mozilla/5.0...',
]

def process_request(self, request, spider):
request.headers['User-Agent'] = random.choice(self.user_agents)

代理

1
2
def process_request(self, request, spider):
request.meta['proxy'] = 'http://ip:port'

十一、实战:爬取豆瓣Top250

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
# spiders/douban.py
import scrapy

class DoubanSpider(scrapy.Spider):
name = 'douban'
allowed_domains = ['movie.douban.com']

def start_requests(self):
for i in range(0, 250, 25):
url = f'https://movie.douban.com/top250?start={i}&filter='
yield scrapy.Request(url, self.parse)

def parse(self, response):
for movie in response.css('div.item'):
rank = movie.css('em::text').get()
title = movie.css('span.title::text').get()
rating = movie.css('span.rating_num::text').get()
yield {
'rank': rank,
'title': title,
'rating': rating,
}

十二、Scrapy Shell

1
scrapy shell 'http://quotes.toscrape.com/'

交互式调试:

1
2
3
4
response.css('title::text').get()
response.css('div.quote span.text::text').getall()
view(response) # 在浏览器打开
fetch('http://quotes.toscrape.com/page/2/') # 重新请求

十三、常见命令

命令 说明
scrapy startproject <name> 创建项目
scrapy genspider <name> <domain> 生成爬虫
scrapy crawl <name> 运行爬虫
scrapy list 列出所有爬虫
scrapy shell <url> 交互式Shell
scrapy view <url> 在浏览器查看
scrapy check 检查爬虫

总结

1
2
3
4
5
6
7
8
✅ Scrapy项目结构
✅ Item定义
✅ Spider编写
✅ CSS/XPath选择器
✅ 翻页与跟进
✅ 模拟登录
✅ Pipeline存储
✅ 中间件配置

下篇预告:Scrapy进阶、分布式爬虫、Redis集成。

#Python爬虫 #Scrapy #数据采集


13-Python爬虫:Scrapy框架入门
https://yourname.github.io/2026/02/10/13-Scrapy框架入门/
作者
JA
发布于
2026年2月10日
许可协议