如果vscode中你的终端不能识别scrapy可以在环境变量中加入scrapy.exe的路径

启动

安装好后,在目标文件夹内启动scrapy startproject tutorial命令,将会创建如下文件

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
tutorial/
scrapy.cfg # deploy configuration file

tutorial/ # project's Python module, you'll import your code from here
__init__.py

items.py # project items definition file

middlewares.py # project middlewares file

pipelines.py # project pipelines file

settings.py # project settings file

spiders/ # a directory where you'll later put your spiders
__init__.py

tutorial/spiders目录下创建我们的第一个爬虫命名为quotes_spider.py

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
import scrapy


class QuotesSpider(scrapy.Spider):
name = "quotes"

def start_requests(self):
urls = [
'https://quotes.toscrape.com/page/1/',
'https://quotes.toscrape.com/page/2/',
]
for url in urls:
yield scrapy.Request(url=url, callback=self.parse)

def parse(self, response):
page = response.url.split("/")[-2]
filename = f'quotes-{page}.html'
with open(filename, 'wb') as f:
f.write(response.body)
self.log(f'Saved file {filename}')

终端中启动爬虫scrapy crawl quotes会得到两个文件quotes-1.html and quotes-2.html

scrapy shell

在解析他两之前,我们介绍 Scrapy shell,用来调试我们输出 scrapy shell <url>

pip install ipython之后 在上级目录中找到scrapy.cfg文件在setting下加入

shell = bpython 如果你的ipython不能用的话

输入exit可以退出

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
scrapy shell "https://quotes.toscrape.com/page/1/"

'''
[ ... Scrapy log here ... ]
2016-09-19 12:09:27 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://quotes.toscrape.com/page/1/> (referer: None)
[s] Available Scrapy objects:
[s] scrapy scrapy module (contains scrapy.Request, scrapy.Selector, etc)
[s] crawler <scrapy.crawler.Crawler object at 0x7fa91d888c90>
[s] item {}
[s] request <GET https://quotes.toscrape.com/page/1/>
[s] response <200 https://quotes.toscrape.com/page/1/>
[s] settings <scrapy.settings.Settings object at 0x7fa91d888c10>
[s] spider <DefaultSpider 'default' at 0x7fa91c8af990>
[s] Useful shortcuts:
[s] shelp() Shell help (print this help)
[s] fetch(req_or_url) Fetch request (or URL) and update local objects
[s] view(response) View response in a browser'''

以上是返回的一些可以操作的对象

1
2
response.css('title')
# [<Selector xpath='descendant-or-self::title' data='<title>Quotes to Scrape</title>'>]

如此可以实现交互式运行

css语法

::text

1
2
response.css('title::text').getall()
# ['Quotes to Scrape']
1
2
response.css('title').getall()
# ['<title>Quotes to Scrape</title>']

get/getall

返回一个,或者返回一个列表

正则

1
2
3
4
5
6
response.css('title::text').re(r'Quotes.*')
# ['Quotes to Scrape']
response.css('title::text').re(r'Q\w+')
# ['Quotes']
response.css('title::text').re(r'(\w+) to (\w+)')
# ['Quotes', 'Scrape']

Xpath

官方推荐使用这个,但我觉得css写的更方便一点

1
2
3
4
response.xpath('//title')
# [<Selector xpath='//title' data='<title>Quotes to Scrape</title>'>]
response.xpath('//title/text()').get()
# 'Quotes to Scrape'
  • /html/head/title: 选择HTML文档中 <head> 标签内的 <title> 元素
  • /html/head/title/text(): 选择上面提到的 <title> 元素的文字
  • //td: 选择所有的 <td> 元素
  • //div[@class="mine"]: 选择所有具有 class="mine" 属性的 div 元素

提取数据

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
'''
<div class="quote">
<span class="text">“The world as we have created it is a process of our
thinking. It cannot be changed without changing our thinking.”</span>
<span>
by <small class="author">Albert Einstein</small>
<a href="/author/Albert-Einstein">(about)</a>
</span>
<div class="tags">
Tags:
<a class="tag" href="/tag/change/page/1/">change</a>
<a class="tag" href="/tag/deep-thoughts/page/1/">deep-thoughts</a>
<a class="tag" href="/tag/thinking/page/1/">thinking</a>
<a class="tag" href="/tag/world/page/1/">world</a>
</div>
</div>'''

scrapy shell 'https://quotes.toscrape.com'

单个提取

1
2
3
4
5
response.css("div.quote")
'''
[<Selector xpath="descendant-or-self::div[@class and contains(concat(' ', normalize-space(@class), ' '), ' quote ')]" data='<div class="quote" itemscope itemtype...'>,
<Selector xpath="descendant-or-self::div[@class and contains(concat(' ', normalize-space(@class), ' '), ' quote ')]" data='<div class="quote" itemscope itemtype...'>,
...]'''

分为两个部分selector和data ,data就是我们操作的分布

1
2
3
4
5
6
7
8
9
10
quote = response.css("div.quote")[0]
text = quote.css("span.text::text").get()
text
'''
'“The world as we have created it is a process of our thinking. It cannot be changed without changing our thinking.”' '''

author = quote.css("small.author::text").get()
author
# 'Albert Einstein'

response.css搜寻的格式为’标签.标签名称’

quote为我们html文件中所有class=quote的标签组,

组内span.text标签下为名言、组内small.author为作者

小组提取

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
'''
<div class="quote">
<span class="text">“The world as we have created it is a process of our
thinking. It cannot be changed without changing our thinking.”</span>
<span>
by <small class="author">Albert Einstein</small>
<a href="/author/Albert-Einstein">(about)</a>
</span>
<div class="tags">
Tags:
<a class="tag" href="/tag/change/page/1/">change</a>
<a class="tag" href="/tag/deep-thoughts/page/1/">deep-thoughts</a>
<a class="tag" href="/tag/thinking/page/1/">thinking</a>
<a class="tag" href="/tag/world/page/1/">world</a>
</div>
</div>'''
1
2
3
tags = quote.css("div.tags a.tag::text").getall()
tags
# ['change', 'deep-thoughts', 'thinking', 'world']

全部提取

1
2
3
4
5
6
7
8
for quote in response.css("div.quote"):
text = quote.css("span.text::text").get()
author = quote.css("small.author::text").get()
tags = quote.css("div.tags a.tag::text").getall()
print(dict(text=text, author=author, tags=tags))
'''
{'text': '“The world as we have created it is a process of our thinking. It cannot be changed without changing our thinking.”', 'author': 'Albert Einstein', 'tags': ['change', 'deep-thoughts', 'thinking', 'world']}
{'text': '“It is our choices, Harry, that show what we truly are, far more than our abilities.”', 'author': 'J.K. Rowling', 'tags': ['abilities', 'choices']}'''

数据保存

scrapy crawl spiderman -O spn.json

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
import scrapy


class QuotesSpider(scrapy.Spider):
name = "quotes"
start_urls = [
'https://quotes.toscrape.com/page/1/',
'https://quotes.toscrape.com/page/2/',
]

def parse(self, response):
for quote in response.css('div.quote'):
yield {
'text': quote.css('span.text::text').get(),
'author': quote.css('small.author::text').get(),
'tags': quote.css('div.tags a.tag::text').getall(),
}

启动爬虫会获得如下内容:

注要在爬虫的根目录启动爬虫

1
2
3
4
5
'''
2016-09-19 18:57:19 [scrapy.core.scraper] DEBUG: Scraped from <200 https://quotes.toscrape.com/page/1/>
{'tags': ['life', 'love'], 'author': 'André Gide', 'text': '“It is better to be hated for what you are than to be loved for what you are not.”'}
2016-09-19 18:57:19 [scrapy.core.scraper] DEBUG: Scraped from <200 https://quotes.toscrape.com/page/1/>
{'tags': ['edison', 'failure', 'inspirational', 'paraphrased'], 'author': 'Thomas A. Edison', 'text': "“I have not failed. I've just found 10,000 ways that won't work.”"}'''

输出格式

scrapy crawl quotes -O quotes.json

-O将会覆写同名文件已存在的内容,

-o则会在已存在文件的后面增加内容,但是新旧格式可能不同,可以使用

scrapy crawl quotes -o quotes.jsonl

有json、jsonl、csv、xml四种格式

爬取整个网站

1
2
3
4
5
6
'''
<ul class="pager">
<li class="next">
<a href="/page/2/">Next <span aria-hidden="true">&rarr;</span></a>
</li>
</ul>'''
1
2
response.css('li.next a').get()
# '<a href="/page/2/">Next <span aria-hidden="true">→</span></a>'
1
2
3
4
5
response.css('li.next a::attr(href)').get()
# '/page/2/'

response.css('li.next a').attrib['href']
# '/page/2/'

::attr()::text一样是可以调用的属性

递归调用

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
import scrapy


class QuotesSpider(scrapy.Spider):
name = "quotes"
start_urls = [
'https://quotes.toscrape.com/page/1/',
]

def parse(self, response):
for quote in response.css('div.quote'):
yield {
'text': quote.css('span.text::text').get(),
'author': quote.css('small.author::text').get(),
'tags': quote.css('div.tags a.tag::text').getall(),
}

next_page = response.css('li.next a::attr(href)').get()
if next_page is not None:
next_page = response.urljoin(next_page)
yield scrapy.Request(next_page, callback=self.parse)

urljoin()方法创建一个完整的链接,如上next_page这是一个简单的字符串

response.follow

使用response.follow方法不需要你有完整的链接

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
import scrapy


class QuotesSpider(scrapy.Spider):
name = "quotes"
start_urls = [
'https://quotes.toscrape.com/page/1/',
]

def parse(self, response):
for quote in response.css('div.quote'):
yield {
'text': quote.css('span.text::text').get(),
'author': quote.css('span small::text').get(),
'tags': quote.css('div.tags a.tag::text').getall(),
}

next_page = response.css('li.next a::attr(href)').get()
if next_page is not None:
yield response.follow(next_page, callback=self.parse)

可以使用response.follow_all 平替循环结构

1
2
for a in response.css('ul.pager a'):
yield response.follow(a, callback=self.parse)
1
2
anchors = response.css('ul.pager a')
yield from response.follow_all(anchors, callback=self.parse)

yield from 使得anchors的数据一个一个出来

1
yield from response.follow_all(css='ul.pager a', callback=self.parse)

调试

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
import os
from scrapy.cmdline import execute

os.chdir(os.path.dirname(os.path.realpath(__file__)))

try:
execute(
[
'scrapy',
'crawl',
'spiderman',
'-o',
'out.json',
]
)
except SystemExit:
pass

在根目录创建run.py函数复制上面内容运行,就可以断点调试了。

测试

一下是我对作者生平的爬取

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
import scrapy


class QuotesSpider(scrapy.Spider):
name = "spiderman"
start_urls = [
'https://quotes.toscrape.com/page/1/',
]

def parse(self, response):
author_list = []
for quote in response.css('div.quote'):
auth_name = quote.css('small.author::text').get()
if auth_name not in author_list:
author_list.append(auth_name)
url = 'http://quotes.toscrape.com/author/'+ auth_name.replace(' ', '-')
yield scrapy.Request(url=url, callback=self.au_parse)

next_page = response.css('li.next a::attr(href)').get()
if next_page is not None:
yield response.follow(next_page, callback=self.parse)

def au_parse(self, response):
born_dt = response.css('span.author-born-date::text').get().strip()
born_lc = response.css('span.author-born-location::text').get().strip()
des = response.css('div.author-description::text').get().strip()
name = response.css('h3.author-title::text').get().strip()
yield {
'name': name,
'data': born_dt,
'location': born_lc,
'description': des,
}

parse函数内处理每个页面的作者名,并组合成作者页面的形式

url = 'http://quotes.toscrape.com/author/'+ auth_name.replace(' ', '-') 生成url送给scrapy生成,

并调用au_parse函数生成数据

命令行调用scrapy crawl spiderman -O au_bio.json