如果vscode中你的终端不能识别scrapy可以在环境变量中加入scrapy.exe的路径

启动

安装好后，在目标文件夹内启动scrapy startproject tutorial命令，将会创建如下文件

tutorial/
    scrapy.cfg            # deploy configuration file

    tutorial/             # project's Python module, you'll import your code from here
        __init__.py

        items.py          # project items definition file

        middlewares.py    # project middlewares file

        pipelines.py      # project pipelines file

        settings.py       # project settings file

        spiders/          # a directory where you'll later put your spiders
            __init__.py

在tutorial/spiders目录下创建我们的第一个爬虫命名为quotes_spider.py

import scrapy


class QuotesSpider(scrapy.Spider):
    name = "quotes"

    def start_requests(self):
        urls = [
            'https://quotes.toscrape.com/page/1/',
            'https://quotes.toscrape.com/page/2/',
        ]
        for url in urls:
            yield scrapy.Request(url=url, callback=self.parse)

    def parse(self, response):
        page = response.url.split("/")[-2]
        filename = f'quotes-{page}.html'
        with open(filename, 'wb') as f:
            f.write(response.body)
        self.log(f'Saved file {filename}')

终端中启动爬虫scrapy crawl quotes会得到两个文件quotes-1.html and quotes-2.html

scrapy shell

在解析他两之前，我们介绍 Scrapy shell，用来调试我们输出 scrapy shell <url>

pip install ipython之后在上级目录中找到scrapy.cfg文件在setting下加入

shell = bpython 如果你的ipython不能用的话

输入exit可以退出

scrapy shell "https://quotes.toscrape.com/page/1/"

'''
[ ... Scrapy log here ... ]
2016-09-19 12:09:27 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://quotes.toscrape.com/page/1/> (referer: None)
[s] Available Scrapy objects:
[s]   scrapy     scrapy module (contains scrapy.Request, scrapy.Selector, etc)
[s]   crawler    <scrapy.crawler.Crawler object at 0x7fa91d888c90>
[s]   item       {}
[s]   request    <GET https://quotes.toscrape.com/page/1/>
[s]   response   <200 https://quotes.toscrape.com/page/1/>
[s]   settings   <scrapy.settings.Settings object at 0x7fa91d888c10>
[s]   spider     <DefaultSpider 'default' at 0x7fa91c8af990>
[s] Useful shortcuts:
[s]   shelp()           Shell help (print this help)
[s]   fetch(req_or_url) Fetch request (or URL) and update local objects
[s]   view(response)    View response in a browser'''

以上是返回的一些可以操作的对象

1 2	response.css('title') # [<Selector xpath='descendant-or-self::title' data='<title>Quotes to Scrape</title>'>]

如此可以实现交互式运行

css语法

::text

1 2	response.css('title::text').getall() # ['Quotes to Scrape']

1 2	response.css('title').getall() # ['<title>Quotes to Scrape</title>']

get/getall

返回一个，或者返回一个列表

正则

response.css('title::text').re(r'Quotes.*')
# ['Quotes to Scrape']
response.css('title::text').re(r'Q\w+')
# ['Quotes']
response.css('title::text').re(r'(\w+) to (\w+)')
# ['Quotes', 'Scrape']

Xpath

官方推荐使用这个，但我觉得css写的更方便一点

response.xpath('//title')
# [<Selector xpath='//title' data='<title>Quotes to Scrape</title>'>]
response.xpath('//title/text()').get()
# 'Quotes to Scrape'

/html/head/title: 选择HTML文档中 <head> 标签内的 <title> 元素
/html/head/title/text(): 选择上面提到的 <title> 元素的文字
//td: 选择所有的 <td> 元素
//div[@class="mine"]: 选择所有具有 class="mine" 属性的 div 元素

提取数据

'''
<div class="quote">
    <span class="text">“The world as we have created it is a process of our
    thinking. It cannot be changed without changing our thinking.”</span>
    <span>
        by <small class="author">Albert Einstein</small>
        <a href="/author/Albert-Einstein">(about)</a>
    </span>
    <div class="tags">
        Tags:
        <a class="tag" href="/tag/change/page/1/">change</a>
        <a class="tag" href="/tag/deep-thoughts/page/1/">deep-thoughts</a>
        <a class="tag" href="/tag/thinking/page/1/">thinking</a>
        <a class="tag" href="/tag/world/page/1/">world</a>
    </div>
</div>'''

scrapy shell 'https://quotes.toscrape.com'

单个提取

response.css("div.quote")
'''
[<Selector xpath="descendant-or-self::div[@class and contains(concat(' ', normalize-space(@class), ' '), ' quote ')]" data='<div class="quote" itemscope itemtype...'>,
 <Selector xpath="descendant-or-self::div[@class and contains(concat(' ', normalize-space(@class), ' '), ' quote ')]" data='<div class="quote" itemscope itemtype...'>,
 ...]'''

分为两个部分selector和data ，data就是我们操作的分布

quote = response.css("div.quote")[0]
text = quote.css("span.text::text").get()
text
'''
'“The world as we have created it is a process of our thinking. It cannot be changed without changing our thinking.”' '''

author = quote.css("small.author::text").get()
author
# 'Albert Einstein'

response.css搜寻的格式为’标签.标签名称’

quote为我们html文件中所有class=quote的标签组，

组内span.text标签下为名言、组内small.author为作者

小组提取

'''
<div class="quote">
    <span class="text">“The world as we have created it is a process of our
    thinking. It cannot be changed without changing our thinking.”</span>
    <span>
        by <small class="author">Albert Einstein</small>
        <a href="/author/Albert-Einstein">(about)</a>
    </span>
    <div class="tags">
        Tags:
        <a class="tag" href="/tag/change/page/1/">change</a>
        <a class="tag" href="/tag/deep-thoughts/page/1/">deep-thoughts</a>
        <a class="tag" href="/tag/thinking/page/1/">thinking</a>
        <a class="tag" href="/tag/world/page/1/">world</a>
    </div>
</div>'''

1
2
3

tags = quote.css("div.tags a.tag::text").getall()
tags
# ['change', 'deep-thoughts', 'thinking', 'world']

全部提取

for quote in response.css("div.quote"):
    text = quote.css("span.text::text").get()
    author = quote.css("small.author::text").get()
    tags = quote.css("div.tags a.tag::text").getall()
    print(dict(text=text, author=author, tags=tags))
'''
{'text': '“The world as we have created it is a process of our thinking. It cannot be changed without changing our thinking.”', 'author': 'Albert Einstein', 'tags': ['change', 'deep-thoughts', 'thinking', 'world']}
{'text': '“It is our choices, Harry, that show what we truly are, far more than our abilities.”', 'author': 'J.K. Rowling', 'tags': ['abilities', 'choices']}'''

数据保存

scrapy crawl spiderman -O spn.json

import scrapy


class QuotesSpider(scrapy.Spider):
    name = "quotes"
    start_urls = [
        'https://quotes.toscrape.com/page/1/',
        'https://quotes.toscrape.com/page/2/',
    ]

    def parse(self, response):
        for quote in response.css('div.quote'):
            yield {
                'text': quote.css('span.text::text').get(),
                'author': quote.css('small.author::text').get(),
                'tags': quote.css('div.tags a.tag::text').getall(),
            }

启动爬虫会获得如下内容:

注要在爬虫的根目录启动爬虫

'''
2016-09-19 18:57:19 [scrapy.core.scraper] DEBUG: Scraped from <200 https://quotes.toscrape.com/page/1/>
{'tags': ['life', 'love'], 'author': 'André Gide', 'text': '“It is better to be hated for what you are than to be loved for what you are not.”'}
2016-09-19 18:57:19 [scrapy.core.scraper] DEBUG: Scraped from <200 https://quotes.toscrape.com/page/1/>
{'tags': ['edison', 'failure', 'inspirational', 'paraphrased'], 'author': 'Thomas A. Edison', 'text': "“I have not failed. I've just found 10,000 ways that won't work.”"}'''

输出格式

scrapy crawl quotes -O quotes.json

-O将会覆写同名文件已存在的内容，

-o则会在已存在文件的后面增加内容，但是新旧格式可能不同，可以使用

scrapy crawl quotes -o quotes.jsonl

有json、jsonl、csv、xml四种格式

爬取整个网站

'''
<ul class="pager">
    <li class="next">
        <a href="/page/2/">Next <span aria-hidden="true">&rarr;</span></a>
    </li>
</ul>'''

1 2	response.css('li.next a').get() # '<a href="/page/2/">Next <span aria-hidden="true">→</span></a>'

response.css('li.next a::attr(href)').get()
# '/page/2/'

response.css('li.next a').attrib['href']
# '/page/2/'

::attr()同::text一样是可以调用的属性

递归调用

import scrapy


class QuotesSpider(scrapy.Spider):
    name = "quotes"
    start_urls = [
        'https://quotes.toscrape.com/page/1/',
    ]

    def parse(self, response):
        for quote in response.css('div.quote'):
            yield {
                'text': quote.css('span.text::text').get(),
                'author': quote.css('small.author::text').get(),
                'tags': quote.css('div.tags a.tag::text').getall(),
            }

        next_page = response.css('li.next a::attr(href)').get()
        if next_page is not None:
            next_page = response.urljoin(next_page)
            yield scrapy.Request(next_page, callback=self.parse)

urljoin()方法创建一个完整的链接，如上next_page这是一个简单的字符串

response.follow

使用response.follow方法不需要你有完整的链接

import scrapy


class QuotesSpider(scrapy.Spider):
    name = "quotes"
    start_urls = [
        'https://quotes.toscrape.com/page/1/',
    ]

    def parse(self, response):
        for quote in response.css('div.quote'):
            yield {
                'text': quote.css('span.text::text').get(),
                'author': quote.css('span small::text').get(),
                'tags': quote.css('div.tags a.tag::text').getall(),
            }

        next_page = response.css('li.next a::attr(href)').get()
        if next_page is not None:
            yield response.follow(next_page, callback=self.parse)

可以使用response.follow_all 平替循环结构

1 2	for a in response.css('ul.pager a'): yield response.follow(a, callback=self.parse)

1 2	anchors = response.css('ul.pager a') yield from response.follow_all(anchors, callback=self.parse)

yield from 使得anchors的数据一个一个出来

1	yield from response.follow_all(css='ul.pager a', callback=self.parse)

调试

import os
from scrapy.cmdline import execute

os.chdir(os.path.dirname(os.path.realpath(__file__)))

try:
    execute(
        [
            'scrapy',
            'crawl',
            'spiderman',
            '-o',
            'out.json',
        ]
    )
except SystemExit:
    pass

在根目录创建run.py函数复制上面内容运行，就可以断点调试了。

测试

一下是我对作者生平的爬取

import scrapy


class QuotesSpider(scrapy.Spider):
    name = "spiderman"
    start_urls = [
        'https://quotes.toscrape.com/page/1/',
    ]
    
    def parse(self, response):
        author_list = []
        for quote in response.css('div.quote'):
            auth_name = quote.css('small.author::text').get()
            if auth_name not in author_list:
                author_list.append(auth_name)
                url = 'http://quotes.toscrape.com/author/'+ auth_name.replace(' ', '-')
                yield scrapy.Request(url=url, callback=self.au_parse)

        next_page = response.css('li.next a::attr(href)').get()
        if next_page is not None:
            yield response.follow(next_page, callback=self.parse)

    def au_parse(self, response):
        born_dt = response.css('span.author-born-date::text').get().strip()
        born_lc = response.css('span.author-born-location::text').get().strip()
        des = response.css('div.author-description::text').get().strip()
        name = response.css('h3.author-title::text').get().strip()
        yield {     
                    'name': name,
                    'data': born_dt,
                    'location': born_lc,
                    'description': des,
                }

parse函数内处理每个页面的作者名，并组合成作者页面的形式

url = 'http://quotes.toscrape.com/author/'+ auth_name.replace(' ', '-') 生成url送给scrapy生成，

并调用au_parse函数生成数据

命令行调用scrapy crawl spiderman -O au_bio.json