//div[@class="mine"]: 选择所有具有 class="mine" 属性的 div 元素
提取数据
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16
''' <div class="quote"> <span class="text">“The world as we have created it is a process of our thinking. It cannot be changed without changing our thinking.”</span> <span> by <small class="author">Albert Einstein</small> <a href="/author/Albert-Einstein">(about)</a> </span> <div class="tags"> Tags: <a class="tag" href="/tag/change/page/1/">change</a> <a class="tag" href="/tag/deep-thoughts/page/1/">deep-thoughts</a> <a class="tag" href="/tag/thinking/page/1/">thinking</a> <a class="tag" href="/tag/world/page/1/">world</a> </div> </div>'''
quote = response.css("div.quote")[0] text = quote.css("span.text::text").get() text ''' '“The world as we have created it is a process of our thinking. It cannot be changed without changing our thinking.”' '''
''' <div class="quote"> <span class="text">“The world as we have created it is a process of our thinking. It cannot be changed without changing our thinking.”</span> <span> by <small class="author">Albert Einstein</small> <a href="/author/Albert-Einstein">(about)</a> </span> <div class="tags"> Tags: <a class="tag" href="/tag/change/page/1/">change</a> <a class="tag" href="/tag/deep-thoughts/page/1/">deep-thoughts</a> <a class="tag" href="/tag/thinking/page/1/">thinking</a> <a class="tag" href="/tag/world/page/1/">world</a> </div> </div>'''
for quote in response.css("div.quote"): text = quote.css("span.text::text").get() author = quote.css("small.author::text").get() tags = quote.css("div.tags a.tag::text").getall() print(dict(text=text, author=author, tags=tags)) ''' {'text': '“The world as we have created it is a process of our thinking. It cannot be changed without changing our thinking.”', 'author': 'Albert Einstein', 'tags': ['change', 'deep-thoughts', 'thinking', 'world']} {'text': '“It is our choices, Harry, that show what we truly are, far more than our abilities.”', 'author': 'J.K. Rowling', 'tags': ['abilities', 'choices']}'''
数据保存
scrapy crawl spiderman -O spn.json
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17
import scrapy
classQuotesSpider(scrapy.Spider): name = "quotes" start_urls = [ 'https://quotes.toscrape.com/page/1/', 'https://quotes.toscrape.com/page/2/', ]
defparse(self, response): for quote in response.css('div.quote'): yield { 'text': quote.css('span.text::text').get(), 'author': quote.css('small.author::text').get(), 'tags': quote.css('div.tags a.tag::text').getall(), }
启动爬虫会获得如下内容:
注要在爬虫的根目录启动爬虫
1 2 3 4 5
''' 2016-09-19 18:57:19 [scrapy.core.scraper] DEBUG: Scraped from <200 https://quotes.toscrape.com/page/1/> {'tags': ['life', 'love'], 'author': 'André Gide', 'text': '“It is better to be hated for what you are than to be loved for what you are not.”'} 2016-09-19 18:57:19 [scrapy.core.scraper] DEBUG: Scraped from <200 https://quotes.toscrape.com/page/1/> {'tags': ['edison', 'failure', 'inspirational', 'paraphrased'], 'author': 'Thomas A. Edison', 'text': "“I have not failed. I've just found 10,000 ways that won't work.”"}'''