爬虫框架：Scrapy 快速入门-FreeNAS中文网

admin 管理员组

文章数量: 887021

文章目录

一、Scrapy简介
- 1.1 示例代码
- 1.2 示例代码的运行流程
二、安装Scrapy
- 2.1 Ubuntu下安装
- 2.2 Windows下安装
- 2.3 Mac OS下安装
三、Scrapy 快速入门
- 3.1 创建 scrapy 项目
- 3.2 编写 spider
- 3.3 运行爬虫项目
- - 3.3.1 start_requests 方法的快捷方式
- 3.4 数据提取
- - 3.4.1 提取名言和作者
  - 3.4.2 在 spider 中提取数据
- 3.5 存储提取的数据
- 3.6 追踪链接
- 3.7 创建请求的快捷方式
- 3.8 更多示例和模式
- 3.9 使用 spider 参数

一、Scrapy简介

Scrapy 是一种网络爬虫框架，用于对网站进行爬网并从其页面提取结构化数据。它的应用很广泛，从数据挖掘到监控再到自动化测试都可以用它来完成。

1.1 示例代码

下面是一个最简单的使用示例：

import scrapy


class QuotesSpider(scrapy.Spider):
    name = 'quotes'
    start_urls = [
        'http://quotes.toscrape/tag/humor/',
    ]

    def parse(self, response):
        for quote in response.css('div.quote'):
            yield {
                'author': quote.xpath('span/small/text()').get(),
                'text': quote.css('span.text::text').get(),
            }

        next_page = response.css('li.next a::attr("href")').get()
        if next_page is not None:
            yield response.follow(next_page, self.parse)

把它放在一个文本文件中，命名为 quotes_spider.py 然后用 runspider命令：

scrapy runspider quotes_spider.py -o quotes.jl

运行完成后，它将以JSON格式将响应数据保存在quotes.jl 文件中，如下所示。

1.2 示例代码的运行流程

当我们运行scrapy runspider quotes_spider.py命令时，Scrapy会自动在quotes_spider.py 文件中查找Spider（它被定义为scrapy.Spider的子类），并通过爬虫引擎运行它：

爬虫先向start_urls列表中的 url 发送请求；
得到响应后，调用默认的parse回调方法，并将响应传递给它；
在回调方法中，我们使用CSS选择器循环取出目标元素，提取信息并yield；
循环完成后查找下一页的链接，并使用与回调相同的解析方法调度下一个请求。

二、安装Scrapy

Scrapy需要Python 3.6+，可以是CPython实现(默认)，也可以是PyPy 7.2.0+实现（参见备选实现）。

2.1 Ubuntu下安装

安装依赖，在终端中执行：

sudo apt-get install python3 python3-dev python3-pip libxml2-dev libxslt1-dev zlib1g-dev libffi-dev libssl-dev

使用pip安装Scrapy：

pip install Scrapy

2.2 Windows下安装

在windows系统下安装，建议使用conda（Anaconda 或 Miniconda）：

conda install -c conda-forge scrapy

使用pip会有依赖问题，虽然有解决方法，但还是推荐conda。

2.3 Mac OS下安装

请直接参考官方文档：传送门。

三、Scrapy 快速入门

我们下面通过爬取quotes.toscrape来快速入门Scrapy，该网站是一个展示名人名言的网站。

我们接下来需要完成以下任务：

创建一个 scrapy 项目；
写一个 spider（spider 只是整个项目的一部分，并非指代整个爬虫项目）来爬取网站并提取数据；
使用命令行导出抓取的数据；
改变 spider 递归地跟随链接；
使用 spider 参数。

3.1 创建 scrapy 项目

在开始抓取之前，我们必须建立一个新的Scrapy项目，名称为tutorial。在终端中执行命令：

scrapy startproject tutorial

scrapy 会创建如下的目录：

tutorial/
    scrapy.cfg            # 部署配置文件
    tutorial/             # 存放我们的代码
        __init__.py
        items.py          # 定义 items
        middlewares.py    # 中间件
        pipelines.py      # 管道，持久化相关内容
        settings.py       # 项目配置文件
        spiders/          # 存放我们编写的 spider
            __init__.py

3.2 编写 spider

scrapy 使用 spider 来抓取网站，而 spider 被定义为一个类，并且必须继承 scrapy.Spider类。

在 spider 文件夹下新建 quotes_spider.py文件，写入以下代码：

import scrapy


class QuotesSpider(scrapy.Spider):
    name = "quotes"

    def start_requests(self):
        urls = [
            'http://quotes.toscrape/page/1/',
            'http://quotes.toscrape/page/2/',
        ]
        for url in urls:
            yield scrapy.Request(url=url, callback=self.parse)

    def parse(self, response):
        page = response.url.split("/")[-2]
        filename = f'quotes-{page}.html'
        with open(filename, 'wb') as f:
            f.write(response.body)
        self.log(f'Saved file {filename}')

scrapy.Spider定义了一些方法和属性：

name：spider 的名称，是一个唯一标识，即在整个项目中不能有同名的 spider。
start_requests()：必须返回一个可迭代对象，比如一个列表或生成器对象。spider 将从这些对象开始爬取，之后的请求将从这些初始请求中陆续生成。
parse()：一个回调方法，用于处理请求的响应。response参数是TextResponse的一个实例，它保存页面内容，并有一些进一步处理响应的方法。

parse()方法通常用来解析响应，将所抓取的数据作为字典提取出来，同时查找要请求的新url，并从中创建新的请求。

3.3 运行爬虫项目

在 scrapy.cfg 同级目录下执行以下命令：

scrapy crawl quotes

上面的命令运行名称为“quote”的 spider，会输出类似于下面的内容：

... (omitted for brevity)
2016-12-16 21:24:05 [scrapy.core.engine] INFO: Spider opened
2016-12-16 21:24:05 [scrapy.extensions.logstats] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2016-12-16 21:24:05 [scrapy.extensions.telnet] DEBUG: Telnet console listening on 127.0.0.1:6023
2016-12-16 21:24:05 [scrapy.core.engine] DEBUG: Crawled (404) <GET http://quotes.toscrape/robots.txt> (referer: None)
2016-12-16 21:24:05 [scrapy.core.engine] DEBUG: Crawled (200) <GET http://quotes.toscrape/page/1/> (referer: None)
2016-12-16 21:24:05 [scrapy.core.engine] DEBUG: Crawled (200) <GET http://quotes.toscrape/page/2/> (referer: None)
2016-12-16 21:24:05 [quotes] DEBUG: Saved file quotes-1.html
2016-12-16 21:24:05 [quotes] DEBUG: Saved file quotes-2.html
2016-12-16 21:24:05 [scrapy.core.engine] INFO: Closing spider (finished)
...

现在，检查当前目录中的文件。我们就会发现多出了 quotes-1.html 和 quotes-2.html 两个文件，其中的内容对应于各自的url。

3.3.1 start_requests 方法的快捷方式

我们可以定义一个start_urls类属性，它是一个要请求的url的列表，用它可以代替start_requests方法。

class QuotesSpider(scrapy.Spider):
    name = "quotes"
    start_urls = [
        'http://quotes.toscrape/page/1/',
        'http://quotes.toscrape/page/2/',
    ]

    def parse(self, response):
        ……

3.4 数据提取

学习数据提取最好的方式是通过Scrapy shell，执行以下命令：

scrapy shell 'http://quotes.toscrape/page/1/'

我们将会看到以下输出：

2022-01-17 10:56:00 [scrapy.utils.log] INFO: Scrapy 2.5.1 started (bot: tutorial)
……
<GET http://quotes.toscrape/page/1/> (referer: None)
[s] Available Scrapy objects:
[s]   scrapy     scrapy module (contains scrapy.Request, scrapy.Selector, etc)
[s]   crawler    <scrapy.crawler.Crawler object at 0x7fb27098a1c0>
[s]   item       {}
[s]   request    <GET http://quotes.toscrape/page/1/>
[s]   response   <200 http://quotes.toscrape/page/1/>
[s]   settings   <scrapy.settings.Settings object at 0x7fb270986910>
[s]   spider     <DefaultSpider 'default' at 0x7fb270643640>
[s] Useful shortcuts:
[s]   fetch(url[, redirect=True]) Fetch URL and update local objects (by default, redirects are followed)
[s]   fetch(req)                  Fetch a scrapy.Request and update local objects 
[s]   shelp()           Shell help (print this help)
[s]   view(response)    View response in a browser

尝试使用CSS获取响应对象中的选择器对象：

>>> response.css('title')
[<Selector xpath='descendant-or-self::title' data='<title>Quotes to Scrape</title>'>]

要从上面的选择器对象中提取文本，可以这样做：

>>> response.css('title::text').getall()
['Quotes to Scrape']

我们在CSS查询中添加了::text，这意味着我们只想直接选择<title>元素中的文本内容。如果没有指定::text，就会得到完整的title元素，包括它的标签:

要获取全部元素：

>>> response.css('title').getall()
['<title>Quotes to Scrape</title>']

只要第一个元素：

>>> response.css('title::text').get()
'Quotes to Scrape'

上面的代码也可以写成下面这样：

>>> response.css('title::text')[0].get()  # 先用索引取出选择器，然后获取选择器中的元素
'Quotes to Scrape'

使用正则表达式选择元素：

>>> response.css('title::text').re(r'Quotes.*')
['Quotes to Scrape']
>>> response.css('title::text').re(r'Q\w+')
['Quotes']
>>> response.css('title::text').re(r'(\w+) to (\w+)')
['Quotes', 'Scrape']

3.4.1 提取名言和作者

http://quotes.toscrape网站中，每一句名言的HTML代码都是像下面这样：

<div class="quote">
    <span class="text">“The world as we have created it is a process of our
    thinking. It cannot be changed without changing our thinking.”</span>
    <span>
        by <small class="author">Albert Einstein</small>
        <a href="/author/Albert-Einstein">(about)</a>
    </span>
    <div class="tags">
        Tags:
        <a class="tag" href="/tag/change/page/1/">change</a>
        <a class="tag" href="/tag/deep-thoughts/page/1/">deep-thoughts</a>
        <a class="tag" href="/tag/thinking/page/1/">thinking</a>
        <a class="tag" href="/tag/world/page/1/">world</a>
    </div>
</div>

我们先打开 scrapy shell ，请求http://quotes.toscrape：

scrapy shell 'http://quotes.toscrape'

获取选择器列表：

>>> response.css("div.quote")
[<Selector xpath="descendant-or-self::div[@class and contains(concat(' ', normalize-space(@class), ' '), ' quote ')]" data='<div class="quote" itemscope itemtype...'>,
 <Selector xpath="descendant-or-self::div[@class and contains(concat(' ', normalize-space(@class), ' '), ' quote ')]" data='<div class="quote" itemscope itemtype...'>,
 ...]

获取上面列表中的第一个选择器：

quote = response.css("div.quote")[0]

注意：get()方法获取的是元素，而[0]获取的是列表中的第一个选择器。

从 quote 中提取文本、作者和标签：

>>> text = quote.css("span.text::text").get()
>>> text
'“The world as we have created it is a process of our thinking. It cannot be changed without changing our thinking.”'
>>> author = quote.css("small.author::text").get()
>>> author
'Albert Einstein'

如果标签（不是指HTML标签，而是 quote 中的标签）是一个字符串列表，我们可以使用.getall()方法获取所有标签：

>>> tags = quote.css("div.tags a.tag::text").getall()
>>> tags
['change', 'deep-thoughts', 'thinking', 'world']

现在，我们已经知道了如何获取每一个需要的数据，就下了要做的就是遍历元素将数据提取后放入一个字典：

>>> for quote in response.css("div.quote"):
...     text = quote.css("span.text::text").get()
...     author = quote.css("small.author::text").get()
...     tags = quote.css("div.tags a.tag::text").getall()
...     print(dict(text=text, author=author, tags=tags))
...
{'text': '“The world as we have created it is a process of our thinking. It cannot be changed without changing our thinking.”', 'author': 'Albert Einstein', 'tags': ['change', 'deep-thoughts', 'thinking', 'world']}
{'text': '“It is our choices, Harry, that show what we truly are, far more than our abilities.”', 'author': 'J.K. Rowling', 'tags': ['abilities', 'choices']}
{'text': '“There are only two ways to live your life. One is as though nothing is a miracle. The other is as though everything is a miracle.”', 'author': 'Albert Einstein', 'tags': ['inspirational', 'life', 'live', 'miracle', 'miracles']}
……

3.4.2 在 spider 中提取数据

上面的提取实在 scrapy shell 中完成的，接下来我们在 spider 中编写代码，完成数据的提取。

Scrapy spider 通常会生成许多字典，其中包含从页面中提取的数据。为此，我们在回调函数中使用 yield Python关键字，如下所示:

import scrapy


class QuotesSpider(scrapy.Spider):
    name = "quotes"
    start_urls = [
        'http://quotes.toscrape/page/1/',
        'http://quotes.toscrape/page/2/',
    ]

    def parse(self, response):
        for quote in response.css('div.quote'):
            yield {
                'text': quote.css('span.text::text').get(),
                'author': quote.css('small.author::text').get(),
                'tags': quote.css('div.tags a.tag::text').getall(),
            }

然后我们运行 scrapy 项目，会在终端中输出如下内容：

……
2022-01-17 16:20:27 [scrapy.core.scraper] DEBUG: Scraped from <200 http://quotes.toscrape.com/page/1/>
{'text': '“It is our choices, Harry, that show what we truly are, far more than our abilities.”', 'author': 'J.K. Rowling', 'tags': ['abilities', 'choices']}
2022-01-17 16:20:27 [scrapy.core.scraper] DEBUG: Scraped from <200 http://quotes.toscrape.com/page/1/>
{'text': '“There are only two ways to live your life. One is as though nothing is a miracle. The other is as though everything is a miracle.”', 'author': 'Albert Einstein', 'tags': ['inspirational', 'life', 'live', 'miracle', 'miracles']}
……

3.5 存储提取的数据

最简单的方式是使用 Feed exports 导出，在终端中运行：

scrapy crawl quotes -O quotes.json

这会将数据序列化为 JSON 格式，保存在 quotes.json 文件中。

**使用-O（大写英文字母）会用新的数据覆盖文件原有的数据，-o（小写英文字母）则会追加到文件原有数据末尾。使用追加模式会打乱 JSON 文件的格式，使文件失效！**所以，如果想追加模式，就应该使用其他文件格式，比如 JSON Lines：

scrapy crawl quotes -o quotes.jl

JSON Lines 格式非常有用，由于每个记录都是单独的行，追加内容十分方便，所以我们可以用它处理大文件，而不必将所有内容放入内存中。

3.6 追踪链接

通常，我们爬取一个网站不会只爬取前几页，而是爬取几十、几百页……，甚至是整个网站。所以，我们需要一种方法来动态获取这些 URL，这个方法就是追踪链接，从页面中提取其他页面的 URL。

第一件事是提取到我们想要追踪的页面的链接。检查我们的页面，我们可以看到有一个链接到下一个页面，带有以下标记：

<ul class="pager">
    <li class="next">
        <a href="/page/2/">Next <span aria-hidden="true">&rarr;</span></a>
    </li>
</ul>

我们可以通过 shell 试着提取：

>>> response.css('li.next a').get()
'<a href="/page/2/">Next <span aria-hidden="true">→</span></a>'

然后通过::attr()提取href属性中的值：

>>> response.css('li.next a::attr(href)').get()
'/page/2/'

还有一个attrib属性可用(更多信息请参见选择元素属性)：

>>> response.css('li.next a').attrib['href']
'/page/2/'

现在回到 spider 中，将代码修改为递归地追踪下一页的链接，并从中提取数据：

import scrapy


class QuotesSpider(scrapy.Spider):
    name = "quotes"
    start_urls = [
        'http://quotes.toscrape/page/1/',
    ]

    def parse(self, response):
        for quote in response.css('div.quote'):
            yield {
                'text': quote.css('span.text::text').get(),
                'author': quote.css('small.author::text').get(),
                'tags': quote.css('div.tags a.tag::text').getall(),
            }

        next_page = response.css('li.next a::attr(href)').get()
        if next_page is not None:
            next_page = response.urljoin(next_page)
            yield scrapy.Request(next_page, callback=self.parse)

在提取数据之后，parse()方法查找到下一页的链接，使用urljoin()方法构建一个完整的URL，并用scrapy.Request()生成一个到下一页的新请求，将自己注册为回调，以处理下一页的数据提取。

3.7 创建请求的快捷方式

作为创建请求对象的快捷方式，我们可以使用response.follow：

import scrapy


class QuotesSpider(scrapy.Spider):
    name = "quotes"
    start_urls = [
        'http://quotes.toscrape/page/1/',
    ]

    def parse(self, response):
        for quote in response.css('div.quote'):
            yield {
                'text': quote.css('span.text::text').get(),
                'author': quote.css('span small::text').get(),
                'tags': quote.css('div.tags a.tag::text').getall(),
            }

        next_page = response.css('li.next a::attr(href)').get()
        if next_page is not None:
            yield response.follow(next_page, callback=self.parse)

不同于scrapy.Request()， response.follow 支持相对 URL （即 URL 的路径部分，不包括域名和协议部分），不需要调用response.urljoin()方法。

我们也可以传递一个选择器给response.follow，而不是一个字符串。这个选择器应该提取必要的属性，比如href：

for href in response.css('ul.pager a::attr(href)'):
    yield response.follow(href, callback=self.parse)

对于<a>元素，response.follow会自动提取其中的href属性值，所以代码还可以进一步精简：

for a in response.css('ul.pager a'):
    yield response.follow(a, callback=self.parse)

要从一个可迭代对象中创建多个请求，你可以使用response.follow_all：

anchors = response.css('ul.pager a')
yield from response.follow_all(anchors, callback=self.parse)

或者使用更加精简的形式：

yield from response.follow_all(css='ul.pager a', callback=self.parse)

3.8 更多示例和模式

下面是另一个 spider ，它演示了回调和追踪链接，这次是为了抓取作者信息:

import scrapy


class AuthorSpider(scrapy.Spider):
    name = 'author'

    start_urls = ['http://quotes.toscrape/']

    def parse(self, response):
        author_page_links = response.css('.author + a')
        yield from response.follow_all(author_page_links, self.parse_author)

        pagination_links = response.css('li.next a')
        yield from response.follow_all(pagination_links, self.parse)

    def parse_author(self, response):
        def extract_with_css(query):
            return response.css(query).get(default='').strip()

        yield {
            'name': extract_with_css('h3.author-title::text'),
            'birthdate': extract_with_css('.author-born-date::text'),
            'bio': extract_with_css('.author-description::text'),
        }

这个 spider 将从主页开始，追踪 authors 页面的所有链接，在每个 authors 页面调用parse_author回调函数，还有分页链接和我们之前看到的parse()回调方法。

这里我们将回调传递给response.follow_all作为位置参数，它能使代码更短，也更也适用于Request。

parse_author回调函数定义了一个辅助函数，用于通过CSS选择器提取和清理数据，并生成带有作者数据的Python字典。

另外，我们不需要担心多次访问同一作者页面。默认情况下，Scrapy会过滤掉对已经访问过的 URL 的重复请求，避免因为编程错误而过多访问服务器的问题。这可以通过设置DUPEFILTER_CLASS来配置。

3.9 使用 spider 参数

我们可以在运行 spider 时使用-a选项来提供命令行参数：

scrapy crawl quotes -O quotes-humor.json -a tag=humor

这些参数被传递给 Spide r的__init__方法，默认情况下成为 Spider 的属性。

在本例中，传递给 tag 参数的值将通过self.tag使用。我们可以使用它来限制 spider ，让 spider 只获取带有特定标签的名言，并基于参数构建URL：

import scrapy


class QuotesSpider(scrapy.Spider):
    name = "quotes"

    def start_requests(self):
        url = 'http://quotes.toscrape/'
        tag = getattr(self, 'tag', None)
        if tag is not None:
            url = url + 'tag/' + tag
        yield scrapy.Request(url, self.parse)

    def parse(self, response):
        for quote in response.css('div.quote'):
            yield {
                'text': quote.css('span.text::text').get(),
                'author': quote.css('small.author::text').get(),
            }

        next_page = response.css('li.next a::attr(href)').get()
        if next_page is not None:
            yield response.follow(next_page, self.parse)

如果我们将tag=humor参数传递给这个 spider，那么它只会访问带有 humor 标签的url，比如 http://quotes.toscrape/tag/humor。

本教程只涵盖了 Scrapy 的基础知识，但是还有很多其他特性没有在这里提到。查看章节“还有什么？”，快速概述最重要的内容。

本文标签：爬虫框架入门快速 scrapy

版权声明：本文标题：爬虫框架：Scrapy 快速入门内容由网友自发贡献，该文观点仅代表作者本人，转载请联系作者并注明出处：http://www.freenas.com.cn/jishu/1726434664h959986.html，本站仅提供信息存储空间服务，不拥有所有权，不承担相关法律责任。如发现本站有涉嫌抄袭侵权/违法违规的内容，一经查实，本站将立刻删除。

技术交流 – FreeNAS中文网

爬虫框架：Scrapy 快速入门

文章目录

一、Scrapy简介

1.1 示例代码

1.2 示例代码的运行流程

二、安装Scrapy

2.1 Ubuntu下安装

2.2 Windows下安装

2.3 Mac OS下安装

三、Scrapy 快速入门

3.1 创建 scrapy 项目

3.2 编写 spider

3.3 运行爬虫项目

3.3.1 start_requests 方法的快捷方式

3.4 数据提取

3.4.1 提取名言和作者

3.4.2 在 spider 中提取数据

3.5 存储提取的数据

3.6 追踪链接

3.7 创建请求的快捷方式

3.8 更多示例和模式

3.9 使用 spider 参数

更多相关文章

两台笔记本之间快速传输文件，两台电脑之间快速传输大量文件

格式化的硬盘怎么恢复数据？格式化数据恢复的7个小妙招，助你快速恢复文件

麒麟操作系统（Kylin）入门—安装系统

快速安装可视化IDS系统 （带视频）

Python爬虫之浏览器User-Agent大全

分享电脑浏览器上实用的speedceo插件，选中文字简单快速检索

python卸载与安装配置（小白快速入门）

【腾讯云Cloud Studio实战训练营】用Vue+Vite快速构建完成交互式3D小故事

Python网络爬虫——爬取小视频网站源视频！自己偷偷看哦！

电脑系统崩溃了，如何重置电脑？不用重装也能让电脑快速恢复使用！

五、Python复习教程（重点）-爬虫框架实战

Windows Update启动不了如何解决？教你快速修复

【Scrapy爬虫框架】：快速掌握 scrapy 爬虫框架以及了解原理

计算机里没有四款小游戏,电脑里自带游戏没有怎么办 这个方法快速找回

【本科大学毕业生论文分享】基于SSM框架的连锁服装销售系统的设计与实现

Win11 删除“入门”和“Windows备份”以及 Win10 删除“Windows备份”的方法

入门Java第一步—＞IDEA的下载与安装与JDK的环境配置

bmob Harmony快速开发手机号一键登录功能

【eNSP】华为ensp快速入门实验

【快速解决】WindowsApps拒绝访问的问题

发表评论

推荐文章

两首歌，觉着好玩，改编了一下歌词

岛屿类问题(DFS、BFS、DSU)

hdoj 1736 美观化文字

使用U盘安装win10系统。从零基础到精通，收藏这篇就够了！

Windows_Server搭建DC域控制环境

热门文章

超级硬盘数据恢复软件v2.7.2.6_针对超级硬盘数据恢复软件的注册方式的分享

network

国赛数模要点精讲（三）

【联邦学习实战】FLGo入门

关于苹果9或se2的小想法

php自定义提示信息,Laravel Validator自定义错误返回提示消息并在前端展示

PHP反序列化逃逸

windows下载安装ElasticSearch

windows 安全模型简介

【历史上的今天】11 月 30 日：Windows Vista 诞生；初代 Nook 电子书发布；自动驾驶先驱出生

最新文章

Raid技术

LSI_阵列卡操作手册

破解Centos7_root用户密码

Redhat重置Root用户密码方法

远程批量修改linux服务器密码的脚本

Windows7 系统安全设置权限技巧

（Windows系统）详细介绍Windows系统 含有英文版

最新Windows 11教育版下载：专为教育设计的系统！

Win7系统下搭建NFS服务器

零基础使用UltraISO制作并安装纯净Win10系统指南

快速安装可视化IDS系统（带视频）

计算机里没有四款小游戏,电脑里自带游戏没有怎么办这个方法快速找回

（Windows系统）详细介绍Windows系统含有英文版