python进阶-04-Python Scrapy带你掌握Python Scrapy（2.12）爬虫框架，附带实战-FreeNAS中文网

admin 管理员组

文章数量: 900433

python进阶-04-一篇带你掌握Python Scrapy（2.12）爬虫框架，附带实战

一.简介

在Python进阶系列我们来介绍Scrapy框架最新版本2.12，远超市面上的老版本，Scrapy框架在爬虫行业内鼎鼎大名，在学习之前我想请大家思考Scrapy究竟能解决什么问题？或者能爬哪一类型的网站！还有针对Scrapy的局限性我们如何依然使用好Scrapy！好，开始我们今天的日拱一卒！

二.安装Python Scrapy

#使用豆瓣源安装 提升安装速度
pip install Scrapy -i http://pypi.doubanio/simple --trusted-host pypi.doubanio

三.Scrapy 中文文档

学习任何一门技术最好的还是看官方文档，我先贴上

https://scrapy/

Scrapy也有比较不错的中文文档

https://scrapy-chs.readthedocs.io/zh-cn/stable/intro/tutorial.html

大家根据需要自己选择，这个框架很简单。。

四.创建Scrapy项目

在开始学习之前我先带大家实现一个简单的爬虫，再最后对Scrapy的运行流程进行介绍，这样大家才能更好的理解，我们来创建一个新的Scrapy项目,在vscode的终端中运行以下命令！

scrapy startproject tutorial

文件结构的介绍

tutorial/
    scrapy.cfg            # deploy configuration file

    tutorial/             # project's Python module, you'll import your code from here
        __init__.py

        items.py          # project items definition file

        middlewares.py    # project middlewares file

        pipelines.py      # project pipelines file

        settings.py       # project settings file

        spiders/          # a directory where you'll later put your spiders
            __init__.py

注意：不是一个爬虫项目只能有一个爬虫，一个爬虫项目中可以创建很多爬虫任务，我们通过不同爬虫任务的name来指定运行哪个爬虫。

五.创建我们的第一个爬虫

在tutorial/spiders目录下新建quotes_spider.py 文件

当然也可以使用命令创建一个爬虫,大家初学习的时候先手动创建吧！一样的。

scrapy genspider mydomain mydomain

quotes_spider.py文件内容如下：

#quotes_spider.py
from pathlib import Path

import scrapy


class QuotesSpider(scrapy.Spider):
    name = "quotes"

    def start_requests(self):
        urls = [
            "https://quotes.toscrape/page/1/",
            "https://quotes.toscrape/page/2/",
        ]
        for url in urls:
            yield scrapy.Request(url=url, callback=self.parse)

    def parse(self, response):
        page = response.url.split("/")[-2]
        filename = f"quotes-{page}.html"
        Path(filename).write_bytes(response.body) # 写入文件 默认utf-8
        self.log(f"Saved file {filename}") #终端中输出log日志
        
        
# 也可以这样写
# parse()是Scrapy的默认回调方法，该方法用于没有显式分配回调的请求
#quotes_spider.py
from pathlib import Path

import scrapy


class QuotesSpider(scrapy.Spider):
    name = "quotes"

    def start_requests(self):
        urls = [
            "https://quotes.toscrape/page/1/",
            "https://quotes.toscrape/page/2/",
        ]

    def parse(self, response):
        page = response.url.split("/")[-2]
        filename = f"quotes-{page}.html"
        Path(filename).write_bytes(response.body) # 写入文件 默认utf-8
        self.log(f"Saved file {filename}") #终端中输出log日志

注意：

name = "quotes"：Scrapy项目中，name必须是唯一；

def start_requests(self): 必须返回一个可迭代的请求（可以返回一个请求列表或编写一个生成器函数），Scrapy将从它开始开始爬行。后续请求将从这些初始请求连续生成。

def parse(self, response):将被调用以处理为每个请求下载的响应的方法。response参数是TextResponse的一个实例，它保存页面内容，并有更多有用的方法来处理它。parse()方法通常解析响应，将抓取的数据提取为dict，并查找要跟踪的新URL并从中创建新请求（Request）。

五.启动我们的Scrapy项目

进入我们的Scrapy项目tutorial,执行scrapy crawl quotes

(my_venv) PS F:\开发源码\python_demo_06> cd tutorial
(my_venv) PS F:\开发源码\python_demo_06\tutorial> 
(my_venv) PS F:\开发源码\python_demo_06\tutorial> scrapy crawl quotes

大家运行成功应该可以看到我们爬虫项目运行成功，并且我们tutorial 文件夹下多了2个文件quotes-1.html、quotes-2.html，这时候我们已经成功实现Scrapy框架；

执行原理：

1.Scrapy执行scrapy crawl quotes时会从spiders中找到name为quotes的爬虫，启动此爬虫；

2.接着执行start_requests 函数中的urls，请求地址，开始执行Scrapy中的内置请求，yield scrapy.Request(url=url, callback=self.parse) 如果我们指定了callback 就走callback对应的函数，如果没有指定则找默认的self.parse函数，如果啥都没有。。。爬虫关闭

3.self.parse接到请求返回后会执行解析。。

请大家思考一个问题为啥用yield 而不用return？如果用return会出现什么情况？

截止目前我们还没有解析HTML，请稍等，好菜还没上！慢慢看。。

六.Scrapy解析数据，利用Scrapy自带的Xpath和css selectors

我们之前的文章介绍过BeautifulSoup 和Xpath来提取数据，但是呢Scrapy很强大，自带css选择器和Xpath选择器，我们可以直接使用，当然也可以依然使用BeautifulSoup 和Xpath来提取数据，既然我们今天介绍Scrapy，那么我们就用Scrapy自带的来提取数据，好还是上面的代码，只不过改一改quotes_spider.py文件

修改后的代码：

#quotes_spider.py
from pathlib import Path

import scrapy


class QuotesSpider(scrapy.Spider):
    name = "quotes"

    def start_requests(self):
        urls = [
            "https://quotes.toscrape/page/1/",
            "https://quotes.toscrape/page/2/",
        ]
        for url in urls:
            yield scrapy.Request(url=url, callback=self.parse)

    # def parse(self, response):
    #     page = response.url.split("/")[-2]
    #     filename = f"quotes-{page}.html"
    #     Path(filename).write_bytes(response.body)
    #     self.log(f"Saved file {filename}")
    def parse(self, response):
        print("**************提取开始******************")
        print(response.css("title"))
        print("**************提取结束******************")

        '''
        输出：
        **************提取开始******************
        [<Selector xpath='descendant-or-self::title' data='<title>Quotes to Scrape</title>'>]
        **************提取结束******************
        2024-11-20 22:57:31 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://quotes.toscrape/page/2/> (referer: None)
        **************提取开始******************
        [<Selector xpath='descendant-or-self::title' data='<title>Quotes to Scrape</title>'>]
        **************提取结束******************
        '''

看到了没?输出如下（后面所有的提取我只写关键部分:）：

print(response.css("title"))
#[<Selector xpath='descendant-or-self::title' data='<title>Quotes to Scrape</title>'>]

CSS选择器：

CSS选择器提取内容为列表
```
response.css("title::text").getall()
#['Quotes to Scrape']
```
这里注意::text如果不加::text会出现什么情况呢？可以发现节点标签不是我们想要的。。所以要加::text才能获取我们想要的内容
```
response.css("title").getall()
#['<title>Quotes to Scrape</title>']
```
只获取第一个结果
```
response.css("title::text").get()
# 'Quotes to Scrape'
```
也可以这样写
```
response.css("title::text")[0].get()
#'Quotes to Scrape'
```
**注意：**这2种写法有什么区别呢？

response.css(“title::text”)[0].get()：如果没有结果索引会引发IndexError

response.css(“title::text”).get()：如果没有结果，返回None

CSS选择器+正则表达式

response.css("title::text").re(r"Quotes.*")
#['Quotes to Scrape']
response.css("title::text").re(r"Q\w+")
#['Quotes']
response.css("title::text").re(r"(\w+) to (\w+)")
#['Quotes', 'Scrape']

直接使用CSS选择器

response.css("div.quote")
'''
输出：
[<Selector query="descendant-or-self::div[@class and contains(concat(' ', normalize-space(@class), ' '), ' quote ')]" data='<div class="quote" itemscope itemtype...'>,
<Selector query="descendant-or-self::div[@class and contains(concat(' ', normalize-space(@class), ' '), ' quote ')]" data='<div class="quote" itemscope itemtype...'>,
...]
'''

Xpath选择器：

提取内容

response.xpath("//title")
#[<Selector query='//title' data='<title>Quotes to Scrape</title>'>]

提取文字内容

response.xpath("//title/text()").get()
#'Quotes to Scrape'

提取标签包含指定文字的标签

        next_page = response.xpath("//div[@class='bottem2']/a[contains(text(), '下一章')]/@href").get()
        next_page = response.xpath("substring-after(//a[contains(text(), '下一章')]/@href, '/html')").get()
        next_page = response.xpath("//a[contains(text(), '下一章')]/@href").get()

提取指定标签及其子标签的全部内容

textContent = response.xpath('//div[@id="content"]//text()').getall()

使用插件SelectorGadget 帮我们快速获取css选择器和Xpath选择器：

我已经给大家准备好最新版下载地址：https://download.csdn/download/Lookontime/90025172

安装方式：

下载后解压

在谷歌浏览器中输入chrome://extensions/

在我们真实项目中，这样构造CSS选择器和Xpath选择器，效率还有有点慢！有没有更好的办法，还真有，但不是很完美，就是使用SelectorGadget 插件！可以帮我们快速构建Xpath，我们只要稍做修改即可！

注意，我在阅读官方的时候，说CSS选择器在Scrapy引擎下实际是被转换为Xpath，而且官方建议使用Xpath，这里我之前写过一篇专门介绍Xpath的文章，有兴趣的可以去看我之前的文章，有前端基础的小伙伴看这个应该超级简单。。。

七.使用scrapy shell 'https://quotes.toscrape’来验证我们的解析：

'https://quotes.toscrape’的页面结构如下：

<div class="quote">
    <span class="text">“The world as we have created it is a process of our
    thinking. It cannot be changed without changing our thinking.”</span>
    <span>
        by <small class="author">Albert Einstein</small>
        <a href="/author/Albert-Einstein">(about)</a>
    </span>
    <div class="tags">
        Tags:
        <a class="tag" href="/tag/change/page/1/">change</a>
        <a class="tag" href="/tag/deep-thoughts/page/1/">deep-thoughts</a>
        <a class="tag" href="/tag/thinking/page/1/">thinking</a>
        <a class="tag" href="/tag/world/page/1/">world</a>
    </div>
</div>

让我们执行这个命令

scrapy shell 'https://quotes.toscrape'

接着会进行等待页面，我们执行我们的选择器

>>> response.css("div.quote")
[<Selector query="descendant-or-self::div[@class and contains(concat(' ', normalize-space(@class), ' '), ' quote ')]" data='<div class="quote" itemscope itemtype...'>,
<Selector query="descendant-or-self::div[@class and contains(concat(' ', normalize-space(@class), ' '), ' quote ')]" data='<div class="quote" itemscope itemtype...'>,
...]

>>> quote = response.css("div.quote")[0]

>>> text = quote.css("span.text::text").get()
>>> text
'“The world as we have created it is a process of our thinking. It cannot be changed without changing our thinking.”'
>>> author = quote.css("small.author::text").get()
>>> author
'Albert Einstein'
>>> tags = quote.css("div.tags a.tag::text").getall()
>>> tags
['change', 'deep-thoughts', 'thinking', 'world']



>>> for quote in response.css("div.quote"):
...    text = quote.css("span.text::text").get()
...    author = quote.css("small.author::text").get()
...    tags = quote.css("div.tags a.tag::text").getall()
...    print(dict(text=text, author=author, tags=tags))
...
{'text': '“The world as we have created it is a process of our thinking. It cannot be changed without changing our thinking.”', 'author': 'Albert Einstein', 'tags': ['change', 'deep-thoughts', 'thinking', 'world']}
{'text': '“It is our choices, Harry, that show what we truly are, far more than our abilities.”', 'author': 'J.K. Rowling', 'tags': ['abilities', 'choices']}
...

如何退出 scrapy shell？

quit()

八.如何实现解析下一页直到不满足条件时停止

举个例子，当我们想安安静静看本小说又不想被满屏的广告打扰，这个时候我们就有一个爬虫需求，爬取网页中的内容，让后找到下一页，继续爬取，继续找寻下一页，直到不满足条件时停止，这个时候我们怎么实现？

有人说，我们把所有的页面url全部放到def start_requests(self)函数 urls中，这样不就可以了？可以是可以，你估计得累死。。因为第一url不可能是规律的递增变化。还有就是爬取的顺序我们需要控制或者才有其他办法。

那么有没有办法我们只给起始页面，页面解析下一页的url让后返回给parse来进行循环解析呢？当然有

举例我们的下一页如下：

<ul class="pager">
    <li class="next">
        <a href="/page/2/">Next <span aria-hidden="true">&rarr;</span></a>
    </li>
</ul>

构建我们的爬虫

import scrapy


class QuotesSpider(scrapy.Spider):
    name = "quotes"
    start_urls = [
        "https://quotes.toscrape/page/1/",
    ]

    def parse(self, response):
        for quote in response.css("div.quote"):
            yield {
                "text": quote.css("span.text::text").get(),
                "author": quote.css("small.author::text").get(),
                "tags": quote.css("div.tags a.tag::text").getall(),
            }

        next_page = response.css("li.next a::attr(href)").get()
        if next_page is not None:
            next_page = response.urljoin(next_page)
            yield scrapy.Request(next_page, callback=self.parse)

刚刚我们看到了，我们解析出下一页的url然后构建请求地址，然后再将请求内容返回给self.parse，直到不满足条件为止，好！非常棒！但是Scrapy框架更强大，有更简单的方法，大家接着看

方式一：response.follow

import scrapy


class QuotesSpider(scrapy.Spider):
    name = "quotes"
    start_urls = [
        "https://quotes.toscrape/page/1/",
    ]

    def parse(self, response):
        for quote in response.css("div.quote"):
            yield {
                "text": quote.css("span.text::text").get(),
                "author": quote.css("span small::text").get(),
                "tags": quote.css("div.tags a.tag::text").getall(),
            }

        next_page = response.css("li.next a::attr(href)").get()
        if next_page is not None:
            yield response.follow(next_page, callback=self.parse)

方式二：利用for循环

for href in response.css("ul.pager a::attr(href)"):
    yield response.follow(href, callback=self.parse)

方式三：直接返回a标签

for a in response.css("ul.pager a"):
    yield response.follow(a, callback=self.parse)

方式四：使用response.follow_all

anchors = response.css("ul.pager a")
yield from response.follow_all(anchors, callback=self.parse)

方式五：直接传入解析器

yield from response.follow_all(css="ul.pager a", callback=self.parse)

一个完整的例子

import scrapy


class AuthorSpider(scrapy.Spider):
    name = "author"

    start_urls = ["https://quotes.toscrape/"]

    def parse(self, response):
        author_page_links = response.css(".author + a")
        yield from response.follow_all(author_page_links, self.parse_author)

        pagination_links = response.css("li.next a")
        yield from response.follow_all(pagination_links, self.parse)

    def parse_author(self, response):
        def extract_with_css(query):
            return response.css(query).get(default="").strip()

        yield {
            "name": extract_with_css("h3.author-title::text"),
            "birthdate": extract_with_css(".author-born-date::text"),
            "bio": extract_with_css(".author-description::text"),
        }

九.给爬虫传入参数

我们想在运行代码时传入参数, 只需要执行命令时使用 -a选项

执行命令：

scrapy crawl quotes -O quotes-humor.json -a tag=humor

爬虫代码：

import scrapy


class QuotesSpider(scrapy.Spider):
    name = "quotes"

    def start_requests(self):
        url = "https://quotes.toscrape/"
        tag = getattr(self, "tag", None)
        if tag is not None:
            url = url + "tag/" + tag
        yield scrapy.Request(url, self.parse)

    def parse(self, response):
        for quote in response.css("div.quote"):
            yield {
                "text": quote.css("span.text::text").get(),
                "author": quote.css("small.author::text").get(),
            }

        next_page = response.css("li.next a::attr(href)").get()
        if next_page is not None:
            yield response.follow(next_page, self.parse)

命令解析：

scrapy crawl quotes
- 运行名为 quotes 的爬虫。quotes 是在你的 Scrapy 项目中定义的爬虫名称，通常会在 spiders 文件夹中找到对应的代码文件。
-O quotes-humor.json
- -O 代表输出文件，quotes-humor.json 是输出文件的名称。
- Scrapy 会将爬取到的数据保存为 JSON 格式文件，覆盖同名文件（如果存在）。
-a tag=humor
- 使用 -a 参数为爬虫传递一个名为 tag 的参数，其值为 humor。
- 在爬虫代码中，可以通过 self.tag 访问这个参数。通常，这种参数用于向爬虫指定一个过滤条件，比如只抓取与“幽默”相关的内容。

十.Scrapy数据容器Item和Field

截止到目前大家是不是好像明白了Scrapy，但是又不太明白，是不是存在一个疑问，我是实现了爬虫和解析数据，但是我怎么使用呢？这就涉及到Scrapy数据容器和Scrapy管道的概念！先别急，我们来介绍Scrapy数据容器

Scrapy中提供了2个类 Item和Field，使用前需要在items.py中先导入，items.py代码如下：

#items.py
import scrapy


# class TutorialItem(scrapy.Item):
#     # define the fields for your item here like:
#     # name = scrapy.Field()
#     pass
# class DmozItem(scrapy.Item):
#     title = scrapy.Field()
#     link = scrapy.Field()
#     desc = scrapy.Field()

class QuoteItem(scrapy.Item):
    imgBase64 = scrapy.Field()
    file_name = scrapy.Field()  

class VideoItem(scrapy.Item):
    video_url = scrapy.Field()
    file_name = scrapy.Field()  


class TextItem(scrapy.Item):
    title = scrapy.Field()
    Content = scrapy.Field()

**Item基类：**实现的自定义数据类，必须继承Item基类如class TextItem(scrapy.Item)

**Field类：**描述自定义数据类包含的字段，如title、Content

使用前需要创建Item对象

item = TextItem()
item['title'] = response.xpath('//div[@class="bookname"]/h1/text()').get()
textContent = response.xpath('//div[@id="content"]//text()').getall()
# 去除 '\r' 和 '\xa0'
cleaned_list = [re.sub(r'[\r\xa0]+', '', text) for text in textContent]
item['Content'] = cleaned_list

获取字段值：

print(item['title'])
print(item['Content'])

获取所有字段名

item.keys()

Item复制

item2 = item.copy()

十一.Scrapy pipeline 管道

截止到目前我们实现了Scrapy数据容器，那么怎么使用数据容器？这就涉及到Scrapy pipeline 管道，这里是重点因为Scrapy pipeline可以自动接收Scrapy数据容器，并根据Scrapy数据容器来实现不同的功能，如将item解析存储到数据库，下载图片，下载文件，数据存储到json，excel，txt等。

使用Scrapy pipeline 管道首先要进行注册，Scrapy 爬虫开始后会自动将item数据传输到所有已经注册的pipeline 以实现不同管道处理不同内容。

pipeline 注册：在settings.py文件下注册

#settings.py文件下

ITEM_PIPELINES = {

#   'tutorial.pipelines.TutorialPipeline': 300,

#   'tutorial.save_Image_pipeline.SaveImagePipeline': 300,

#   'tutorial.video_download_pipeline.VideoDownloadPipeline': 500,

  'tutorial.text_download_pipeline.TextDownloadPipeline':300

}

‘tutorial.text_download_pipeline.TextDownloadPipeline’ ： pipeline 文件地址

300：数字越小优先级越高

一个完整的pipeline示例

#text_download_pipeline.py
from itemadapter import ItemAdapter
import os


class TextDownloadPipeline():
    def __init__(self):
        # 定义存储text的目标文件夹
        self.target_folder = "DownLoadText"
        # 如果目标文件夹不存在，则创建
        if not os.path.exists(self.target_folder):
            os.makedirs(self.target_folder)
    def open_spider(self,spider):
        #spider开始前
        pass
    def close_spider(self,spider):
        #spider 结束后
        pass
    def process_item(self, item, spider):
        adapter = ItemAdapter(item)
        title = adapter.get('title', '未知标题')
        content = adapter.get('Content', [])
        text_name = '下载的文档内容.txt'
        file_path = os.path.join(self.target_folder, text_name)

        with open(file_path, 'a', encoding='utf-8') as file:
            file.write(f"标题: {title}\n")
            for line in content:
                file.write(f"{line}\n")
            file.write("\n")  # 每个 item 之间添加空行分隔
            return item

注意：def process_item(self, item, spider):必须实现的方法

这个管道是用来解析Scrapy 容器item来实现将item中的内容一行行写入 txt文件

除了我们自定义的pipeline外，Scrapy 两个特殊的pipeline，分别用来处理文件和图片：FilesPipeline和ImagesPipeline，下面我们来掌握这一概念:

	FilesPipeline	ImagesPipeline
导入路径	scrapy.pipelines.files.FilesPipeline	scrapy.pipelines.images.ImagesPipeline
Item字段	file_urls,files	image_urls,images
存储路径	FILES_STORE	IMAGES_STORE

FilesPipeline

settings.py文件下配置

import os
#注册pipeline
ITEM_PIPELINES = {
  'scrapy.pipelines.files.FilesPipeline':300

}
#配置文件存储路径
FILES_STORE="F:\\DownloadFiels"
if not os.path.exists(FILES_STORE):
    os.makedirs(FILES_STORE)

实现item.py 中的数据容器

class FileItem(scrapy.Item):
    file_urls = scrapy.Field()
    files = scrapy.Field()

代码实现

#新建爬虫
scrapy genspider dload_files model
#或者直接在spiders创建

#dload_files.py
import scrapy
from scrapy.linkextractors import LinkExtractor
from scrapy.spiders import CrawlSpider, Rule
from tutorial.items import FileItem, TextItem
import re

class XbiquguSpider(CrawlSpider):
    name = 'dload_files'
    allowed_domains = ['www.model']
    def __init__(self):
        pass
    def start_requests(self):
        urls = [
            "https://www.model/html/265/265564/229802.html",
        ]
        for url in urls:
            yield scrapy.Request(url=url, callback=self.parse)
    def parse(self, response):
        for line in response.xpath('//div[@class="bookname"]/ul/li'):
            for example in line.xpath('.//ul/li'):
                url = example.xpath('.//a//@href').extract_first()
                url = response.urljoin(url)
                yield scrapy.Request(url,callback=self.parse_files)
    def parse_files(self, response):
        href = response.xpath('//a/@href').extract_first()
        url = response.urljoin(href)
        fileItem = FileItem()
        fileItem['file_urls'] =[url]
        return fileItem

运行爬虫

(my_venv) PS F:\开发源码\python_demo_06\tutorial> scrapy crawl dload_files - o myfiles.json

ImagesPipeline

settings.py文件下配置

import os
#注册pipeline
ITEM_PIPELINES = {
  'scrapy.pipelines.images.ImagesPipeline':300

}
#配置文件存储路径
IMAGES_STORE="F:\\ImageFiels"
if not os.path.exists(IMAGES_STORE):
    os.makedirs(IMAGES_STORE)
#配置要抓取最大最小图片尺寸
IMAGES_THUMBS = {
    'small': (50, 50),
#    'big': (270, 270),
}
#配置要抓取最大最小图片尺寸
#IMAGES_MIN_WIDTH = 50 #最小宽度
#IMAGES_MIN_HEIGHT = 50 #最小宽度

实现item.py 中的数据容器

import scrapy

class MyImageItem(scrapy.Item):
    image_urls = scrapy.Field()  # 存放图片 URL 列表
    images = scrapy.Field()  # 存放下载后的图片信息

代码实现

#新建爬虫
scrapy genspider dload_files model
#或者直接在spiders创建

#image_files.py
import scrapy
from myproject.items import MyImageItem

class ImageSpider(scrapy.Spider):
    name = 'imagespider'
    start_urls = ['https://example']

    def parse(self, response):
        item = MyImageItem()
        item['image_urls'] = response.css('img::attr(src)').extract()  # 提取图片 URL
        yield item

运行爬虫

(my_venv) PS F:\开发源码\python_demo_06\tutorial> scrapy crawl imagespider - o myimages.json

十二.使用Scrapy来实现小说下载的完整案例

在spiders文件下创建xbiqugu.py 爬虫

#xbiqugu.py
import scrapy
from scrapy.linkextractors import LinkExtractor
from scrapy.spiders import CrawlSpider, Rule
from tutorial.items import TextItem
import re

class XbiquguSpider(CrawlSpider):
    name = 'xbiqugu'
    allowed_domains = ['www.477zw3']
    def __init__(self):
        self.count = 0
    def start_requests(self):
        urls = [
            "https://www.477zw3/html/265/265564/229802.html",
        ]
        for url in urls:
            yield scrapy.Request(url=url, callback=self.parse_item)
    def parse_item(self, response):
        self.count +=1
        print(f"开始爬取-----------{self.count}")
        item = TextItem()
        item['title'] = response.xpath('//div[@class="bookname"]/h1/text()').get()
        print(item['title'])
        textContent = response.xpath('//div[@id="content"]//text()').getall()
        # 去除 '\r' 和 '\xa0'
        cleaned_list = [re.sub(r'[\r\xa0]+', '', text) for text in textContent]
        item['Content'] = cleaned_list
        # print(item['Content'])
        yield item
        #爬取下一页
        # next_page = response.xpath("//div[@class='bottem2']/a[contains(text(), '下一章')]/@href").get()
        # next_page = response.xpath("substring-after(//a[contains(text(), '下一章')]/@href, '/html')").get()
        next_page = response.xpath("//a[contains(text(), '下一章')]/@href").get()
        print("下一页:",next_page) # 输出：下一页: /html/265/265564/229803.html
        if next_page is not None:
            yield response.follow(next_page, callback=self.parse_item)

创建Scrapy容器

#items.py
import scrapy


class TextItem(scrapy.Item):
    title = scrapy.Field()
    Content = scrapy.Field()

创建管道text_download_pipeline.py

# text_download_pipeline
# Define your item pipelines here
#
# Don't forget to add your pipeline to the ITEM_PIPELINES setting
# See: https://docs.scrapy/en/latest/topics/item-pipeline.html


# useful for handling different item types with a single interface
from itemadapter import ItemAdapter
import os


class TextDownloadPipeline():
    def __init__(self):
        # 定义存储text的目标文件夹
        self.target_folder = "DownLoadText"
        # 如果目标文件夹不存在，则创建
        if not os.path.exists(self.target_folder):
            os.makedirs(self.target_folder)
    def open_spider(self,spider):
        #管道开始前
        pass
    def close_spider(self,spider):
        #pipeline 结束后
        pass
    def process_item(self, item, spider):
        adapter = ItemAdapter(item)
        title = adapter.get('title', '未知标题')
        content = adapter.get('Content', [])
        text_name = '测试技术.txt'
        file_path = os.path.join(self.target_folder, text_name)

        with open(file_path, 'a', encoding='utf-8') as file:
            file.write(f"标题: {title}\n")
            for line in content:
                file.write(f"{line}\n")
            file.write("\n")  # 每个 item 之间添加空行分隔
            return item

settings.py中注册管道

#settings.py
import random

BOT_NAME = 'tutorial'

SPIDER_MODULES = ['tutorial.spiders']
NEWSPIDER_MODULE = 'tutorial.spiders'
# Obey robots.txt rules
ROBOTSTXT_OBEY = False

# See also autothrottle settings and docs
DOWNLOAD_DELAY = 0 #每次请求间隔 0 秒

ITEM_PIPELINES = {
    'tutorial.text_download_pipeline.TextDownloadPipeline':300
}

USER_AGENT_LIST = [
      "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_13_6) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/69.0.3497.100 Safari/537.36",
      "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.1 (KHTML, like Gecko) Chrome/22.0.1207.1 Safari/537.1",
      "Mozilla/5.0 (X11; CrOS i686 2268.111.0) AppleWebKit/536.11 (KHTML, like Gecko) Chrome/20.0.1132.57 Safari/536.11",
      "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/536.6 (KHTML, like Gecko) Chrome/20.0.1092.0 Safari/536.6",
      "Mozilla/5.0 (Windows NT 6.2) AppleWebKit/536.6 (KHTML, like Gecko) Chrome/20.0.1090.0 Safari/536.6",
      "Mozilla/5.0 (Windows NT 6.2; WOW64) AppleWebKit/537.1 (KHTML, like Gecko) Chrome/19.77.34.5 Safari/537.1",
      "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/536.5 (KHTML, like Gecko) Chrome/19.0.1084.9 Safari/536.5",
      "Mozilla/5.0 (Windows NT 6.0) AppleWebKit/536.5 (KHTML, like Gecko) Chrome/19.0.1084.36 Safari/536.5",
      "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/536.3 (KHTML, like Gecko) Chrome/19.0.1063.0 Safari/536.3",
      "Mozilla/5.0 (Windows NT 5.1) AppleWebKit/536.3 (KHTML, like Gecko) Chrome/19.0.1063.0 Safari/536.3",
      "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_8_0) AppleWebKit/536.3 (KHTML, like Gecko) Chrome/19.0.1063.0 Safari/536.3",
      "Mozilla/5.0 (Windows NT 6.2) AppleWebKit/536.3 (KHTML, like Gecko) Chrome/19.0.1062.0 Safari/536.3",
      "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/536.3 (KHTML, like Gecko) Chrome/19.0.1062.0 Safari/536.3",
      "Mozilla/5.0 (Windows NT 6.2) AppleWebKit/536.3 (KHTML, like Gecko) Chrome/19.0.1061.1 Safari/536.3",
      "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/536.3 (KHTML, like Gecko) Chrome/19.0.1061.1 Safari/536.3",
      "Mozilla/5.0 (Windows NT 6.1) AppleWebKit/536.3 (KHTML, like Gecko) Chrome/19.0.1061.1 Safari/536.3",
      "Mozilla/5.0 (Windows NT 6.2) AppleWebKit/536.3 (KHTML, like Gecko) Chrome/19.0.1061.0 Safari/536.3",
      "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/535.24 (KHTML, like Gecko) Chrome/19.0.1055.1 Safari/535.24",
      "Mozilla/5.0 (Windows NT 6.2; WOW64) AppleWebKit/535.24 (KHTML, like Gecko) Chrome/19.0.1055.1 Safari/535.24"
]
USER_AGENT = random.choice(USER_AGENT_LIST)
#创建日志
LOG_LEVEL = "INFO"

from  datetime import datetime



LOG_DIR = "log"

if not os.path.exists(LOG_DIR):
    os.makedirs(LOG_DIR)

today = datetime.now()

LOG_FILE = f"{LOG_DIR}/scrapy_{today.year}_{today.month}_{today.day}.log"

大家注意爬虫不是一直都是可以使用，需要根据情况进行调整，但是Scrapy框架确实减少了我们实现爬虫的逻辑，非常强大！

十三.总结

关于python Scrapy一起写了好几个晚上，应该都已经明白实现原理和怎么使用了，我们回归到一开始，Scrapy有什么局限性？怎么解决！确实现在的网页内容大多是JS动态生成，针对这种情况Scrapy是不能解决的！那么如何解决，这就涉及到Scrapy的动态爬取！请大家继续关注后续，我来给大家介绍！利用Scraoy来实现动态爬取！

本文标签：进阶爬虫带你实战框架

版权声明：本文标题：python进阶-04-Python Scrapy带你掌握Python Scrapy（2.12）爬虫框架，附带实战内容由网友自发贡献，该文观点仅代表作者本人，转载请联系作者并注明出处：http://www.freenas.com.cn/jishu/1735051535h1695513.html，本站仅提供信息存储空间服务，不拥有所有权，不承担相关法律责任。如发现本站有涉嫌抄袭侵权/违法违规的内容，一经查实，本站将立刻删除。

技术交流 – FreeNAS中文网

python进阶-04-Python Scrapy带你掌握Python Scrapy（2.12）爬虫框架，附带实战

python进阶-04-一篇带你掌握Python Scrapy（2.12）爬虫框架，附带实战

一.简介

二.安装Python Scrapy

三.Scrapy 中文文档

四.创建Scrapy项目

五.创建我们的第一个爬虫

五.启动我们的Scrapy项目

六.Scrapy解析数据，利用Scrapy自带的Xpath和css selectors

CSS选择器：

Xpath选择器：

使用插件SelectorGadget 帮我们快速获取css选择器和Xpath选择器：

七.使用scrapy shell 'https://quotes.toscrape’来验证我们的解析：

八.如何实现解析下一页直到不满足条件时停止

九.给爬虫传入参数

命令解析：

十.Scrapy数据容器Item和Field

十一.Scrapy pipeline 管道

十二.使用Scrapy来实现小说下载的完整案例

十三.总结

更多相关文章

android 串流 ps4,就想要玩游戏！PS4有线串流到笔记本电脑实战

ZYNQ进阶之路5--PS端hello xilinx zynq设计

〖Python 数据库开发实战 - MongoDB篇⑤〗- 安装和使用MongoDB客户端软件

【Zabbix实战之部署篇】Zabbix监控windows系统配置方法

Python从入门到项目实战笔记(1)基础环境

Python网络爬虫识记

python推荐系统实战_《推荐系统开发实战》高清完整PDF版 下载

基于ThinkPHP框架的学生管理系统+留言板后台管理系统

Nmap系统扫描实战

WiFi爆破实战

实战项目：Boost搜索引擎

java dht 爬虫_P2P中DHT网络爬虫

2020年30种最佳的免费网页爬虫软件

网站搭建笔记精简版---廖雪峰WebApp实战-Day2:编写Web App骨架笔记

GitHub 优秀的 Android 开源项目和框架

GitHub 热门开源项目：超10万星标，《GPT-4 和 ChatGPT 实战指南》——大模型应用开发的入门宝典

10分钟带你搞懂chatgpt 函数调用

【代码补全】一文带你了解GitHub Copilot Free 版代码补全工具：详细讲解以及竞品对比分析！（GitHub Copilot、TabNine、CodeMoss）

《AIGC辅助软件开发：ChatGPT 10倍效率编程实战》

AIGC从入门到实战：利用 ChatGPT 来生成前后端代码

发表评论

推荐文章

代理服务器关闭没过多久又自动开启,Win10自动更新关闭了过几天又自动开启了怎么办？...

现在公开一个DHT网络爬虫

外接u盘硬盘安装CentOS（linux）的步骤（含双系统）

QtC++编写物联网管理平台（支持winlinuxmac嵌入式linuxmodbusmqtt等）

记一次一加7Pro刷机之旅(刷入Havoc与Magisk)

热门文章

U盘RAW格式无法格式化怎么办？

启动u盘还原成普通u盘（Windows Diskpart）

win10+ubuntu双系统配置

男人 30 岁前要做的 22 件事

系统之家xp服务器系统怎么安装,系统之家教你如何用u盘装xp系统

当当云阅读无法登录,绑定设备超过20个的解决办法

华为设备默认密码

【Docker】快速部署 copilot-gpt4-service：将 Github Copilot 转换为 GPT-4 模型进行对话

windows下 QT 的 Android 环境搭建（附软件测试和ADB调试工具）

电脑、手机 自动化 键鼠操作（ 类似按键精灵 ）

最新文章

Raid技术

LSI_阵列卡操作手册

破解Centos7_root用户密码

Redhat重置Root用户密码方法

远程批量修改linux服务器密码的脚本

电脑没声音了怎么恢复正常？一键恢复电脑声音

Windows系统电脑使用图形化SFTP客户端WinSCP远程发送文件至内网Linux系统设备

windos读写ext3工具_“ ext2fsd” Windows系统工具，用于读写ext234文件系统

深度剖析Linux与Windows系统的区别

windows系统ping包显示时间

python推荐系统实战_《推荐系统开发实战》高清完整PDF版下载

电脑、手机自动化键鼠操作（类似按键精灵）