将 scrapy 连接到 MySQL(Windows 8 专业版 64 位 python 2.7 scrapy v 1.2)

以下示例在使用 python 2.7scrapy v 1.2 的 Windows 8 专业版 64 位操作系统上进行了测试。假设我们已经安装了 scrapy 框架。 **** ****

我们将在以下教程中使用的 MySQL 数据库

CREATE TABLE IF NOT EXISTS `scrapy_items` (
  `id` bigint(20) UNSIGNED NOT NULL,
  `quote` varchar(255) NOT NULL,
  `author` varchar(255) NOT NULL,
  PRIMARY KEY (`id`)
) ENGINE=InnoDB DEFAULT CHARSET=latin1;

INSERT INTO `scrapy_items` (`id`, `quote`, `author`) 
VALUES (1, 'The world as we have created it is a process of our thinking. It cannot be changed without changing our thinking.', 'Albert Einstein');

安装 MySQL 驱动程序

  1. 下载驱动程序 mysql-connector-python-2.2.1.zip OR MySQL-python-1.2.5.zip(MD5)
  2. 将 zip 解压缩到文件中,例如 C:\ mysql-connector \
  3. 打开 cmd 转到 C:\ mysql-connector ,其中将找到 setup.py 文件并运行 python setup.py install
  4. 复制并运行以下 example.py
from __future__ import print_function
import mysql.connector
from mysql.connector import errorcode

class `MysqlTest()`:
    table = 'scrapy_items'
    conf = {
        'host': '127.0.0.1',
        'user': 'root',
        'password': '',
        'database': 'test',
        'raise_on_warnings': True
    }
    
    def __init__(self, **kwargs):
        self.cnx = `self.mysql_connect()`
    
    def `mysql_connect(self)`:
        try:
            return mysql.connector.connect(**self.conf)
        except mysql.connector.Error as err:
            if err.errno == errorcode.ER_ACCESS_DENIED_ERROR:
                print("Something is wrong with your user name or password")
            elif err.errno == errorcode.ER_BAD_DB_ERROR:
                print("Database does not exist")
            else:
                `print(err)`

    def `select_item(self)`:
        cursor = `self.cnx.cursor()`
        select_query = "SELECT * FROM " + self.table

        `cursor.execute(select_query)`
        for row in `cursor.fetchall()`:
            `print(row)`

        `cursor.close()`
        `self.cnx.close()`

def `main()`:
    mysql = `MysqlTest()`
    `mysql.select_item()`

if __name__ == "__main__" : `main()`

将 Scrapy 连接到 MySQL

首先通过运行以下命令创建一个新的 scrapy 项目

scrapy startproject tutorial

这将创建一个包含以下内容的教程目录:

https://i.stack.imgur.com/e7SqL.jpg

这是我们第一个蜘蛛的代码。将其保存在项目的 tutorial / spiders 目录下名为 quotes_spider.py 的文件中。 ****

我们的第一个蜘蛛

import scrapy
from scrapy.loader import ItemLoader
from tutorial.items import TutorialItem

class QuotesSpider(scrapy.Spider):
    name = "quotes"

    def start_requests(self):
        urls = ['http://quotes.toscrape.com/page/1/']
        for url in urls:
            yield scrapy.Request(url=url, callback=self.parse)

    def parse(self, response):
        boxes = response.css('div[class="quote"]')
        for box in boxes:
            item = ItemLoader(item=TutorialItem())
            quote = box.css('span[class="text"]::text').extract_first()
            author = box.css('small[class="author"]::text').extract_first()
            item.add_value('quote', quote.encode('ascii', 'ignore'))
            item.add_value('author', author.encode('ascii', 'ignore'))
            yield item.load_item()

Scrapy 项目类

为了定义通用输出数据格式,Scrapy 提供了 Item 类。 Item 对象是用于收集已删除数据并指定字段元数据的简单容器。它们提供类似字典的 API,并具有用于声明其可用字段的方便语法。详情请点击我

import scrapy
from scrapy.loader.processors import TakeFirst

class TutorialItem(scrapy.Item):
    # define the fields for your item here like:
    quote = scrapy.Field(output_processor=TakeFirst(),)
    author = scrapy.Field(output_processor=TakeFirst(),)

Scrapy 管道

在一个项目被蜘蛛抓取之后,它被发送到项目管道,该项目管道通过顺序执行的几个组件处理它,这是我们将已删除的数据保存到数据库中的地方。详情请点击我

注意 :不要忘记将管道添加到 tutorial / tutorial / settings.py 文件中的 ITEM_PIPELINES 设置。 ****

from __future__ import print_function
import mysql.connector
from mysql.connector import errorcode

class TutorialPipeline(object):
    table = 'scrapy_items'
    conf = {
        'host': '127.0.0.1',
        'user': 'root',
        'password': '',
        'database': 'sandbox',
        'raise_on_warnings': True
    }
    
    def __init__(self, **kwargs):
        self.cnx = self.mysql_connect()

    def open_spider(self, spider):
        print("spider open")

    def process_item(self, item, spider):
        print("Saving item into db ...")
        self.save(dict(item))
        return item
    
    def close_spider(self, spider):
        self.mysql_close()
    
    def mysql_connect(self):
        try:
            return mysql.connector.connect(**self.conf)
        except mysql.connector.Error as err:
            if err.errno == errorcode.ER_ACCESS_DENIED_ERROR:
                print("Something is wrong with your user name or password")
            elif err.errno == errorcode.ER_BAD_DB_ERROR:
                print("Database does not exist")
            else:
                print(err)
    
    
    def save(self, row): 
        cursor = self.cnx.cursor()
        create_query = ("INSERT INTO " + self.table + 
            "(quote, author) "
            "VALUES (%(quote)s, %(author)s)")

        # Insert new row
        cursor.execute(create_query, row)
        lastRecordId = cursor.lastrowid

        # Make sure data is committed to the database
        self.cnx.commit()
        cursor.close()
        print("Item saved with ID: {}" . format(lastRecordId)) 

    def mysql_close(self):
        self.cnx.close()

参考: https//doc.scrapy.org/en/latest/index.html