将 scrapy 连接到 MySQL(Windows 8 专业版 64 位 python 2.7 scrapy v 1.2)
以下示例在使用 python 2.7 和 scrapy v 1.2 的 Windows 8 专业版 64 位操作系统上进行了测试。假设我们已经安装了 scrapy 框架。 **** ****
我们将在以下教程中使用的 MySQL 数据库
CREATE TABLE IF NOT EXISTS `scrapy_items` (
`id` bigint(20) UNSIGNED NOT NULL,
`quote` varchar(255) NOT NULL,
`author` varchar(255) NOT NULL,
PRIMARY KEY (`id`)
) ENGINE=InnoDB DEFAULT CHARSET=latin1;
INSERT INTO `scrapy_items` (`id`, `quote`, `author`)
VALUES (1, 'The world as we have created it is a process of our thinking. It cannot be changed without changing our thinking.', 'Albert Einstein');
安装 MySQL 驱动程序
- 下载驱动程序 mysql-connector-python-2.2.1.zip OR MySQL-python-
1.2.5.zip(MD5)
- 将 zip 解压缩到文件中,例如 C:\ mysql-connector \
- 打开 cmd 转到 C:\ mysql-connector ,其中将找到 setup.py 文件并运行 python setup.py install
- 复制并运行以下 example.py
from __future__ import print_function import mysql.connector from mysql.connector import errorcode class `MysqlTest()`: table = 'scrapy_items' conf = { 'host': '127.0.0.1', 'user': 'root', 'password': '', 'database': 'test', 'raise_on_warnings': True } def __init__(self, **kwargs): self.cnx = `self.mysql_connect()` def `mysql_connect(self)`: try: return mysql.connector.connect(**self.conf) except mysql.connector.Error as err: if err.errno == errorcode.ER_ACCESS_DENIED_ERROR: print("Something is wrong with your user name or password") elif err.errno == errorcode.ER_BAD_DB_ERROR: print("Database does not exist") else: `print(err)` def `select_item(self)`: cursor = `self.cnx.cursor()` select_query = "SELECT * FROM " + self.table `cursor.execute(select_query)` for row in `cursor.fetchall()`: `print(row)` `cursor.close()` `self.cnx.close()` def `main()`: mysql = `MysqlTest()` `mysql.select_item()` if __name__ == "__main__" : `main()`
将 Scrapy 连接到 MySQL
首先通过运行以下命令创建一个新的 scrapy 项目
scrapy startproject tutorial
这将创建一个包含以下内容的教程目录:
https://i.stack.imgur.com/e7SqL.jpg
这是我们第一个蜘蛛的代码。将其保存在项目的 tutorial / spiders 目录下名为 quotes_spider.py 的文件中。 ****
我们的第一个蜘蛛
import scrapy
from scrapy.loader import ItemLoader
from tutorial.items import TutorialItem
class QuotesSpider(scrapy.Spider):
name = "quotes"
def start_requests(self):
urls = ['http://quotes.toscrape.com/page/1/']
for url in urls:
yield scrapy.Request(url=url, callback=self.parse)
def parse(self, response):
boxes = response.css('div[class="quote"]')
for box in boxes:
item = ItemLoader(item=TutorialItem())
quote = box.css('span[class="text"]::text').extract_first()
author = box.css('small[class="author"]::text').extract_first()
item.add_value('quote', quote.encode('ascii', 'ignore'))
item.add_value('author', author.encode('ascii', 'ignore'))
yield item.load_item()
Scrapy 项目类
为了定义通用输出数据格式,Scrapy 提供了 Item 类。 Item 对象是用于收集已删除数据并指定字段元数据的简单容器。它们提供类似字典的 API,并具有用于声明其可用字段的方便语法。详情请点击我
import scrapy
from scrapy.loader.processors import TakeFirst
class TutorialItem(scrapy.Item):
# define the fields for your item here like:
quote = scrapy.Field(output_processor=TakeFirst(),)
author = scrapy.Field(output_processor=TakeFirst(),)
Scrapy 管道
在一个项目被蜘蛛抓取之后,它被发送到项目管道,该项目管道通过顺序执行的几个组件处理它,这是我们将已删除的数据保存到数据库中的地方。详情请点击我
注意 :不要忘记将管道添加到 tutorial / tutorial / settings.py 文件中的 ITEM_PIPELINES 设置。 ****
from __future__ import print_function
import mysql.connector
from mysql.connector import errorcode
class TutorialPipeline(object):
table = 'scrapy_items'
conf = {
'host': '127.0.0.1',
'user': 'root',
'password': '',
'database': 'sandbox',
'raise_on_warnings': True
}
def __init__(self, **kwargs):
self.cnx = self.mysql_connect()
def open_spider(self, spider):
print("spider open")
def process_item(self, item, spider):
print("Saving item into db ...")
self.save(dict(item))
return item
def close_spider(self, spider):
self.mysql_close()
def mysql_connect(self):
try:
return mysql.connector.connect(**self.conf)
except mysql.connector.Error as err:
if err.errno == errorcode.ER_ACCESS_DENIED_ERROR:
print("Something is wrong with your user name or password")
elif err.errno == errorcode.ER_BAD_DB_ERROR:
print("Database does not exist")
else:
print(err)
def save(self, row):
cursor = self.cnx.cursor()
create_query = ("INSERT INTO " + self.table +
"(quote, author) "
"VALUES (%(quote)s, %(author)s)")
# Insert new row
cursor.execute(create_query, row)
lastRecordId = cursor.lastrowid
# Make sure data is committed to the database
self.cnx.commit()
cursor.close()
print("Item saved with ID: {}" . format(lastRecordId))
def mysql_close(self):
self.cnx.close()