本文共 10463 字,大约阅读时间需要 34 分钟。
scrapy常用的命令分为全局和项目两种命令,全局命令就是不需要依靠scrapy项目,可以在全局环境下运行,而项目命令需要在scrapy项目里才能运行。
一、全局命令##使用scrapy -h可以看到常用的全局命令[root@aliyun ~]# scrapy -hScrapy 1.5.0 - no active projectUsage: scrapy[options] [args]Available commands: bench Run quick benchmark test fetch Fetch a URL using the Scrapy downloader genspider Generate new spider using pre-defined templates runspider Run a self-contained spider (without creating a project) settings Get settings values shell Interactive scraping console startproject Create new project version Print Scrapy version view Open URL in browser, as seen by Scrapy
在bench下面的都是全局命令,bench是特殊的,即使在Available 下面展示,但仍然属于项目命令。
1、fetch命令##fetch主要用来显示爬虫爬取的过程。scrapy fetch 网址[root@aliyun ~]# scrapy fetch http://www.baidu.com2018-03-15 10:50:02 [scrapy.utils.log] INFO: Scrapy 1.5.0 started (bot: scrapybot)2018-03-15 10:50:02 [scrapy.utils.log] INFO: Versions: lxml 4.1.1.0, libxml2 2.9.1, cssselect 1.0.3, parsel 1.4.0, w3lib 1.19.0, Twisted 17.9.0, Python 3.4.2 (default, Mar 15 2018, 10:26:10) - [GCC 4.8.5 20150623 (Red Hat 4.8.5-16)], pyOpenSSL 17.5.0 (OpenSSL 1.0.2k-fips 26 Jan 2017), cryptography 2.1.4, Platform Linux-3.10.0-514.26.2.el7.x86_64-x86_64-with-centos-7.4.1708-Core2018-03-15 10:50:02 [scrapy.crawler] INFO: Overridden settings: {}2018-03-15 10:50:02 [scrapy.middleware] INFO: Enabled extensions:['scrapy.extensions.memusage.MemoryUsage', 'scrapy.extensions.telnet.TelnetConsole', 'scrapy.extensions.corestats.CoreStats', 'scrapy.extensions.logstats.LogStats']2018-03-15 10:50:02 [scrapy.middleware] INFO: Enabled downloader middlewares:['scrapy.downloadermiddlewares.httpauth.HttpAuthMiddleware', 'scrapy.downloadermiddlewares.downloadtimeout.DownloadTimeoutMiddleware', 'scrapy.downloadermiddlewares.defaultheaders.DefaultHeadersMiddleware', 'scrapy.downloadermiddlewares.useragent.UserAgentMiddleware', 'scrapy.downloadermiddlewares.retry.RetryMiddleware', 'scrapy.downloadermiddlewares.redirect.MetaRefreshMiddleware', 'scrapy.downloadermiddlewares.httpcompression.HttpCompressionMiddleware', 'scrapy.downloadermiddlewares.redirect.RedirectMiddleware', 'scrapy.downloadermiddlewares.cookies.CookiesMiddleware', 'scrapy.downloadermiddlewares.httpproxy.HttpProxyMiddleware', 'scrapy.downloadermiddlewares.stats.DownloaderStats']2018-03-15 10:50:02 [scrapy.middleware] INFO: Enabled spider middlewares:['scrapy.spidermiddlewares.httperror.HttpErrorMiddleware', 'scrapy.spidermiddlewares.offsite.OffsiteMiddleware', 'scrapy.spidermiddlewares.referer.RefererMiddleware', 'scrapy.spidermiddlewares.urllength.UrlLengthMiddleware', 'scrapy.spidermiddlewares.depth.DepthMiddleware']2018-03-15 10:50:02 [scrapy.middleware] INFO: Enabled item pipelines:[]2018-03-15 10:50:02 [scrapy.core.engine] INFO: Spider opened2018-03-15 10:50:02 [scrapy.extensions.logstats] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)2018-03-15 10:50:02 [scrapy.extensions.telnet] DEBUG: Telnet console listening on 127.0.0.1:60232018-03-15 10:50:02 [scrapy.core.engine] DEBUG: Crawled (200)(referer: None) 百度一下,你就知道 2018-03-15 10:50:02 [scrapy.core.engine] INFO: Closing spider (finished)2018-03-15 10:50:02 [scrapy.statscollectors] INFO: Dumping Scrapy stats:{'downloader/request_bytes': 212, 'downloader/request_count': 1, 'downloader/request_method_count/GET': 1, 'downloader/response_bytes': 1476, 'downloader/response_count': 1, 'downloader/response_status_count/200': 1, 'finish_reason': 'finished', 'finish_time': datetime.datetime(2018, 3, 15, 2, 50, 2, 425038), 'log_count/DEBUG': 2, 'log_count/INFO': 7, 'memusage/max': 44892160, 'memusage/startup': 44892160, 'response_received_count': 1, 'scheduler/dequeued': 1, 'scheduler/dequeued/memory': 1, 'scheduler/enqueued': 1, 'scheduler/enqueued/memory': 1, 'start_time': datetime.datetime(2018, 3, 15, 2, 50, 2, 241466)}2018-03-15 10:50:02 [scrapy.core.engine] INFO: Spider closed (finished)
##执行这条命令时我出现了一个错误
ImportError: No module named _sqlite3##解决的办法是yum安装sqlite-devel,然后重新编译安装pythonyum install -y sqlite-develcd /usr/local/src/Python-3.4.2./configure prefix=/usr/local/python3make && make installln -fs /usr/local/python3/bin/python3 /usr/bin/python
##注意,如果在scrapy项目目录之外执行这条命令,会使用scrapy默认的爬虫来进行爬取,如果在scrapy项目目录内运行命令,则会调用该项目的爬虫来进行网页的爬取。
##可以通过scrapy fetch -h 来查看命令参数[root@aliyun ~]# scrapy fetch -hUsage===== scrapy fetch [options]Fetch a URL using the Scrapy downloader and print its content to stdout. Youmay want to use --nolog to disable loggingOptions=======--help, -h show this help message and exit--spider=SPIDER use this spider--headers print response HTTP headers instead of body--no-redirect do not handle HTTP 3xx status codes and print response as-isGlobal Options----------------logfile=FILE log file. if omitted stderr will be used--loglevel=LEVEL, -L LEVEL log level (default: DEBUG)--nolog disable logging completely--profile=FILE write python cProfile stats to FILE--pidfile=FILE write process ID to FILE--set=NAME=VALUE, -s NAME=VALUE set/override setting (may be repeated)--pdb enable pdb on failure
通过headers可以获取网页的头部信息,通过logfile可以指定日志文件的存储,nolog可以控制不显示运行爬取的日志,spider可以控制用哪个爬虫,loglevel控制日志的等级。
##通过headers来获取网页的头部信息,nolog参数不显示爬取过程的日志。[root@aliyun ~]# scrapy fetch --headers --nolog http://www.baidu.com> User-Agent: Scrapy/1.5.0 (+https://scrapy.org)> Accept: text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8> Accept-Language: en> Accept-Encoding: gzip,deflate>< Content-Type: text/html< Last-Modified: Mon, 23 Jan 2017 13:28:28 GMT< Cache-Control: private, no-cache, no-store, proxy-revalidate, no-transform< Server: bfe/1.0.8.18< Date: Thu, 15 Mar 2018 03:15:23 GMT< Pragma: no-cache< Set-Cookie: BDORZ=27315; max-age=86400; domain=.baidu.com; path=/
通过使用fetch可以很方便显示出爬取网页的过程。
2、runspider命令scrapy使用runspider命令,可以实现不用scrapy项目直接运行一个爬虫文件。3、setting命令
setting可以查看scrapy对应的配置信息,如果在scrapy项目目录内使用就是查看项目的配置信息,如果在全局使用,那么就是查看默认配置信息。##可以通过--get BOT_NAME来查看对应的scrapy信息,通过再项目目录执行以及在全局运行。[root@aliyun test_scrapy]# cd /python/test_scrapy/myfirstpjt/[root@aliyun myfirstpjt]# scrapy settings --get BOT_NAMEmyfirstpjt[root@aliyun myfirstpjt]# cd[root@aliyun ~]# scrapy settings --get BOT_NAMEscrapybot
4、shell命令
shell可以启动scrapy的交互终端(scrapy shell),常常在开发以及测试时候使用。##在全局下执行5、startproject命令
用于创建scrapy项目。scrapy startproject projectname6、version命令version命令可以显示scrapy的版本[root@aliyun ~]# scrapy versionScrapy 1.5.0##其他相关版本信息[root@aliyun ~]# scrapy version -vScrapy : 1.5.0lxml : 4.1.1.0libxml2 : 2.9.1cssselect : 1.0.3parsel : 1.4.0w3lib : 1.19.0Twisted : 17.9.0Python : 3.4.2 (default, Mar 15 2018, 10:26:10) - [GCC 4.8.5 20150623 (Red Hat 4.8.5-16)]pyOpenSSL : 17.5.0 (OpenSSL 1.0.2k-fips 26 Jan 2017)cryptography : 2.1.4Platform : Linux-3.10.0-514.26.2.el7.x86_64-x86_64-with-centos-7.4.1708-Core
7、view命令
view可以下载网页,并且直接用浏览器查看scrapy view url二、项目命令
##项目命令要在项目的目录下运行1、bench命令bench能测试本地硬件的性能[root@aliyun myfirstpjt]# scrapy bench……2018-03-16 14:56:22 [scrapy.extensions.logstats] INFO: Crawled 255 pages (at 1500 pages/min), scraped 0 items (at 0 items/min)2018-03-16 14:56:23 [scrapy.extensions.logstats] INFO: Crawled 279 pages (at 1440 pages/min), scraped 0 items (at 0 items/min)2018-03-16 14:56:24 [scrapy.extensions.logstats] INFO: Crawled 303 pages (at 1440 pages/min), scraped 0 items (at 0 items/min)……##从返回中看到每分钟大概能爬取1440个页面
2、genspider命令
genspider可以用来创建scrapy爬虫文件,这是一种快速创建爬虫文件的方式。##查看当前可以使用的爬虫模板[root@aliyun myfirstpjt]# scrapy genspider -lAvailable templates: basic crawl csvfeed xmlfeed
##基于其中一个模板创建一个爬虫文件,scrapy genspider -t 模板 新爬虫名 新爬虫爬取的域名
[root@aliyun myfirstpjt]# scrapy genspider -t basic test www.baidu.comCreated spider 'test' using template 'basic' in module: myfirstpjt.spiders.test
##在项目目录内,能看到创建的test.py文件,里面已经写好了域名。
[root@aliyun myfirstpjt]# cd myfirstpjt/[root@aliyun myfirstpjt]# ls__init__.py items.py middlewares.py pipelines.py __pycache__ settings.py spiders[root@aliyun myfirstpjt]# cd spiders/[root@aliyun spiders]# ls__init__.py __pycache__ test.py[root@aliyun spiders]# cat test.py # -*- coding: utf-8 -*-import scrapyclass TestSpider(scrapy.Spider): name = 'test' allowed_domains = ['www.baidu.com'] start_urls = ['http://www.baidu.com/'] def parse(self, response): pass
3、check命令
check命令可以对爬虫文件进行一种交互式的检查。scrapy check 爬虫名##检查爬虫文件检查通过[root@aliyun myfirstpjt]# scrapy check test----------------------------------------------------------------------Ran 0 contracts in 0.000sOK
4、crawl命令
crawl命令可以启动某个爬虫。scrapy crawl 爬虫名[root@aliyun myfirstpjt]# scrapy crawl test --loglevel=INFO2018-03-16 18:35:39 [scrapy.utils.log] INFO: Scrapy 1.5.0 started (bot: myfirstpjt)2018-03-16 18:35:39 [scrapy.utils.log] INFO: Versions: lxml 4.1.1.0, libxml2 2.9.1, cssselect 1.0.3, parsel 1.4.0, w3lib 1.19.0, Twisted 17.9.0, Python 3.4.2 (default, Mar 15 2018, 10:26:10) - [GCC 4.8.5 20150623 (Red Hat 4.8.5-16)], pyOpenSSL 17.5.0 (OpenSSL 1.0.2k-fips 26 Jan 2017), cryptography 2.1.4, Platform Linux-3.10.0-514.26.2.el7.x86_64-x86_64-with-centos-7.4.1708-Core2018-03-16 18:35:39 [scrapy.crawler] INFO: Overridden settings: {'ROBOTSTXT_OBEY': True, 'LOG_LEVEL': 'INFO', 'SPIDER_MODULES': ['myfirstpjt.spiders'], 'BOT_NAME': 'myfirstpjt', 'NEWSPIDER_MODULE': 'myfirstpjt.spiders'}…… 'start_time': datetime.datetime(2018, 3, 16, 10, 35, 39, 671815)}2018-03-16 18:35:39 [scrapy.core.engine] INFO: Spider closed (finished)
5、list命令
list命令可以列出当前使用的爬虫文件。[root@aliyun myfirstpjt]# scrapy listtest
6、edit命令
edit命令可以直接编辑某个爬虫文件,在linux中使用比较好。[root@aliyun myfirstpjt]# scrapy edit test
7、parse命令
parse命令可以实现获取指定的URL网址,并使用对应的爬虫文件进行处理和分析。[root@aliyun myfirstpjt]# scrapy parse http://www.baidu.com --nolog>>> STATUS DEPTH LEVEL 0 <<<# Scraped Items ------------------------------------------------------------[]# Requests -----------------------------------------------------------------[]
转载于:https://blog.51cto.com/lsfandlinux/2087747