0%

python使用scrapy爬虫

python scrapy爬虫

总体思路是:

先测试

再进行解析

获取下一层的链接

创建project:

1
scrapy startproject myproject

进入cmd:

1
scrapy shell http://www.dmoz.org/Computers/Programming/Languages/Python/Books/

avatar

fetch(r”网址”)

view(response)检查网页是否符合要求

1
2
3
response.xpath("//div[@class='site-title']")  解析网页
response.xpath("//div[contains(@class, 'site-descr')]/text()")[0].extract()
response.xpath("//div[@class='site-title']")[0].xpath("..").xpath("@href")[0].extract() # ..返回上层目录,@属性

avatar

进入spiders\myproject_spider.py

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
import scrapy
from scrapy import Spider
from scrapy.http import Request
class DmozSpider(Spider):
name='dmoz'
# allowed_domains=["dmoz.org"]
start_urls=[
"http://www.dmoz.org/Computers/Programming/Languages/Python/Books/",
# "http://www.dmoz.org/Computers/Programming/Languages/Python/Resources/" #一开始的页面,第一层页面
]

def start_requests(self): #开始爬虫的网址
for url in self.start_urls:
yield Request(url, callback=self.parse_aaa)

def parse_aaa(self,response): #第一层页面解析方法和要的操作
# def parse_aaa(self,response):
title = response.xpath("//div[@class='site-title']/text()")[0].extract()
description = response.xpath("//div[contains(@class, 'site-descr')]/text()")[0].extract()
link = response.xpath("//div[@class='site-title']")[0].xpath("..").xpath("@href")[0].extract()
# print title
# print description
print link #获取下一层的链接
yield Request(link, callback=self.parse_bbb) #再进行下一层的爬虫和解析

def parse_bbb(self, response):
print "hello"
bb = response.xpath("//a/text()")[0].extract()
print bb

cmd测试命令:

1
scrapy crawl dmoz

参考链接:

https://www.cnblogs.com/wuxl360/p/5567631.html

https://www.xncoding.com/2016/03/10/scrapy-02.html

解析器讲解:

https://www.xncoding.com/2016/03/14/scrapy-04.html

http://cuiqingcai.com/2621.html