记录抓取某直聘网站

近期有朋友让我帮抓一下某个直聘网站的招聘岗位，闲来无事就试了一下。

创新互联公司总部坐落于成都市区，致力网站建设服务有网站制作、做网站、网络营销策划、网页设计、网站维护、公众号搭建、小程序开发、软件开发等为企业提供一整套的信息化建设解决方案。创造真正意义上的网站建设，为互联网品牌在互动行销领域创造价值而不懈努力！

考虑到这种网站肯定是有反爬机制，于是使用Selenium+Chrome的方式抓取

用到的主要工具：

python3.5

selenium

scrapy

由于[网站的数据跟单(http://www.gendan5.com/tech.html)是可以按照地市来查询的，所以先访问该网站支持的城市划分

使用scrapy的self.start_urls进行请求

self.start_urls = ['https://www.zhipin.com/wapi/zpCommon/data/city.json',]

同时使用selenium请求该网站主页

self.driver.get('https://www.zhipin.com/')

后来发现网站可以识别selenium，不返回数据，于是添加

options = webdriver.ChromeOptions()

options.add_experimental_option('excludeSwitches', ['enable-automation'])

self.driver = webdriver.Chrome(options=options)

将程序设置为开发者模式，数据可以正常请求到

接下来就是解析支持搜索的城市名，并且汇总成我们能使用的数据格式

    dic = {}

    json_text = json.loads(response.text)['zpData']['cityList']

    for i in range(len(json_text)):

        # 获取到各个省的名称，并且作为字典的键名赋值

        province = json_text[i]['name']

        provinces = json_text[i]['subLevelModelList']

        dic.setdefault(province,[])

        citys = []

        # 分类直辖市和地级市，并归类到字典的值

        if provinces.__len__() > 1:

            for ii in range(len(provinces)):

                city = provinces[ii]['name']

                citys.append(city)

        else:

            city = province

            citys.append(city)

        dic[province] = citys

准备工作完成了，接下来就是请求数据了

    self.driver.find_element_by_xpath('//*[@id="wrap"]/div[3]/div/div/div[1]/form/div[2]/p/input').send_keys('需要查询的岗位') # 主页搜索框，过度用

    sleep(2)

    self.driver.find_element_by_xpath('//*[@id="wrap"]/div[3]/div/div/div[1]/form/button').click()

    sleep(2)

到这里，程序算是进入了正轨，直接贴上代码。如下：

-- coding: utf-8 --

import scrapy

import json

import re

from scrapy.spiders import CrawlSpider

from time import sleep

from ..items import ZhaopinBossZhipinItem

from scrapy.selector import Selector

import importlib

import random

from selenium import webdriver

import sys

importlib.reload(sys)

class ZP_boss(CrawlSpider):

boss > 各地口腔招聘

name = "boss"

custom_settings = {

    'ITEM_PIPELINES': {'zhaopin_bosszhipin.pipelines.ZhaopinBossPipeline': 300, },

    'scrapy.downloadermiddlewares.cookies.CookiesMiddleware': 1,

    'DOWNLOAD_DELAY': 0.5,

                           'MYEXT_ENABLED': True

                        }

def __init__(self,):

    super(ZP_boss,self).__init__()

    self.allowed_domains = ["https://www.baidu.com"] # 过滤的url

    self.start_urls = ['https://www.zhipin.com/wapi/zpCommon/data/city.json',] # 访问网页支持搜索的城市

    options = webdriver.ChromeOptions()

    options.add_experimental_option('excludeSwitches', ['enable-automation'])

    self.driver = webdriver.Chrome(options=options)

    self.driver.maximize_window() # 浏览器设置成页面最大化

    self.driver.get('https://www.zhipin.com/')

def parse(self, response):

    dic = {}

    json_text = json.loads(response.text)['zpData']['cityList']

    for i in range(len(json_text)):

        # 获取到各个省的名称，并且作为字典的键名赋值

        province = json_text[i]['name']

        provinces = json_text[i]['subLevelModelList']

        dic.setdefault(province,[])

        citys = []

        # 分类直辖市和地级市，并归类到字典的值

        if provinces.__len__() > 1:

            for ii in range(len(provinces)):

                city = provinces[ii]['name']

                citys.append(city)

        else:

            city = province

            citys.append(city)

        dic[province] = citys

    self.driver.find_element_by_xpath('//*[@id="wrap"]/div[3]/div/div/div[1]/form/div[2]/p/input').send_keys('python') # 主页搜索框，过度用

    sleep(2)

    self.driver.find_element_by_xpath('//*[@id="wrap"]/div[3]/div/div/div[1]/form/button').click()

    sleep(2)

    for prov in dic.keys(): # 循环抓取到的省名

        cts = dic[prov] # 单个省或者直辖市包含的所有城市

        for ct in cts: # 单个城市名

            query = '搜索的岗位'+ct  

            self.driver.find_element_by_xpath('//p[@class="ipt-wrap"]/input[@name="query"]').clear()

            # sleep(0.1)

            self.driver.find_element_by_xpath('//p[@class="ipt-wrap"]/input[@name="query"]').send_keys(query)

            sleep(0.2)

            self.driver.find_element_by_xpath('//button[@class="btn btn-search"]').click() # 点击查询数据

            sleep(1)

            # source = Selector(text=self.driver.page_source)

            panduan = True

            while panduan: # 循环翻页

                sou = Selector(text=self.driver.page_source)

                link_lens = sou.xpath('//*[@id="main"]/div/div[2]/ul/li').extract() # 获取当前页面所有的li标签，一个标签就是一条招聘数据

                # 分解出当前页面每一个li标签，并获取到部分数据

                for link_text in link_lens:

                    sel = Selector(text=link_text)

                    # 招聘单位

                    company = ''.join(sel.xpath('//div[@class="company-text"]/h4/a/text()').extract()).strip()

                    # 城市

                    city = ct

                    # 学历要求

                    education = ''.join(sel.xpath('//div[@class="info-primary"]/p/text()[3]').extract()).strip()

                    # 工作经验

                    experience = ''.join(sel.xpath('//div[@class="info-primary"]/p/text()[2]').extract()).strip()

                    # 获取数据的城市地址

                    adrs_text = sel.xpath('//p/text()').extract()

                    if adrs_text:  # 加这个判断是为了保证有城市数据，有时候网页会抽风导致 下标越界或空对象没有group()方法的错

                        adrs = re.search('(\w+?)\s',''.join(adrs_text[0])).group().strip() # 匹配出当前招聘所在城市名

                        if adrs != ct:  # 如果没有匹配数据，网站会把该省的其他市数据返回，筛选掉这部分数据,只做精准匹配

                            panduan = False

                            continue

                        else:

                            pass

                        are = re.search('\s(\w+?)\s',''.join(adrs_text[0])) # 城市的区

                        if are:

                            area = are.group().strip()

                        else:

                            area = ''

                        main_url = 'https://www.zhipin.com'

                        link_href = ''.join(sel.xpath('//div[@class="info-primary"]/h4[@class="name"]/a/@href').extract()).strip()

                        url = main_url + link_href

                        # 获取详情页的索引值

                        href_index = ''.join(sel.xpath('//div[@class="info-primary"]/h4[@class="name"]/a/@data-index').extract()).strip()

                        # 点击进入详情页

                        link_page = self.driver.find_element_by_xpath('//div[@class="info-primary"]/h4/a[@data-index="{}"]/div[@class="job-title"]'.format(href_index))

                        link_page.click()

                        # driver切换到新页面，获取详情页数据

                        n = self.driver.window_handles  # 获取到所有窗口，返回的是一个list，下标从0开始

                        self.driver.switch_to.window(n[1])  # 切换到新的网页窗口视图，driver的page_source也会更改成新页面的

                        sleep(1)

                        se = Selector(text=self.driver.page_source)

                        # 岗位

                        job_name = ''.join(se.xpath('//div[@class="name"]/h2/text()').extract()).strip()

                        # 薪资

                        salary  = ''.join(se.xpath('//div[@class="name"]/span[@class="salary"]/text()').extract()).strip()

                        # 福利

                        welfare = ';'.join(se.xpath('//*[@id="main"]/div[1]/div/div/div[2]/div[3]/div[2]/span/text()').extract()).strip()

                        # 发布时间

                        publishtime = ''.join(re.findall('\d+.*',''.join(se.xpath('//*[@id="main"]/div[3]/div/div[1]/div[2]/p[@class="gray"]/text()').extract()))).strip()

                        # 岗位职责

                        Duty = ''.join(se.xpath('//*[@id="main"]/div[3]/div/div[2]/div[2]/div[1]/div[@class="text"]').extract()).strip()

                        # 详细地址

                        address = ''.join(se.xpath('//*[@id="main"]/div[3]/div/div[2]/div[2]/div/div[@class="job-location"]/div[@class="location-address"]/text()').extract()).strip()

                        print('发布时间:',publishtime)

                        print('岗位名称:',job_name)

                        print('招聘单位:',company)

                        print('学历要求:',education)

                        print('工作经验:',experience)

                        print('薪资:',salary)

                        print('福利:',welfare)

                        print('地址:',address)

                        print('岗位职责：',Duty)

                        self.driver.close()   # 必须关闭当前数据页面，否则会占用大量资源，查询数据量很大的时候会导致宕机。。。

                        sleep(0.5)

                        self.driver.switch_to.window (n[0])  # 切换回原网页

                    else:

                        continue

                # 先判断是否有分页信息,每页最多30条数据(30个li标签)，少于30条数据表示没有下一页了

                if link_lens.__len__() < 30:

                    print('没有下一页了')

                    panduan = False

                else:

                    if panduan:  # 会出现有下一页但是数据不是我们查询的市的数据，已在上方进行了判断(if adrs != ct:)

                        if ''.join(sou.xpath('//a[@ka="page-next"]/@href').extract()) == "javascript:;": # 网站最多显示10页数据，不做判断会导致死循环

                            panduan = False

                        else:

                            next_page = self.driver.find_element_by_xpath('//a[@ka="page-next"]') # 翻页按钮

                            next_page.click() # 点击翻页

                            print('准备抓取下一页')

                            sleep(random.randint(1,5)) # 考虑到封ip，适当休眠

                    else:

                        break

            sleep(random.randint(5,15))

    self.driver.quit() # 程序运行结束，关闭浏览器进程

数据爬取完毕。

pipelines，sttings和item的代码千篇一律，这里就不放上来了。

由于使用的是selenium，注定了爬取速度不会很快。

数据无价，且爬且珍惜。

网页题目：记录抓取某直聘网站
网站链接：http://shouzuofang.com/article/pgodgj.html

网站建设知识

记录抓取某直聘网站

-- coding: utf-8 --

boss > 各地口腔招聘

其他资讯