Python爬虫基础之广西人才网的信息爬取(1) - 简单爬虫
最近刚学习了Python的语法,迫不及待想做自己的第一个爬虫了,这里记录自己编写一个简单的爬虫程序爬取广西人才网的过程
前期准备
1.拟定好要爬取的数据,因为这是第一次做爬虫,简单爬取一下广西人才网上各大编程语言的招聘信息,以每个岗位的标题为准,之后再做扩展
2.分析网页结构,来到广西人才网首页,如图所示,在分类中找到本次要爬取的岗位分类,都是开发类:
点进这两个分类的,可以看到浏览器URL栏中他们的URL分别为:
https://s.gxrc.com/sJob?schType=1&expend=1&PosType=5480
https://s.gxrc.com/sJob?schType=1&expend=1&PosType=5484
点击进去,可以看到如图的页面:
可以看出,本次,我们主要需要爬取分析的就是这些岗位的标题部分,以及薪资部分,这两个分类,每个分类都有几十页,这里如何获取到每一页的URL呢,我们试着跳转到本分类的下一页,可以看到,URL变成了https://s.gxrc.com/sJob?schType=1&expend=1&PosType=5480&page=2
再多往后面后面翻几页,可以印证我们的猜想,这个网站各分类中每一页的岗位信息分别为: https://s.gxrc.com/sJob?schType=1&expend=1&PosType=岗位类型&page=页码
开始编写代码
1.这里直接上代码,关于Beautiful Soup具体用法和如何查看页面中元素的标签这里不再赘述,网上有详细教程,不在此篇讨论范围之内
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
from bs4 import BeautifulSoup
import urllib.request
from collections import OrderedDict
# IT类工作地址
listTypes = ['5480', '5484']
jobsNum = []
jobList = ["c#/.net", "java", "php"]
dicResult = OrderedDict()
# 伪装浏览器头部
headers = {
'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) '
'AppleWebKit/537.36 (KHTML, like Gecko) '
'Chrome/81.0.4044.138 Safari/537.36'
}
for type_number in listTypes:
url_prefix = 'https://s.gxrc.com/sJob?schType=1&expend=1&PosType=' + type_number + '&page='
# Request类的实例,构造时需要传入Url,Data,headers等等的内容
request = urllib.request.Request(url=url_prefix + '1', headers=headers)
first_page = urllib.request.urlopen(request)
soup = BeautifulSoup(first_page, 'lxml')
intLastPageNumber = int(soup.find('i', {"id": "pgInfo_last"}).text)
urls = [url_prefix + str(i) for i in range(1, intLastPageNumber + 1)]
for url in urls:
request = urllib.request.Request(url=url, headers=headers)
page = urllib.request.urlopen(request)
search_result = BeautifulSoup(page, 'lxml').find_all(name='div', attrs='rlOne')
bolTypeRight = False
for job in search_result:
tag_a = job.find('a')
href = tag_a.get('href')
company = job.find('li', 'w2').text
salary = job.find('li', 'w3').text
jobName = tag_a.text.lower()
if href not in jobsNum:
jobsNum.append(href)
for jobType in jobList:
if '-' in salary:
if '/' in jobType:
bolTypeRight = jobType.split('/')[0] in jobName or jobType.split('/')[1] in jobName
else:
bolTypeRight = jobType in jobName
if bolTypeRight:
if jobType not in dicResult.keys():
dicResult[jobType] = [int(salary.split('-')[0])]
else:
dicResult[jobType].append(int(salary.split('-')[0]))
print(company + " " + jobName + ":" + salary + " " + href)
print('广西人才网IT岗统计结果(按岗位标题)')
print('总岗位数量:' + str(len(jobsNum)))
for resultKey, value in dicResult.items():
print(resultKey + '岗位总数量:' + str(len(value)) + ',平均工资(按岗位最低工资为准):' + str(sum(value) / len(value)))
本文由作者按照 CC BY 4.0 进行授权