奥奥最近为了研究爬取一些通过js渲染后的数据时,犯愁了,因此需要用到我们的独门绝技强大的Selenium,由于我们的测试环境是宝塔终端,因此我们需要对其改造下:
要不是API贵的离谱,奥奥也不会用这么拙劣的方式去爬数据,哈哈哈!
第一个步骤:安装chrome和chromedriver驱动
首先,我们使用wget下载安装包:
wget https://dl.google.com/linux/direct/google-chrome-stable_current_x86_64.rpm
然后使用命令安装包:
sudo yum localinstall google-chrome-stable_current_x86_64.rpm
安装完成就是这样子哦,接下来,检验下版本号:
google-chrome --version
可以看到,Chrome已经安装成功了,现在我们拿到的版本号是 110.0.5481.177 根据版本号继续去安装驱动程序
-----------------------------------中途休息,和奥奥一起摸鱼吧-------------------------------
开始吧,开始安装驱动程序chromedriver
首先我们打开官网: (如果速度不行,自己备魔法)
http://chromedriver.storage.googleapis.com/index.html
用我们智慧的眼睛,找一找我们对应的版本,不然用不成的,所以一定注意了!
但是我们发现上述也没有我抓到的版本号,这不是折磨人吗 ?奥奥也没有气类,毕竟折腾嘛,嘎嘎香,继续,其实选取相同大版本即可
我们拿到的版本号是 110.0.5481.177 我们去拿110.0.5481.77
这里看你的平台了,我们是宝塔,所以用liunx64的,其他平台自己想办法吧!win其实最简单,不再赘述!
下载并解压,我们已经得到了这个驱动文件了,好了丢到宝塔上去吧!
自己 记住你上传的目录吧!都可以的,反正等下都要自己定义的。
接着哦开始给chromedriver挪窝
mv chromedriver /usr/bin/
然后给权限:
chmod +x /usr/bin/chromedriver
接着查看版本信息:
chromedriver -version
差不多这个配置也完成了!开干,开干!!!
第二个步骤,使用脚本对接
# 设置Chrome浏览器的参数 chrome_options = Options() chrome_options.add_argument('--headless') # 无头模式,即不显示浏览器窗口 chrome_options.add_argument('--disable-gpu') chrome_options.add_argument('--no-sandbox') chrome_options.add_argument('--disable-dev-shm-usage') chrome_options.add_argument('--start-maximized') chrome_options.add_argument('--ignore-certificate-errors') chrome_options.add_argument('--proxy-server={}'.format(proxy.http_proxy)) # 设置headers headers = get_header() # 初始化Chrome浏览器 driver = webdriver.Chrome(options=chrome_options) # 访问网页 driver.get(url) # 等待3秒完成渲染页面 time.sleep(3)
这是我给出的一个片段,更多的自己百度吧,这里只讲怎么用!
本来是在写一个站长工具的爬虫,由于还有点问题就不公开了!
不过有大佬有兴趣,可以QQ741500926滴滴我!
补充可用源码:
import time from selenium import webdriver from selenium.webdriver.common.proxy import Proxy, ProxyType from selenium.webdriver.common.keys import Keys from selenium.webdriver.chrome.options import Options import random import urllib.error from lxml import etree import requests import http.cookiejar as cookielib from selenium.webdriver.common.by import By import os os.system('clear') world = input("请输入SEO权重域名查询:") url = 'https://seo.chinaz.com/'+str(world) try: # 设置Chrome浏览器的参数 chrome_options = Options() chrome_options.add_argument('--headless') # 无头模式,即不显示浏览器窗口 chrome_options.add_argument('--no-sandbox') chrome_options.add_argument('--disable-dev-shm-usage') chrome_options.add_argument('--start-maximized') chrome_options.add_argument('--ignore-certificate-errors') # chrome_options.add_argument('--proxy-server={}'.format(proxies)) # 初始化Chrome浏览器 driver = webdriver.Chrome(options=chrome_options) # 访问网页 driver.get(url) # 等待3秒完成渲染页面 time.sleep(1) img_elems = driver.find_elements(By.XPATH, '//div[@class="_chinaz-seo-newrank"]/span/a/img') for img_elem in img_elems: src_info = img_elem.get_attribute('src') if 'baidu0' in src_info: print('百度PC权重 0') elif 'baidu1' in src_info: print('百度PC权重 1') elif 'baidu2' in src_info: print('百度PC权重 2') elif 'baidu3' in src_info: print('百度PC权重 3') elif 'baidu4' in src_info: print('百度PC权重 4') elif 'baidu5' in src_info: print('百度PC权重 5') elif 'baidu6' in src_info: print('百度PC权重 6') elif 'baidu7' in src_info: print('百度PC权重 7') elif 'baidu8' in src_info: print('百度PC权重 8') elif 'baidu9' in src_info: print('百度PC权重 9') elif 'bd0' in src_info: print('百度移动权重 0') elif 'bd1' in src_info: print('百度移动权重 1') elif 'bd2' in src_info: print('百度移动权重 2') elif 'bd3' in src_info: print('百度移动权重 3') elif 'bd4' in src_info: print('百度移动权重 4') elif 'bd5' in src_info: print('百度移动权重 5') elif 'bd6' in src_info: print('百度移动权重 6') elif 'bd7' in src_info: print('百度移动权重 7') elif 'bd8' in src_info: print('百度移动权重 8') elif 'bd9' in src_info: print('百度移动权重 9') elif 'sogou0' in src_info: print('搜狗权重 0') elif 'sogou1' in src_info: print('搜狗权重 1') elif 'sogou2' in src_info: print('搜狗权重 2') elif 'sogou3' in src_info: print('搜狗权重 3') elif 'sogou4' in src_info: print('搜狗权重 4') elif 'sogou5' in src_info: print('搜狗权重 5') elif 'sogou6' in src_info: print('搜狗权重 6') elif 'sogou7' in src_info: print('搜狗权重 7') elif 'sogou8' in src_info: print('搜狗权重 8') elif 'sogou9' in src_info: print('搜狗权重 9') elif 'bing0' in src_info: print('必应权重 0') elif 'bing1' in src_info: print('必应权重 1') elif 'bing2' in src_info: print('必应权重 2') elif 'bing3' in src_info: print('必应权重 3') elif 'bing4' in src_info: print('必应权重 4') elif 'bing5' in src_info: print('必应权重 5') elif 'bing6' in src_info: print('必应权重 6') elif 'bing7' in src_info: print('必应权重 7') elif 'bing8' in src_info: print('必应权重 8') elif 'bing9' in src_info: print('必应权重 9') elif '3600' in src_info: print('360权重 0') elif '3601' in src_info: print('360权重 1') elif '3602' in src_info: print('360权重 2') elif '3603' in src_info: print('360权重 3') elif '3604' in src_info: print('360权重 4') elif '3605' in src_info: print('360权重 5') elif '3606' in src_info: print('360权重 6') elif '3607' in src_info: print('360权重 7') elif '3608' in src_info: print('360权重 8') elif '3609' in src_info: print('360权重 9') elif 'shenma0' in src_info: print('神马权重 0') elif 'shenma1' in src_info: print('神马权重 1') elif 'shenma2' in src_info: print('神马权重 2') elif 'shenma3' in src_info: print('神马权重 3') elif 'shenma4' in src_info: print('神马权重 4') elif 'shenma5' in src_info: print('神马权重 5') elif 'shenma6' in src_info: print('神马权重 6') elif 'shenma7' in src_info: print('神马权重 7') elif 'shenma8' in src_info: print('神马权重 8') elif 'shenma9' in src_info: print('神马权重 9') else: print('数据错误') # 关闭浏览器 driver.quit() except Exception as e: print(e)
抓的是这个:
成果:
API我也实现了:
多多学习 多多益善!