失败的爬虫 成功的尝试

在爬取完漫画网站之后,我在想,我还能用自己浅薄的知识做点什么,但实在是因为自己 python的基本功不够扎实,以及自己的需求过于模糊,所以最后还是选择了爬取笔趣阁的小说。练习python,熟悉bs4 和 requsets 的使用。

因为想要使用多进程,所以是用的与之前不一样的写法,以及困于PyChram的输入网址之后直接回车就会打开浏览器,所以默认多加一个字符长度,默认是空格

首先是爬取文章标题

#

from bs4 import BeautifulSoup
import requests


def get_titles(urlx):
    wb_data = requests.get(urlx)
    wb_data.encoding = 'UTF-8'
    soup = BeautifulSoup(wb_data.text, 'lxml')
    return soup.select('h1')[0].get_text()


其次是 爬取文章目录

#

from bs4 import BeautifulSoup
import requests
import pymongo

head = 'https://www.biquge.com.cn'
client = pymongo.MongoClient('localhost', 27017)
BQG = client['BQG']
chapters = BQG['chapters']


def get_chapters(urlx):
    wb_data = requests.get(urlx)
    wb_data.encoding = 'UTF-8'
    soup = BeautifulSoup(wb_data.text, 'lxml')
    local_chapters = soup.select('dd > a')
    for index, each in enumerate(local_chapters):
        local_chapter = {
            'index': index,
            'url': head+each.get('href')
        }
        chapters.insert_one(local_chapter)


然后是 爬取 文章内容

#

from bs4 import BeautifulSoup
import requests
import pymongo

client = pymongo.MongoClient('localhost', 27017)
BQG = client['BQG']
books = BQG['books']
chapters = BQG['chapters']
articles = BQG['articles']


def get_articles(urlx):
    wb_data = requests.get(urlx)
    wb_data.encoding = 'UTF-8'
    soup = BeautifulSoup(wb_data.text, 'lxml')
    wb_data.close()
    title = {'T|C': 1, 'm': soup.select('div.bookname > h1')[0].get_text()}
    articles.insert_one(title)
    content = soup.select('#content').__str__().replace('[<div id="content">', '').replace('</div>]', '').replace('\xa0', '').split('<br/><br/>')
    for each in content:
        paragraph = {
            'T|C': 0,
            'm': each
        }
        articles.insert_one(paragraph)

主程序:

import pymongo
import os
from multiprocessing import Pool
from zerox.bqg_get_chapters import get_chapters
from zerox.bqg_get_articles import get_articles
from zerox.bqg_get_titles import get_titles

client = pymongo.MongoClient('localhost', 27017)
BQG = client['BQG']
books = BQG['books']
chapters = BQG['chapters']
articles = BQG['articles']
rootpath = '/home/x/BORBER/File/Tmp/novel/'
filend = '.txt'


def write_in(item):
    if item['T|C'] == 1:
        file.write(item['m'])
        file.write('\n\n\n\n')
    else:
        print('  ')
        file.write(item['m'])
        file.write('\n\n')


if __name__ == '__main__':
    BQG.drop_collection('books')
    BQG.drop_collection('chapters')
    BQG.drop_collection('articles')
    pool = Pool()
    print('Enter the specific link to the novel:')
    url = input()[:-1]
    title = get_titles(url)
    a = rootpath + title + filend
    file = open(rootpath + title + filend, 'w')
    get_chapters(url)
    for item in chapters.find():
        get_articles(item['url'])
    for item in articles.find():
        write_in(item)



以下提供 监测程序:

import time
import pymongo
import os

client = pymongo.MongoClient('localhost', 27017)
BQG = client['BQG']
books = BQG['books']
chapters = BQG['chapters']
articles = BQG['articles']

while True:
    os.system('clear')
    print(chapters.find().count()+articles.find().count())
    time.sleep(2)

文件结构:

-zero
	-zerox  #python package
		-bqg_get_chapters.py
		-bqg_get_articles.py
		-bqg_get_titles.py
		-count.py
		-main.py
		-__init__.py

但是 为什么文章叫 失败的爬虫呢? 因为我只成功了两次,后面的话就会报错了,应该是 笔趣阁的反扒手段,所以还是同样,如果以后我的技术进步了,我会着手改进这个爬虫的 ,有想尝试的朋友可以先试着加个 headers 到 get方法。

此爬虫需要使用 mongoDB

真的去下苦工 打基础了 (๑•̀ㅂ•́)و✧