Beautiful Soup findAll doesn#39;t find them all(Beautiful Soup findAll 没有找到它们)
问题描述
I'm trying to parse a website and get some info with the find_all()
method, but it doesn't find them all.
This is the code:
#!/usr/bin/python3
from bs4 import BeautifulSoup
from urllib.request import urlopen
page = urlopen ("http://mangafox.me/directory/")
# print (page.read ())
soup = BeautifulSoup (page.read ())
manga_img = soup.findAll ('a', {'class' : 'manga_img'}, limit=None)
for manga in manga_img:
print (manga['href'])
It only prints half of them...
Different HTML parsers deal differently with broken HTML. That page serves broken HTML, and the lxml
parser is not dealing very well with it:
>>> import requests
>>> from bs4 import BeautifulSoup
>>> r = requests.get('http://mangafox.me/directory/')
>>> soup = BeautifulSoup(r.content, 'lxml')
>>> len(soup.find_all('a', class_='manga_img'))
18
The standard library html.parser
has less trouble with this specific page:
>>> soup = BeautifulSoup(r.content, 'html.parser')
>>> len(soup.find_all('a', class_='manga_img'))
44
Translating that to your specific code sample using urllib
, you would specify the parser thus:
soup = BeautifulSoup(page, 'html.parser') # BeatifulSoup can do the reading
这篇关于Beautiful Soup findAll 没有找到它们的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持编程学习网!
本文标题为:Beautiful Soup findAll 没有找到它们
- 如何调试 CSS/Javascript 悬停问题 2022-01-01
- 是否可以将标志传递给 Gulp 以使其以不同的方式 2022-01-01
- 使用 iframe URL 的 jQuery UI 对话框 2022-01-01
- 在不使用循环的情况下查找数字数组中的一项 2022-01-01
- 从原点悬停时触发 translateY() 2022-01-01
- 如何向 ipc 渲染器发送添加回调 2022-01-01
- 如何显示带有换行符的文本标签? 2022-01-01
- 为什么我的页面无法在 Github 上加载? 2022-01-01
- 为什么悬停在委托事件处理程序中不起作用? 2022-01-01
- 我不能使用 json 使用 react 向我的 web api 发出 Post 请求 2022-01-01