Python element tree - extract text from element, stripping tags(Python元素树 - 从元素中提取文本,剥离标签)
问题描述
使用 Python 中的 ElementTree,如何从节点中提取所有文本,剥离该元素中的所有标签并仅保留文本?
With ElementTree in Python, how can I extract all the text from a node, stripping any tags in that element and keeping only the text?
例如,假设我有以下内容:
For example, say I have the following:
<tag>
Some <a>example</a> text
</tag>
我想返回一些示例文本
.我该怎么做呢?到目前为止,我所采取的方法都产生了相当灾难性的后果.
I want to return Some example text
. How do I go about doing this? So far, the approaches I've taken have had fairly disastrous outcomes.
推荐答案
如果你在 Python 3.2+ 下运行,你可以使用 itertext
.
If you are running under Python 3.2+, you can use itertext
.
itertext
创建一个文本迭代器,它按文档顺序循环此元素和所有子元素,并返回所有内部文本:
itertext
creates a text iterator which loops over this element and all subelements, in document order, and returns all inner text:
import xml.etree.ElementTree as ET
xml = '<tag>Some <a>example</a> text</tag>'
tree = ET.fromstring(xml)
print(''.join(tree.itertext()))
# -> 'Some example text'
如果你在较低版本的 Python 中运行,你可以重用 itertext()
的实现,通过将其附加到 Element
类,之后您可以像上面一样调用它:
If you are running in a lower version of Python, you can reuse the implementation of itertext()
by attaching it to the Element
class, after which you can call it exactly like above:
# original implementation of .itertext() for Python 2.7
def itertext(self):
tag = self.tag
if not isinstance(tag, basestring) and tag is not None:
return
if self.text:
yield self.text
for e in self:
for s in e.itertext():
yield s
if e.tail:
yield e.tail
# if necessary, monkey-patch the Element class
if 'itertext' not in ET.Element.__dict__:
ET.Element.itertext = itertext
xml = '<tag>Some <a>example</a> text</tag>'
tree = ET.fromstring(xml)
print(''.join(tree.itertext()))
# -> 'Some example text'
这篇关于Python元素树 - 从元素中提取文本,剥离标签的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持编程学习网!
本文标题为:Python元素树 - 从元素中提取文本,剥离标签
- 如何将一个类的函数分成多个文件? 2022-01-01
- 沿轴计算直方图 2022-01-01
- pytorch 中的自适应池是如何工作的? 2022-07-12
- python-m http.server 443--使用SSL? 2022-01-01
- 如何在 Python 的元组列表中对每个元组中的第一个值求和? 2022-01-01
- 如何在 python3 中将 OrderedDict 转换为常规字典 2022-01-01
- python check_output 失败,退出状态为 1,但 Popen 适用于相同的命令 2022-01-01
- 分析异常:路径不存在:dbfs:/databricks/python/lib/python3.7/site-packages/sampleFolder/data; 2022-01-01
- padding='same' 转换为 PyTorch padding=# 2022-01-01
- 使用Heroku上托管的Selenium登录Instagram时,找不到元素';用户名'; 2022-01-01