parsing large XML using SAX in java(在 Java 中使用 SAX 解析大型 XML)
问题描述
我正在尝试解析堆栈溢出数据转储,其中一张表名为 posts.xml,其中包含大约 1000 万个条目.示例 xml:
I am trying to parse the stack overflow data dump, one of the tables is called posts.xml which has around 10 million entry in it. Sample xml:
<?xml version="1.0" encoding="utf-8"?>
<posts>
<row Id="1" PostTypeId="1" AcceptedAnswerId="26" CreationDate="2010-07-07T19:06:25.043" Score="10" ViewCount="1192" Body="<p>Now that the Engineer update has come, there will be lots of Engineers building up everywhere. How should this best be handled?</p>
" OwnerUserId="11" LastEditorUserId="56" LastEditorDisplayName="" LastEditDate="2010-08-27T22:38:43.840" LastActivityDate="2010-08-27T22:38:43.840" Title="SW4gVGVhbSBGb3J0cmVzcyAyLCB3aGF0IGlzIGEgZ29vZCBzdHJhdGVneSB0byBkZWFsIHdpdGggbG90cyBvZiBlbmdpbmVlcnMgdHVydGxpbmcgb24gdGhlIG90aGVyIHRlYW0/" Tags="<strategy><team-fortress-2><tactics>" AnswerCount="5" CommentCount="7" />
<row Id="2" PostTypeId="1" AcceptedAnswerId="184" CreationDate="2010-07-07T19:07:58.427" Score="5" ViewCount="469" Body="<p>I know I can create a Warp Gate and teleport to Pylons, but I have no idea how to make Warp Prisms or know if there's any other unit capable of transporting.</p>

<p>I would in particular like this to built remote bases in 1v1</p>
" OwnerUserId="10" LastEditorUserId="68" LastEditorDisplayName="" LastEditDate="2010-07-08T00:16:46.013" LastActivityDate="2010-07-08T00:21:13.163" Title="V2hhdCBwcm90b3NzIHVuaXQgY2FuIHRyYW5zcG9ydCBvdGhlcnM/" Tags="<starcraft-2><how-to><protoss>" AnswerCount="3" CommentCount="2" />
<row Id="3" PostTypeId="1" AcceptedAnswerId="56" CreationDate="2010-07-07T19:09:46.317" Score="7" ViewCount="356" Body="<p>Steam won't let me have two instances running with the same user logged in.</p>

<p>Does that mean I cannot run a dedicated server on a PC (for example, for Left 4 Dead 2) <em>and</em> play from another machine?</p>

<p>Is there a way to run the dedicated server without running steam? Is there a configuration option I'm missing?</p>
" OwnerUserId="14" LastActivityDate="2010-07-07T19:27:04.777" Title="SG93IGNhbiBJIHJ1biBhIGRlZGljYXRlZCBzZXJ2ZXIgZnJvbSBzdGVhbT8=" Tags="<steam><left-4-dead-2><dedicated-server><account>" AnswerCount="1" />
<row Id="4" PostTypeId="1" AcceptedAnswerId="14" CreationDate="2010-07-07T19:11:05.640" Score="10" ViewCount="201" Body="<p>When I get to the insult sword-fighting stage of The Secret of Monkey Island, do I have to learn every single insult and comeback in order to beat the Sword Master?</p>
" OwnerUserId="17" LastEditorUserId="17" LastEditorDisplayName="" LastEditDate="2010-07-08T21:25:04.787" LastActivityDate="2010-07-08T21:25:04.787" Title="RG8gSSBoYXZlIHRvIGxlYXJuIGFsbCBvZiB0aGUgaW5zdWx0cyBhbmQgY29tZWJhY2tzIHRvIGJlIGFibGUgdG8gYWR2YW5jZSBpbiBUaGUgU2VjcmV0IG9mIE1vbmtleSBJc2xhbmQ/" Tags="<monkey-island><adventure>" AnswerCount="3" CommentCount="2" />
我想解析这个xml,但只加载xml的某些属性,即Id、PostTypeId、AcceptedAnswerId和其他2个属性.SAX 中有没有办法让它只加载这些属性?如果有那怎么办?我对 SAX 很陌生,所以一些指导会有所帮助.
I would like to parse this xml, but only load certain attributes of the xml, which are Id, PostTypeId, AcceptedAnswerId and other 2 attributes. Is there a way in SAX so that it only loads these attributes?? If there is then how? I am pretty new to SAX, so some guidance would help.
否则加载整个东西会很慢,而且一些属性无论如何都不会被使用,所以它是无用的.
Otherwise loading the whole thing would just be purely slow and some of the attributes won't be used anyways so it's useless.
另一个问题是是否可以跳转到具有行 ID X 的特定行?如果可能的话,我该怎么做?
One other question is that would it be possible to jump to a particular row that has a row Id X? If possible then how do I do this?
推荐答案
StartElement" Sax Event 允许处理单个 XML 元素.
"StartElement" Sax Event permits to process a single XML ELement.
在java代码中你必须实现这个方法
In java code you must implement this method
public void startElement(String uri, String localName,
String qName, Attributes attributes)
throws SAXException {
if("row".equals(localName)) {
//this code is executed for every xml element "row"
String id = attributes.getValue("id");
String PostTypeId = attributes.getValue("PostTypeId");
String AcceptedAnswerId = attributes.getValue("AcceptedAnswerId");
//others two
// you have your att values for an "row" element
}
}
对于每个元素,您可以访问:
For every element, you can access:
- 命名空间 URI
- XML QName
- XML 本地名称
- 属性图,这里可以提取你的两个属性...
具体细节见 ContentHandler 实现.
see ContentHandler Implementation for specific deatils.
再见
更新:改进了之前的片段.
UPDATED: improved prevous snippet.
这篇关于在 Java 中使用 SAX 解析大型 XML的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持编程学习网!
本文标题为:在 Java 中使用 SAX 解析大型 XML
- 在 Java 中,如何将 String 转换为 char 或将 char 转换 2022-01-01
- Eclipse 的最佳 XML 编辑器 2022-01-01
- 获取数字的最后一位 2022-01-01
- 如何指定 CORS 的响应标头? 2022-01-01
- java.lang.IllegalStateException:Bean 名称“类别"的 BindingResult 和普通目标对象都不能用作请求属性 2022-01-01
- 未找到/usr/local/lib 中的库 2022-01-01
- 如何使 JFrame 背景和 JPanel 透明且仅显示图像 2022-01-01
- 将 Java Swing 桌面应用程序国际化的最佳实践是什么? 2022-01-01
- 转换 ldap 日期 2022-01-01
- GC_FOR_ALLOC 是否更“严重"?在调查内存使用情况时? 2022-01-01