August 16, 2008
New SgmlReader Release(s)
Steve Bjorg @ 7:51 am
With the launch of the MindTouch Deki 8.08 RC1, we’re also releasing updated versions of SgmlReader, the versatile .NET library written in C# for parsing HTML/SGML files. The benefit of SgmlReader is that it can cope with some fairly loosely formatted documents and convert most of the content into valid XML.
SgmlReader is being released in two versions: 1.7.5 and 1.8.0. 1.7.5 marks the last version compiled for .NET 1.1. Starting with 1.8.0, .NET 2.0 is required. Both can be downloaded from SourceForge.net and include compiled binaries, as well as the source code.
Improvements in 1.7.5 (.NET 1.1)
- Detect ending quote in attributes (e.g.
<p class="para>...</p>) - Each unknown prefix is mapped to a unique namespace, allowing duplicate local names (e.g.
<p o:x="foo" m:x="bar">...</p>)
Improvements in 1.8.0 (.NET 2.0)
- Major code review & clean-up to use generics by jamesgmbutler (thanks!)
- Support XML-only entity ' in HTML/SGML documents (e.g.
<p>It's ok</p>)
To parse a HTML document into valid XML:
XmlDocument FromHtml(TextReader reader) {
// setup SgmlReader
Sgml.SgmlReader sgmlReader = new Sgml.SgmlReader();
sgmlReader.DocType = "HTML";
sgmlReader.WhitespaceHandling = WhitespaceHandling.All;
sgmlReader.CaseFolding = Sgml.CaseFolding.ToLower;
sgmlReader.InputStream = reader;
// create document
XmlDocument doc = new XmlDocument();
doc.PreserveWhitespace = true;
doc.XmlResolver = null;
doc.Load(sgmlReader);
return doc;
}
All in all, some good improvements. If you have any recommendations on how to make it event better, please leave a comment or join us on the forums.





No Comments »
No comments yet.
RSS feed for comments on this post. TrackBack URL
Leave a comment