DoctypeChanger is a Java class that lets you add, modify or remove a DOCTYPE declaration from a byte stream as it is fed into an XML parser.
FilterInputStream
vs FilterReader
DoctypeChanger is based on code by a nice fellow called Simon St.Laurent, who pointed me to it when I asked on the XML-DEV mailing list. The original code can be downloaded here. Thanks Simon!
Development of DoctypeChanger was sponsored by Social Change Online, a great company doing amazing things in webmapping and online GIS.
A DOCTYPE declaration, in case you were wondering, is the line at the top of some XML/SGML docs. For example:
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Strict//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-strict.dtd">
Most of your life you can happily ignore these things. Then one day you'll want to validate your XML, perhaps against a local DTD. "No problem, I'll just tell my parser that my local DTD is here and I always want to validate what I'm getting." Aha.. at this point you'll find that, due to some fundamental law, all XML parsers will stubbornly:
Now this isn't very useful. Say I'm getting XML from a remote source over which I have no control. If the incoming XML doesn't have a DOCTYPE, I simply can't validate. If it does have a DOCTYPE:
So until XML parsers change, it seems the only workaround is to manipulate the incoming XML stream, and manipulate the DOCTYPE declaration before the parser sees it. That is what DoctypeChanger does.
Update (18/11/01): it seems the XML gods have heard our cries,
and there is an enhanced SAX EntityResolver
in the works,
which will let one set a DTD even if no DOCTYPE declaration is present.
The relevant interface is org.xml.sax.ext.EntityResolver2
With DoctypeChanger, you can do just about any manipulation of a DOCTYPE declaration. Add it, remove it, replace bits of it.. anything :). You are given access to the old DOCTYPE declaration, so it's possible to express conditional replacement, like "if the old public id is -//OASIS//DTD DocBook XML V4.1.2//EN, then make the system id /usr/share/sgml/docbook/xml-dtd-4.1.2/docbookx.dtd".
It works like this: you implement a Generator class which determines the new DOCTYPE declaration. Your Generator is passed the old DOCTYPE declaration, so your new generated DOCTYPE declaration can somehow depend on the old one. Typically this is done with an anonymous inner class, like Swing event handlers.
This is done by always returning a null
Doctype
object.
InputStream in = ... // get your XML InputStream DoctypeChangerStream changer = new DoctypeChangerStream(in); changer.setGenerator( new DoctypeGenerator() { public Doctype generate(Doctype old) { return null; } } ); // .. and pass it on to the parser.
Here we ignore the old doctype declaration, and just return a new one.
InputStream in = ... // get your XML InputStream DOCTYPEChangerStream changer = new DOCTYPEChangerStream(in); changer.setGenerator( new DoctypeGenerator() { public Doctype generate(Doctype old) { return new DoctypeImpl("rootElement", "pubId", "sysId", "internalSubset"); } } ); // .. and pass it on to the parser.
Done by "passing through" all parts of the old doctype declaration, except the system id, which we replace.
InputStream in = ... // get your XML InputStream DOCTYPEChangerStream changer = new DOCTYPEChangerStream(in); changer.setGenerator( new DoctypeGenerator() { public Doctype generate(Doctype old) { return new DoctypeImpl( old.getRootElement(), old.getPublicId(), "/home/jeff/dtds/mydtd.dtd", old.getInternalSubset() ); } } ); // .. and pass it on to the parser.
This is our example, "if the old public id is -//OASIS//DTD DocBook XML V4.1.2//EN, then make the system id /usr/share/sgml/docbook/xml-dtd-4.1.2/docbookx.dtd. Otherwise, strip out the doctype declaration.".
InputStream in = ... // get your XML InputStream DOCTYPEChangerStream changer = new DOCTYPEChangerStream(in); changer.setGenerator( new DoctypeGenerator() { public Doctype generate(Doctype old) { if (old.getPublicId().equals( "-//OASIS//DTD DocBook XML V4.1//EN")) { return new DoctypeImpl( old.getRootElement(), old.getPublicId(), "/usr/share/sgml/docbook/xml-dtd-4.1.2/docbookx.dtd", old.getInternalSubset() ); } else return null; } } ); // .. and pass it on to the parser.
I hope you get the picture :)
To get you going, here is a bit of standard JAXP code which demonstrates the typical DoctypeChanger usage:
import net.socialchange.doctype.DoctypeChangerStream; import net.socialchange.doctype.DoctypeGenerator; import net.socialchange.doctype.Doctype; import net.socialchange.doctype.DoctypeImpl; import javax.xml.parsers.DocumentBuilderFactory; import javax.xml.parsers.DocumentBuilder; import javax.xml.parsers.ParserConfigurationException; import org.w3c.dom.Document; import org.w3c.dom.DocumentType; import org.xml.sax.SAXException; import java.io.InputStream; import java.io.IOException; import java.io.FileInputStream; import java.net.URL; public class Parse { public static InputStream getXMLStream(String uri) throws IOException { InputStream fileIn = new URL(uri).openStream(); DoctypeChangerStream changer = new DoctypeChangerStream(fileIn); changer.setGenerator( new DoctypeGenerator() { public Doctype generate(Doctype old) { if (old.getPublicId().equals("-//OASIS//DTD DocBook XML V4.1//EN")) { return new DoctypeImpl( old.getRootElement(), old.getPublicId(), "/usr/share/sgml/docbook/xml-dtd-4.1.2/docbookx.dtd", null ); } else return null; } } ); return changer; } public static void main(String args[]) throws Exception { if (args.length != 1) { System.err.println("Usage: java "+Parse.class.getName()+" <valid XML file>"); return; } DocumentBuilderFactory dbf = DocumentBuilderFactory.newInstance(); dbf.setNamespaceAware(true); // always do this. See http://xml.apache.org/~edwingo/jaxp-faq.html dbf.setValidating(true); DocumentBuilder db = dbf.newDocumentBuilder(); Document doc = db.parse( getXMLStream(args[0]) ); DocumentType doctype = doc.getDoctype(); if (doctype != null) { System.out.println("Doctype:"); System.out.println("\tName: "+doctype.getName()); System.out.println("\tPublic Id: "+doctype.getPublicId()); System.out.println("\tSystem Id: "+doctype.getSystemId()); System.out.println("\tInternal Subset: "+doctype.getInternalSubset()); } } }
DoctypeChanger works by parsing the first part of the stream, byte by byte. One byte is assumed to encode one character, and the encoding is assumed to be ASCII-compatible. This is not true for multi-byte encodings, specifically UTF-16 and UCS-4. So please don't use DoctypeChanger where you suspect multi-byte encoding may be used.
This can't be "fixed" either, since the XML may have come with a MIME type specifying the encoding. In this case, the MIME type charset parameter takes preference over anything in the XML. Since DoctypeChanger has no access to this info, we can never reliably figure out the encoding.
And if you didn't understand a word of the above, you probably don't have to worry. 95% of XML out there is plain old UTF-8 or ISO-8859-1, and will work fine with DoctypeChanger. If you want to impress your friends with your knowledge of Unicode encoding, read UTF-8 and Unicode FAQ for Unix/Linux. It's
FilterInputStream
vs FilterReader
The original code implemented a filter for both byte streams and character streams. I have limited myself to byte streams. If anyone wants to port this to char streams, it should be very easy (literally a few lines), but porting the unit tests won't be as fun. Here is some justification for byte streams, culled from the original source:
/* Comments from Nigel Whitaker regarding using a Stream rather than a Reader: My main comment is that I think extending a FilterInputStream would be a better choice than a FilterReader. Let me try and explain my (and xml-dev's) reasoning/research: I am trying to use DoctypeChanger as a direct filter on the input to SAX (well SAX2 as in Xerces' XMLReader). My original code created an org.xml.sax.InputSource and fed this into the parse method. I then tried to add your Filter and started getting validation exceptions, even though I ensured your code was correctly adding a Doctype. The problem arises from the use of a Reader in this chain: java.io.File -> java.io.FileInputStream -> java.io.InputStreamReader -> DoctypeChanger -> org.xml.sax.InputSource -> parser.parse() My original working/validating code (with the Doctype added by hand) was: java.io.File -> java.io.FileInputStream -> org.xml.sax.InputSource -> parser.parse() Although the use of Readers and Writers is the prefered way of doing Java IO since Java 1.1, it causes some problems with SAX parsing. It seems that SAX prefers to handle InputStreams of raw bytes because otherwise (with java.lang.Character based Readers) it cannot properly handle the encoding conversion issues of the input file. I did a bit of research and found that this issue was previously covered in a July 1999 xml-dev discussion entitled "encoding problem fixed". It appears that depending on the platform-specific JVM Readers may default to different encodings. See this thread: http://lists.xml.org/archives/xml-dev/199907/msg00413.html David Megginson also issued an "Important SAX Guideline" in this message: http://lists.xml.org/archives/xml-dev/199907/msg00414.html Always use an InputStream in preference to a Reader when you don't know the XML document's character encoding in advance. Users who use DoctypeChanger to write a temporary or intermediate file may not encounter this problem, only those who apply the Filter as part of the parsing, as I was doing. Following the guideline seems to have solved my problems. The conversion to a FilterInputStream was a fairly simple change to your code. */
The following section has little to do with DoctypeChanger, but falls in the same problem domain. If you're using DoctypeChanger, it's probably because you want to parse XML from dubious sources. In this situation, there's a few other tips and code snippets you may find useful.
The best way to validate against a local DTD in Java is to retrieve the DTD
as a resource. A resource is anything loaded through the Java
classloader mechanism, with methods like
Class.getResourceAsStream
. This is more portable than using
java.io.File
as a data source, because you can retrieve your DTD
from inside a jar, inside unpackaged webapps, etc.
The way to make parser retrieve the DTD from a resource is to implement a
org.xml.sax.EntityResolver
. The parser will then call the
resolveEntity
method when it wishes to retrieve the DTD, and this
is where you can return a DTD loaded as a resource.
Here is an EntityResolver
class which, when it gets a request
for an entity with a specific public ID, will return bytes from a specific
resource.
import java.io.IOException;
import java.io.InputStream;
import org.xml.sax.EntityResolver;
import org.xml.sax.InputSource;
import org.xml.sax.SAXException;
/**
* This SAX EntityResolver allows us to validate against local DTDs.
*
* public static final String DSML_DTD_PUBLIC_ID = "http://www.dsml.org/DSML";
* public static final String DSML_DTD_RESOURCE = "/net/socialchange/bob/dtds/dsml.dtd";
* ..
* ..
* builder.setEntityResolver(new LocalEntityResolver(DSML_DTD_PUBLIC_ID, DSML_DTD_RESOURCE));
* ..
* SAXBuilder builder = new SAXBuilder(true);
*
*
* Stolen from jakarta tomcat 3.2.2, share/org/apache/jasper/compiler/JspUtil.java
*/
public class LocalEntityResolver implements EntityResolver {
String dtdId;
String dtdResource;
public LocalEntityResolver(String id, String resource) {
this.dtdId = id;
this.dtdResource = resource;
}
public InputSource resolveEntity(String publicId, String systemId)
throws SAXException, IOException
{
//System.out.println ("publicId = " + publicId);
//System.out.println ("systemId is " + systemId);
//System.out.println ("resource is " + dtdResource);
if (publicId == null) {return null; }
if (publicId.equals(dtdId)) {
InputStream input =
this.getClass().getResourceAsStream(dtdResource);
InputSource isrc =
new InputSource(input);
return isrc;
}
else {
//System.out.println ("returning null");
return null;
}
}
}
This can be used in conjunction with DoctypeChanger. You would typically configure
DoctypeChanger to change the public id to a preset value. Then configure
LocalEntityResolver
to return a DTD for that preset value, and
you're set. All incoming XML will be validated against your local DTD.
Entity catalogs are wonderful things from SGML land, which are still relatively rare in XML applications. If you're not familiar with them, read Norm Walsh's article, which also introduces some code for catalog-based entity resolution. You should see some parallels with what DoctypeChanger tries to achieve. Like DoctypeChanger, catalogs allow you to remap from one DOCTYPE declaration to another.
Despite some overlap, two tools (DoctypeChanger and catalog resolution code) are really aimed at solving different problems. DoctypeChanger is more flexible in what it can do to DOCTYPE declarations, but it is for once-off parsing jobs, not a generalized entity resolution system like catalogs.
Currently (2002-04-15), you can download the public domain Arbortext catalog software mentioned in the article, here, or you can download the much-enhanced Sun code from http://www.sun.com/xml/developers/resolver/.
Any comments, bug reports, enhancements and the like welcome. Email Jeff Turner at jeff_turner@users.sourceforge.net.
Then the commented-out doctype would be used/modified instead of the
correct one.
Thanks to José María Fernández González for reporting this.
Also changed from Mozilla license to a less restrictive/complex Apache derivative (consent obtained from the original author).
Copyright (c) 1999-2001 by Simon St.Laurent and Jeff Turner. All Rights Reserved.
This code is licensed under a modified Apache Software License 1.1 without the advertising clause.
And the usual disclaimer..
This program is distributed in the hope that it will be useful, but WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the LICENSE.txt for more details.