Difficulty: ★☆☆ (easy)
Keywords: pretty-print, xmlformat

Problem

Your file, be it autogenerated or somehow “mangled” is poorly indented and you want to get rid of this.

Solution

There are different solutions to this problem:

  • a “pretty-print” stylesheet

  • the XML parser xmllint

  • the xmlformat command

XML Parser xmllint

The XML parser xmllint offers the --format option to turn on indentation for each element.

xmllint --format XMLFILE

The Pretty-Print Stylesheet

The simplest stylesheet for indentation is shown in Example 3.2, “pretty.xsl. It relies on the copy.xsl stylesheet.

Example 3.2. pretty.xsl
<?xml version="1.0" encoding="UTF-8"?>
<xsl:stylesheet version="1.0"
  xmlns:xsl="http://www.w3.org/1999/XSL/Transform"
  xmlns="http://docbook.org/ns/docbook">
  
  <xsl:import href="copy.xsl"/>
  <xsl:output indent="yes"/>

  <xsl:strip-space elements="*"/>
  <xsl:preserve-space elements="d:screen d:literallayout
    d:programlisting d:address"/>
</xsl:stylesheet>

Command xmlformat

The xmlformat tool is a Perl script which is available from the Web site http://www.kitebird.com/software/xmlformat/.

The tool distinguishes between block elements, inline elements, and verbatim elements (similar to DocBook). The difference between the types is the whitespace normalization.

Block elements typically begin with a new line and children are indented. Spacing before and after can be controlled too.

Inline elements occur in block elements. Normalization and line-wrapping occurs in regard to the enclosing block element.

Verbatim elements are not formatted at all. That means, the content of the input element is the same as the content of the output element, including whitespaces.

Discussion

The xmllint command with its --format option is the easiest candidate but lacks customization. This is useful if you do not have any other tools at hand and you prefer a quick and rough reformatting.

The pretty.xsl stylesheet is a pure XSLT solution. As such, it works on every platform which supports an XSLT processor. It is adaptable to your needs, but mixed content (like in para) is problematic.

The most adaptable method is xmlformat.


Project@GitHubIssue#8