Difficulty: ★☆☆ (easy)
Keywords: chunk, chunking

Problem

You want to split your result HTML into different files, all correctly linked.

Solution

The DocBook XSL stylesheets uses the term chunking for splitting up your result into different HTML files. A chunk is therefor a single HTML file. Use the chunk.xsl stylesheet, it is available for all HTML variants. Usually this is enough to chunk your result.

To influence the chunking process, use the following parameters:

base.dir

Sets the output directory for all chunks. If not set, the output directory is system dependent. Usually it is the current directory from where you have executed your XSLT processor.

chunk.section.depth

Sets the depth to which sections should be chunked. Default is 1.

chunk.first.sections

Controls, if a first top-level sect1 or section element is chunked. If non-zero, a separate file (“chunk”) is created, otherwise the section is included in the component. Default is 0 (= zero, no separate chunk is created).

use.id.as.filename

Controls the filename of the chunked element. If non-zero, the filename is derived from the ID of the element. If zero, the filename is generated and numbered according to its position. Default is 0 (= zero, do not use IDs for file names).

Discussion

To better understand what the stylesheet creates, lets assume the following book structure:

Example 5.4. Book Structure With Components and Sections
book
  preface
  chapter
    sect1
    sect1
  appendix
    sect1
    sect1

The following subsections show how to influence the output.

Knowing the Default Behaviour

Using the chunk.xsl stylesheet with xsltproc, saxon, or any other XSLT processor leads to the following file names:

Example 5.5. Result of Chunking Without Parameters (Default)
# No parameters set, default behaviour
book         --> index.html
  preface    --> pr01.html
  chapter    --> ch01.html
    sect1
    sect1    --> ch01s02.html
  appendix   --> apa.html
    sect1
    sect1    --> apas02.html

As you can see, the file name consists of several components:

  • An abbreviation of the chunked element. Each chunked element is assigned an abbreviation, one or two characters long. The available abbreviations are shown in Table 5.1.

    Table 5.1. Abbreviations for Chunked Elements
    AbbreviationElement
    apappendix
    ararticle
    bibibliography
    bkbook
    chchapter
    cocolophon
    goglossary
    ixindex
    prpreface
    ptpart
    rerefentry
    rnreference
    ssection
    seset
    sisetindex
    totopic
  • A consecutive number. Each chunked component gets a number. For example, the first chapter has ch01, the second chapter ch02, and so on.

  • Additional sub components. If components has subcomponents like sections, the subcomponent's abbreviation is included in the file name. As such, the second section in the first chapter gets the file name ch01s02.html.

Writing to a Directory

If you want to have your files in a specific directory, set the parameter base.dir to your preferred value, for example:

# base.dir=html/
book         --> html/index.html
  preface    --> html/pr01.html
  chapter    --> html/ch01.html
    sect1
    sect1    --> html/ch01s02.html
  appendix   --> html/apa.html
    sect1
    sect1    --> html/apas02.html

Chunking the First Section

Example 5.5 showed, that the first section is not chunked. This is the default behavior. However, if you want the first section also to be written in a separate file, set the chunk.first.sections parameter to 1 to get the following result:

book         --> index.html
  preface    --> pr01.html
  chapter    --> ch01.html
    sect1    --> ch01s01.html
    sect1    --> ch01s02.html
  appendix   --> apa.html
    sect1    --> apas01.html
    sect1    --> apas02.html

Influencing the Chunking Depth

If we have very deeply nested structures with sections and subsections, we may want to chunk these as well. As an example, lets assume the following chapter with these subsections:

chapter
  sect1
    sect2
      sect3
        sect4
    sect2
  sect1

By default, only sect1 elements are written to a file. Anything below a sect1 like sect2, sect3 etc. is written to the same file that contains the sect1 element.

To control the chunking process for sections, use the parameter chunk.section.depth. By default, the parameter is set to 1 which is equivalent to chunk only level one sections. Setting chunk.section.depth to 2 has the following effect:

# chunk.section.depth=2, chunk.first.sections=1
chapter         --> ch01.html
  sect1         --> ch01s01.html
    sect2       --> ch01s01s01.html
      sect3
        sect4
    sect2       --> ch01s01s02.html
  sect1         --> ch01s02.html

As you can see, with the value of 2, level two sections (sect2) are written to a separate file. A value of 3 has the following effect:

# chunk.section.depth=3, chunk.first.sections=1
chapter         --> ch01.html
  sect1         --> ch01s01.html
    sect2       --> ch01s01s01.html
      sect3     --> ch01s01s01s01.html
        sect4
    sect2       --> ch01s01s02.html
  sect1         --> ch01s02.html

In other words, parameter chunk.section.depth cuts at the respective section level.

Create Stable File Names through IDs

The previous sections used predictable file names, but not stable ones. If you add or remove a section or chapter, the numbering of the chapters and sections will change and as such the file names too. If you want to share a link, this naming scheme is not useful as it is not stable.

Stable file names are not affected when you restructure your document. If you add or remove a structural element, the file names will still be the same.

To create such stable file names, use the parameter use.id.as.filename. This creates a file name through the xml:id attribute of your component. However, you should keep in mind some issues when you use this naming scheme:

  • Validate your document before you transform it.  Validating your document shows you any problems with IDs. For example, double IDs, missing IDs, and syntactically wrong IDs. This is very useful as the DocBook XSL stylesheets do not check for file names which occur twice. This could lead to a situation where one file name overwrites the other.

  • Set IDs to your components. You need to set IDs to your components, otherwise it will fallback to the default naming scheme.

  • Use “speaking” IDs.  Some tools can generate IDs automatically which could lead to something like y8w739zya. Such IDs are nonsense and useless as you cannot memorize them and they do not give any hints. Avoid that and replace such IDs with some meaningful and easy to remember name. This will also benefit your file names.

  • Avoid unusual characters in your ID.  Although it may be tempting to use umlauts, diacritica, or other Unicode characters, it is recommended to stay in the realm of the ASCII character set. Depending on the file system, the tools you use, or the operating system, Unicode characters could not be fully supported and as such could lead to wrong file names.

  • Structure your IDs consistently.  It is easier to find an HTML file if it is named consistently. For example, if you have a chapter about introduction, you could set the ID to intro. Any sections inside this chapter would use it as a prefix and append their own. A section with describes an overview could have an ID named intro.overview. This helps you when you search a specific HTML file.

Lets amend Example 5.4, “Book Structure With Components and Sections” with IDs:

Example 5.6. Book Structure with IDs
book xml:id="book"
  preface xml:id="preface"
  chapter xml:id="intro"
    sect1 xml:id="intro.concept"
    sect1 xml:id="intro.requirements"
  appendix xml:id="app.overview"
    sect1 xml:id="app.overview.method-a"
    sect1 xml:id="app.overview.method-b"

Transforming it through chunk.xsl leads to the following result:

# use.id.as.filename=1, chunk.first.sections=1
book         --> index.html
  preface    --> preface.html
  chapter    --> intro.html
    sect1    --> intro.concept.html
    sect1    --> intro.requirements.html
  appendix   --> app.overview.html
    sect1    --> app.overview.method-a.html
    sect1    --> app.overview.method-b.html

Maybe the file name of the book is a bit surprising. By default, its basename is index. If you want to change that too, set the parameter root.filename to your preferred value (without a file extension).

Of course, you can combine it with all the other parameters which are explained in this topic.

See Also


Project@GitHubIssue#10