Difficulty: ★☆☆ (easy)
Keywords: preserving entities, entities, placeholders

Problem

You want to process your XML file, but you do not want to expand any of the defined entities.

Solution

Replace all your entities with a string that is easy to search for. For example, you can replace the entity &product; with the string [[[product]]] (assuming that the string [[[product]]] doesn't occur anywhere else in our document).

Before you can proceed, you need to prepare your system as described in Procedure 3.3, “Preparing the Workflow”.

Procedure 3.3. Preparing the Workflow
  1. Make sure you have Python 3 installed on your system. Find installation instructions on the project's home page at https://www.python.org/. The script works with Python 3.4 and above.

    On Linux or macOS Python may already be installed. If that is not the case, Python can be installed using the system's package manager.

  2. Download the Python script ents2text.py.

  3. Save the script where it can be found by the system. Linux users can store the script in either ~/bin or /usr/bin. In this procedure, we use ~/bin. Make the script executable with:

    $ chmod +x ~/bin/ents2text.py
  4. Create a link:

    $ ln -rs ~/bin/ents2text.py ~/bin/text2ents.py

    This is required for converting from entities to text and vice versa. The script uses the script's name to determine the conversion direction.

Once you have done all the preparations, proceed with Procedure 3.4, “Workflow for Protecting Custom Entities”.

Procedure 3.4. Workflow for Protecting Custom Entities
  1. Convert all entities to text with the following command:

    $ ents2text.py XMLFILE1 XMLFILE2...

    The script converts all entities to their protected notation in place and creates backup files with the extension .bak.

  2. Process your XML file with your XML tools.

  3. Revert all the protected notation to their original entity notation:

    $ text2ents.py XMLFILE1 XMLFILE2...

Discussion

Not expanding entities looks like an abnormal use case as XML parsers are supposed to expand entities by default. However, expanding entities has one disadvantage: after the XML parser has resolved entities, the content is indistinguishable from other content. In other words, you cannot distinguish content that comes from an entity definition from existing content. You cannot “go back” and revert this process once all entities are expanded.

For this reason, preserving entities can be useful when you want to process XML, but keep the definied entities untouched. Procedure 3.4 showed one solution. However, be aware of certain issues that might cause problems with XML files:

  • The script does not know XML. As such, it reads the XML file line by line and replaces each line through regular expressions.

  • The script can replace entities in situations where it is not desirable. For example, if you have an entity in an href of a xi:include element, such reference will not work anymore when resolving XIncludes.

  • The script doesn't handle external entities. External entities refer to external storage objects, for example:

    <!DOCTYPE book [
      <!ENTITY intro SYSTEM "chap-intro.xml">
    ]>
    <book>
      <title>...</title>
      &intro;
    </book>

    Such external entities usually refer to whole XML structures like chapters, sections etc. Replacing such structures with ordinary text would lead to syntax errors in your XML files. However, such entities are not used very often and they should be replaced by XInludes (see Section 1.3, “Modularize Your Document with XIncludes”).

If you prefer a solution that can handle XML, use the Python script ents2pi.py. This script can parse XML and replaces each entity with a processing instruction. For example, the entity &product; is converted to the PI <?entity product>. That makes it easier to process it with XSLT or any other XML-agnostic tool.

Table 3.4. Comparison Between ent2text.py and ents2pi.py
 ent2text.pyents2pi.py
Entity to TextEntity to PI
XML-awareNoYes
Easy to process it furtherMore difficultEasy
Preserving source codeAlwaysUsually
DependenciesOnly Python standard librarylxml[a]

See Also


Project@GitHubIssue#8