You want to process your XML file, but you do not want to expand any of the defined entities.
Replace all your entities with a string that is easy to search for. For
example, you can replace the entity &product;
with the string [[[product]]]
(assuming that the
string [[[product]]]
doesn't occur anywhere else in
our document).
Before you can proceed, you need to prepare your system as described in Procedure 3.3, “Preparing the Workflow”.
Make sure you have Python 3 installed on your system. Find installation instructions on the project's home page at https://www.python.org/. The script works with Python 3.4 and above.
On Linux or macOS Python may already be installed. If that is not the case, Python can be installed using the system's package manager.
Download the Python script ents2text.py
.
Save the script where it can be found by the system. Linux users
can store the script in either ~/bin
or /usr/bin
. In this procedure, we use ~/bin
. Make the script executable
with:
$
chmod +x ~/bin/ents2text.py
Create a link:
$
ln -rs ~/bin/ents2text.py ~/bin/text2ents.py
This is required for converting from entities to text and vice versa. The script uses the script's name to determine the conversion direction.
Once you have done all the preparations, proceed with Procedure 3.4, “Workflow for Protecting Custom Entities”.
Convert all entities to text with the following command:
$
ents2text.py XMLFILE1 XMLFILE2...
The script converts all entities to their protected notation in
place and creates backup files with the extension .bak
.
Process your XML file with your XML tools.
Revert all the protected notation to their original entity notation:
$
text2ents.py XMLFILE1 XMLFILE2...
Not expanding entities looks like an abnormal use case as XML parsers are supposed to expand entities by default. However, expanding entities has one disadvantage: after the XML parser has resolved entities, the content is indistinguishable from other content. In other words, you cannot distinguish content that comes from an entity definition from existing content. You cannot “go back” and revert this process once all entities are expanded.
For this reason, preserving entities can be useful when you want to process XML, but keep the definied entities untouched. Procedure 3.4 showed one solution. However, be aware of certain issues that might cause problems with XML files:
The script does not know XML. As such, it reads the XML file line by line and replaces each line through regular expressions.
The script can replace entities in situations where it is
not desirable. For example, if you have an entity in an href
of a xi:include
element,
such reference will not work anymore when resolving XIncludes.
The script doesn't handle external entities. External entities refer to external storage objects, for example:
<!DOCTYPE book [ <!ENTITY intro SYSTEM "chap-intro.xml"> ]> <book> <title>...</title> &intro; </book>
Such external entities usually refer to whole XML structures like chapters, sections etc. Replacing such structures with ordinary text would lead to syntax errors in your XML files. However, such entities are not used very often and they should be replaced by XInludes (see Section 1.3, “Modularize Your Document with XIncludes”).
If you prefer a solution that can handle XML, use the Python
script ents2pi.py
.
This script can parse XML and replaces each entity with a processing
instruction. For example, the entity &product;
is converted to the PI <?entity product>
. That makes
it easier to process it with XSLT or any other XML-agnostic tool.
ent2text.py
and ents2pi.py
ent2text.py | ents2pi.py | |
---|---|---|
Entity to Text | Entity to PI | |
XML-aware | No | Yes |
Easy to process it further | More difficult | Easy |
Preserving source code | Always | Usually |
Dependencies | Only Python standard library | lxml[a] |
Project@GitHub | Issue#8 |