You want to process your XML file, but you do not want to expand any of the defined entities.
Replace all your entities with a string that is easy to search for. For
example, you can replace the entity &product;
with the string [[[product]]]
(assuming that the
string [[[product]]]
doesn't occur anywhere else in
our document).
Before you can proceed, you need to prepare your system as described in Procedure 3.3, “Preparing the Workflow”.
Once you have done all the preparations, proceed with Procedure 3.4, “Workflow for Protecting Custom Entities”.
Not expanding entities looks like an abnormal use case as XML parsers are supposed to expand entities by default. However, expanding entities has one disadvantage: after the XML parser has resolved entities, the content is indistinguishable from other content. In other words, you cannot distinguish content that comes from an entity definition from existing content. You cannot “go back” and revert this process once all entities are expanded.
For this reason, preserving entities can be useful when you want to process XML, but keep the definied entities untouched. Procedure 3.4 showed one solution. However, be aware of certain issues that might cause problems with XML files:
The script does not know XML. As such, it reads the XML file line by line and replaces each line through regular expressions.
The script can replace entities in situations where it is
not desirable. For example, if you have an entity in an href
of a xi:include
element,
such reference will not work anymore when resolving XIncludes.
The script doesn't handle external entities. External entities refer to external storage objects, for example:
<!DOCTYPE book [ <!ENTITY intro SYSTEM "chap-intro.xml"> ]> <book> <title>...</title> &intro; </book>
Such external entities usually refer to whole XML structures like chapters, sections etc. Replacing such structures with ordinary text would lead to syntax errors in your XML files. However, such entities are not used very often and they should be replaced by XInludes (see Section 1.3, “Modularize Your Document with XIncludes”).
If you prefer a solution that can handle XML, use the Python
script ents2pi.py
.
This script can parse XML and replaces each entity with a processing
instruction. For example, the entity &product;
is converted to the PI <?entity product>
. That makes
it easier to process it with XSLT or any other XML-agnostic tool.
ent2text.py
and ents2pi.py
ent2text.py | ents2pi.py | |
---|---|---|
Entity to Text | Entity to PI | |
XML-aware | No | Yes |
Easy to process it further | More difficult | Easy |
Preserving source code | Always | Usually |
Dependencies | Only Python standard library | lxml[a] |
Project@GitHub | Issue#8 |