XML Boiler program

Table of Contents

XML Boiler is a command line program (in the future I am going to make also HTTP(S) proxy interface) which automatically processes XML based on its namespaces.

This is an alpha (not thoroughly tested) release. The URLs used belows surely will change in future versions (so not preserving backward compatibility).

This program (well, almost) conforms to the specification. See the specification for more details of what this program does and what is its purpose. One important thing which does not yet work is XML validation.

Prerequisites

You need to know what is XML and what are XML namespaces before reading this document. You also need to know RDF to make anything except things like simple examples shown in this document.

To install XML Boiler, first install Python 3.7 (or above), then run:

pip install xml-boiler

Introduction

Consider an XML file with an XInclude directive:

<y xmlns:xi="http://www.w3.org/2001/XInclude">
    <xi:include href="simple.xml"/>
</y>
        

To expand the XInclude directive, we can run XML Boiler as follows:

boiler -i xinclude.xml script 'http://portonvictor.org/ns/trans/XInclude#script1'

But the magic goes here: boiler can figure out which script to run, given an input XML file and possibly the precedence (such as include operation below).

boiler -i xinclude.xml chain -u http://portonvictor.org/ns/trans/precedence-include

We can chain several commands separating them by plus:

boiler --preload 'http://portonvictor.org/ns/base' -i xinclude.xml pipe \
  'script http://portonvictor.org/ns/trans/XInclude#script1 + script http://portonvictor.org/ns/base#NSClean-script'

or

boiler --preload 'http://portonvictor.org/ns/base' -i xinclude.xml pipe \
  'chain -u http://portonvictor.org/ns/trans/precedence-include + script http://portonvictor.org/ns/base#NSClean-script'

The later script http://portonvictor.org/ns/base#NSClean-script removes unused namespaces (http://www.w3.org/2001/XInclude in our case).

The last script above required to preload http://portonvictor.org/ns/base asset, because this asset defines the script http://portonvictor.org/ns/base#NSClean-script which we use. (Two previosus examples used asset http://www.w3.org/2001/XInclude which loaded automatically because the URL of the asset was present in the input file as a namespace URI.)

In the current version of this software assets are really loaded from local files despite of their names look like URLs. In a future vesion of this software we should make possible to load assets from real Web URLs, so allowing the namespace owners to put description of their XML tags online to be used by this software.

The magic of chain

The command chain is magical. It automatically finds a chain of tranformations of the source document into the destination namespace. To do this it consults assets located at namespaces in the documents, preloaded (preload) assets, etc.

For more details on how it works see the specification.

Why then not just to run a Unix pipeline instead of XML Boiler? The reasons are:

Scripts vs transformations

The main thing that assets define are transformations.

A transformation may for example define how to transform from one XML namespace to another one.

You can (provided that the asset defining the transformation is loaded) call a transformation like this:

boiler --preload http://portonvictor.org/ns/base transform 'http://portonvictor.org/ns/base#xml-format'

A transformation can provide several alternative scripts. You can more selectively run a particular script instead of automatically chosen script for a transformation:

boiler --preload http://portonvictor.org/ns/base script 'http://portonvictor.org/ns/base#xml-format-script'

Command line

Common options

This list is not complete. Use --help for the full list of options.

-h, --help
Help message
-i INPUT, --input INPUT
input file (defaults to stdin)
-o OUTPUT, --output OUTPUT
output file (defaults to stdout)
-l {DEBUG,INFO,WARNING,ERROR,CRITICAL}, --log-level {DEBUG,INFO,WARNING,ERROR,CRITICAL}
Set log level.
-p NAMESPACE, --preload NAMESPACE
Load the specified asset before the main loop.
-r {none,breadth,depth}, --recursive {none,breadth,depth}
recursive download mode (none, breadth-first, depth-first)
-y NAME=DIR, --directory NAME=DIR
additional directory with assets
--software {package,executable,both}
determine installed software by package manager and/or executables in PATH. 'package' are now supported only on Debian-based systems. Defaults to 'both' on Debian- based and 'executable' on others.
-d DOWNLOADERS, --downloaders DOWNLOADERS
assets to be loaded before the main loop; a plus-separated list of comma-separated lists of "builtin","DIR" (DIR is given by --directory option, "builtin" is the assets distributed with XML Boiler)
-s {precedence,doc}, --next-script {precedence,doc}
next script algorithm (precedence is not supported)
-n {ignore,remove,error}, --not-in-target {ignore,remove,error}
What to do if the result XML file contains namespaces not in the target. remove is not supported.

Chain

chain or c command (boiler chain ...) runs an automatic transformation pipeline (see the specification). It accepts the name of the input file (none or - for stdin) and the following options:

-t NAMESPACE, --target NAMESPACE
target namespace (often the XHTML namespace http://www.w3.org/1999/xhtml)
-u URL, --universal-precedence URL
universal precedence (see the specification)

Supported namespaces and scripts

We currently have implicit support for the following namespaces:

http://www.w3.org/2001/XInclude (XInclude)

XInclude standard allows to include one XML document inside another.

See Wikipedia.

Extensible Modular Markup

The following transforms EMM to XHTML2:

PYTHONPATH=. ./bin/boiler -l DEBUG -i xmlboiler/tests/core/data/xml/emm.xml --preload http://portonvictor.org/ns/base --preload 'http://portonvictor.org/ns/EMM' \
pipe 'c -t https://www.w3.org/2002/06/xhtml2/ -n ignore + t http://portonvictor.org/ns/base#NSClean'

http://portonvictor.org/ns/comment (Comment)

Tags c:comment of this namespace are simply removed from the XML.

http://portonvictor.org/ns/EMM/sections (Structure)

This is transforming a EMM module into an XHTML1 document.

See the source of this document for an example. <h?> tags of the correct nesting are automatically created. This allows to generate <h?> tags of correct nesting.

<struct:toc/> automatically generates a table of contents.

http://portonvictor.org/ns/syntax (Syntax highlighting)

<pre syntax:format="JavaScript">function() { return 123 }</pre>

produces

function() { return 123 }

Also we support the following scripts and transformations:

http://portonvictor.org/ns/base#xml-format transformation (or http://portonvictor.org/ns/base#xml-format-script) from http://portonvictor.org/ns/base asset indents XML code.

http://portonvictor.org/ns/base#NSClean transformation (or http://portonvictor.org/ns/base#NSClean-script) from http://portonvictor.org/ns/base asset removes unused XML namespaces.

Your own transformations

You can create your own tranformations (and scripts) (after reading the specification). Currently transformations in Python and XSLT are supported, but to add support for new language is not very difficult. A script is simply a program which receives XML on stdin and prints transformed XML on stdout.

You can put your own assets (with your transformation, for example) into a directory and name the file with the asset as percent-encoded URI of the asset.

Then you use --directory NAME=DIR option to instruct XML Boiler to read assets from this directory.

Future plans

The key opportunities this project opens:

Support/donate

Support this project by money:

If you know Python 3, participate in programming.


Copyright © Victor Porton 2018