Convert ISO/NISO STS documents to Metanorma using mnconvert

Ronald Tse

Alexander Dyuzhev on 17 Nov 2022

Introduction

An increasingly number of National Standards Bodies (NSBs), as members of ISO and IEC, now use Metanorma to re-publish their adoptions of ISO and IEC standards.

Occasionally, these NSBs also re-publish "corrections" of the ISO and IEC standards when the source files happen to be erroneous or corrupt.

Typically, a somewhat manual process was needed to convert ISO STS and NISO STS files to Metanorma.

To address this need, Metanorma now provides mnconvert, a command line tool that helps automatically convert ISO STS or NISO STS formats into Metanorma.

With mnconvert, STS users can easily convert and republish their ISO- and IEC-sourced documents using Metanorma.

ISO STS and NISO STS formats

The ISO Standards Tag Set (ISO STS) was originally developed by ISO with its NSB members as an internal XML format to facilitate its XML-based publishing workflow.

The ISO STS XML schema defines a set of elements and attributes that allow encoding of metadata and content of an ISO/IEC standard document in XML.

The format was originally developed based on a tailored set of elements from the Journal Article Tag Set (JATS) schema, which was originally developed by the National Institute of Health (NIH) for usage in scientific publishing.

As ISO STS came into use, several NSBs wanted to generalize a format for national standards publishing. They took ISO STS to the National Information Standards Organization (NISO) to standardize it.

The result was NISO Standards Tag Set (NISO STS), a generalized tag set that supports notions of national and regional adoption of standards, used by NSBs.

Starting in 2022, ISO has began its migration from ISO STS to NISO STS. IEC, which held out from the previous XML exercise, has started to migrate its publication workflow to NISO STS starting in 2023.

Migrating from ISO STS and NISO STS formats to Metanorma

ISO and IEC now distribute standards to its members using the NISO STS format.

However, NISO STS is an "encoding format" meant to be read-only.

The usage of NISO STS does not facilitate the:

authoring or editing of standards;
adoption of standards into national or regional standards formats with modifications.

In addition, not all NSBs wished to adopt the same set of tools used by ISO and IEC to work with NISO STS, due to cost, quality, flexibility and integration issues.

An increasing number of NSBs have adopted Metanorma for their STS processing toolchain, since Metanorma enables ingestion of ISO STS and NISO STS files, authoring and editing of the standards, and the re-publication or adoption into the national / regional formats, or other presentational formats like HTML and Word DOC.

An ISO STS or NISO STS XML file, especially when at length, are generally difficult to manually migrate into the Metanorma format, even with advanced text editors.

For NSBs that work with both ISO and IEC publishing, the incompatibilities between the NISO STS files from both organizations further exacerbates the issue. Despite both using NISO STS, ISO and IEC use slightly different conventions and encodings in NISO STS. This presents an additional barrier for NSBs trying to use a unified XML toolchain to handle documents from both ISO and IEC.

The encoding differences between ISO’s NISO STS and IEC’s NISO STS is documented in the NISO STS 1.0 — IEC/ISO Coding Guidelines document.

Introducing `mnconvert`

General

The newly released mnconvert is a command-line tool that allows conversion of ISO STS and NISO STS XML content into Metanorma AsciiDoc.

Instead of a simple XML transformation process, mnconvert produces a directory of files following best practices in Metanorma document editing, providing split sections and an immediate toolset for re-publishing or modification of content in the same ISO and IEC presentation.

mnconvert is a Java program packed into a JAR file for execution.

Quick guidance on how to use mnconvert is provided below.

Note	For further information on how to use the command, please see `mnconvert` documentation.

Prerequisites

A system needs to meet the following prerequisites to run mnconvert.

Presence of a Java JRE. Java typically comes with your system, with the official versions at Oracle (install Java from java.com) or from OpenJDK.

Note
If you think you have the Java JRE already but unsure, execute the command java -version to confirm.
Download the latest JAR release of mnconvert from the mnconvert releases page;

If you wanted to extend mnconvert, you will also need to install:

a Java Development Kit (JDK), with the most popular one OpenJDK;
the make and maven build tools, as per these instructions.

Using `mnconvert`

The command structure of mnconvert is:

Command syntax for running mnconvert

java -jar <location to mnconvert JAR file> <path to STS XML file> [options]

The default behavior of mnconvert is to convert the given XML file to Metanorma AsciiDoc:

The converted files will be placed at the same directory as the XML file.
The filename of the main document will be kept the same as in the provided XML file, but with the .adoc extension.
Sections and images (if any) will be placed in separate subfolders, named sections and images, respectively, in accordance with Metanorma best practices in authoring (see samples).

Note	Optional parameters to `mnconvert` are listed and explained in the section below.

Optional parameters for conversion

The full list of mnconvert options are described at mnconvert documentation.

The following options are commonly used:

--output or -o

to specify an output path different to the XML file.

--output-format or -of

to specify the output format. Values are:

adoc: (default) Metanorma AsciiDoc output.
xml: Metanorma Presentational XML output.

--imagesdir or -img

to specify an :imagesdir: attribute value which determines where the images will be extracted to. Defaults to images.

--mathml or -m

to specify the MathML version of the XML STS. Values are:

2: for MathML version 2
3: for MathML version 3

--check-type or -ct

to perform a validation check of the XML file against ISO STS or NISO STS.

Values are:

xsd-niso: check against NISO XSD
dtd-niso: check against NISO DTD
iso-dtd: check against ISO DTD

Converting ISO STS and NISO STS into Metanorma AsciiDoc

General

While mnconvert is the main processor in converting STS documents into Metanorma AsciiDoc, there is a quality process surrounding conversion.

These are the major steps in converting an ISO STS or NISO STS document into Metanorma AsciiDoc:

Run mnconvert
Validate and perform semantic checks
Generate converted output

Running `mnconvert`

For illustrative purposes, let’s assume that:

the JAR and XML files are located in the same folder;
we are using JAR release 1.56.0, i.e. mnconvert-1.56.0.jar;
the XML STS file we want to convert is named standard.xml.

Taking these assumptions into account, the command becomes:

Example of running mnconvert on standard.xml

java -jar mnconvert-1.56.0.jar standard.xml [options]

Validate and perform semantic checks

General

Once the initial conversion with mnconvert tool is performed, it is likely that still some adjustments need to be made to obtain a well-encoded Metanorma AsciiDoc.

The reasons are given below:

ISO STS and NISO STS are "presentational" formats that are geared towards instructions to display or render content, the content itself does not come with detailed semantics. Metanorma, however, is a semantic publishing engine, which requires the user to encode content with proper semantics. This means that during the conversion, when mnconvert does not know the meaning of content, the user needs to supplement them post-conversion.
ISO- and IEC- sourced XML files can contain errors and occasionally miss critical data, for example the encoding of the committees and working groups. Such missing data require manual supplementation.
In order to generate the document, the document needs to pass all Metanorma validation rules, so a remedial process is necessary.

Metanorma documentation will be our best ally in this chore.

The complexity of remediation depends on the source XML file itself, but is typically a quick validation — given that mnconvert has done the majority of the work.

The following manual validation checks are useful in ensuring the correctness of the resulting document.

A quality checklist is provided at Annex: Conversion quality checklist for users to keep track of validation steps per conversion.

Check: Verify document attributes

Ensure the correctness of document attributes.

If the STS XML source is encoded properly, the document attributes will be correctly encoded.
Frequently the sourced XML files will not provide contribution information or publication attributes, such as publication date. Ensure that all attributes used for describing (sub)committees are available. Any missing information should be manually supplemented.
Metanorma supports conversion of STS files at any stage (at ISO DIS and IS stages utilize STS). Check whether the document is published as a draft or not.

Check: Remove unnecessary non-breaking spaces

XML STS documents tend to include numerous non-breaking spaces to avoid the split of continuity of some expressions by a line break.

Some of these non-breaking spaces are unnecessary in Metanorma encoding. These unnecessary non-breaking spaces should be removed.

For example:

white space in a percentage expression (e.g., 75 %) should not be broken by a line break;
non-breaking spaces in reference declarations are not needed.

Check: Validate "Normative references" clause boilerplate

Verify if the initial text of the "Normative references" clause corresponds to the usual predefined text, or if it is any other different.

In the "Normative references" clause, the initial paragraph will be encoded like this after using mnconvert:

Example of converted "Normative references" boilerplate

[NOTE,type=boilerplate]
====
The following referenced documents are indispensable for the application
of this document. For dated references, only the edition cited applies.
For undated references, the latest edition of the referenced document
(including any amendments) applies.
====

Occasionally, ISO and IEC documents will contain a non-standard (non-compliant to ISO/IEC DIR 2) boilerplate.

If the converted boilerplate is compliant to ISO/IEC DIR 2, we can remove this NOTE block, as this text is provided automatically by Metanorma.
If the text differs, it means that the ISO/IEC DIR 2 boilerplate was overridden. In which case, this markup should be retained.

Check: Validate "Terms and definitions" clause encoding

Verify the encoding of the "Terms and definitions" clause.

For a "Terms and definitions" clauses that imports terms from another document, it is advised to use [source="REFERENCE_ANCHOR1,REFERENCE_ANCHOR_2"] syntax before the clause definition to generate a predefined text for the "Terms and definitions" clause.
When a "Terms and definitions" clause contains sub-clauses, care must be taken to ensure the correctness of the clause.
Concept mentions should be checked to ensure that terms are encoded in the {{…}} syntax.

Enhancement: Encoding of source code blocks

Encoding of any "source code" blocks should be checked.

Metanorma supports [source] content blocks as containers for source code. In ISO STS and NISO STS, there is no such provision for source code, they are just treated as plain text with particular formatting styles.

Source code blocks mainly contain the code without any line breaks.
Insert necessary line breaks and indentations where needed (or, if there are too many lines, just copy and paste the whole code from the XML source)
Check whether all source code blocks are encoded properly with the [source] declaration. It is preferable to specify the programming language used in the code block.
In some cases, source code blocks are applied with formatting. If any source code text is boldfaced, italicized, or underlined, the text should be surrounded by delimiters {{{…}}} for enabling markup. You also have the option of the subs attribute configured as subs="verbatim,quotes", to enable rich text inside code.

Check: Validate references in "Bibliography" and "Normative references"

Check the references in "Bibliography" and Normative references sections.

Occasionally, in ISO or IEC sourced STS documents, an identical reference can be listed in both sections, despite violating the rules of the ISO/IEC DIR 2.

If that is the case, remove the reference from the "Bibliography" and make sure that all the cross-references are then made with the anchor given in the "Normative references" section.
Ensure that there are no cross-references pointing to the document itself (it happens).
Ensure that correct identifiers are given for auto-fetching references.

Check: Verify cross-references

All cross-references should be validated.

Original text often refers to the bibliographic entries differently, which makes it hard for mnconvert to convert all of them perfectly. That is mostly noticeable in references to some specific clause, or even paragraph from some bibliographic entry. Therefore, make sure to avoid hanging clauses in cross-references.

Anchors mostly use underscores for dividing separate words/numbers which specify some reference. In the AsciiDoc files generated by mnconvert, it is possible that an anchor will include unnecessary whitespace before or after an underscore, or two underscores instead of one. It is recommended to confirm that the same anchor is used for the same reference throughout the document and remove any needless whitespace to avoid broken cross-references.

Check: Validate document encoding

Perform a quick validation of the document’s encoding in general.

Ensure the document complies with the best practices for encoding a document in Metanorma AsciiDoc:

Remove non-ASCII characters;

Note
It is a good practice to use regular expressions for finding non-ASCII characters. In this type of conversions, it is especially handy since it finds unnoticeable non-ASCII whitespaces.
Remove width parameters in tables;
Check width and height parameters given to images, most are not needed;

Validate text encoding; some ISO and IEC sourced STS XML files originate from OCR, and they will contain OCR errors;

Note	It is possible that characters are wrongly encoded in the source XML file, e.g. number `1` written as `l` and vice versa. It is advised to check whether there are such typos in the converted AsciiDoc and to correct them.

Optionally, split lines longer than 100 characters into shorter ones for ease of editing.

Enhancement: Re-encode math into `stem` blocks

XML STS documents frequently contain math or numbers not in MathML but as textual encoding.

These entries can be migrated into the \$...\$ blocks used in Metanorma for proper formatting.

Generating the resulting document

After the validation process, it is time to attempt generating the document through Metanorma.

Generating the converted and validated document

$ metanorma site generate

Voila!

Conclusion

Metanorma now provides mnconvert that helps users convert existing ISO STS and NISO STS content into Metanorma for adoption or republication.

In fact, the same process can also be used to "normalize" STS XML files, since Metanorma also produces STS XML output, and some NSBs are finding that useful!

Annex: Conversion quality checklist

Conversion quality checklist:

Verify document attributes
Remove unnecessary non-breaking spaces
Validate "Normative references" clause boilerplate
Validate "Terms and definitions" clause encoding
Validate references in "Bibliography" and "Normative references"
Verify cross-references
Validate document encoding

Introduction

ISO STS and NISO STS formats

Migrating from ISO STS and NISO STS formats to Metanorma

Introducing mnconvert

General

Prerequisites

Using mnconvert

Optional parameters for conversion

Converting ISO STS and NISO STS into Metanorma AsciiDoc

General

Running mnconvert

Validate and perform semantic checks

General

Check: Verify document attributes

Check: Remove unnecessary non-breaking spaces

Check: Validate "Normative references" clause boilerplate

Check: Validate "Terms and definitions" clause encoding

Enhancement: Encoding of source code blocks

Check: Validate references in "Bibliography" and "Normative references"

Check: Verify cross-references

Check: Validate document encoding

Enhancement: Re-encode math into stem blocks

Generating the resulting document

Conclusion

Annex: Conversion quality checklist

Introducing `mnconvert`

Using `mnconvert`

Running `mnconvert`

Enhancement: Re-encode math into `stem` blocks