I had an “oop, ack” reaction reading the documentation for the Microsoft Office Word XML format the Danes posted yesterday.
Read the section on formatting text: WordML defines a run to represent some sequence of text within a paragraph. Inside the run, you turn text decoration on an off with semaphores. Instead of:
<w:r>
<w:t><w:b>Hello World</w:b></w:t>
</w:r>
The markup is:
<w:r>
<w:rPr>
<w:b/>
</w:rPr>
<w:t>Hello, bold text</w:t>
</w:r>
And to turn it off:
<w:r>
<w:rPr>
<w:b />
</w:rPr>
<w:t>Hello, bold text.</w:t>
<w:rPr>
<w:b w:val=”off” />
</w:rPr>
<w:t>Goodbye, bold text.</w:t>
</w:r>
I haven’t tried writing XSLT to process this, so don’t have an opinion of how hard or easy this will be to do.
It gets more interesting when you look at the namespace Microsoft has defined for annotations. To create a bookmark within a document, you use two complete elements:
<w:p><w:r><w:t>Before bookmark</w:t> </w:r> </w:p>
<aml:annotation aml:id=”0″ w:type=”Word.Bookmark.Start”
w:name=”MyBookmark” />
<w:p><w:r><w:t>Inside bookmark</w:t></w:r></w:p>
<aml:annotation aml:id=”0″ w:type=”Word.Bookmark.End” />
<w:p><w:r><w:t>After bookmark</w:t></w:r>
</w:p>
So extracting the bookmarks from a WordML document is not a trivial bit of XSLT.
The annotation language reminds me of Ted Nelson’s notion of keeping the markup out of the document. Another file contained the offsets, annotations, and markup.