This week I learned I had a 1.5 mb Word file to convert to XML. It’s time to find a tool to do the conversion.
I looked at two products that run under Mac OS X: Infinity Loop’s upCast and Logictran’s R2Net.
After spending a day poking at them, I recommend upCast.
upCast can convert RTF to either XHTML 1.0 or Infinity Loop’s own XML format. The native XML format was a hierarchal form of the document, so it was easy to write XSLT to go from that to my target XML.
You’ll need to make sure that you’ve assigned styles consistently through the source RTF file for either the XML or XHTML options. That goes for both of these products: you have to add semantics to the Word/RTF file, else all you’re going to get out of them is a list of DIV elements.
Logictran’s R2Net wasn’t as useful. In order to convert a document, you’ll spend some customizing a configuration file (which has to be added to the OS X application bundle) using their syntax (but comes with documentation and examples.) It’s also possible to generate output that would violate the well-formedness constraint on XML.
upCast let me use XSLT, something I already knew, to customize the output. The intermediate XML file, besides nesting the content by using document headings, has the RTF style name as an attribute for each element, which provides plenty of hooks for writing a transform.
upCast is a more expensive product, but the incremental cost is worth what I expect you’ll save in time.
Possibly Related posts (machine generated):