Over at w3.org, Dan Brickley wrote a web tool to extract Adobe XMP data from Adobe files such as PDFs and Photoshop files.
I asked him about the script, and he wrote that the key is a Perl regular expression applied to the document:
m/id='W5M0MpCehiHzreSzNTczkc9d'\s*(bytes=')*([^']*)'?\?>(.*)<\?xpacket end='([^']*)'\?>/sg
The XMP is returned in $3.
You’ll want to experiment with this expression because in practice, it’s a little greedy with respect to line endings. In some documents, I get back the XMP and the rest of the PDF file. Line endings are the culprit, I think.
