Embed license info in PDF metadata

Just embed the license in metadata, he said as I was preparing a poster PDF for an online conference. Okay, but how?

My goal here is to add the authors and a Creative Commons license statement as PDF metadata. My starting point for a poster is usually Inkscape or Scribus, and I couldn't find such an export option in either of those. I need another software to edit the produced PDF file. I am looking to use only free software which is readily available on Debian and/or Fedora (so no Adobe Acrobat).

Creative commons has an (old?) wiki, and its PDF page points out two mechanisms of storing metadata in PDF: document information dictionary and metadata stream – as described in PDF 1.7 standard section 10.2.

The former is more simplistic. This unix stackexchange post describes how to add metadata with Ghostscript. That wouldn't do it for me. First, Ghostscript will re-process the entire document. Second, there is no standard property name for license.

The XMP metadata stream is more flexible, thanks to namespaces, including Dublin Core (which Adobe also uses). Adobe Acrobat's PDF properties and metadata help page also describes XMP. CC wiki's XMP page says that "Creative Commons is recommending XMP as the preferred format for embedded metadata, given its support for numerous file formats and the balkanized state of embedded metadata standards". This leads me to believe that XMP is indeed the preferred way. However, CC wiki's PDF page gives a rather obscure example (using a C library), so I needed a better tool.

Searching the Internet, I found ExifTool. Despite its name, it works with a large variety of file and metadata formats. ExifTool works with PDF and can insert XMP tags with built-in support for many namespaces, including Dublin Core (DC) and Creative Commons (CC). It is available on Debian as libimage-exiftool-perl and on Fedora as perl-Image-ExifTool. Whether it's the best tool for the job, I honestly don't know.

Reading the documentation is always helpful to get an idea of what's going on, but here's an example.

Given a PDF file, document.pdf, I add several authors (dc:Creator), a title (dc:Title), and license information (dc:Rights and cc:License):

  exiftool \
      -XMP-dc:Creator+="John Doe" \
      -XMP-dc:Creator+="Jane Doe" \
      -XMP-dc:Title="Lorem ipsum" \
      -XMP-dc:Rights="This work is licensed under a Creative Commons Attribution 4.0 International License" \
      -XMP-cc:License="https://creativecommons.org/licenses/by/4.0/" \
      document.pdf

Note that I am using dc:Rights to add a typical license statement, and cc:License to add a link to the license. ExifTool expects cc:License to be a string, so it's probably the best I can do.

As a first test of whether that worked, ExifTool can read the metadata it added (subset of output shown):

  exiftool document.pdf
ExifTool Version Number         : 12.57
File Name                       : document.pdf
...
XMP Toolkit                     : Image::ExifTool 12.57
License                         : https://creativecommons.org/licenses/by/4.0/
Creator                         : John Doe, Jane Doe
Rights                          : This work is licensed under a Creative Commons Attribution 4.0 International License
Title                           : Lorem ipsum
Producer                        : pdfTeX-1.40.24
Author                          :
Subject                         :
...

Pdfinfo can read the metadata too, but we need to provide the -meta flag. It prints raw XML, so we can see exactly what was added by ExifTool – it looks like a reasonable XML document:

  pdfinfo -meta document.pdf
  <?xpacket begin='' id='W5M0MpCehiHzreSzNTczkc9d'?>
  <x:xmpmeta xmlns:x='adobe:ns:meta/' x:xmptk='Image::ExifTool 12.57'>
  <rdf:RDF xmlns:rdf='http://www.w3.org/1999/02/22-rdf-syntax-ns#'>

   <rdf:Description rdf:about=''
    xmlns:cc='http://creativecommons.org/ns#'>
    <cc:license rdf:resource='https://creativecommons.org/licenses/by/4.0/'/>
   </rdf:Description>

   <rdf:Description rdf:about=''
    xmlns:dc='http://purl.org/dc/elements/1.1/'>
    <dc:creator>
     <rdf:Seq>
      <rdf:li>John Doe</rdf:li>
      <rdf:li>Jane Doe</rdf:li>
     </rdf:Seq>
    </dc:creator>
    <dc:rights>
     <rdf:Alt>
      <rdf:li xml:lang='x-default'>This work is licensed under a Creative Commons Attribution 4.0 International License</rdf:li>
     </rdf:Alt>
    </dc:rights>
    <dc:title>
     <rdf:Alt>
      <rdf:li xml:lang='x-default'>Lorem ipsum</rdf:li>
     </rdf:Alt>
    </dc:title>
   </rdf:Description>
  </rdf:RDF>
  </x:xmpmeta>

  <!-- note: skipping several empty lines -->

  <?xpacket end='w'?>

Evince (GNOME document viever) picks up the creators as authors, alhough, sadly, none of the rights / license information. Strangely, I cound not find any of these metadata in Okular (KDE document viewer).

/img/embed-pdf-metadata/evince-properties.png

So, mission accomplished… kind of.