Thu 05 October 2017

Setting up a markdown/LaTeX document.

In this article we'll take a quick look on how to set up a markdown / LaTeX project with the help of Python and Pandoc. The goal is to have a single command to compile a set of markdown documents into a single LaTeX pdf. As an example project I've modified my master thesis paper to consist of markdown files.

Motivation

LaTeX is a great tool to write beautiful looking documents, however it can be a bit ehm... cumbersome, especially the syntax. Markdown on the other hand is a wonderfully lightweight markup language that makes writing a joy, at least in my opinion. Since I had to write a master thesis, I figured I would try and get the best of both worlds and write my content in markdown, which I would then transform into LaTeX documents, which I could compile normally. For those that don't know me, I must admit I'm a sucker for convoluted solutions to problems, and I can't deny that this played a role into settling on this approach.

I'm far from the first to explore this approach, there are several blogposts on the web that, ..., and I have drawn inspiration from those.

Tools

The main tool, which most blogposts use, is Pandoc. Pandoc is a tool that can transform text files from a wide variety of syntaxes into different syntaxes. On top of that it is free software released under the GPL. The scrooge in me can definitely appreciate that.

After we have transformed the files from markdown to latex, they need to be compiled to a pdf, I personally use pdflatex for that, mostly due to the fact I'm lazy and haven't looked into other options like XeLaTeX or LuaTeX. We won't venture too much into the domain of LaTeX itself. I'll assume you already are familiar with it, if not I definitely recommend taking a look at it. It is definitely my go to tool for creating documents.

Lastly we'll use Python to wrap all the different terminal commands into a single build command. This script will copy the relevant files from the source folder into a build folder, and then call the appropriate compile functions. For this we make use of several libraries. All the paths will be done with the pathlib library which comes bundled with Python 3. In order to make it possible to run the build function as a commandline task instead of a script we use Invoke. Finally we use plumbum to access the different executables we need to execute.

Document structure

Before we dive in to setting up the whole project, lets first take a look at the document structure. In the case of my paper we can discern three parts: an abstract, a bunch of section making up the main matter, and the appendix. Pandoc makes use of templates to convert files from one type to another. In this case we will define three templates, an abstract, section and appendix template. These will generate corresponding LaTeX documents. These LaTeX documents will be included in a single tex document which will also specify the document class and LaTeX packages used. This is done with the \input{<path>} command.

Each of the images used will be written in LaTeX. Images in LaTeX can be a finicky business. It would most likely be possible to do this neatly in markdown as well. However due to my experience with LaTeX, I figured I should not try and attempt to do these in markdown. Since Pandoc will automatically recognise LaTeX commands in markdown, we can freely mix and match LaTeX and markdown syntax. Each of the images can thus be efficiently included in a markdown document with the \input{<path>} command.

The references used in the paper will also be done with regular LaTeX commands. The bibliography file used is a single bibtex file. Within the markdown document the \cite{<label>} will be used to cite a source.

In case of the paper, paper.tex specifies the LaTeX source file, and paper.bib specifies the bibtex file.

Setting up the project

With the tools, motivation, and document structure out of the way it's time to get our hands dirty. I will describe the set up as used within my paper project however keep in mind that in no need you need to abide by this set up, and the code and folder structure could be easily adjusted to fit your personal needs. We will divide the project into three folders, a src source folder, which will contain all the files we write ourselves. a build folder generated by the script we will implement in the next section, and a target folder which will also be generated by the script. The compile commands of latex and markdown will be run on the build folder. The final outputted pdf will be copied to the target folder. Under normal circumstances we won't need to dig around in the build folder.

The source folder will contain all of our documents. In the top level we will have three files:

The tex document specifying our LaTeX document,
The bibtex document containing our bibliography
A conf.json specifying how our document is structured.

Furthermore we define three folders:

content, which contains all of our markdown files
img, which will contain all files pertaining to our images
layout, which will contain all the relevant pandoc templates

content and the conf.json file

The content in the paper our structured in a certain way. This structure needs to be explicitly defined. There are several ways for doing this, I personally prefer to save this data outside of my documents and use a conf.json file to indicate the structure of my document, however alternatives are possible. You could for example add additional metadata within the filenames and directory names of the content, and then use some sort of discovery algorithm. That being said, it would add additional complexity, which can be circumvented by just using a json file to capture the metadata.

So what's in the json file? Since the json file will be read by the script to determine the structure of the document will be, basically it contains all metadata. The top level structure contains the following entries:

main_tex_file : Specifies the main LaTeX file, which will be compiled with pdflatex.
main_bib_file : Specifies the main bibtex file, which will be compiled with bibtex.
units : Specifies the structure of the document.

units is a list describing each section within the paper. Personally I like to subdivide longer files into smaller self-contained elements. In this case, I have divided up my sections into subsections. Each subsection gets its own markdown file. Thus each unit is structure describing a section of the paper. Each section gets its own folder within content. The folder name is specified with the folder entry. The title describes the title of the section which will be used in the template. Finally the subunits describe the subsections of my paper, in the order they appear in. These are separate markdown files, which will each be read by the script. These subunits will be combined during the execution of the script to create a single markdown file per section.

Finally an abstract.md and appendix.md are placed within the content folder which will be transformed in there respective LaTeX files.

The content and conf.json file are closely linked. The conf.json file basically describes the structure of the content.

layout

The layout folder contains the Pandoc templates. A Pandoc template describes how the content of a file is translated into the different file. In our case we need three templates

For each of these the template is linked. They are rather small files, which specify where the body is placed. In the case of the section template, an additional parameter is added, the title variable. Here the title of the section will be placed.

img

The img folder contains all the files related to the images of the project. It contains two subfolders, a tex folder which contains all the tex files typesetting the images which are imported in the markdown/LaTeX files, and a raw folder which contains the actual jpeg, png, etc. images.

This is the basic document structure needed to create markdown/LaTeX pdf files. Next up, the build script that will do all the heavy lifting!

The automatisation script

In order to convert the src folder into a compiled pdf, a bunch of steps need to be taken. First the separate subunit files need to be combined to make the section documents. Then these section documents, the abstract and the appendix need to be copied to the build folder and converted to tex documents with pandoc. All the image files, the main tex file and the bibliography files also need to be copied to the build file. Then in order to get a nicely formatted pdf, first pdflatex needs to be run on the main tex file, then bibtex needs to be run, and then pdflatex needs to be run twice more (this has something to due with setting the citations and references within the text properly, though I don't know the exact details), only then we can copy the generated pdf to the target folder.

This, of course, is not something you'd want to have to do each time a small change is made to the document. So we automate it! For each of these steps we will write a small function that automates that exact step. Then all these functions are called in a single invoke task, which we can call by calling invoke build in the root folder of our project.

We start out with a tasks.py file. This file has to be called tasks.py such that invoke recognises it as a file containing invoke commands. I won't go into too much depth as much of the script should be rather self-explanatory.

Supporting functionality

local is a construct defined in plumbum. It basically allows us to obtain an executable which can be found on the path variable of our terminal, and call it, like it is a python function. Pretty sweet stuff. In order for this script to work, it does require pandoc, pdflatex, and bibtex to be on the path variable. You can test this easily by calling each of these functions within the terminal. If it throws an error that the command is not recognises, you either haven't installed it, or you need to add its directory to the path variable.

Pandoc requires a couple of flags, in order to know how the files should be converted. I wrapped these in three functions, to improve the readability.

Finally we have three helper functions which are responsible for copying files.

update_file: copies the file at the src to target if no file at target exists or it is not equal to src.
update_dir: copies all the files corresponding with the delimiter or set of delimiters, from the src folder to the target folder.
is_str_equal_to_file: compares the contents of the file and the string on equality. If no file exists it is not equal.

Compile functions

Each of the compile functions executes one of the steps defined in the beginning of this section.

build_markdown_files
update_main_tex
update_main_bib
update_templates
update_img
compile_markdown
compile_latex
copy_output

The update functions merely copy either the specified file or directories, and the copy_output file copies the created pdf to the target. The more interesting functions are the compile and build functions.

The build_markdown_files copies the abstract and appendix markdown files and then for each section construction a string containing the content of that section by appending all the subunits specified in the conf.json file. In case these have changed since last compilation, they are copied to the build directory.

The compile_markdown then compiles each of the markdown files in the md subfolder of build. For this it uses the section template for the sections and the appendix and abstract for the appendix and abstract files. The output is saved in build/content. At the end of the compilation, we have all the LaTeX files needed to construct the pdf.

The compile_latex function calls the 3 pdflatex functions and the bibtex function in the right order to construct the actual pdf. The actual code might is rather straightforward, it specifies the appropriate paths and then just calls the executable.

Task

The last thing left to do is to wrap all of the previous functions into a single task. This function is called build, and is wrapped inside a @task decorator. This will allow invoke to recognise it as a commandline function, which can then be called with invoke build. This function reads the configuration file and then calls all of the previously defined functions. And with that, we can compile the whole document with just a single command. Whoop whoop.

Conclusion and Future work

With all of that done, we have shown how to set up a markdown / LaTeX document, which can be compiled with a single command task. It is a rather straightforward process, and can be easily adapted to fit other documents, and it leads to a rather painless compilation process.

There are a lot more customisation options which could be included in the script. The configuration file could be extended with additional elements, like whether elements should be compiled as sections or chapters. The images should probably be done with markdown as well. Furthermore the external pandoc executable should probably be replaced with the appropriate pandoc bindings for the library. With each future project involving markdown and LaTeX, I will add a little bit of this functionality

That's it for now, I hope to see you again at the next article!