In this article we'll take a quick look on how to set up a markdown / LaTeX project with the help of Python and Pandoc. The goal is to have a single command to compile a set of markdown documents into a single LaTeX pdf. As an example project I've modified my master thesis paper to consist of markdown files.
Motivation
LaTeX is a great tool to write beautiful looking documents, however it can be a bit ehm... cumbersome, especially the syntax. Markdown on the other hand is a wonderfully lightweight markup language that makes writing a joy, at least in my opinion. Since I had to write a master thesis, I figured I would try and get the best of both worlds and write my content in markdown, which I would then transform into LaTeX documents, which I could compile normally. For those that don't know me, I must admit I'm a sucker for convoluted solutions to problems, and I can't deny that this played a role into settling on this approach.
I'm far from the first to explore this approach, there are several blogposts on the web that, ..., and I have drawn inspiration from those.
Tools
The main tool, which most blogposts use, is Pandoc. Pandoc is a tool that can transform text files from a wide variety of syntaxes into different syntaxes. On top of that it is free software released under the GPL. The scrooge in me can definitely appreciate that.
After we have transformed the files from markdown to latex, they need to be
compiled to a pdf, I personally use pdflatex
for that, mostly due to the
fact I'm lazy and haven't looked into other options like XeLaTeX or LuaTeX.
We won't venture too much into the domain of LaTeX itself. I'll assume you
already are familiar with it, if not I definitely recommend taking a look at it.
It is definitely my go to tool for creating documents.
Lastly we'll use Python to wrap all the different terminal commands into a single build command. This script will copy the relevant files from the source folder into a build folder, and then call the appropriate compile functions. For this we make use of several libraries. All the paths will be done with the pathlib library which comes bundled with Python 3. In order to make it possible to run the build function as a commandline task instead of a script we use Invoke. Finally we use plumbum to access the different executables we need to execute.
Document structure
Before we dive in to setting up the whole project, lets first take a look at the
document structure. In the case of my paper we can discern three parts: an
abstract, a bunch of section making up the main matter, and the appendix.
Pandoc makes use of templates to convert files from one type to another. In this
case we will define three templates, an abstract, section and appendix template.
These will generate corresponding LaTeX documents. These LaTeX documents will be
included in a single tex
document which will also specify the document class
and LaTeX packages used. This is done with the \input{<path>}
command.
Each of the images used will be written in LaTeX. Images in LaTeX can
be a finicky business. It would most likely be possible to do this neatly in
markdown as well. However due to my experience with LaTeX, I figured I should
not try and attempt to do these in markdown. Since Pandoc will automatically
recognise LaTeX commands in markdown, we can freely mix and match LaTeX and
markdown syntax. Each of the images can thus be efficiently included in a
markdown document with the \input{<path>}
command.
The references used in the paper will also be done with regular LaTeX commands.
The bibliography file used is a single bibtex file. Within the markdown document
the \cite{<label>}
will be used to cite a source.
In case of the paper, paper.tex specifies the LaTeX source file, and paper.bib specifies the bibtex file.
Setting up the project
With the tools, motivation, and document structure out of the way it's time to
get our hands dirty. I will describe the set up as used within my paper project
however keep in mind that in no need you need to abide by this set up, and the
code and folder structure could be easily adjusted to fit your personal needs.
We will divide the project into three folders, a src
source
folder, which will contain all the files we write ourselves. a build
folder
generated by the script we will implement in the next section, and a target
folder which will also be generated by the script. The compile commands of latex
and markdown will be run on the build folder. The final outputted pdf will be
copied to the target folder. Under normal circumstances we won't need to dig
around in the build folder.
The source folder will contain all of our documents. In the top level we will have three files:
- The
tex
document specifying our LaTeX document, - The bibtex document containing our bibliography
- A
conf.json
specifying how our document is structured.
Furthermore we define three folders:
content
, which contains all of our markdown filesimg
, which will contain all files pertaining to our imageslayout
, which will contain all the relevant pandoc templates
content and the conf.json file
The content in the paper our structured in a certain way. This structure needs to be explicitly defined. There are several ways for doing this, I personally prefer to save this data outside of my documents and use a conf.json file to indicate the structure of my document, however alternatives are possible. You could for example add additional metadata within the filenames and directory names of the content, and then use some sort of discovery algorithm. That being said, it would add additional complexity, which can be circumvented by just using a json file to capture the metadata.
So what's in the json file? Since the json file will be read by the script to determine the structure of the document will be, basically it contains all metadata. The top level structure contains the following entries:
main_tex_file
: Specifies the main LaTeX file, which will be compiled with pdflatex.main_bib_file
: Specifies the main bibtex file, which will be compiled with bibtex.units
: Specifies the structure of the document.
units
is a list describing each section within the paper. Personally I like
to subdivide longer files into smaller self-contained elements. In this case,
I have divided up my sections into subsections. Each subsection gets its own
markdown file. Thus each unit is structure describing a section of the paper.
Each section gets its own folder within content. The folder name is specified
with the folder
entry. The title
describes the title of the section which
will be used in the template. Finally the subunits describe the subsections
of my paper, in the order they appear in. These are separate markdown files,
which will each be read by the script. These subunits will be combined
during the execution of the script to create a single markdown file per section.
Finally an abstract.md
and appendix.md
are placed within the content
folder which will be transformed in there respective LaTeX files.
The content and conf.json file are closely linked. The conf.json file basically describes the structure of the content.
layout
The layout
folder contains the Pandoc templates.
A Pandoc template describes how
the content of a file is translated into the different file. In our case
we need three templates
For each of these the template is linked. They are rather small files, which specify where the body is placed. In the case of the section template, an additional parameter is added, the title variable. Here the title of the section will be placed.
img
The img
folder contains all the files related to the images of the project.
It contains two subfolders, a tex
folder which contains all the tex
files
typesetting the images which are imported in the markdown/LaTeX files, and
a raw
folder which contains the actual jpeg
, png
, etc. images.
This is the basic document structure needed to create markdown/LaTeX pdf files. Next up, the build script that will do all the heavy lifting!
The automatisation script
In order to convert the src
folder into a compiled pdf, a bunch of steps need
to be taken. First the separate subunit files need to be combined to make the
section documents. Then these section documents, the abstract and the appendix
need to be copied to the build folder and converted to tex
documents with pandoc.
All the image files, the main tex
file and the bibliography files also need
to be copied to the build file. Then in order to get a nicely formatted pdf,
first pdflatex needs to be run on the main tex file, then bibtex needs to be run,
and then pdflatex needs to be run twice more (this has something to due with setting
the citations and references within the text properly, though I don't know the exact
details), only then we can copy the generated pdf to the target folder.
This, of course, is not something you'd want to have to do each time a small change
is made to the document. So we automate it! For each of these steps we will
write a small function that automates that exact step. Then all these functions
are called in a single invoke task, which we can call by calling invoke build
in the root folder of our project.
We start out with a tasks.py file. This file has to be called tasks.py
such
that invoke recognises it as a file containing invoke commands. I won't go into
too much depth as much of the script should be rather self-explanatory.
Supporting functionality
local
is a construct defined in plumbum
. It basically allows us to obtain an
executable which can be found on the path
variable of our terminal, and call
it, like it is a python function. Pretty sweet stuff. In order for this script
to work, it does require pandoc
, pdflatex
, and bibtex
to be on the path
variable. You can test this easily by calling each of these functions within
the terminal. If it throws an error that the command is not recognises, you
either haven't installed it, or you need to add its directory to the path
variable.
Pandoc requires a couple of flags, in order to know how the files should be converted. I wrapped these in three functions, to improve the readability.
Finally we have three helper functions which are responsible for copying files.
update_file
: copies the file at thesrc
totarget
if no file attarget
exists or it is not equal tosrc
.update_dir
: copies all the files corresponding with the delimiter or set of delimiters, from thesrc
folder to thetarget
folder.is_str_equal_to_file
: compares the contents of the file and the string on equality. If no file exists it is not equal.
Compile functions
Each of the compile functions executes one of the steps defined in the beginning of this section.
build_markdown_files
update_main_tex
update_main_bib
update_templates
update_img
compile_markdown
compile_latex
copy_output
The update functions merely copy either the specified file or directories, and
the copy_output
file copies the created pdf to the target.
The more interesting functions are the compile and build functions.
The build_markdown_files
copies the abstract and appendix markdown files
and then for each section construction a string containing the content of
that section by appending all the subunits specified in the conf.json file.
In case these have changed since last compilation, they are copied to the
build directory.
The compile_markdown
then compiles each of the markdown files in the md
subfolder of build
. For this it uses the section template for the sections
and the appendix and abstract for the appendix and abstract files. The output
is saved in build/content
. At the end of the compilation, we have all the
LaTeX files needed to construct the pdf.
The compile_latex
function calls the 3 pdflatex functions and the bibtex
function in the right order to construct the actual pdf.
The actual code might is rather straightforward, it specifies the appropriate
paths and then just calls the executable.
Task
The last thing left to do is to wrap all of the previous functions into a single
task. This function is called build, and is wrapped inside a @task
decorator.
This will allow invoke to recognise it as a commandline function, which can
then be called with invoke build
. This function reads the configuration file
and then calls all of the previously defined functions. And with that, we can
compile the whole document with just a single command. Whoop whoop.
Conclusion and Future work
With all of that done, we have shown how to set up a markdown / LaTeX document, which can be compiled with a single command task. It is a rather straightforward process, and can be easily adapted to fit other documents, and it leads to a rather painless compilation process.
There are a lot more customisation options which could be included in the script. The configuration file could be extended with additional elements, like whether elements should be compiled as sections or chapters. The images should probably be done with markdown as well. Furthermore the external pandoc executable should probably be replaced with the appropriate pandoc bindings for the library. With each future project involving markdown and LaTeX, I will add a little bit of this functionality
That's it for now, I hope to see you again at the next article!