- May 28, 2020
Introduction to Snakemake
In this tutorial, we will learn how to operate snakemake to create executable workflows.
- Jul 9, 2019
Parsing TEI XML documents with Python
In the previous blogpost, we learned about GROBID which outputs TEI XMLs from PDFs as input. We now attain some hand-on experience with juggling TEI XML documents.
- Jul 1, 2019
GROBID: Structured text from PDFs
In this post, we learn how to turn a pdf into a structured text document. To this end, we will use a tool called GROBID outputting a corresponding XML document for each pdf. This approach has these advantages over OCR techniques to be
- light-weighted: computation takes a couple of seconds vs. 3-5 minutes with tesseract, a state-of-the-art OCR framework
- easy to use: you just need to parse XML
- storage-efficient: the resulting (TEI) XML file takes just some kB for an entire paper
- REST-ful: GROBID can be run locally or remotely
I’ll conclude with a brief discussion of the TEI format (semi)structuring a PDF and with an application of GROBID.
- Jun 21, 2019
Welcome!
Hi there, I am Max and thank you for your time checking out my new blog about code, data and open science. In this blogpost, I will give a brief overview about me and this blog’s objective.