TIL that creating Markdown from HTML is hard

Today I wanted to take the text of a Confluence page and convert it to Markdown.

Getting the text was rather easy (after some duckduckgoing): Confluence provides a REST-API that returns among other things the content of pages in json. Here’s the curl-command to do this:

curl "https://$CONFLUENCE/rest/api/content/343109903?expand=body.view" | jq .body.view.value > text.html

This curl-command gets the contents of the page with the id 343109903 and shows the whole content of the json key body.view (that contains the actual documentation I want).

Opening the html file shows the expected contents, however without images or styling. And some suprising newline-characters before and after code blocks. Also the code inside the code blocks has no line breaks.

I then converted the Markdown file with the help of the python package html2text:

html2text tex2.html > text.md

The result was… just like the html-file. The things broken in html were broken in Markdown, too: the code-blocks, the newlines and the images. Also all tables except the most simple tables were completely broken, too.

Repairing this was not worth the hassle so I decided I could live without Markdown-documentation in my code.

Related posts: