Save Asciidoctor AST as an AsciiDoc text document - asciidoctor

Working with Asciidoc programmatically (I'm using AsciiDoctorJ), is there a straightforward way to get back the AsciiDoc text data from the AST DOM?
I can get the pre-processed AsciiDoc stream from the pre-processor, but if I want to make any changes to the AST as it's being loaded, I don't see any way to render a Document back into the AsciiDoc form.
I suppose it's possible to implement a Converter, or simply traverse the DOM tree and write out its contents as AsciiDoc text myself, but that's a serious undertaking, and there are lot of nooks and crannies that I'm bound to miss.
Considering that AsciiDoc code contains the information that lets it determine how to convert the text into the tree, I was wondering if there is a straightforward way to simply reverse that.

The Asciidoctor parser does not currently store enough information to reproduce the original source document. For more information, see: https://github.com/asciidoctor/asciidoctor/issues/3312
Depending on what you want to achieve, the best option is probably to use a Preprocessor extension to process the raw AsciiDoc before Asciidoctor parses it: https://docs.asciidoctor.org/asciidoctorj/latest/extensions/preprocessor/

Related

Markdown tabs in GitLab Markdown

I am writing documentation that has steps for Windows, Mac, Linux.
I want to make it look like this HTML5 tabbed HTML5 example
there is support for HTML in gitlab markdown
There is a reference to a sanitation class that validates the inline HTML in gitlab marrkdown
My question is:
Recommendations to achieve the tabbed documentation. Is there a workaround for displaying CSS correctly in markdown?
how to make this work?
Simply insert the relevant HTML/CSS/JS into your Markdown document.
As Markdown's Syntax Rules state (emphasis in original):
Markdown's syntax is intended for one purpose: to be used as a format
for writing for the web.
Markdown is not a replacement for HTML, or even close to it. Its
syntax is very small, corresponding only to a very small subset of
HTML tags. The idea is not to create a syntax that makes it easier
to insert HTML tags. In my opinion, HTML tags are already easy to
insert. The idea for Markdown is to make it easy to read, write, and
edit prose. HTML is a publishing format; Markdown is a writing
format. Thus, Markdown's formatting syntax only addresses issues that
can be conveyed in plain text.
For any markup that is not covered by Markdown's syntax, you simply
use HTML itself. There's no need to preface it or delimit it to
indicate that you're switching from Markdown to HTML; you just use
the tags.
The only restrictions are that block-level HTML elements -- e.g. <div>,
<table>, <pre>, <p>, etc. -- must be separated from surrounding
content by blank lines, and the start and end tags of the block should
not be indented with tabs or spaces. Markdown is smart enough not
to add extra (unwanted) <p> tags around HTML block-level tags.
However, there is one down side to this:
Note that Markdown formatting syntax is not processed within block-level
HTML tags. E.g., you can't use Markdown-style *emphasis* inside an
HTML block.
Finally, there is the concern that you appear to looking to have this document hosted on a third party site (perhaps in a readme on Gitlab). Most third party sites who process and host Markdown documents (including Gitlab) run the output through an HTML sanitizer for security reasons (to avoid XSS attaches, etc). Therefore, you are likely to find that various required hooks in your HTML will be stripped out and it won't work. Of course, this won't be a problem on your own site where you have total control.
The solution was tried in readme.rd from the text processor used by Microsoft VsCode and commited to gitlab. In the attached picture there is the rendering. It was not as expected. Perhaps the functionality to have tabs will be available soon.
An alternative is "collapsible sections" in GitLab Flavored Markdown. Link to documentation: link

Is there a way to collaborate with non-markdown users when developing DOCX documents with pandoc markdown?

Say I use markdown to write a memo, and convert it with pandoc to a DOCX, which my non-technically-inclined collaborator uses, and say the collaborator changes a few things while tracking changes.
Now I want to accept some changes and reject some others, then get the new version back into markdown to work on the next draft. But converting docx -> markdown with pandoc tends to be lossy--viewed as functions, the functions are not inverses; ToMarkdown(ToWord(md_file)) != md_file.
With this limitation, the pandoc/markdown workflow is basically a dead-end after draft 1. It's great to use vim and plaintext instead of Word for the first draft, but if there are a significant number of changes, then it's often just as much work to recover and verify them and correct unintentional losses in v2 of a markdown file from the collaborator's DOCX as it is just to put up with MS Word from the get-go.
Does anyone have a workaround for this situation that gets them to "v2" or higher using markdown / plaintext with minimal manual work in Word?
There is a long discussion at pandoc-discuss about the issue. The short answer is no, there is no support for to docx and back to md without losses.
That said, #mb21 mentions the --track-changes flag, which allows for a little bit more control, however you would have to incorporate changes manually.
The solution for your problem is either:
Convince collaborators to use md instead of Word, or;
Start using Authorea which uses pandoc in the background to generate the documents. You can even make it sinc with github, while your collaborator uses the webversion.
I have been experimenting with option 2, but it is super hard to convince most of the collaborators to move to an online interface.

Generate HTML from MathJax code in a batch script

I know how to use MathJax to convert TeX commands in a web page to mathematical formulae. The MathJax scripts would search the page for TeX commands and convert them inline to HTML statements.
Is there a way to do this as a form of pre-processing? In other words, I have some text or HTML files on my harddisk that contain raw TeX commands. I'd like to use MathJax to convert them to HTML, so that they can be viewed without having the MathJax scripts.
The reason I need this is that these pages are very long and contain many, many TeX statements. MathJax is fast, but it's not fast enough for such huge pages, so I need to preprocess them.
Thanks for any hints.
MathJax-node provides APIs for using MathJax in nodejs, thus enabling this kind of preprocess. There are examples in the repository for handling HTML fragments.
The SVG output can be used this way but the HTML-CSS output cannot because it is very client dependent.
However, the new CommomHTML output -- which has been completed in MathJax v2.6, currently in beta -- will be usable this way. It will be integrated into mathjax-node once v2.6 is out of beta.

Document format for writing homeworks in Vim

I'm a college student majoring in CS, and that means I spend a lot of time poking around in vim. I'm still a complete noob, but I love editing text in the terminal--it's more fun than writing documents has any right to be.
However, I'm curious if there's a basic, low-frills document format I can use (from within vim) to typeset my homework assignments. I'm familiar enough with LaTeX, and if it were possible I'd use it for everything, but it has two main disadvantages:
It takes a long time to write an entire LaTeX document, and
LaTeX doesn't handle code very well.
With that in mind, I'd like to know if some format exists which addresses both these needs and is still easy to hash out quickly from a terminal-based text editor. I use vim for literally everything else I write, so the need to keep LibreOffice Writer around just for homeworks seems a bit overbearing to me.
Thanks!
I would tend towards something light like Markdown, but the needed capabilities depend on what requirements you have for the output (formatting and styling).
I find the AsciiDoc project quite interesting. From their website:
AsciiDoc is a text document format for writing notes, documentation,
articles, books, ebooks, slideshows, web pages, blogs and UNIX man
pages. AsciiDoc files can be translated to many formats including
HTML, PDF, EPUB, man page. AsciiDoc is highly configurable: both the
AsciiDoc source file syntax and the backend output markups (which can
be almost any type of SGML/XML markup) can be customized and extended
by the user.
It even comes with a Vim syntax.

add a duplicate (hidden) text layer to a pdf for extra searching

My problem:
I have a pdf with lots of roman characters with complex diacritical marks (e.g., ṣ, ś, ṝ, ǎ, etc.). To make it easier to search within the pdf, I would like to add an additional layer, much as one does with hocr, where the same text is present without the diacritics.
When using full-text search engines I can index multiple terms at the same position (vector) - I would like to achieve the same effect here.
I have read lots about adding a hocr layer to scanned images, but I really just want to duplicate the text layer, pass it through a script that strips the diacritics (straightforward enough) and then adds it back in as a hidden but searchable layer.
Anyone have any suggestions? (Solutions involving any platform, language, library or toolchain will be useful!)
Thanks :)
Edit: please let me know if the question is unclear.
Well I have a (slightly ugly and hackish) solution, so I thought I'd share it.
I'm using PDFMiner to extract the text, along with the co-ordinates. Then I'm using ReportLab to write the normalized versions of the text to a new pdf, in exactly the same position, as hidden text. To make the positions line up properly, I found I had to use exactly the same font, so I've used a combination of FontForge and MuPDF to extract the required font(s) from the original pdf.
Finally, having created the new pdf, I'm using pdftk to merge it with the original.
It works pretty well, but has the downside that copying text out of the pdf results in the normalized text being copied too. But this is acceptable for my present purposes, and I can't see any way around it. The pdf spec. doesn't really support my objective, and so I don't imagine I can do better than this hackish solution.
I have written something similar to add searchable text by OCR'ing images and converting it to PDF in C#. I used QuickPDF from www.quickpdf.com to create hidden white text objects on top of the image and this worked reasonably well.
In your case QuickPDF would allow you to extract the text strings along with bounding boxes and font details. You could then normalize your text and create the invisible text objects using the existing font and position information and then save it out to a new file.
This would basically give you the same PDF as you have now and also give you both the original and normalised text as you are getting now.
QuickPDF is a commercial library. If your solution works well for you then there is no used buying a commercial engine though. The nice thing though is that it only requires 1 SDK and you would look at it if you had a more than a few PDF's to convert.

Resources