I've recently began coding for my degree and for a project I am submitting it via a pdf created in Jupyter so that my code can be seen. It all works within Jupyter but when I export to PDF the image that I have embedded in markdown doesn't load. All that loads in Microsoft edge is a small black box with a white cross in and in chrome there is a small image of mountains in two pieces. I am not sure where I'm going wrong. My image is written in like this:
<img src="files/masterbiaspic.png" />
And I don't know how to fix it.
I really don't have a wide knowledge of code so please be simple with your answers.
Kind regards and happy new year,
E
You appear to be using raw HTML to insert your images into your document. What you may not know is that most Markdown parsers do not look at the contents of raw HTML, they simply pass it through unaltered. However, raw HTML is not understood by the PDF file format, and in fact, when converting to PDF, there is no clean way to convert raw HTML to PDF without also parsing the HTML (which is beyond the scope of Markdown parsers). Therefore, if you want to output to PDF, you should only use pure Markdown (without any raw HTML). That way the parser can easily convert everything to a proper format for PDF output.
As it turns out, Markdown includes its own syntax for images (see the documentation for details). Try this:
![alt text](files/masterbiaspic.png)
By doing that, Jupyter Notebook will know about the image and should import it into the PDF properly.
It could be that the above will not resolve the problem. It depends on which method is used to convert to PDF. Some tools may take the HTML output of Markdown and convert that to PDF, which would mean you have a different problem entirely.
Related
I am working on a tutorial website using MERN, whereon I will be displaying tutorials on pretty much anything I know well, and new things that I learn. The backend and frontend work fine. I just don't know what to do with regards to the tutorial's body in the submission form. The kind of tutorials that I want to add will have a combination of images, specially highlighted text(i.e. code examples), and text. So, I was thinking maybe I can upload all that as a word document and have it be parsed before it is saved in the database. Is this the way to go? Are there useful libraries that can make this easier to handle?
I suspect you'll like markdown.
Markdown is a simple markup language where you enter plain old text but can render the text with styles. There are many libraries (especially for react and node) that will convert your markdown to html, and it can be easily extended to style code snippets, images, and even react components.
I have a selection of pdfs that I want to text mine. I use tika to parse the text out of each pdf and save to a .txt with utf-8 encoding (I'm using windows)
Most of the pdfs were OCR'd before I got them but when I view the extracted text I have "pnÁnn¿¡c" instead of "Phádraig" if I view the PDF.
Is it possible for me to verify the text layer of the PDF (forgive me if thats the incorrect term) Ideally without needing the full version of Acrobat
It sounds like you are dealing with scanned books with "hidden OCR", ie. the PDF shows an image of the original document, behind which there is a layer of OCRed text.
That allows you to use the search function and to copy-paste text out of the document.
When you highlight the text, the hidden characters become visible (though this behaviour maybe depends on the viewer you use).
To be sure, you can copy-paste the highlighted text to a text editor.
This will allow you to tell if you are actually dealing with OCR quality this terrible, or if your extraction process caused mojibake.
Since OCR quality heavily depends on language resources (dictionaries, language model), I wouldn't be surprised if the output was actually that bad for a low-resource language like Gaelic (Old Irish?).
I've noticed that when I use an OCR to transform a scanned PDF document into text, in this case Adobe Acrobat Pro, I'm getting very different outputs depending on how I extract the data.
In the above photo - you can see a piece of a PDF that has been OCR'ed into fairly good quality text. If I select it in Adobe and copy it to say, a word or txt doc, it paste over perfectly fine.
However, if I export it using Adobe to Rich Text Format, use Python's PDFminer, or Python Apache Tika then I get the above photo which as you can see completely jumbles it. The extraction results are very consistent between the approaches - basically all 3 jumble it in the exact same way.
Would any of you have any idea as to why an OCR'd PDF can be copied just fine to a text editor but is extracting in such a bizarre way?
Thank you!
Regards,
Mano
So what ended up working for me was running the initial parsing with Apache-Tika and then, on the few that didn't work on, pass them through PyPDF2. My theory is that PyPDF2 uses a different mechanism for parsing that doesn't rely on the root of the PDF unlike Tika and that is what seems to have become corrupted in a few of these OCR'd docs.
Not sure of the initial cause but that was my solution.
I know how to use MathJax to convert TeX commands in a web page to mathematical formulae. The MathJax scripts would search the page for TeX commands and convert them inline to HTML statements.
Is there a way to do this as a form of pre-processing? In other words, I have some text or HTML files on my harddisk that contain raw TeX commands. I'd like to use MathJax to convert them to HTML, so that they can be viewed without having the MathJax scripts.
The reason I need this is that these pages are very long and contain many, many TeX statements. MathJax is fast, but it's not fast enough for such huge pages, so I need to preprocess them.
Thanks for any hints.
MathJax-node provides APIs for using MathJax in nodejs, thus enabling this kind of preprocess. There are examples in the repository for handling HTML fragments.
The SVG output can be used this way but the HTML-CSS output cannot because it is very client dependent.
However, the new CommomHTML output -- which has been completed in MathJax v2.6, currently in beta -- will be usable this way. It will be integrated into mathjax-node once v2.6 is out of beta.
How can I convert a Google Docs, which contains images and tables, into a Markdown file which can be published as a post using Jekyll?
Is it possible to first export the Google Docs into a PDF and then convert the PDF to Markdown? What will happen to the images and tables in that case?
May 2018 Update
The script originally suggested in this answer appears to no longer work and has not been updated for 5 years.
An alternative solution (which is based on the old script) can be found at https://github.com/evbacher/gd2md-html
I tried it out, it works pretty well.
Previous Answer
You can use a Google Script to do the conversion for you!
This one will let you convert to .md and it will email you the converted file. I've tested it and works fine. It works with basic tables, and if you have images in the doc, it will attach them to the email.
Instructions for installing are on the same link, in the GitHub description, but I pasted it here for ease of access:
Add the script:
Open your Google Drive document (http://drive.google.com)
Tools ->
Script Manager > New
Select "Blank Project", then paste this code in
and save.
Clear the myFunction() default empty function and paste the
contents of converttomarkdown.gapps into the code editor
File -> Save
Run the script:
Tools > Script Manager
Select "ConvertToMarkdown" function.
Click Run
button (First run will require you to authorize it. Authorize and run
again)
Converted doc with images attached will be emailed to you.
Subject will be "[MARKDOWN_MAKER]...".
Good luck!
You can export as HTML. Jekyll can serve static HTML files.
Btw, "standard" markdown doesn't have tables. There are implementation that have it, but I'm afraid you'll have to convert them by hand to the right format, which will be implementation dependent. I don't know about Jekyll, maybe it's easiest to just use HTML tables within the markdown text.
You could create a new theme based on the HTML export. The export should contain the stylesheet embedded in a <style> tag within the HTML document. It's not really easy to create new themes, but doable. Or, if you just want the content and don't mind using whatever Jekyll theme you already have, then you can cut out the stylesheet part and keep the html only.
Another option would be to change how files are delimited in Excel on your computer. This guide can help you do that (http://www.howtogeek.com/howto/21456/export-or-save-excel-files-with-pipe-or-other-delimiters-instead-of-commas/)
Then every time you copy and paste from excel to a markdown file/jekyll you automatically have the pipes. All you will need to do is add some dashes to separate your topline..
Google Docs -> docx to Markdown -> md
I myself looked far and wide but I believe the best way to do this is by using Pandoc.
Works for all platforms (check their incredible website ) , what you are looking for is the following command on your cmd or PowerShell (Windows) :
pandoc input_filename.docx -s -o output.md
Pro Tip:
Pandoc comes with a little trick to store up even all of the images in your document to your custom folder and then adding the image tags in the markdown by using relative referencing to those images at the correct places. The amazing line of code is:
pandoc --extract-media ./your_custom_folder input_filename.docx -o output_filename.md