Not able to read file in Pypandoc

Not able to read file in Pypandoc - python-3.x

I am trying to covert a pdf to html using Pandoc. I have installed pandoc binary , added the environment variable path and then using
import pypandoc
import os
os.environ.setdefault('PYPANDOC_PANDOC', 'C://Program Files//Pandoc//pandoc.exe')
file_path = r"D:/46580375_1593783098922.pdf"
output = pypandoc.convert_file("46580375_1593783098922.pdf", to='html', outputfile= 'test.html')
It is giving me an error :
RuntimeError: Invalid input format! Got "pdf" but expected one of
these: commonmark, creole, csv, docbook, docx, dokuwiki, epub, fb2,
gfm, haddock, html, ipynb, jats, jira, json, latex, man, markdown,
markdown_github, markdown_mmd, markdown_phpextra, markdown_strict,
mediawiki, muse, native, odt, opml, org, rst, t2t, textile, tikiwiki,
twiki, vimwiki
What am I missing?

As the error said, you can't convert PDF to HTML via pandoc.

Related

Why there is a permisson denied error while using node-tesseract-ocr?

I am using node-tesseract-ocr library for using ocr for my node js project. I installed tesseract-ocr in my machine(windows) using choco and then node-tesseract-ocr using npm. While requesting that particular route I am getting the following error
Error, cannot read input file "myActualPath": Permission denied
Error during processing.
This is the code I am using
const config = {
lang: 'eng',
oem: 1,
psm: 3,
};
tesseract
.recognize(__dirname, `../public/data/${reciept}`, config)
.then((text) => {
serialResponse = text.match(new RegExp(serial, 'g'));
})
.catch((error) => {
console.log(error.message);
});

Make sure you have added the tesseract-OCR path in your environment variables, and restart your IDE
Note, for programs like PyCharm and many others, you need to also close the program and re-open it after setting the system environment variable - As told by silas in another post similar to this one.
You can refer that post here .
Make sure to import the necessary packages in your module
import pytesseract
import argparse
import cv2
Then construct the argument parser and parse the arguments.
ap = argparse.ArgumentParser()
ap.add_argument("-i", "--image", required=True, help="path to input image to be OCR'd")
args = vars(ap.parse_args())
Note The first Python import you’ll notice in this script is pytesseract (Python Tesseract), a Python binding that ties in directly with the Tesseract OCR application running on your system. The power of pytesseract is our ability to interface with Tesseract rather than relying on ugly os.cmd calls as we needed to do before pytesseract ever existed.
For additional reference.

Base64 encoded file says "GZIP", but decoding it in Python outputs corrupt HTML

I'm having trouble reading data from files I have from an old backup (Windows system).
Example how the content looks like:
GZIP
-}_HTML>
<H AD>
<META HTTP-EQUV="Conten-Type" CO!TENT="tex/html; chrset=wind&ws-1252">
It's almost proper HTML... but some characters are corrupted.
In Base64, it looks like this:
R1pJUAwAAAAKAAAALX0AAF9IVE1MPg0KPEggQUQ+DQo8TUVUQSBIVFRQLUVRVQ5WPSJDb250ZW4ZLVR5cGUiIENPIVRFTlQ9InRleBwvaHRtbDsgY2gTcnNldD13aW5kJndzLTEyNTIi
Since it says "GZIP" at the top, I tried decompressing it with gzip in Python.
import zlib
import base64
s = "R1pJUAwAAAAKAAAALX0AAF9IVE1MPg0KPEggQUQ+DQo8TUVUQSBIVFRQLUVRVQ5WPSJDb250ZW4ZLVR5cGUiIENPIVRFTlQ9InRleBwvaHRtbDsgY2gTcnNldD13aW5kJndzLTEyNTIi"
s = base64.b64decode(s.encode('Latin1'))
zlib.decompress(s, 31)
Though I'm getting the error:
zlib.error: Error -3 while decompressing data: incorrect header check
Same with this code:
import gzip
s = gzip.decompress(s)
s = str(s,'utf-8')
print(s)
gzip.BadGzipFile: Not a gzipped file (b'GZ')
Any idea how I can recover data from this file?

It is neither gzip nor any sort of compression at all. Despite the word "GZIP" at the top. It is what you see.

Download xml file from the server with Python3

am trying to download a xml file from public data bank
http://api.worldbank.org/v2/en/indicator/SP.POP.TOTL?downloadformat=xml
I tried to do it with requests:
import requests
response = requests.get(url)
response.encoding = 'utf-8' #or response.apparent_encoding
print(response.content)
and wget
import wget
wget.download(url, './my.xml')
But both of the ways provide mess instead of a correct file (it looks like a broken encoding, but I cannot fix it)
If I try to download the file via web browser I get correct a UTF-8 xml file.
What am I doing wrong in the code?

How to get WKHTMLTOPDF working on Heroku?

I created a website which generates PDF using PDFKIT and I know how to install and setup environment variable path on Window. I managed to deploy my first website on Heroku but now I'm getting error "No wkhtmltopdf executable found: "b''" When trying to generate the PDF.
I have no idea, How to install and setup WKHTMLTOPDF on Heroku because this is first time I'm dealing with Linux.
I really tried everything before asking this but even following this not working for me.
Python 3 flask install wkhtmltopdf on heroku
If possible, please guide me with step by step on how to install and setup this.
I followed all the resource and everything but couldn't make it work. Every time I get the same error.
I'm using Django version 2. Python version 3.7.
This is what I get if I do heroku stack
Available Stacks
cedar-14
container
heroku-16
* heroku-18
Error, I'm getting when generating the PDF.
No wkhtmltopdf executable found: "b''"
If this file exists please check that this process can read it. Otherwise please install wkhtmltopdf - https://github.com/JazzCore/python-pdfkit/wiki/Installing-wkhtmltopdf
My website works very well on localhost without any problem and as far as I know, I'm sure that I have done something wrong in installing wkhtmltopdf.
Thank you

It's non-trivial. If you want to avoid all of the below's headache, you can just use my service, api2pdf: https://github.com/api2pdf/api2pdf.python. Otherwise, if you want to try and work through it, see below.
1) Add this to your requirements.txt to install a special wkhtmltopdf pack for heroku as well as pdfkit.
git+git://github.com/johnfraney/wkhtmltopdf-pack.git
pdfkit==0.6.1
2) I created a pdf_manager.py in my flask app. In pdf_manager.py I have a method:
def _get_pdfkit_config():
"""wkhtmltopdf lives and functions differently depending on Windows or Linux. We
need to support both since we develop on windows but deploy on Heroku.
Returns:
A pdfkit configuration
"""
if platform.system() == 'Windows':
return pdfkit.configuration(wkhtmltopdf=os.environ.get('WKHTMLTOPDF_BINARY', 'C:\\Program Files\\wkhtmltopdf\\bin\\wkhtmltopdf.exe'))
else:
WKHTMLTOPDF_CMD = subprocess.Popen(['which', os.environ.get('WKHTMLTOPDF_BINARY', 'wkhtmltopdf')], stdout=subprocess.PIPE).communicate()[0].strip()
return pdfkit.configuration(wkhtmltopdf=WKHTMLTOPDF_CMD)
The reason I have the platform statement in there is that I develop on a windows machine and I have the local wkhtmltopdf binary on my PC. But when I deploy to Heroku, it runs in their linux containers so I need to detect first which platform we're on before running the binary.
3) Then I created two more methods - one to convert a url to pdf and another to convert raw html to pdf.
def make_pdf_from_url(url, options=None):
"""Produces a pdf from a website's url.
Args:
url (str): A valid url
options (dict, optional): for specifying pdf parameters like landscape
mode and margins
Returns:
pdf of the website
"""
return pdfkit.from_url(url, False, configuration=_get_pdfkit_config(), options=options)
def make_pdf_from_raw_html(html, options=None):
"""Produces a pdf from raw html.
Args:
html (str): Valid html
options (dict, optional): for specifying pdf parameters like landscape
mode and margins
Returns:
pdf of the supplied html
"""
return pdfkit.from_string(html, False, configuration=_get_pdfkit_config(), options=options)
I use these methods to convert to PDF.

Just follow these steps to Deploy Django app(pdfkit) on Heroku:
Step 1:: Add following packages in requirements.txt file
wkhtmltopdf-pack==0.12.3.0
pdfkit==0.6.0
Step 2: Add below lines in the views.py to add path of binary file
import os, sys, subprocess, platform
if platform.system() == "Windows":
pdfkit_config = pdfkit.configuration(wkhtmltopdf=os.environ.get('WKHTMLTOPDF_BINARY', 'C:\\Program Files\\wkhtmltopdf\\bin\\wkhtmltopdf.exe'))
else:
os.environ['PATH'] += os.pathsep + os.path.dirname(sys.executable)
WKHTMLTOPDF_CMD = subprocess.Popen(['which', os.environ.get('WKHTMLTOPDF_BINARY', 'wkhtmltopdf')],
stdout=subprocess.PIPE).communicate()[0].strip()
pdfkit_config = pdfkit.configuration(wkhtmltopdf=WKHTMLTOPDF_CMD)
Step 3: And then pass pdfkit_config as argument as below
pdf = pdfkit.from_string(html,False,options, configuration=pdfkit_config)

pdf2swf couldn't create a font for 'SimSun'

I was trying to convert a pdf into a swf and i was using swftools. To support Chinese, i downloaded xpdf-chinese-simplified.tar and modified the add-to-xpdfrc file like this
#----- begin Chinese Simplified support package (2011-sep-02)
cidToUnicode Adobe-GB1 /usr/local/share/xpdf-chinese-simplified/Adobe-GB1.cidToUnicode
unicodeMap ISO-2022-CN /usr/local/share/xpdf-chinese-simplified/ISO-2022-CN.unicodeMap
unicodeMap EUC-CN /usr/local/share/xpdf-chinese-simplified/EUC-CN.unicodeMap
unicodeMap GBK /usr/local/share/xpdf-chinese-simplified/GBK.unicodeMap
cMapDir Adobe-GB1 /usr/local/share/xpdf-chinese-simplified/CMap
toUnicodeDir /usr/local/share/xpdf-chinese-simplified/CMap
#fontFileCC Adobe-GB1 /usr/..../gkai00mp.ttf
displayCIDFontTT Adobe-GB1 /usr/local/share/xpdf-chinese-simplified/gkai00mp.ttf
#----- end Chinese Simplified support package
When i tried to convert the pdf,
/usr/local/bin/pdf2swf 10434_102_demo_1414995035745.pdf -o test.swf -s
languagedir=/usr/local/share/xpdf-chinese-simplified
an error occured:
Error: Couldn't create a font for 'SimSun'
PS. I have two environments, one is MAC and the other is Redhat, everything is ok in MAC and this error only occurs in Redhat.

As this question is missing an answer, I can add an optional solution (as a workaround) here.
For Flexpaper, if you use the "HTML5" render mode, it will render Chinese, and other unicode text, correctly.

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string

Not able to read file in Pypandoc - python-3.x

As the error said, you can't convert PDF to HTML via pandoc.

Related

Why there is a permisson denied error while using node-tesseract-ocr?

Base64 encoded file says "GZIP", but decoding it in Python outputs corrupt HTML

Download xml file from the server with Python3

How to get WKHTMLTOPDF working on Heroku?

pdf2swf couldn't create a font for 'SimSun'

Categories

Resources