Pentaho 7.1 PDF wrong diacritc - linux

We have installed new LINUX server for our Pentaho installation, but I am having problem with diacritics in PDF generated files.
For web (HTML), I have set encoding to utf-8 which is working perfectly.
BUT for PDF encoding utf-8 is not working. I have "fixed" it on old server by setting up CP-1250 encoding, but I don't want to use old standard anymore. So I have been trying to fix it.
I have set option in pentaho-server/tomcat/webapps/pentaho/WEB-INF/classes/classic-engine.properties to
org.pentaho.reporting.engine.classic.core.modules.output.pageable.pdf.Encoding=UTF-8
but PDF versions of report are still ignoring letters with diacritics..
Soo, my thought is, that there must be some PDF encoding setting above this, perhaps some global PDF generator setting or perhaps Java or Linux itself?
Is anyone able to give me a hint where should I look, what to check?

Related

ImageMagick issue on AppEngine Standard (PDFs and NodeJS)

I am using App Engine Standard. Since ImageMagick is available on it, I tried a few PDF manipulation libraries and basically, what I would like to do, is simply converting a PDF into an image.
The issue I am getting is this:
'convert-im6.q16: not authorized /tmp/ygM1sF-Txq00JkGbpal8YWBQ.pdf\'
# error/constitute.c/ReadImage/412.\nconvert-im6.q16: no images
defined/tmp/ygM1sF-Txq00JkGbpal8YWBQ-0.png\' #
error/convert.c/ConvertImageCommand/3258.\n' }
After some research, I found out that post here: Fix for ImageMagick convert errors with pdf files. Here is what he says:
PDF files on Linux systems are usually handled by ghostscript (via the
terminal command gs). And, ImageMagick (done through the terminal
convert command) uses ghostscript for reading and writing PDF files.
Because the security problems are serious and numerous, ImageMagick’s
access to PDF files is then cut off.
Granted, through these security flaws in PDF someone could craft a
malicious image file that, when converted by ImageMagick into a PDF,
will then do very nasty things to your computer.
But, ghostscript has since been updated once and once again with
security fixes. How about a fix for ImageMagick to get PDF
functionality back? Or, at least an explanation of progress towards
fixing this issue?
I can't change the ImageMagick configuration on App Engine Standard, but I wonder if there is something else I can do. Or maybe the engineers at Google would be able to update ImageMagick instead and remove that limitation?
I really need to convert PDF into images, so I wonder if it worth waiting, or if I need to find another solution.
Thanks for your ideas.

Javascript export CSV encoding utf-8 and using excel to open issue

I have been reading quite some posts including this one
Javascript export CSV encoding utf-8 issue
I know lots mentioned it's because of microsoft excel that using something like this should work
https://superuser.com/questions/280603/how-to-set-character-encoding-when-opening-excel
I have tried on ubuntu (which didn't even have any issue), on windows10, which I have to use the second posts to import, on mac which has the biggest problem because mac does not import, does not read the unicode at all.
Is there anyway I can do it in coding while exporting to enforce excel to open with utf-8? or some other workaround I might be able to try?
Thanks in advance for any help and suggestions.
Many Windows applications, including Excel, assume the localized ANSI encoding (Windows-1252 on US Windows) when opening a file, unless the file starts with byte-order-mark (BOM) code point. While UTF-8 doesn't need a BOM, a UTF-8-encoded BOM at the start of a file clues Excel that the file is UTF-8. The byte sequence is EF BB BF and the equivalent Unicode code point is U+FEFF.

How to disable encoding in a text-editor?

This is such a basic question I am surprised I could not easily find an answer to it:
I use Notepad++ to write my scripts in. Someone sent me some code for a shell script (.sh) that I could modify to suit my needs. I simply changed a small bit of text using Notepad++ (on Windows) and used FileZilla (SFTP) to upload it to my server (Debian Linux).
There were a few problems with this that it took my server admin an hour to find, namely:
FileZilla, for whatever reason, defaults to ASCII rather than binary! (changed it to binary and removed the .sh association with ASCII)
The permissions were wrong, chmod took care of this
Problem is it STILL did not work. To fix it my server admin simply copied the text right on the server (using vim or nano) into a new shell script file and saved that. Before he kept saying the problem was Windows (which he loves to hate on) but it seems it is the encoding that text-editors are using that is corrupting the files.
He said my text-editor encoding needs to be said to "None". However, that is not an option - only ANSI, UTF and UTS variants are options!
How can I create a shell script on Windows with no encoding whatsoever so that it doesn't get corrupted?
I need to be able to simply transfer the file to the server, I can't mess around with modifying it once on the server which is wholly impractical.
To fix it EndOfLine and encoding on Notepad++ :
On the bottom right of Notepad++ you can right click on the left of the encoding "UTF-8" and click on Convert UNIX(LF) format. Be sure to change encoding to UTF-8 if it is not the case.
In Filezilla :
Transfert mode : auto

GhostScript - ImageMagick converts pdf to image to odd letters when converting Microsoft Print to PDF files

NOTICE: Watch updates at bottom.
I am building an API which supposed to convert PDF to base64 images (doesn't matter which type - jpg, jpeg, png..).
The API is built with NodeJS on CentOS 7.5 x64.
I have searched all over the web for npm packages which converts pdf to images, the very most of them uses ImageMagick and GhostScript (The others doesn't seem to work). These packages work well on code but the problem starts when GhostScript does it job.
For example, a simple pdf page with text will look like this after conversion:
This is the output in shell:
**** Warning: can't process font stream, loading font by the name.
**** This file had errors that were repaired or ignored.
**** The file was produced by:
**** >>>> Microsoft: Print To PDF <<<<
**** Please notify the author of the software that produced this
**** file that it does not conform to Adobe's published PDF
**** specification.
I have tried to convert the images with shell commands ended up with the same outputs.
Thanks by advance.
UPDATE:
Converting a sample pdf file which probably was not printed to pdf by Microsoft worked fine, maybe this is the problem?
UPDATE 2:
After converting a few more pdfs it turns out that this is Microsoft Print to PDF files only that making this problem.
This was reported as a bug to the Ghostscript Bugzilla here
As can be seen from the thread, this is due to using an old version of Ghostscript, and has been fixed at some point in the past. So the problem is due to using old (in this case more than 5 years old) software.

Issue in pdf to text depending on PDF file's version

I'm actually working on ubuntu as I'm trying to parse pdf files to extract text from them, which I managed to get working (using tesseract for example), BUT as I get a 1.7 pdf file version, conversion doesn't work (I get a blank page in my 'name.txt' file).
So I was wondering if anyone knew about some magic that can solve my problem regarding this pdf version issue...
I looked pretty much everywhere I could on the web, without seeing similar issues, therefore I came to y'all.
Hope you'll find a way to help me, cause google hasn't been such a friend so far...

Resources