CLI- soffice convert csv to pdf with semicolon as delimiter - excel

I want to convert the csv file to pdf file from command line using soffice command. But my csv file is colon separated instead of comma.
If I use command:
soffice --convert-to pdf ./sampleCSVFile.csv
This will give me pdf file but there are ; in the file. I found a article to convert to convert ods to csv with semicolon as delimiter: https://ask.libreoffice.org/t/cli-convert-ods-to-csv-with-semicolon-as-delimiter/5021
So similar to that I tried:
unoconv -f pdf -e FilterOptions="59,34,0,1" ./sampleCSVFile.csv
But it didn't help.
sampleCSVFile.csv as follow:
Level 1;Level2
Level 1;Level2
Level 1 ;Level2
Level 1;Level2
Level 1 ;Level2
Level 1;Level2
Level 1;Level2
Level 1;Level2
Level 1;Level2
Is there a way to convert this colon separated csv file to pdf?
(without changing the delimiter colon to comma)

Traditionally in DOS you used Edline to write a text file then either Copy or Type to the Con, Com or Lpn device (Line PriNter).
Windows still allows the print command to do that, and its possible to echo text via Notepad to a PDF virtual printer as a port. I will skip that as it not quite suited to your usage.
However by way of example, here I take your file and print virtually to PDF FilePort then call the Port result to the console. I could use one line rather than two but its more GUI visual.
However its not cross platform, and there are other simpler ways to convert text to pdf per platform.
You ask about Soffice and the principles are much the same since before PDFs were invented. soffice --infilter="calc_pdf_export" --convert-to pdf sampleCSVFile.csv
The text you transPort to exPort is the same as you imPort. However printing blind can add default print headers, footers (Page 1) and styles.
Because it is the most basic of methods
Whatever is in your Character Separated Values File.txt will be similar output. The only difference is there is no such thing as a tab or line wrap in a PDF (as its a virtual laser printer) not a mechanical line feed one.

Related

Agnostically convert spreadsheet to csv file

I have a script that takes a csv and does something with it.
I'd like this script to be agnostic to the spreadsheet file type (xlsx, xls, ods for instance), and always convert the file to csv for processing. Is there a way to do this without corrupting the data in any way?
You can use a headless version of the open-source software, Libreoffice, to convert the same extensions that Libreoffice can normally do within in the GUI. This solution does require you to install the whole office suite which may be overkill depending on your particular situation.
However, via the command line, you can call Libreoffice to do this conversion:
soffice --headless --convert-to csv <input_file> --outdir </path/to/dir>
This example is under the assumption that you are using a Unix-like machine, however there should be a similar version for Windows as well (e.g. soffice.exe). Replace the <input_file> with your file name and the </path/to/dir/> to the path to the directory you want to have your output (the output directory option is opional). You can use the wildcard * as the input file, which would convert all the files in the directory to csv.

Tesseract Batch Convert Images to Searchable PDF And Multiple Corresponding Text Files

I’m using tesseract to batch convert a list of images to both a searchable PDF as well as a TXT file containing the OCRd text.
tesseract infile outfile -l eng myconfig
infile contains a list of image paths to process
myconfig contains tesseract preferences to specify the output types (tessedit_create_text 1 and tessedit_create_pdf 1)
This leaves me with outfile.pdf and outfile.txt, the latter of which contains page separators for delimiting text between images.
What I’m really looking to do, however, is to output multiple TXT files on a per-image basis, using the same corresponding image name. For example, Image1.jpg.txt, Image2.jpg.txt, Image3.jpg.txt...
Does tesseract have the option to support this behavior natively? I realize that I can loop through the image file list and execute tesseract on a per-image basis, but this is not ideal as I’d also have to run tesseract a second time to generate the merged PDF. Instead, I’d like to run both options at the same time, with less overall execution time.
I also realize that I can split the merged TXT file on the page separator into multiple text files, but then I have to introduce less elegant code to map and rename all of those split files to correspond to their original image names: Rename 0001.txt to Image1.jpg.txt...
I’m working with both Python 3 and Linux commands at my disposal.
You can prepare a batch file that loops through the input images and output to both txt and pdf at the same time -- more efficient, one single OCR operation instead of two. You can then split output .txt file to pages.
tesseract inimagefile outfile txt pdf
Converting multiple images to a single PDF file.
On Linux, you can list all images and then pipe them to tesseract
ls *.jpg | tesseract - yourFileName txt pdf
Where:
youFileName: is the name of the output file.
txt pdf: are the output formats, you can also use only one of them.
Converting images to individual text files
On Linux, you can use the for loop to go through files and execute an action for every file.
for FILE in *.jpg; do tesseract $FILE ${FILE::-4}; done
Where:
for FILE in *.jpg : loop through all JPG files (you can change the extension based on your format)
$FILE: is the name of the image file, e.g. 001.jpg
${FILE::-4}: is the name of the image but without the extension, e.g. 001.jpg will be 001 because we removed the last 4 characters.
We need this to name the text files to the corresponding names, e.g.
001.jpg will be converted to 001.txt
002.jpg will be converted to 002.txt
Since Tesseract doesn't seem to handle this natively, I've just developed a function to split the merged TXT file on the page separator into multiple text files. Although from my observations, I'm not sure that Tesseract runs any faster by simultaneously converting batch images to both PDF and TXT (versus running it twice - once for PDF, and once for TXT).
Thank you!
BTW i'm using 4.1.1.
And i discovered another trainedata for spanish language that do a better job than the standard one. Actually recognizes well the "o" character. The only problem is the processing time, but i let the PC working overnight.
Honestly i don't know how the new trainedata file is doing the job better. I donwloaded at:
https://github.com/tesseract-ocr/tessdata_best

Pentaho - CSV Input not understanding special character [Windows to Linux]

I have a transformation on Pentaho Data Integration where the first thing I do is I use the "CSV Input" to map my flat file.
I've never had a problem with it on windows, but now I'm chaning my server that spoon is going to run to a linux server and now I'm having problems with special characters.
The first thing I noticed was that my tables where being updated because the system was understanding the names as diferent strings to the ones that are at my database.
Checking for the problem, I also noticed that if I go to my "CSV Input" -> Preview, it will show me the preview of my data with the problem above:
Special characters are not showing.
Where it should be:
Diretoria de Suporte à Decisão e Aplicação
I used a command to checked my file charset/codification and it showed:
$ file -bi foo.csv
text/plain; charset=iso-8859-1
If I open foo.csv on vi, it understands the special characters.
Any idea on what could be the problem or what should I try?
I don't have any data files with this encoding, so you'll have to do some experimenting, but there are some steps designed to deal with these issues.
First, the CSV Input step has a field that allows you to select the encoding of the source file. The Text File Input step has both a "Format" (meaning line terminator) and "Encoding" selector under the "Content" tab.
In Transforms, you have the Change file encoding step under the Utility tab. This step is designed to copy many files while changing their encoding; that's why it's in a transform.
In Jobs, there's the Convert file between Windows and Unix step under the File Management tab, but this appears to only deal with line terminators.
Either way it appears if the CSV/Text file input steps don't suit your needs, you'll have to copy the file to a new encoding before reading it in. It will probably be easiest to try handling it with the file input steps first.

How to convert xls format to arff format?

I have a (.xls) data file and I want to use it in Weka.
I try to convert it with Weka converter as described in Weka tutorial, but after saving as .csv type I don't know how to tag the attribute or data.
Is there any other way to convert .xls format to arff?
You must work wtih csv files, they can be directly converted in command line:
java -cp "path/to/weka.jar" weka.core.converters.CSVLoader file.csv > file.arff
Please make sure that there are no quotation mark " for numeric data.
If you use non standard characters, use UTF-8 and add -Dfile.encoding=utf-8 when you start weka.
Any nominal or string data will be automatically converted
If you still have problems, please provide part of your csv file.
What do you mean by "how tag the attribute or data." ?
The following steps can solve your problem
Save your .xls file in .csv format.
Open the Weka GUI Chooser and then click on the tools button in the top menu bar.
Click on the Arffviwer
Choose file types to be loaded like, *.csv, *.data
Open *.csv file to view the data and values
Name the file with the .arff extension
Save the file

How to extract (import) data from a mainframe dataset to excel table

I want to build a little application that calculates the critical batch of a batch flow.
As input I need to use a Mainframe dataset. If possible, being dynamic, that is, I can choose the fields that apply at the time.
I've searched the internet about that but found nothing that suited what I wanted to do.
Is there a way to do that?
I have a dataset in a mainframe library and I want to ftp that file to Excel.
Convert the file to CSV on the mainframe (for example, via a REXX exec, a z/OS UNIX shell script, or a Lua4z program),
and then insert that CSV file into Excel via FTP.
You do not need to transfer the CSV file to your PC's file system and then, as a separate step, open it in Excel.
Instead, you define the FTP (or HTTP) URL for the CSV as a data source in Excel. One advantage of this technique is that you can refresh the data from that URL
without having to reapply formatting in Excel.
There are various tutorials on the web for doing this.
In brief:
Create a new blank workbook (I'm using Excel 2010).
Select the first cell in the empty worksheet (this step is unnecessary - the cell is already selected - if you've only just created the workbook).
On the Data tab, click From Text
In the File name text box of the Import Text File dialog, enter the FTP URL of the CSV file. For example:
ftp://zos1//u/me/data.csv
(This assumes that your mainframe is configured to allow FTP using this path.)
The two consecutive slash (/) characters following the host name (zos1) indicate that the path refers to a z/OS UNIX file (/u/me/data.csv).
The CSV file must be in a z/OS UNIX path. The FTP client does not accept MVS-style (dsname) paths such as 'me.csv(data)' (even when URL-encoded; that is, with the single quotes escaped as %27); by contrast, cURL accepts such paths just fine.
The CSV file on the mainframe must be ASCII encoded, not EBCDIC. (Here, I'm using the term ASCII imprecisely: the precise character encoding you want depends on your PC's settings. You probably want Windows-1252.) This is because the FTP client sets the default transfer type to binary.
Enter your user name and password (your z/OS TSO user ID and password).
Wait for the data to load.
Format the cells. For example, set the format of any columns containing date/time values.
On the Data tab, click Connections, select the connection (that Excel created when you specified a URL for the file name), and clear the check box Prompt for file name on refresh.
To refresh the data, replacing the current data with the results of a new FTP request: on the Data tab, click Refresh All. The data is replaced; the cell formatting remains intact.
Converting an EBCDIC-encoded CSV file to ASCII
(Strictly speaking, I mean ISO-8859, not ASCII.)
Suppose you have JCL that generates a CSV file encoded in EBCDIC. You want to make that CSV file available to Excel via FTP as an ASCII-encoded z/OS UNIX (zFS) file.
Replace your existing DD statement for the output CSV file with the following DD statement:
//OUTCSV DD PATH='/u/me/data-ebcdic.csv',
// PATHOPTS=(OWRONLY,OCREAT,OTRUNC),
// PATHDISP=(KEEP,DELETE),
// PATHMODE=(SIRUSR,SIWUSR,SIRGRP),
// FILEDATA=TEXT
Replace the ddname OUTCSV with your ddname, and the zFS file path /u/me/data-ebcdic.csv with the path that you want to use.
Thanks to the FILEDATA=TEXT parameter, the resulting CSV file will have a X'15' byte at the end of each line.
Append the following step to your JCL:
//ICONV EXEC PGM=IKJEFT01
//SYSTSIN DD *
BPXBATCH sh iconv -f IBM-037 -t iso8859-1 +
/u/me/data-ebcdic.csv +
> /u/me/data-ascii.csv
/*
//SYSPRINT DD SYSOUT=*
//SYSTSPRT DD SYSOUT=*
In case you're wondering why I'm calling iconv as a shell command via BPXBATCH, the following:
//ICONV EXEC PGM=EDCICONV
// PARM=('FROMCODE(IBM-037),TOCODE(iso8859-1)')
didn't quite work: it left the X'15' bytes as is, whereas running iconv as a shell command correctly converted them to X'0A'. (z/OS 2.2.)
You've got some good information in the comments, consensus appears to be conversion to CSV (or TSV to avoid commas embedded in your data) is the easiest route. Here is a bit more information, copied from another answer...
I would strongly suggest you get the files into a text format before
transferring them to another box with a different code page. Trying to
deal with mixed text (which must have its code page translated) and
binary (which must not have its code page translated but which likely
must be converted from big endian to little endian) is harder than
doing the conversion up front.
The conversion can likely be done via the SORT utility on the
mainframe. Mainframe SORT utilities tend to have extensive data
manipulation functions. There are other mechanisms you could use
(other utilities, custom code written in the language of your choice,
purchased packages) but this is what we tend to do in these
circumstances.
Once you have your flat files converted such that all data is text,
you can transfer them via FTP or SFTP or FTPS.
...and thanks for coming back and adding more information. Hopefully the people here have provided enough information to help you solve your problem.
XML would be another possible text oriented solution. It would take more effort to create, but you could design your spreadsheet in Excel and save as an XML document, then write a program to generate the xml text using the data from your mainframe dataset. While this would be more difficult to implement than a simple CSV or TSV file, it has the advantage of implementing the spreadsheet formulas and attributes that a CSV file can not do. Another advantage, you can attach the XML document to an SMTP email note and deliver the document in "spreadsheet format" to your client.

Resources