Pentaho - CSV Input not understanding special character [Windows to Linux] - linux

I have a transformation on Pentaho Data Integration where the first thing I do is I use the "CSV Input" to map my flat file.
I've never had a problem with it on windows, but now I'm chaning my server that spoon is going to run to a linux server and now I'm having problems with special characters.
The first thing I noticed was that my tables where being updated because the system was understanding the names as diferent strings to the ones that are at my database.
Checking for the problem, I also noticed that if I go to my "CSV Input" -> Preview, it will show me the preview of my data with the problem above:
Special characters are not showing.
Where it should be:
Diretoria de Suporte à Decisão e Aplicação
I used a command to checked my file charset/codification and it showed:
$ file -bi foo.csv
text/plain; charset=iso-8859-1
If I open foo.csv on vi, it understands the special characters.
Any idea on what could be the problem or what should I try?

I don't have any data files with this encoding, so you'll have to do some experimenting, but there are some steps designed to deal with these issues.
First, the CSV Input step has a field that allows you to select the encoding of the source file. The Text File Input step has both a "Format" (meaning line terminator) and "Encoding" selector under the "Content" tab.
In Transforms, you have the Change file encoding step under the Utility tab. This step is designed to copy many files while changing their encoding; that's why it's in a transform.
In Jobs, there's the Convert file between Windows and Unix step under the File Management tab, but this appears to only deal with line terminators.
Either way it appears if the CSV/Text file input steps don't suit your needs, you'll have to copy the file to a new encoding before reading it in. It will probably be easiest to try handling it with the file input steps first.

Related

(dd command linux) last byte goes to next line

Hi friends I need some help.
We have a tool that convert binary files to text files, and after that stores into Hadoop (HDFS).
In production, that ingestion tool uses ftp to download files from mainframe in binary format (EBCDIC), and we don't have access to donwload files from mainframe in development environment.
In order to test file conversion, we manually create text files, and we are trying to convert file using dd command (linux), using these parameters:
dd if=asciifile.txt of=ebcdicfile conf=ebcdic
After pass through our conversion tool, the expected result is:
000000000000000 DATA
000000000000000 DATA
000000000000000 DATA
000000000000000 DATA
However, it's returning the following result:
000000000000000 DAT
A000000000000000 DA
TA000000000000000 D
ATA000000000000000
I have tried with cbs, obs and ibs parameters, assigning lrec (number of lines of each line) without success.
Can anyone help me?
A few things to consider:
How exactly is the data transferred via FTP? Your "in binary format(EBCDIC)" simply doesn't make any sense at all. The FTP either transfers in binary format, then nothing gets changed, or converted during the transfer. Or the FTP transfers in text mode, aka. ASCII mode, then data is converted from a specific EBCDIC code page to a specific non-EBCDIC code page. You need to know what mode, and if text mode, what are the two code pages being used.
From the man pages for dd, it is unclear what EBCDIC, and ASCII code pages are used for the conversion. I'm just guessing here: EBCDIC code page might be CP-037, and ASCII might be CP-437. If these don't match the ones used in the FTP, the resulting test data is incorrect.
I understand you don't have access to production data in the development environment. However, you should still be able to get test data from the development mainframe using FTP from there. If not, how will you be doing end to end testing?
The EBCDIC conversion is eating your line endings:
https://www.ibm.com/docs/en/zos/2.2.0?topic=server-different-end-line-characters-in-text-files

How can i Convert a text file to UCS-2 LE, from whatever the default is?

I am looking for a way to convert or save a text file in the UCS-2 LE format; specifically without BOM...i guess.
I have zero knowledge what any of that means actually; but i know i need that because of this wiki page on what i am trying to accomplish: https://developer.valvesoftware.com/wiki/Closed_Captions
in other words:
this is for a specific game engine, "Source Engine," which requires the format in order to compile in-game closed captions for sounds.
I have tried saving the file in Notepad++ using the "UCS-2 LE BOM" option under the encoding menu...there is no option for just "UCS-2 LE" however, and because of this, the captions cannot be compiled for the game engine. I need to save without BOM, "I guess" (because again I don't know what I'm talking about and I assume based on logical conclusions, that I need to not have BOM, whatever that actually means.)
I would like to know about a way to either save a txt file in that encoding format; or a way to convert one.
In my specific case; it appears that my problem boils down to "the program is weird."
what I mean by this is, notepad++ actually does save in the correct format; but I failed to realize that because of a quirk in the caption compiler where it only works if you drag the file onto it; not via command line as previously thought.
I will accept this as the answer when i am allowed to in 2 days.

Corrupt Text File read/write/open

I have a large text file that I take notes in; Recently, after saving it, it won't open and gives following error. I tried a few things on web that didn't work---opening in different encoding format, etc. Nothing worked. Any idea how I can open it again? Is there a language I can use from bash? I'm very familiar with PHP. Any ideas? Different text editor?
Error:
"The document “ToDo.txt” could not be opened. Text encoding Unicode (UTF-8) isn’t applicable."
"The file may have been saved using a different text encoding, or it may not be a text file."
cat the file from the CLI and make sure your data is still there. Then you could simply copy and paste the output into a new file and hopefully get rid of whatever weird encodings are causing that text editor to not read the file.

How to extract (import) data from a mainframe dataset to excel table

I want to build a little application that calculates the critical batch of a batch flow.
As input I need to use a Mainframe dataset. If possible, being dynamic, that is, I can choose the fields that apply at the time.
I've searched the internet about that but found nothing that suited what I wanted to do.
Is there a way to do that?
I have a dataset in a mainframe library and I want to ftp that file to Excel.
Convert the file to CSV on the mainframe (for example, via a REXX exec, a z/OS UNIX shell script, or a Lua4z program),
and then insert that CSV file into Excel via FTP.
You do not need to transfer the CSV file to your PC's file system and then, as a separate step, open it in Excel.
Instead, you define the FTP (or HTTP) URL for the CSV as a data source in Excel. One advantage of this technique is that you can refresh the data from that URL
without having to reapply formatting in Excel.
There are various tutorials on the web for doing this.
In brief:
Create a new blank workbook (I'm using Excel 2010).
Select the first cell in the empty worksheet (this step is unnecessary - the cell is already selected - if you've only just created the workbook).
On the Data tab, click From Text
In the File name text box of the Import Text File dialog, enter the FTP URL of the CSV file. For example:
ftp://zos1//u/me/data.csv
(This assumes that your mainframe is configured to allow FTP using this path.)
The two consecutive slash (/) characters following the host name (zos1) indicate that the path refers to a z/OS UNIX file (/u/me/data.csv).
The CSV file must be in a z/OS UNIX path. The FTP client does not accept MVS-style (dsname) paths such as 'me.csv(data)' (even when URL-encoded; that is, with the single quotes escaped as %27); by contrast, cURL accepts such paths just fine.
The CSV file on the mainframe must be ASCII encoded, not EBCDIC. (Here, I'm using the term ASCII imprecisely: the precise character encoding you want depends on your PC's settings. You probably want Windows-1252.) This is because the FTP client sets the default transfer type to binary.
Enter your user name and password (your z/OS TSO user ID and password).
Wait for the data to load.
Format the cells. For example, set the format of any columns containing date/time values.
On the Data tab, click Connections, select the connection (that Excel created when you specified a URL for the file name), and clear the check box Prompt for file name on refresh.
To refresh the data, replacing the current data with the results of a new FTP request: on the Data tab, click Refresh All. The data is replaced; the cell formatting remains intact.
Converting an EBCDIC-encoded CSV file to ASCII
(Strictly speaking, I mean ISO-8859, not ASCII.)
Suppose you have JCL that generates a CSV file encoded in EBCDIC. You want to make that CSV file available to Excel via FTP as an ASCII-encoded z/OS UNIX (zFS) file.
Replace your existing DD statement for the output CSV file with the following DD statement:
//OUTCSV DD PATH='/u/me/data-ebcdic.csv',
// PATHOPTS=(OWRONLY,OCREAT,OTRUNC),
// PATHDISP=(KEEP,DELETE),
// PATHMODE=(SIRUSR,SIWUSR,SIRGRP),
// FILEDATA=TEXT
Replace the ddname OUTCSV with your ddname, and the zFS file path /u/me/data-ebcdic.csv with the path that you want to use.
Thanks to the FILEDATA=TEXT parameter, the resulting CSV file will have a X'15' byte at the end of each line.
Append the following step to your JCL:
//ICONV EXEC PGM=IKJEFT01
//SYSTSIN DD *
BPXBATCH sh iconv -f IBM-037 -t iso8859-1 +
/u/me/data-ebcdic.csv +
> /u/me/data-ascii.csv
/*
//SYSPRINT DD SYSOUT=*
//SYSTSPRT DD SYSOUT=*
In case you're wondering why I'm calling iconv as a shell command via BPXBATCH, the following:
//ICONV EXEC PGM=EDCICONV
// PARM=('FROMCODE(IBM-037),TOCODE(iso8859-1)')
didn't quite work: it left the X'15' bytes as is, whereas running iconv as a shell command correctly converted them to X'0A'. (z/OS 2.2.)
You've got some good information in the comments, consensus appears to be conversion to CSV (or TSV to avoid commas embedded in your data) is the easiest route. Here is a bit more information, copied from another answer...
I would strongly suggest you get the files into a text format before
transferring them to another box with a different code page. Trying to
deal with mixed text (which must have its code page translated) and
binary (which must not have its code page translated but which likely
must be converted from big endian to little endian) is harder than
doing the conversion up front.
The conversion can likely be done via the SORT utility on the
mainframe. Mainframe SORT utilities tend to have extensive data
manipulation functions. There are other mechanisms you could use
(other utilities, custom code written in the language of your choice,
purchased packages) but this is what we tend to do in these
circumstances.
Once you have your flat files converted such that all data is text,
you can transfer them via FTP or SFTP or FTPS.
...and thanks for coming back and adding more information. Hopefully the people here have provided enough information to help you solve your problem.
XML would be another possible text oriented solution. It would take more effort to create, but you could design your spreadsheet in Excel and save as an XML document, then write a program to generate the xml text using the data from your mainframe dataset. While this would be more difficult to implement than a simple CSV or TSV file, it has the advantage of implementing the spreadsheet formulas and attributes that a CSV file can not do. Another advantage, you can attach the XML document to an SMTP email note and deliver the document in "spreadsheet format" to your client.

How can I determine file encodings on Windows / IIS?

From the answers to this question it appears there's a file somewhere on our server that's been saved with the wrong encoding.
I've seen this happen before - most often when pasting from Word into Visual Studio, when "smart quotes" can interfere with Visual Studio's encoding settings when saving the file.
Thing is - the problem I'm having involves 20-30 different script files, include files and so on (hey, that was how we kept it modular back in the day...) and I really don't want to open every one of them in Visual Studio and check the file encodings individually.
Is there any way I can analyze a folder tree full of files and spit out a list of each filename along with the text encoding used to save the file? (Or - if encodings aren't clearly specified - work out what encoding Microsoft IIS thinks was used to save the file?)
A textfile's encoding is just how it was intended to be interpreted, so you cannot detect this in a reliable way. You can probably detect UTF-8 and 16-bit unicode, but there's no way distinguishing between ISO-8859-1/2/3/4 etc... (Windows-1250/1251/1252 etc.).
If your document contains "weird" quotes, other than "" or '', you can simply find these, and replace them manually.

Resources