file transfer from Windows to Linux - linux

I am exporting data in a csv file using ssis. In my ssis package i compress the file in zip format and upload it on a linux server using sftp. The problem is that in the destination file system, the csv files include a ^M character which comes from the dos system.
I found three solutions.
First i could set the sftp transfer mode to ascii and not zip the file (i later found out this is only supported by ftp). Considering that my unzipped file is > 3Gb that is not efficient, the upload will take ages.
Secondly once transferred i could unzip the file and convert it using dos2unix utility, but again dos2unix is not installed and i am not authorized to install it to the target system.
Finally i could use a unix editor like sed to remove ^M from the end of lines. My file is consisted of more than 4 million lines and this would again take ages.
Q: Is there any way to preformat my file in ASCII using ssis, then zip and transfer?

While searching on this issue i found a very useful links were they described the cause and possible resolutions of this issue:
How to remove CTRL-M (^M) characters from a file in Linux
Why are special characters such as “carriage return” represented as “^M”?
Cause
File has been transferred between systems of different types with different newline conventions. For example, Windows-based text editors will have a special carriage return character (CR+LF) at the end of lines to denote a line return or newline, which will be displayed incorrectly in Linux (^M). This can be difficult to spot, as some applications or programs may handle the foreign newline characters properly while others do not. Thus some services may crash or not respond correctly. Often times, this is because the file is created or perhaps even edited on a Microsoft Windows machine and then uploaded or transferred to a Linux server. This typically occurs when a file is transferred from MS-DOS (or MS-Windows) without ASCII or text mode.
Possible resolutions
(1) Using dos2unix command
dos2unix includes utilities to convert text files with DOS or MAC line breaks to Unix line breaks and vice versa. It also includes conversion of UTF-16 to UTF-8.
dos2unix
Dos2Unix / Unix2Dos - Text file format converters
dos2unix and unix2dos commands
You can use a similar command via Execute Process Task:
dos2unix filename
(2) Data Flow Task
You can create a Data Flow task that transfer data from Flat File Source into a new Flat File Destination were both Flat File Connection mAnager has the same structure except the Row Delimiter property ({CR}{LF} in Source , {LF} in destination)
Flat File Connection Manager Editor (Columns Page)
(3) Using a Script Task - StreamReader/Writer
You can use a script task with a similar code:
string data = null;
//Open and read the file
using (StreamReader srFileName = new StreamReader(FileName))
{
data = srFileName.ReadToEnd();
data = data.Replace("\r\n","\n");
}
using (StreamWriter swFileName = new StreamWriter(FileName))
{
swFileName.Write(data);
}
Replacing LF with CRLF in text file
(4) Extract using unzip -a
From the following unzip documentation:
-a
convert text files. Ordinarily all files are extracted exactly as they are stored (as ''binary'' files). The -a option causes files identified by zip as text files (those with the 't' label in zipinfo listings, rather than 'b') to be automatically extracted as such, converting line endings, end-of-file characters and the character set itself as necessary. (For example, Unix files use line feeds (LFs) for end-of-line (EOL) and have no end-of-file (EOF) marker; Macintoshes use carriage returns (CRs) for EOLs; and most PC operating systems use CR+LF for EOLs and control-Z for EOF. In addition, IBM mainframes and the Michigan Terminal System use EBCDIC rather than the more common ASCII character set, and NT supports Unicode.) Note that zip's identification of text files is by no means perfect; some ''text'' files may actually be binary and vice versa. unzip therefore prints ''[text]'' or ''[binary]'' as a visual check for each file it extracts when using the -a option. The -aa option forces all files to be extracted as text, regardless of the supposed file type. On VMS, see also -S.
So you can use the following command to extract text files with changing line endings:
unzip -a filename
Credit to #jww comment
Other Useful links
How to Remove ^M in Linux & Unix
Remove CTRL-M characters from a file in UNIX
Convert DOS line endings to Linux line endings in vim
How to replace crlf with lf in a single file
How to remove Windows carriage returns for text files in Linux

I didn't try it, but I thought you could do a CR+LF -> LF conversion just when outputing to the csv file. I looked in this link here
Scroll down to the section "Header row delimiter". It seems that if you choose {LF} as a row delimiter, your resulting .zip file will show correctly in your linux box.
BTW, probably you know, but I have to mention that ^M is the representation of CR in a linux / unix box.
BTW2, in most cases the ^M in linux is not a problem, just some annoying thing.
I hope I could help!

Related

Why does `^M` appear in terminal output when looking at some files?

I'm trying to send file using curl to an endpoint and save the file to the machine.
Sending curl from Linux and saving it on the machine works well,
but doing the same curl from Windows is adding ^M character to every end of line.
I'm printing the file before saving it and can't see ^M. Only viewing the file on the remote machine after saving it shows me ^M.
A simple string replacement doesn't seem to work.
Why is ^M being added? How can I prevent this?
Quick Answer: That's a carriage return. They're a harmless but mildly irritating artifact of how Windows encodes text files. You can strip them out of your files with dos2unix. You can configure most text editors to use "Unix Line Endings" or "LF Line Endings" to prevent them from appearing in new files that you create from Windows PCs in the future.
Long Answer (with some historical trivia):
In a plain text file, when you create a new line (by pressing enter/return), a "line break" is embedded in the file. On Unix/Linux, this is a single character, '\n', the "line feed". On Windows, this is two sequential characters, '\r\n', the "carriage return" followed by the "line feed".
When physical teletype terminals, which behaved much like typewriters, were still in use, the "line feed" character meant "move the paper up to the next line" and the "carriage return" character meant "slide the carriage all the way over so the typing head is on the far left". From the very beginning, nearly all teletype terminals supported implicit carriage return; i.e., triggering a line feed would automatically trigger a carriage return. The developers working on what later evolved into Windows decided that it would be best to include explicit carriage returns, just in case (for some reason) the teletype does not perform one implicitly. The Unix developers, on the other hand, chose to work with the assumption of implicit carriage return.
The carriage return and line feed are ASCII Control Characters which means they do not have a visible representation as standalone printable characters, instead they affect the output cursor itself (in this case, the position of the output cursor).
The "^M" you see is a stand-in representation for the carriage return character, used by programs that don't fully "cook" their output (i.e., don't apply the effects of some ASCII Control Characters). (Other control characters have other representations starting with "^", and the "^" character is also used to represent the "ctrl" keyboard key in some Unix programs like nano.)
You can use dos2unix to convert the line endings from Windows-style to Unix-style.
$ curl https://example.com/file_with_crlf.txt | dos2unix > file.txt
On some distros, this tool is included by default, on others it can be installed via the package manager (e.g., on Ubuntu, sudo apt install dos2unix). There also exists a package, unix2dos, for the inverse.
Most "smart" text editors for coding (Sublime, Atom, VS Code, Notepad++, etc.) will happily read and write with either Windows-style or Unix-style line endings (this might require changing some configuration options). Often, the line-endings are auto-detected by scanning the contents of a file, and usually new files are created with the Operating System's native line endings (by default). Even the new version of Notepad supports Unix-style line endings. On the other hand, some Unix tools will produce strange results in the presence of Windows-style line breaks. If your codebase will be used by people on both Unix and Windows operating systems, the nice thing to do is to use Unix-style line endings everywhere.
Git on Windows also has an optional mode that checks out all files with Windows-style line breaks, but checks them back in with Unix-style line breaks.
Side Notes (interesting, but not directly related to your question):
What the carriage return actually does (on a modern virtual terminal, be it Windows or Unix) is move the output cursor to the beginning of the line. If you use the carriage return without a line feed, you can "overwrite" part of a string that has already been printed.
$ printf "dogdog" ; printf "\rcat\n"
catdog
Some Unix programs use this to asynchronously update part of the last line of output, to implement things like a live-updating progress indicator. For example, curl, which shows download progress on stdout if the file contents are piped elsewhere.
Also: If you had a tool that interpreted Windows-style line endings as literally as possible, and you fed it a string with Unix-style line endings such as "hello\nworld", you would get output like this:
hello
world
Fortunately, such implementations are extremely rare and, in general, the vast majority of Windows tools can render Unix-style line-endings identically to Windows-style line endings without any problem.

Replace CRLF with LF in Python 3.6

I've tried searching the web, and a number of different things I've read on the web, but don't seem to get the desired result.
I'm using Windows 7 and Python 3.6.
I'm connecting to an Oracle db with cx_oracle and creating a text file with the query results. The file that is created (which I'll call my_file.txt to make it easy) has 3688 lines in it all with CRLF which needs to be converted to the unix LF.
If I run python crlf.py my_file.txt it is all converted correctly & there is no issues, but that means I need to run another command manually which I do not want to do.
So I tried adding the code below to my file.
filename = "NameOfFileToBeConverted"
fileContents = open(filename,"r").read()
f = open(filename,"w", newline="\n")
f.write(fileContents)
f.close()
This does convert the majority of the CRLF to LF but # line 3501 it has a NUL character 3500 times on the one line followed by a row of data from the database & it ends with the CRLF, every line from here on still has the CRLF.
So with that not working, I removed it and then tried
import subprocess
subprocess.Popen("crlf.py "+ filename, shell=True)
I also tried using
import os
os.system("crlf.py "+ filename)
The "+ filename" in the two examples above is just providing the filename that is created during the data extract.
I don't know what else to try from here.
Convert Line Endings in-place (with Python 3)
Windows to Linux/Unix
Here is a short script for directly converting Windows line endings (\r\n also called CRLF) to Linux/Unix line endings (\n also called LF) in-place (without creating an extra output file):
# replacement strings
WINDOWS_LINE_ENDING = b'\r\n'
UNIX_LINE_ENDING = b'\n'
# relative or absolute file path, e.g.:
file_path = r"c:\Users\Username\Desktop\file.txt"
with open(file_path, 'rb') as open_file:
content = open_file.read()
content = content.replace(WINDOWS_LINE_ENDING, UNIX_LINE_ENDING)
with open(file_path, 'wb') as open_file:
open_file.write(content)
Linux/Unix to Windows
Just swap the line endings to content.replace(UNIX_LINE_ENDING, WINDOWS_LINE_ENDING).
Code Explanation
Important: Binary Mode We need to make sure that we open the file both times in binary mode (mode='rb' and mode='wb') for the conversion to work.
When opening files in text mode (mode='r' or mode='w' without b), the platform's native line endings (\r\n on Windows and \r on old Mac OS versions) are automatically converted to Python's Unix-style line endings: \n. So the call to content.replace() couldn't find any line endings to replace.
In binary mode, no such conversion is done.
Binary Strings In Python 3, if not declared otherwise, strings are stored as Unicode (UTF-8). But we open our files in binary mode - therefore we need to add b in front of our replacement strings to tell Python to handle those strings as binary, too.
Raw Strings On Windows the path separator is a backslash \ which we would need to escape in a normal Python string with \\. By adding r in front of the string we create a so called raw string which doesn't need any escaping. So you can directly copy/paste the path from Windows Explorer.
Alternative We open the file twice to avoid the need of repositioning the file pointer. We also could have opened the file once with mode='rb+' but then we would have needed to move the pointer back to start after reading its content (open_file.seek(0)) and truncate its original content before writing the new one (open_file.truncate(0)).
Simply opening the file again in write mode does that automatically for us.
Cheers and happy programming,
winklerrr

SQL Loader unable to load CSV file data into linux environment correctly

i was able to load the same comma delimited csv file's data into window oracle database correctly but in linux environment, the record being inserted having weird behavior. For example, the data being inserted are having a behavior like \n. i selected the record and paste it out notice that the record is like this
"data
"
the control file i used is as below
Load DATA
REPLACE INTO TABLE TABLE_NM
FIELDS TERMINATED BY ','
OPTIONALLY ENCLOSED BY '"'
TRAILING NULLCOLS
please advice what i can do to make this scenario right. thank you in advance
Its the classic issue where on *nix systems lines end with a linefeed, but on Windows lines end with a carriage return/linefeed. Since your data ends with carriage return/linefeed it is read fine on Windows, but Linux loads the carriage return.
You can either preprocess the data file and replace the line (record) termination character with a utility like dos2unix or change the control file by adding the STR clause to the INFILE option to set the record termination character to the carriage return:
INFILE "test.dat" "STR x'0D'"
I would opt for running the data through dos2unix to keep the control file more generic and not data filename specific.
After investigation, notice that the root cause of the issues is the feed file is not generated from Linux base environment. So after I manually convert the file into Linux version, the feed file is able to load into DB without any issues.

Charset after FTP binary transfer

We have two machines (unix and windows) and we send a file vía FTP from first (unix [IBM1047]) to second (windows [UTF16]). If you use ASCII mode, some especial characters such as Ñ ó... are not display correctly. So we changed to BINARY mode and after the transfer we set charset file to UTF16. But everything works fine except returns carriages which are not displayed (1 line un the file).
So what is we missing?
Binary mode means that there are no changes done to the file, which includes changes on the line endings. UNIX and Windows have traditionally different line endings, i.e. \n on UNIX vs. \r\n on Windows. If your application is not able to deal with UNIX-style line endings you have to convert all the line endings in the file. See also Windows command to convert Unix line endings?.

Komodo IDE FTP (ASCII, binary) end-of-line characters

I've some problem when working on remote files (perl scripts) with Komodo IDE. There is (as far as I know) no way to change ftp transfer mode from binary to ASCII, which result in "^M" character at the end of every line. My setup is Linux server, and Windows client. Is there any way to solve this issue without nessecity of correcting saved file on Linux every time. This behaviour disqualify Komodo IDE, which was my favourite IDE until now.
The "^M" you observe has nothing to do with your file being ASCII, but line ending format (carriage return and line feed characters.)
I have not verified this, but here's a link showing how to save files in Komodo using a different line ending method. Saving files in DOS mode is not needed anymore, since most editors recognize UNIX file format nowadays.
Add switch -w to your Perl shebang.

Resources