Replace CRLF with LF in Python 3.6 - python-3.x

I've tried searching the web, and a number of different things I've read on the web, but don't seem to get the desired result.
I'm using Windows 7 and Python 3.6.
I'm connecting to an Oracle db with cx_oracle and creating a text file with the query results. The file that is created (which I'll call my_file.txt to make it easy) has 3688 lines in it all with CRLF which needs to be converted to the unix LF.
If I run python crlf.py my_file.txt it is all converted correctly & there is no issues, but that means I need to run another command manually which I do not want to do.
So I tried adding the code below to my file.
filename = "NameOfFileToBeConverted"
fileContents = open(filename,"r").read()
f = open(filename,"w", newline="\n")
f.write(fileContents)
f.close()
This does convert the majority of the CRLF to LF but # line 3501 it has a NUL character 3500 times on the one line followed by a row of data from the database & it ends with the CRLF, every line from here on still has the CRLF.
So with that not working, I removed it and then tried
import subprocess
subprocess.Popen("crlf.py "+ filename, shell=True)
I also tried using
import os
os.system("crlf.py "+ filename)
The "+ filename" in the two examples above is just providing the filename that is created during the data extract.
I don't know what else to try from here.

Convert Line Endings in-place (with Python 3)
Windows to Linux/Unix
Here is a short script for directly converting Windows line endings (\r\n also called CRLF) to Linux/Unix line endings (\n also called LF) in-place (without creating an extra output file):
# replacement strings
WINDOWS_LINE_ENDING = b'\r\n'
UNIX_LINE_ENDING = b'\n'
# relative or absolute file path, e.g.:
file_path = r"c:\Users\Username\Desktop\file.txt"
with open(file_path, 'rb') as open_file:
content = open_file.read()
content = content.replace(WINDOWS_LINE_ENDING, UNIX_LINE_ENDING)
with open(file_path, 'wb') as open_file:
open_file.write(content)
Linux/Unix to Windows
Just swap the line endings to content.replace(UNIX_LINE_ENDING, WINDOWS_LINE_ENDING).
Code Explanation
Important: Binary Mode We need to make sure that we open the file both times in binary mode (mode='rb' and mode='wb') for the conversion to work.
When opening files in text mode (mode='r' or mode='w' without b), the platform's native line endings (\r\n on Windows and \r on old Mac OS versions) are automatically converted to Python's Unix-style line endings: \n. So the call to content.replace() couldn't find any line endings to replace.
In binary mode, no such conversion is done.
Binary Strings In Python 3, if not declared otherwise, strings are stored as Unicode (UTF-8). But we open our files in binary mode - therefore we need to add b in front of our replacement strings to tell Python to handle those strings as binary, too.
Raw Strings On Windows the path separator is a backslash \ which we would need to escape in a normal Python string with \\. By adding r in front of the string we create a so called raw string which doesn't need any escaping. So you can directly copy/paste the path from Windows Explorer.
Alternative We open the file twice to avoid the need of repositioning the file pointer. We also could have opened the file once with mode='rb+' but then we would have needed to move the pointer back to start after reading its content (open_file.seek(0)) and truncate its original content before writing the new one (open_file.truncate(0)).
Simply opening the file again in write mode does that automatically for us.
Cheers and happy programming,
winklerrr

Related

How to detect what kind of break line in a text file in python?

My problem is the following. I have a text file with a bunch of lines in it. The problem is this text might have been created by Windows or Unix or Mac.
I want to open this text in python (as a string block) and split according to a break line to get an array at the end with all lines. The problem is I only tested this with a windows created file so I can split the string block easily according \n. But if I understand correctly other environnement use \r \r\n ...Etc
I want a general solution where I can detect what kind of line break is used in a file before I start splitting in order to split it correctly. Is that possible to do?
thanks;
UNIX_NEWLINE = '\n'
WINDOWS_NEWLINE = '\r\n'
MAC_NEWLINE = '\r'
This will be how the different os apply line breaks in a file and how python sees it

Why do apostrophes (" ' ") turn into ▒'s when reading from a file in Python?

I used Bash to open a Python file. The Python file should read a utf-8 file and display it in the terminal. It gives me a bunch of ▒'s ("Aaron▒s" instead of "Aaron's"). Here's the code:
# It reads text from a text file (done).
f = open("draft.txt", "r", encoding="utf8")
# It handles apostrophes and non-ASCII characters.
print(f.read())
I've tried different combinations of:
read formats with the open function ("r" and "rb")
strip() and rstrip() method calls
decode() method calls
text file encoding (specifically ANSI, Unicode, Unicode big endian, and UTF-8).
It still doesn't display apostrophes (" ' ") properly. How do I make it display apostrophes instead of ▒'s?
The issue is with Git Bash. If I switch to Powershell, Python displays the apostrophes (Aaron's) perfectly. The semantic read errors (Aaron▒s) appear only with Git Bash. I'll give more details if I learn more about it.
Update: #jasonharper and #entpnerd suggested that the draft.txt apostrophe might be "apostrophe-ish" and not a legitimate apostrophe. I compared the draft.txt apostrophe (copy and pasted from a Google Doc) with an apostrophe directly entered. They look different (’ vs. '). In xxd, the value for the apostrophe-ish character is 92. An actual apostrophe is 27. Git Bash only supports the latter (unless there's just something I need to configure, which is more likely).
Second Update: Clarified that I'm using Git Bash. I wasn't aware that there were multiple terminals (is that the right way of putting it?) that ran Bash.

file transfer from Windows to Linux

I am exporting data in a csv file using ssis. In my ssis package i compress the file in zip format and upload it on a linux server using sftp. The problem is that in the destination file system, the csv files include a ^M character which comes from the dos system.
I found three solutions.
First i could set the sftp transfer mode to ascii and not zip the file (i later found out this is only supported by ftp). Considering that my unzipped file is > 3Gb that is not efficient, the upload will take ages.
Secondly once transferred i could unzip the file and convert it using dos2unix utility, but again dos2unix is not installed and i am not authorized to install it to the target system.
Finally i could use a unix editor like sed to remove ^M from the end of lines. My file is consisted of more than 4 million lines and this would again take ages.
Q: Is there any way to preformat my file in ASCII using ssis, then zip and transfer?
While searching on this issue i found a very useful links were they described the cause and possible resolutions of this issue:
How to remove CTRL-M (^M) characters from a file in Linux
Why are special characters such as “carriage return” represented as “^M”?
Cause
File has been transferred between systems of different types with different newline conventions. For example, Windows-based text editors will have a special carriage return character (CR+LF) at the end of lines to denote a line return or newline, which will be displayed incorrectly in Linux (^M). This can be difficult to spot, as some applications or programs may handle the foreign newline characters properly while others do not. Thus some services may crash or not respond correctly. Often times, this is because the file is created or perhaps even edited on a Microsoft Windows machine and then uploaded or transferred to a Linux server. This typically occurs when a file is transferred from MS-DOS (or MS-Windows) without ASCII or text mode.
Possible resolutions
(1) Using dos2unix command
dos2unix includes utilities to convert text files with DOS or MAC line breaks to Unix line breaks and vice versa. It also includes conversion of UTF-16 to UTF-8.
dos2unix
Dos2Unix / Unix2Dos - Text file format converters
dos2unix and unix2dos commands
You can use a similar command via Execute Process Task:
dos2unix filename
(2) Data Flow Task
You can create a Data Flow task that transfer data from Flat File Source into a new Flat File Destination were both Flat File Connection mAnager has the same structure except the Row Delimiter property ({CR}{LF} in Source , {LF} in destination)
Flat File Connection Manager Editor (Columns Page)
(3) Using a Script Task - StreamReader/Writer
You can use a script task with a similar code:
string data = null;
//Open and read the file
using (StreamReader srFileName = new StreamReader(FileName))
{
data = srFileName.ReadToEnd();
data = data.Replace("\r\n","\n");
}
using (StreamWriter swFileName = new StreamWriter(FileName))
{
swFileName.Write(data);
}
Replacing LF with CRLF in text file
(4) Extract using unzip -a
From the following unzip documentation:
-a
convert text files. Ordinarily all files are extracted exactly as they are stored (as ''binary'' files). The -a option causes files identified by zip as text files (those with the 't' label in zipinfo listings, rather than 'b') to be automatically extracted as such, converting line endings, end-of-file characters and the character set itself as necessary. (For example, Unix files use line feeds (LFs) for end-of-line (EOL) and have no end-of-file (EOF) marker; Macintoshes use carriage returns (CRs) for EOLs; and most PC operating systems use CR+LF for EOLs and control-Z for EOF. In addition, IBM mainframes and the Michigan Terminal System use EBCDIC rather than the more common ASCII character set, and NT supports Unicode.) Note that zip's identification of text files is by no means perfect; some ''text'' files may actually be binary and vice versa. unzip therefore prints ''[text]'' or ''[binary]'' as a visual check for each file it extracts when using the -a option. The -aa option forces all files to be extracted as text, regardless of the supposed file type. On VMS, see also -S.
So you can use the following command to extract text files with changing line endings:
unzip -a filename
Credit to #jww comment
Other Useful links
How to Remove ^M in Linux & Unix
Remove CTRL-M characters from a file in UNIX
Convert DOS line endings to Linux line endings in vim
How to replace crlf with lf in a single file
How to remove Windows carriage returns for text files in Linux
I didn't try it, but I thought you could do a CR+LF -> LF conversion just when outputing to the csv file. I looked in this link here
Scroll down to the section "Header row delimiter". It seems that if you choose {LF} as a row delimiter, your resulting .zip file will show correctly in your linux box.
BTW, probably you know, but I have to mention that ^M is the representation of CR in a linux / unix box.
BTW2, in most cases the ^M in linux is not a problem, just some annoying thing.
I hope I could help!

Python '\n' not working when writing to txt file on Amazon Linux

I have a python script which writes a to a txt log file using the following code:
with open('folder/log_{}.txt'.format(today), "a") as text_file:
text_file.write(text_to_write)
The text_to_writeis usually a string which includes '\n' to start a new line.
It works fine when running the script locally on my windows machine. However, when a run it on an Amazon Linux instance it ignores the '\n' and all the text_to_write are on the same line. The '\n' isn't written to the log either, is it just ignored as if it wasn't there.
I can't find out why this is or how to resolve it so I can specify where to include a new line.
Many thanks
It’s really weird.
But anyway, since you want to write a text file, you should specify the character encoding:
text_to_write = "hello\n"
with open(path, mode="a", encoding="utf-8") as text_file:
text_file.write(text_to_write)
But you can also use a logger. In your main function, you can initialize the logging configuration. For instance you can use a basic configuration:
import logging
logging.basicConfig(level=logging.INFO,
filename=path)
Of course, this configuration can be read from an INI file.
Then in every module you can define a logger and use it like this:
LOG = logging.getLogger(__name__)
LOG.info("hello")
The result is something like this:
INFO:module_name:hello
Edit
Clarification about newlines
For text stream, Python uses the concept of universal newline:
universal newlines
A manner of interpreting text streams in which all of the following are recognized as ending a line: the Unix end-of-line convention '\n', the Windows convention '\r\n', and the old Macintosh convention '\r'.
If you want to write Windows specific text files on a Linux system, you need to force the newline to '\r\n', this can be done like this:
with open(path, mode="a", encoding="utf-8", newline="\r\n") as text_file:
text_file.write(text_to_write)
But, it is usually a bad practice… you’ll understand why bellow…
Notepad
Microsoft Notepad is not very clever: it cannot handle Unix text files correctly because it doesn’t recognize the '\n' like a newline so every line are joined together and appears like a single line of text.
I recommend you to install the famous Notepad++ which is a smart tool and can auto-detect newline style on file opening. You can also use it to change it.
So, my advice is to use universal newline and install a good tool like Notepad++, or Sublime Text, etc.

Python 3: Issue writing new lines along side unicode to text file

I ran into an issue when writing the header of a text file in python 3.
I have a header that contains unicode AND new line characters. The following is a minimum working example:
with open('my_log.txt', 'wb') as my_file:
str_1 = '\u2588\u2588\u2588\u2588\u2588\n\u2588\u2588\u2588\u2588\u2588'
str_2 = 'regular ascii\nregular ascii'
my_file.write(str_1.encode('utf8'))
my_file.write(bytes(str_2, 'UTF-8'))
The above works, except the output file does not have the new lines (it basically looks like I replaced '\n' with ''). Like the following:
████████regular asciiregular ascii
I was expecting:
████
████
regular ascii
regular ascii
I have tried replacing '\n' with u'\u000A' and other characters based on similar questions - but I get the same result.
An additional, and probably related, question: I know I am making my life harder with the above encoding and byte methods. Still getting used to unicode in py3 so any advice regarding that would be great, thanks!
EDIT
Based on Ignacio's response and some more research: The following seems to produce the desired results (basically converting from '\n' to '\r\n' and ensuring the encoding is correct on all the lines):
with open('my_log.txt', 'wb') as my_file:
str_1 = '\u2588\u2588\u2588\u2588\u2588\r\n\u2588\u2588\u2588\u2588\u2588'
str_2 = '\r\nregular ascii\r\nregular ascii'
my_file.write(str_1.encode('utf8'))
my_file.write(str_2.encode('utf8'))
Since you mentioned wanting advice using Unicode on Python 3...
You are probably using Windows since the \n isn't working correctly for you in binary mode. Linux uses \n line endings for text, but Windows uses \r\n.
Open the file in text mode and declare the encoding you want, then just write the Unicode strings. Below is an example that includes different escape codes for Unicode:
#coding:utf8
str_1 = '''\
\u2588\N{FULL BLOCK}\U00002588█
regular ascii'''
with open('my_log.txt', 'w', encoding='utf8') as my_file:
my_file.write(str_1)
You can use a four-digit escape \uxxxx, eight-digit escape \Uxxxxxxxx, or the Unicode codepoint \N{codepoint_name}. The Unicode characters can also be directly used in the file as long as the #coding: declaration is present and the source code file is saved in the declared encoding.
Note that the default source encoding for Python 3 is utf8 so the declaration I used above is optional, but on Python 2 the default is ascii. The source encoding does not have to match the encoding used to open a file.
Use w or wt for writing text (t is the default). On Windows \n will translate to \r\n in text mode.
'wb'
The file is open in binary mode. As such \n isn't translated into the native newline format. If you open the file in a text editor that doesn't treat LF as a line break character then all the text will appear on a single line in the editor. Either open the file in text mode with an appropriate encoding or translate the newlines manually before writing.

Resources