Replace CR by CR LF - text

I'm on Windows and I have an odd text file containing mostly CR+LF line ending. A few lines end with only CR. Which tool to use to transform these odd lines into well formatted (e.g. CR+LF terminated) lines?
I could use either GnuWin32 tools or Python to solve this.
The main problem I have is that I cannot open the file as text file since Python (as most other text processors, such as awk) don't recognize the mixed line endings. So I believe the solution must incorporate binary processing of the file.
The again, I cannot just replace CR by CR LF, since there are also CR LF line endings existing that must not be touched.

To replace lines you can use regular expressions:
\r+ to find CR
\r\n is the text you want as replacement text.
Regular Expressions in Python:
Regular Expression
import re
txt='text where you want to replace the linebreak'
out = re.sub("\r+", '\r\n', txt)
print out

Related

fs.readFileSync adds \r to the end of each string

I'm using let names = fs.readFileSync(namefile).toString().split("\n"). Whenever I do
for(const name of names) {
console.log(`First Name: ${name.split(" ")[0]} Last Name: ${name.split(" ")[1]}
}
the last name part always has \r at the end, how do I make it not add the \r?
fs.readFileSync doesn't add anything to the end of lines,
instead the file that you're trying to read is using CRLF line endings, meaning that each line ends with the \r\n sequence.
Really your file looks something like this:
line1\r\nline2\r\nline3\r\n
But your text editor will hide these characters from you.
There are two different ways you can fix this problem.
Change the type of line endings used in your text editor
This is IDE specific but if you use Visual Studio Code you can find the option in the bottom right.
Clicking on it will allow you to change to LF line endings, a sequence where lines are followed by a single \n character.
Replace unwanted \r characters
Following on from your example we can use .replace to remove any \r characters.
let names = fs.readFileSync(namefile)
.toString()
.replace(/\r/g, "")
.split("\n")
More on line endings

In Rust, is there a way to make literal newlines in r###"..."### using Windows convention?

I'm using Rust on Windows and I found
r###"abc
def"###
results in a string abc\ndef. Is there an easy way to make it abc\r\ndef? Or must I do the replacement manually?
The Rust compiler converts all CRLF sequences to LF when reading source files since 2019 (see merge request, issue) and there is no way to change this behavior.
What you can do:
Use .replace("\n", "\r\n") at runtime to create a new String with CRLF line terminators.
Use regular instead of raw string literals and end your lines with \r, e.g.
"abc\r
def"
Use the std::include_str!() macro to include a file in UTF-8 format which contains the text with CRLF line terminators.

How to detect what kind of break line in a text file in python?

My problem is the following. I have a text file with a bunch of lines in it. The problem is this text might have been created by Windows or Unix or Mac.
I want to open this text in python (as a string block) and split according to a break line to get an array at the end with all lines. The problem is I only tested this with a windows created file so I can split the string block easily according \n. But if I understand correctly other environnement use \r \r\n ...Etc
I want a general solution where I can detect what kind of line break is used in a file before I start splitting in order to split it correctly. Is that possible to do?
thanks;
UNIX_NEWLINE = '\n'
WINDOWS_NEWLINE = '\r\n'
MAC_NEWLINE = '\r'
This will be how the different os apply line breaks in a file and how python sees it

Python 3: Issue writing new lines along side unicode to text file

I ran into an issue when writing the header of a text file in python 3.
I have a header that contains unicode AND new line characters. The following is a minimum working example:
with open('my_log.txt', 'wb') as my_file:
str_1 = '\u2588\u2588\u2588\u2588\u2588\n\u2588\u2588\u2588\u2588\u2588'
str_2 = 'regular ascii\nregular ascii'
my_file.write(str_1.encode('utf8'))
my_file.write(bytes(str_2, 'UTF-8'))
The above works, except the output file does not have the new lines (it basically looks like I replaced '\n' with ''). Like the following:
████████regular asciiregular ascii
I was expecting:
████
████
regular ascii
regular ascii
I have tried replacing '\n' with u'\u000A' and other characters based on similar questions - but I get the same result.
An additional, and probably related, question: I know I am making my life harder with the above encoding and byte methods. Still getting used to unicode in py3 so any advice regarding that would be great, thanks!
EDIT
Based on Ignacio's response and some more research: The following seems to produce the desired results (basically converting from '\n' to '\r\n' and ensuring the encoding is correct on all the lines):
with open('my_log.txt', 'wb') as my_file:
str_1 = '\u2588\u2588\u2588\u2588\u2588\r\n\u2588\u2588\u2588\u2588\u2588'
str_2 = '\r\nregular ascii\r\nregular ascii'
my_file.write(str_1.encode('utf8'))
my_file.write(str_2.encode('utf8'))
Since you mentioned wanting advice using Unicode on Python 3...
You are probably using Windows since the \n isn't working correctly for you in binary mode. Linux uses \n line endings for text, but Windows uses \r\n.
Open the file in text mode and declare the encoding you want, then just write the Unicode strings. Below is an example that includes different escape codes for Unicode:
#coding:utf8
str_1 = '''\
\u2588\N{FULL BLOCK}\U00002588█
regular ascii'''
with open('my_log.txt', 'w', encoding='utf8') as my_file:
my_file.write(str_1)
You can use a four-digit escape \uxxxx, eight-digit escape \Uxxxxxxxx, or the Unicode codepoint \N{codepoint_name}. The Unicode characters can also be directly used in the file as long as the #coding: declaration is present and the source code file is saved in the declared encoding.
Note that the default source encoding for Python 3 is utf8 so the declaration I used above is optional, but on Python 2 the default is ascii. The source encoding does not have to match the encoding used to open a file.
Use w or wt for writing text (t is the default). On Windows \n will translate to \r\n in text mode.
'wb'
The file is open in binary mode. As such \n isn't translated into the native newline format. If you open the file in a text editor that doesn't treat LF as a line break character then all the text will appear on a single line in the editor. Either open the file in text mode with an appropriate encoding or translate the newlines manually before writing.

Carriage Return, Line Feed and New Line

What are the differences among Carriage Return, Line Feed and New line? Does it depend on OS? Why do we need to use all of them just for getting to next line?
Generally, a "new line" refers to any set of characters that is commonly interpreted as signaling a new line, which can include:
CR LF on DOS/Windows
CR on older Macs
LF on Unix variants, including modern Macs
CR is the Carriage Return ASCII character (Code 0x0D), usually represented as \r.
LF is the Line Feed character (Code 0x0A), usually represented as \n.
Original typewriter-based computers needed both of these characters, which do exactly what they say: CR returned the carriage to the left side of the paper, LF fed it through by one line. Windows kept this sequence unmodified, while Unix variants opted for more efficient character usage once they were only needed symbolically.
Make sure you look for a platform-agnostic new line symbol or function if you need to represent this sequence in code. If not, at least make sure that you account for the above three variants.
More on the history: The Great Newline Schism - Coding Horror

Resources