When I save lines of data in excel files as tab delimited .txt files, and then open those files in VIM, I see that what was once a multi-line file in excel is now a single line file in VIM.
The "lines" can be separated in VIM using some substitution commands:
%s/^M/\r\n/g
After this, the "lines" are now separated by an ^#.
I deal with it using another substitution command:
%s/^#//g
My questions are:
Why do my multi-line txt excel files open as a single line in VI?
What is ^#?
Is there a better way to 'fix' my txt files?
You can often fix problems like this by running the following command:
dos2unix FILE_NAME
This will re-format the file's newline characters in place (ie it will modify your file).
(This command does not exist on Mac OS X)
If you don't have dos2unix (eg you're on a Mac), you can just use sed:
sed 's/^M$//' input.txt > output.txt
You can also use sed -i if you want to avoid creating a new file by performing the substitution in place.
You can enter ^M by typing CTRLV followed by CTRLM
More reading here
Try this command:
:%s/^M/\r/g
\r is the carriage return character vim uses. The ^M character is a newline character that is literally displayed.
Related
I need to read a file into an array and concatenate a string at the end of each line. Here is my bash script:
#!/bin/bash
IFS=$'\n' read -d '' -r -a lines < ./file.list
for i in "${lines[#]}"
do
tmp="$i"
tmp="${tmp}stuff"
echo "$tmp"
done
However, when I do this, an action of replace happens, instead of concatenation.
For example, in the file.list, we have:
http://www.example1.com
http://www.example2.com
What I need is:
http://www.example1.comstuff
http://www.example2.comstuff
But after executing the script above, I get things as below on the terminal:
stuff//www.example1.com
stuff//www.example2.com
Btw, my PC is Mac OS.
The problem also occurs while concatenating strings via awk, printf, and echo commands. For example echo $tmp"stuff" or echo "${tmp}""stuff"
The file ./file.lst is, most probably, generated on a Windows system or, at least, it was saved using the Windows convention for end of line.
Windows uses a sequence of two characters to mark the end of lines in a text file. These characters are CR (\r) followed by LF (\n). Unix-like systems (Linux and macOS starting with version 10) use LF as end of line character.
The assignment IFS=$'\n' in front of read in your code tells read to use LF as line separator. read doesn't store the LF characters in the array it produces (lines[]) but each entry from lines[] ends with a CR character.
The line tmp="${tmp}stuff" does what is it supposed to do, i.e. it appends the word stuff to the content of the variable tmp (a line read from the file).
The first line read from the input file contains the string http://www.example1.com followed by the CR character. After the string stuff is appended, the content of variable tmp is:
http://www.example1.com$'\r'stuff
The CR character is not printable. It has a special interpretation when it is printed on the terminal: it sends the cursor at the start of the line (column 1) without changing the line.
When echo prints the line above, it prints (starting on a new line) http://www.example1.com, then the CR character that sends the cursor back to the start of the line where is prints the string stuff. The stuff fragment overwrites the first 5 characters already printed on that line (http:) and the result, as it is visible on screen, is:
stuff//www.example1.com
The solution is to get rid of the CR characters from the input file. There are several ways to accomplish this goal.
A simple way to remove the CR characters from the input file is to use the command:
sed -i.bak s/$'\r'//g file.list
It removes all the CR characters from the content of file file.list, saves the updated string back into the file.list file and stores the original file.list file as file.list.bak (a backup copy in case it doesn't produce the output you expect).
Another way to get rid of the CR character is to ask the shell to remove it in the command where stuff is appended:
tmp="${tmp/$'\r'/}stuff"
When a variable is expanded in a construct like ${tmp/a/b}, all the appearances of a in $tmp are replaced with b. In this case we replace \r with nothing.
I'm guessing it's have something to do with the Carriage Return character.
Did your file.list created on windows? If so, try to use dos2unix before running the script.
Edit
You can check your files using the file command.
Example:
file file.list
If you saved the file in Windows Notepad like this:
Then it will probably come up like this:
file.list: ASCII text, with no line terminators
You can use built in tools like iconv to convert the encodings. However for a simple use like this, you can just use a command that works for multiple encodings without any conversion necessary.
You could simply buffer the file through cat, and use a regular expression that applies to either:
Carriage return followed by line terminator, or
Line terminator on it's own
Then append the string.
Example:
cat file.list | grep -E -v "^$" | sed -E -e "s/(\r?$)/stuff/g"
Will work with ASCII text, and ASCII text with no line terminators.
If you need to modify a stream to append a fixed string, you can use sed or awk, for instance:
sed 's/$/stuff/'
to append stuff to the end of each line.
using "dos2unix file.list" would also solve the problem
I have a big text file (URL.txt) and I wish to perform the following using a single sed command:
Find and replace text 'google' with 'facebook' between line numbers 19 and 33.
Display the output on the terminal without altering the original file.
You can use sed addresses:
sed '19,33s/google/facebook/g' file
This will run the substitution on lines between and including 19 and 33.
The form of a sed command is as follows:
[address[,address]]function[arguments]
Where 19,33 is the addreses,
substitute is function
and global is the argument
the above answer ALMOST worked for me on Mac OSX.
sed '19,33s/google/facebook/' file
worked perfectly without braces.
sed '19,$s/google/facebook/' file
works until the end of the file as well.
I have a lot of files that end in the classical ^M, an artifact from my Windows times. As this is all source code, git actually thinks those files changed, so I want to remove those nasty lines once and for all.
Here is what I created:
sed -i 's/^M//g' file
But that does not work. Of course I did not type a literal ^M but rather ^V^M (ctrl V, ctrl M). In vim it works (:%s/s/^M//g) and if I modify it like this:
sed -i 's/^M/a/g' file
It also works, i.e. it ends every line with an 'a'. It also works to do this:
sed -i 's/random_string//g' file
Where random_string exists in the file. So I can replace ^M by any character and I can remove lines but I cannot remove ^M. Why?
Note: It is important that it is just removed, no replacing by another invisible char or something. I would also like to avoid double execution and adding an arbitrary string and removing it afterwards. I want to understand why this fails (but it does not report an error).
That character is matched with \r by sed. Use:
sed -e "s/\r//g" input-file
For my case, I had to do
sed -e "s/\r/\n/g" filename.csv
After that wc -l filename Showed correct output instead of 0 lines.
I have a csv file into which has crept some ^M dos line ends, and I want to get rid of them, as well as 16 spaces and 3 tabs which follow. Like, I have to merge that line with the next one down. Heres an offending record and a good one as a sample of what I mean:
"Mary had a ^M
little lamb", "Nursery Rhyme", 1878
"Mary, Mary quite contrary", "Nursery Rhyme", 1838
I can remove the ^M using sed as you can see, but I cannot work out how to rm the nix line end to join the lines back up.
sed -e "s/^M$ //g" rhymes.csv > rhymes.csv
UPDATE
Then I read "However, the Microsoft CSV format allows embedded newlines within a double-quoted field. If embedded newlines within fields are a possibility for your data, you should consider using something other than sed to work with the data file." from:
http://sed.sourceforge.net/sedfaq4.html
So editing my question to ask Which tool I should be using?
With help from How can I replace a newline (\n) using sed?, I made this one:
sed -e ':a;N;$!ba;s/\r\n \t\t\t/=/' -i rhymes.csv
<CR> <LF> <16 spaces> <3 tabs>
If you just want to delete the CR, you could use:
<yourfile tr -d "\r" | tee yourfile
(or if the two input and output file are different: <yourfile tr -d "\r" > output)
dos2unix file_name
to convert file, or
dos2unix old_file new_file
to create new file.
I've written a script that cleans up .csv files, removing some bad commas and bad quotes (bad, means they break an in house program we use to transform these files) using sed:
# remove all commas, and re-insert the good commas using clean.sed
sed -f clean.sed $1 > $1.1st
# remove all quotes
sed 's/\"//g' $1.1st > $1.tmp
# add the good quotes around good commas
sed 's/\,/\"\,\"/g' $1.tmp > $1.tmp1
# add leading quotes
sed 's/^/\"/' $1.tmp1 > $1.tmp2
# add trailing quotes
sed 's/$/\"/' $1.tmp2 > $1.tmp3
# remove utf characters
sed 's/<feff>//' $1.tmp3 > $1.tmp4
# replace original file with new stripped version and delete .tmp files
cp -rf $1.tmp4 quotes_$1
Here is clean.sed:
s/\",\"/XXX/g;
:a
s/,//g
ta
s/XXX/\",\"/g;
Then it removes the temp files and viola we have a new file that starts with the word "quotes" that we can use for our other processes.
My question is:
Why do I have to make a sed statement to remove the feff tag in that temp file? The original file doesn't have it, but it always appears in the replacement. At first I thought cp was causing this but if I put in the sed statement to remove before the cp, it isn't there.
Maybe I'm just missing something...
U+FEFF is the code point for a byte order mark. Your files most likely contain data saved in UTF-16 and the BOM has been corrupted by your 'cleaning process' which is most likely expecting ASCII. It's probably not a good idea to remove the BOM, but instead to fix your scripts to not corrupt it in the first place.
To get rid of these in GNU emacs:
Open Emacs
Do a find-file-literally to open the file
Edit off the leading three bytes
Save the file
There is also a way to convert files with DOS line termination convention to Unix line termination convention.