Why do SED, GREP or AWK fail to remove blank lines from text files?

Why do SED, GREP or AWK fail to remove blank lines from text files? - linux

I am trying to remove blank lines from large text files. For some reason it seems that neither
sed "/^$/d" file.txt > trimmed.txt
nor
grep -v "^$" file.txt > trimmed.txt
nor
awk /./ file.txt > trimmed.txt
do anything. Any thoughts?
UPDATE
Thanks to the great comments by #fedorqui & #Sebastian Stigler the problem was quickly identified as DOS/Windows carriage returns (^M$) at the end of each line.
While I appreciate Sebatian's suggestion to reformat the files using dos2unix I would rather have a solution using the tools generally available in most linux distributions.
The solution that worked for me was an answer given by #Jeremy Stein to this question [Can't remove empty lines with sed regex:
sed -n '/[!-~]/p' file.txt > trimmed.txt

I just tried the commandos with a toy example and they work fine as long the file.txt was a file with unix newlines. If the file contains windows newlines then none of the commands were able to remove the blank lines.
You can use the dos2unix linux tool to convert the newlines in file.txt to unix newlines. If you need the output on a windows system then you can use unix2dos to convert trimmed.txt into a file with windows newlines.

Related

sed not responding for me

I'm trying to use sed but can't get it to work. My system is Ubuntu 16.04.2 with GNU sed 4.2.2
I have a folder with numerous text files that I want to edit. From a terminal in that folder I have tried the following (separately - one command at a time).
To remove blank lines from all the files: sed -i '/^$/d' *txt
In case the lines include white spaces: sed -i '/^\s*$/d' *txt
To remove line 4: sed -i '4d' *.txt
In each case, there is no error message in the terminal but the changes do not happen. I've tried the same commands but for an individual file rather than all, so with the filename instead of *.txt, but still no changes achieved.
The only sign of sed being active at all is if I don't do the -i but ask for a new file: sed '4d' fred.txt > fred2.text
The new file fred2 was created - but with line 4 still there!
What am I missing? How can I get sed to actually carry out these commands?
Thanks for any help.

After looking at the files in more detail, and doing some experimenting, I discovered what the problem was: The text files I was trying to edit with sed had CR line-end characters. When I changed this to LF the sed commands worked fine.
Thank you to those who made suggestions.

remove \n and keep space in linux

I have a file contained \n hidden behind each line:
input:
s3741206\n
s2561284\n
s4411364\n
s2516482\n
s2071534\n
s2074633\n
s7856856\n
s11957134\n
s682333\n
s9378200\n
s1862626\n
I want to remove \n behind
desired output:
s3741206
s2561284
s4411364
s2516482
s2071534
s2074633
s7856856
s11957134
s682333
s9378200
s1862626
however, I try this:
tr -d '\n' < file1 > file2
but it goes like below without space and new line
s3741206s2561284s4411364s2516482s2071534s2074633s7856856s11957134s682333s9378200s1862626
I also try sed $'s/\n//g' -i file1 and it doesn't work in mac os.
Thank you.

This is a possible solution using sed:
sed 's/\\n/ /g'

with awk
awk '{sub(/\\n/,"")} 1' < file1 > file2

What you are describing so far in your question+comments doesn't make sense. How can you have a multi-line file with a hidden newline character at the end of each line? What you show as your input file:
s3741206\n
s2561284\n
s4411364\n
etc.
where each "\n" above according to your comment is a single newline character "\n" is impossible. If those "\n"s were newline characters then your file would simply look like:
s3741206
s2561284
s4411364
etc.
There's really only 2 possibilities I can think of:
You are wrongly interpreting what you are seeing in your input file
and/or using the wrong terminology and you actually DO have \r\n
at the end of every line. Run cat -v file to see the \rs as
^Ms and run dos2unix or similar (e.g. sed 's/\r$//' file) to
remove the \rs - you do not want to remove the \ns or you will
no longer have a POSIX text file and so POSIX tools will exhibit
undefined behavior when run on it. If that doesn't work for you then
copy/paste the output of cat -v file into your question so we can
see for sure what is in your file.
Or:
It's also entirely possible that your file is a perfectly fine POSIX
text file as-is and you are incorrectly assuming you will have a
problem for some reason so also include in your question a
description of the actual problem you are having, include an example
of the command you are executing on that input file and the output
you are getting and the output you expected to get.

You could use bash-native string substitution
$ cat /tmp/newline
s3741206\n
s2561284\n
s4411364\n
s2516482\n
s2071534\n
s2074633\n
s7856856\n
s11957134\n
s682333\n
s9378200\n
s1862626\n
$ for LINE in $(cat /tmp/newline); do echo "${LINE%\\n}"; done
s3741206
s2561284
s4411364
s2516482
s2071534
s2074633
s7856856
s11957134
s682333
s9378200
s1862626

delete ';' at the end of each line

I have a huge (10+ GB) .csv file on a Linux server. The lines look somehow like this:
6;20000327;20000425;990099,0;20000327;LL;UBXO;7;-1;62;F;30;001;NO;NO;wgB;0;99;0002;5530;001;708;196;1;AA;N;N;100;53,81;0;0;0;1;1;;1;
6;20000327;20000425;990099,0;20000425;LL;OLD*;62;62;92;F;30;001;NO;NO;ueB;0;99;0002;XXXX;001;;;1;AA;N;N;;;0;0;1;0;0;;30;
I am searching for a fast script to do the following:
change any occurrence of <number>,<number> to <number>.<number>
delete the last semicolon of each line
I have especially problems with the second one, because the script shouldn't mind if it is a Linux file or a windows file.
I tried to do it with sed but failed thus far.
[edit]
I finally used a mix of Dennis Williams and SiegeX solutions:
sed 's/;\([0-9]*\),\([0-9]*\);/;\1.\2;/g;s/;\(\r\?\)$/\1/' inputfile
(the part with s/;[[:blank:]]*$// didn't work at my file...)

sed 's/;\([0-9]*\),\([0-9]*\);/;\1.\2;/g;s/;[[:blank:]]*$//' ./infile

$ cat file
6;20000327;20000425;990099,0;20000327;LL;UBXO;7;-1;62;F;30;001;NO;NO;wgB;0;99;0002;5530;001;708;196;1;AA;N;N;100;53,81;0;0;0;1;1;;1;
6;20000327;20000425;990099,0;20000425;LL;OLD*;62;62;92;F;30;001;NO;NO;ueB;0;99;0002;XXXX;001;;;1;AA;N;N;;;0;0;1;0;0;;30;
$ perl -p -e 's/(\d+),(\d+)/\1.\2/g; s/;$//' file
6;20000327;20000425;990099.0;20000327;LL;UBXO;7;-1;62;F;30;001;NO;NO;wgB;0;99;0002;5530;001;708;196;1;AA;N;N;100;53.81;0;0;0;1;1;;1
6;20000327;20000425;990099.0;20000425;LL;OLD*;62;62;92;F;30;001;NO;NO;ueB;0;99;0002;XXXX;001;;;1;AA;N;N;;;0;0;1;0;0;;30
Note: perl handles different line endings for you.

Give this a try:
sed 's/,/./g;s/;\r\?$//' inputfile
To preserve the carriage return if it's there:
sed 's/,/./g;s/;\(\r\?\)$/\1/' inputfile

If you are handy with perl, you You can use a perl one liner to do these things. Here's an example of you might do the number change:
perl -i -pe 's/(\d),(\d)/$1\.$2/' yourfile
be very careful with the -i option, as it causes perl to operate on the existing file in place.

Best tool to remove dos line ends and join line back up again

I have a csv file into which has crept some ^M dos line ends, and I want to get rid of them, as well as 16 spaces and 3 tabs which follow. Like, I have to merge that line with the next one down. Heres an offending record and a good one as a sample of what I mean:
"Mary had a ^M
little lamb", "Nursery Rhyme", 1878
"Mary, Mary quite contrary", "Nursery Rhyme", 1838
I can remove the ^M using sed as you can see, but I cannot work out how to rm the nix line end to join the lines back up.
sed -e "s/^M$ //g" rhymes.csv > rhymes.csv
UPDATE
Then I read "However, the Microsoft CSV format allows embedded newlines within a double-quoted field. If embedded newlines within fields are a possibility for your data, you should consider using something other than sed to work with the data file." from:
http://sed.sourceforge.net/sedfaq4.html
So editing my question to ask Which tool I should be using?

With help from How can I replace a newline (\n) using sed?, I made this one:
sed -e ':a;N;$!ba;s/\r\n \t\t\t/=/' -i rhymes.csv
<CR> <LF> <16 spaces> <3 tabs>
If you just want to delete the CR, you could use:
<yourfile tr -d "\r" | tee yourfile
(or if the two input and output file are different: <yourfile tr -d "\r" > output)

dos2unix file_name
to convert file, or
dos2unix old_file new_file
to create new file.

Replacing a line in a csv file?

I have a set of 10 CSV files, which normally have a an entry of this kind
a,b,c,d
d,e,f,g
Now due to some error entries in this file have become of this kind
a,b,c,d
d,e,f,g
,,,
h,i,j,k
Now I want to remove the line with only commas in all the files. These files are on a Linux filesystem.
Any command that you recommend that can replaces the erroneous lines in all the files.

It depends on what you mean by replace. If you mean 'remove', then a trivial variant on #wnoise's solution is:
grep -v '^,,,$' old-file.csv > new-file.csv
Note that this deletes just those lines with exactly three commas. If you want to delete mal-formed lines with any number of commas (including zero) - and no other characters on the line, then:
grep -v '^,*$' ...
There are endless other variations on the regex that would deal with other scenarios. Dealing with full CSV data with commas inside quotes starts to need something other than a regex machine. It can be done, within broad limits, especially in more complex regex systems such as PCRE or Perl. But it requires more work.
Check out Mastering Regular Expressions.

sed 's/,,,/replacement/' < old-file.csv > new-file.csv
optionally followed by
mv new-file.csv old-file.csv

Replace or remove, your post is not clear... For replacement see wnoise's answer. For removing, you could use
awk '$0 !~ /,,,/ {print}' <old-file.csv > new-file.csv

What about trying to keep only lines which are matching the desired format instead of handling one exception ?
If the provided input is what you really want to match:
grep -E '[a-z],[a-z],[a-z],[a-z]' < oldfile.csv > newfile.csv
If the input is different, provide it, the regular expression should not be too hard to write.

Do you want to replace them with something, or delete them entirely? Either way, it can be done with sed. To delete:
sed -i -e '/^,\+$/ D' yourfile1.csv yourfile2.csv ...
To replace: well, see wnoise's answer, or if you don't want to create new files with the output,
sed -i -e '/^,\+$/ s//replacement/' yourfile1.csv yourfile2.csv ...
or
sed -i -e '/^,\+$/ c\
replacement' yourfile1.csv yourfile2.csv ...
(that should be entered exactly as is, including the line break). Of course, you can also do this with awk or perl or, if you're only deleting lines, even grep:
egrep -v '^,+$' < oldfile.csv > newfile.csv
I tested these to make sure they work, but I'd advise you to do the same before using them (just in case). You can omit the -i option from sed, in which case it'll print out the results (rather than writing them back to the file), or omit the output redirection >newfile.csv from grep.
EDIT: It was pointed out in a comment that some features of these sed commands only work on GNU sed. As far as I can tell, these are the -i option (which can be replaced with shell redirection, sed ... <infile >outfile ) and the \+ modifier (which can be replaced with \{1,\} ).

Most simply:
$ grep -v ,,,, oldfile > newfile
$ mv newfile oldfile

yes, awk or grep are very good option if you are working in linux platform. However you can use perl regex for other platform. using join & split options.

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string

Why do SED, GREP or AWK fail to remove blank lines from text files? - linux

Related

sed not responding for me

remove \n and keep space in linux

delete ';' at the end of each line

Best tool to remove dos line ends and join line back up again

Replacing a line in a csv file?

Categories

Resources