perl output messed up in fedora, ubuntu - linux

I wrote a perl script for mapping two data sets. When I run the program using the Linux terminal, the output is messed up. It seems like the output is overlapping. I am using Fedora 25. I have tried the code on Windows and it works fine.
Same problem is there on Ubuntu as well.
DESIRED:
ADAM 123 JOHN 321
TOM 473 BENTLY 564
and so on....
OUTPUT that i am getting:
ADAM 123N 321
TOM 473TLY 564
and so on......
I have tested the code on Windows and it works perfectly fine. Though the same problem remains on Ubuntu 16.04 lts.
please help.
code:
use warnings;
open F, "friendship_network_wo_weights1.txt", or die;
open G, "username_gender_1.txt", or die;
while (<G>){
chomp $_;
my #a = split /\t/, $_;
$list{$a[0]} = $a[1];
}
close G;
while (<F>){
chomp $_;
my #b = split /\t/, $_;
if ((exists $list{$b[0]}) && (exists $list{$b[1]})){
$get = "$b[0]\t${list{$b[0]}}\t$b[1]\t${list{$b[1]}}\n";
$get =~ s/\r//g;
print "$get";
}
}
close F;

The problem is on Windows the newline is \r\n. On everything else it's \n. Assuming these files were created on Windows, when you read them on Unix each line will still have a trailing \r after the chomp.
\r is the "carriage return" character. It's like on an old typewriter how you had to move the whole typehead back to the left side at the end of a line, computer displays used to be fancy typewriters called Teleprinters. When you print it, the cursor moves back to the beginning of the line. Anything you print after that gets overwritten. Here's a simple example.
print "foo\rbar\r\n";
What you'll see is bar. This is because it prints...
foo
\r sends the cursor back to the start of the line
bar overwrites foo
\r sends the cursor back to the start of the line
\n goes to the start of the next line (doesn't matter where the cursor is)
chomp will only remove whatever is in $/ off the end of the string. On Unix that's \n. On Windows it's \r\n.
There's a number of ways to solve this. One of the safest is to manually remove newlines of both types with a regex.
# \015 is octal character 015 which is carriage return.
# \012 is octal character 012 which is newline
$line =~ s{\015?\012$}{};
That says to remove maybe a \r and definitely a \n at the end of the line.

Related

Perl splits string incorrectly when loaded from file

I am probably missing out on something becauseI started Perl today, so please excuse me if it's something very obvious.
I would like to load string from a file and then split it character by character.
I have done the following
use strict;
open my $fh, "<", "hello.txt" || die "Cannot open file!\n";
my $data = do { local $/ ; <$fh>};
print $data;
print join( ', ',(split( //, $data)));
close $fh;
When I execute this script the first print statement prints $data without problem, however the second print prints only the join string.
Hello, world!
,
I am running on Windows 7 machine with Strawberry Perl, I don't have access to Unix/Linux machine at the moment so I could not test it elsewhere.
This is probably an issue with the carriage return character "\r" – Windows line endings are \r\n, and a \r on its own moves back to the start of the line, overwriting what you have already written.
You could chomp $data first to remove the line ending, though this will only remove the last line ending.
You can also have Perl convert the Windows \r\n line endings to Unix \n line endings when reading in the file, by applying the :crlf IO layer:
open my $fh, "<:crlf", "hello.txt" or die "Cannot open file!\n";
(Note that it must be open … or die … or open(…) || die … but not open … || die …, because of operator precedence rules.)

Why can't vim insert newlines with s? [duplicate]

I'm trying to replace each , in the current file by a new line:
:%s/,/\n/g
But it inserts what looks like a ^# instead of an actual newline. The file is not in DOS mode or anything.
What should I do?
If you are curious, like me, check the question Why is \r a newline for Vim? as well.
Use \r instead of \n.
Substituting by \n inserts a null character into the text. To get a newline, use \r. When searching for a newline, you’d still use \n, however. This asymmetry is due to the fact that \n and \r do slightly different things:
\n matches an end of line (newline), whereas \r matches a carriage return. On the other hand, in substitutions \n inserts a null character whereas \r inserts a newline (more precisely, it’s treated as the input CR). Here’s a small, non-interactive example to illustrate this, using the Vim command line feature (in other words, you can copy and paste the following into a terminal to run it). xxd shows a hexdump of the resulting file.
echo bar > test
(echo 'Before:'; xxd test) > output.txt
vim test '+s/b/\n/' '+s/a/\r/' +wq
(echo 'After:'; xxd test) >> output.txt
more output.txt
Before:
0000000: 6261 720a bar.
After:
0000000: 000a 720a ..r.
In other words, \n has inserted the byte 0x00 into the text; \r has inserted the byte 0x0a.
Here's the trick:
First, set your Vi(m) session to allow pattern matching with special characters (i.e.: newline). It's probably worth putting this line in your .vimrc or .exrc file:
:set magic
Next, do:
:s/,/,^M/g
To get the ^M character, type Ctrl + V and hit Enter. Under Windows, do Ctrl + Q, Enter. The only way I can remember these is by remembering how little sense they make:
A: What would be the worst control-character to use to represent a newline?
B: Either q (because it usually means "Quit") or v because it would be so easy to type Ctrl + C by mistake and kill the editor.
A: Make it so.
In the syntax s/foo/bar, \r and \n have different meanings, depending on context.
Short:
For foo:
\r == "carriage return" (CR / ^M)
\n == matches "line feed" (LF) on Linux/Mac, and CRLF on Windows
For bar:
\r == produces LF on Linux/Mac, CRLF on Windows
\n == "null byte" (NUL / ^#)
When editing files in linux (i.e. on a webserver) that were initially created in a windows environment and uploaded (i.e. FTP/SFTP) - all the ^M's you see in vim, are the CR's which linux does not translate as it uses only LF's to depict a line break.
Longer (with ASCII numbers):
NUL == 0x00 == 0 == Ctrl + # == ^# shown in vim
LF == 0x0A == 10 == Ctrl + J
CR == 0x0D == 13 == Ctrl + M == ^M shown in vim
Here is a list of the ASCII control characters. Insert them in Vim via Ctrl + V,Ctrl + ---key---.
In Bash or the other Unix/Linux shells, just type Ctrl + ---key---.
Try Ctrl + M in Bash. It's the same as hitting Enter, as the shell realizes what is meant, even though Linux systems use line feeds for line delimiting.
To insert literal's in bash, prepending them with Ctrl + V will also work.
Try in Bash:
echo ^[[33;1mcolored.^[[0mnot colored.
This uses ANSI escape sequences. Insert the two ^['s via Ctrl + V, Esc.
You might also try Ctrl + V,Ctrl + M, Enter, which will give you this:
bash: $'\r': command not found
Remember the \r from above? :>
This ASCII control characters list is different from a complete ASCII symbol table, in that the control characters, which are inserted into a console/pseudoterminal/Vim via the Ctrl key (haha), can be found there.
Whereas in C and most other languages, you usually use the octal codes to represent these 'characters'.
If you really want to know where all this comes from: The TTY demystified. This is the best link you will come across about this topic, but beware: There be dragons.
TL;DR
Usually foo = \n, and bar = \r.
You need to use:
:%s/,/^M/g
To get the ^M character, press Ctrl + v followed by Enter.
\r can do the work here for you.
With Vim on Windows, use Ctrl + Q in place of Ctrl + V.
This is the best answer for the way I think, but it would have been nicer in a table:
Why is \r a newline for Vim?
So, rewording:
You need to use \r to use a line feed (ASCII 0x0A, the Unix newline) in a regex replacement, but that is peculiar to the replacement - you should normally continue to expect to use \n for line feed and \r for carriage return.
This is because Vim used \n in a replacement to mean the NIL character (ASCII 0x00). You might have expected NIL to have been \0 instead, freeing \n for its usual use for line feed, but \0 already has a meaning in regex replacements, so it was shifted to \n. Hence then going further to also shift the newline from \n to \r (which in a regex pattern is the carriage return character, ASCII 0x0D).
Character | ASCII code | C representation | Regex match | Regex replacement
-------------------------+------------+------------------+-------------+------------------------
nil | 0x00 | \0 | \0 | \n
line feed (Unix newline) | 0x0a | \n | \n | \r
carriage return | 0x0d | \r | \r | <unknown>
NB: ^M (Ctrl + V Ctrl + M on Linux) inserts a newline when used in a regex replacement rather than a carriage return as others have advised (I just tried it).
Also note that Vim will translate the line feed character when it saves to file based on its file format settings and that might confuse matters.
From Eclipse, the ^M characters can be embedded in a line, and you want to convert them to newlines.
:s/\r/\r/g
But if one has to substitute, then the following thing works:
:%s/\n/\r\|\-\r/g
In the above, every next line is substituted with next line, and then |- and again a new line. This is used in wiki tables.
If the text is as follows:
line1
line2
line3
It is changed to
line1
|-
line2
|-
line3
Here's the answer that worked for me. From this guy:
----quoting Use the vi editor to insert a newline char in replace
Something else I have to do and cannot remember and then have to look up.
In vi, to insert a newline character in a search and replace, do the following:
:%s/look_for/replace_with^M/g
The command above would replace all instances of “look_for” with “replace_with\n” (with \n meaning newline).
To get the “^M”, enter the key combination Ctrl + V, and then after that (release all keys) press the Enter key.
If you need to do it for a whole file, it was also suggested to me that you could try from the command line:
sed 's/\\n/\n/g' file > newfile
in vim editor the following command successfully replaced \n with new line
:%s/\\n/\r/g

Using Perl Win7 to write a file for Linux and having only Linux line endings

This Perl script is running on Win7, modifying a Clearcase config spec that will be read on a Linux machine. Clearcase is very fussy about its line endings, they must be precisely and only \n (0x0A) however try as I may I cannot get Perl to spit out only \n endings, they usually come out \r\n (0x0D 0x0A)
Here's the Perl snippet, running over an array of config spec elements and converting element /somevob/... bits into element /vobs/somevob/... and printing to a file handle.
$fh = new FileHandle;
foreach my $line (#cs_array)
{
$line =~ s/([element|load])(\s+\/)(.+)/$1$2vobs\/$3/g;
$line =~ s/[\r\n]/\n/g; # tried many things here
$fh->print($line);
}
$fh->close();
Sometimes the elements in the array are multi-line and separated by \n
element /vob1/path\nelement\n/vob2/path\nload /vob1/path\n element\n
/vob3/path
load /vob3/path
When I look into the file written on Win7 in a binary viewer there is always a 0x0D 0x0A newline sequence which Clearcase on Linux complains about. This appears to come from the print.
Any suggestions? I thought this would be a 10 minute job...
Try
$fh->binmode;
Otherwise you're probably in text mode, and for Windows this means that \n is translated to \r\n.
You are running afoul of the :crlf IO Layer that is the default for Perl on Windows.
You can use binmode after the fact to remove this layer, or you can open the filehandle with :raw (the default layer for *nix) or some other appropriate IO Layer in the 1st place.
Sample:
$fh = FileHandle->new($FileName, '>:raw')
Check perldoc open for more details on IO Layers.

How to clean a data file from binary junk?

I have this data file, which is supposed to be a normal ASCII file. However, it has some junk in the end of the first line. It only shows when I look at it with vi or less -->
y mon d h XX11 XX22 XX33 XX44 XX55 XX66^#
2011 6 6 10 14.0 15.5 14.3 11.3 16.2 16.1
grep is also saying that it's a binary file: Binary file data.dat matches
This is causing some trouble in my parsing script. I'm splitting each line and putting them to array. The last element(XX66) in first array is somehow corrupted, because of the junk and I can't make a match to it.
How to clean that line or the array? I have tried dos2unix to the file and substituting array members with s/\s+$//. What is that junk anyway? Unfortunately I have no control over the data, it's a third party data.
Any ideas?
Grep is trying to be smart and, when it sees an unprintable character, switches to "binary" mode. Add "-a" or "--text" to force grep to stay in "text" mode.
As for sed, try sed -e 's/\([^ -~]*\)//g', which says, "change everything not between space and tilde (chars 0x20 and 0x7E, respectively) into nothing". That'll strip tabs, too, but you can insert a tab character before the space to include them (or any other special character).
The "^#" is one way to represent an NUL (aka "ascii(0)" or "\0"). Some programs may also see that as an end-of-file if they were implemented in a naive way.
If it's always the same codes (eg ^# or related) then you can find/replace them.
In Vim for example:
:%s/^#//g in edit mode will clear out any of those characters.
To enter a character such as ^#, press and hold down the Ctrl button, press 'v' and then press the character you need - in the above case, remember to hold shift down to get the # key. The Ctrl key should be held down til the end.
The ^# looks like it's a control character. I can't figure out what character it should be, but I guess that's not important.
You can use s/^#//g to get rid of them, but you have to actually COPY the character, just putting ^ and # together won't do it.
e:f;b.
I created this small script to remove all binary, non-ASCII and some annoying characters from a file. Notice that the char are octal-based:
#!/usr/bin/perl
use strict;
use warnings;
my $filename = $ARGV[0];
open my $fh, '<', $filename or die "File not found: $!";
open my $fh2, '>', 'report.txt' ;
binmode($fh);
my ($xdr, $buffer) = "";
# read 1 byte at a time until end of file ...
while (read ($fh, $buffer, 1) != 0) {
# append the buffer value to xdr variable
$xdr .= $buffer;
if (!($xdr =~ /[\0-\11]/) and (!($xdr =~ /[\13-\14]/))and (!($xdr =~ /[\16-\37]/)) and (!($xdr =~ /[\41-\55]/)) and (!($xdr =~ /[\176-\177]/))) {
print $fh2 $xdr;
}
$xdr = "";
}
# finaly, clean all the characters that are not ASCII.
system("perl -plne 's/[^[:ascii:]]//g' report.txt > $filename.clean.txt");
Stripping individual characters using sed is going to be very slow, perhaps several minutes for 100MB file.
As an alternative, if you know the format/structure of the file, e.g. a log file where the "good" lines of the file start with a timestamp, then you can grep out the good lines and redirect those to a new file.
For example, if we know that all good lines start with a timestamp with the year 2021, we can use this expression to only output those lines to a new file:
grep -a "^2021" mylog.log > mylog2.log
Note that you must use the -a or --text option with grep to force grep to output lines when it detects that the file is binary.

How to replace a character by a newline in Vim

I'm trying to replace each , in the current file by a new line:
:%s/,/\n/g
But it inserts what looks like a ^# instead of an actual newline. The file is not in DOS mode or anything.
What should I do?
If you are curious, like me, check the question Why is \r a newline for Vim? as well.
Use \r instead of \n.
Substituting by \n inserts a null character into the text. To get a newline, use \r. When searching for a newline, you’d still use \n, however. This asymmetry is due to the fact that \n and \r do slightly different things:
\n matches an end of line (newline), whereas \r matches a carriage return. On the other hand, in substitutions \n inserts a null character whereas \r inserts a newline (more precisely, it’s treated as the input CR). Here’s a small, non-interactive example to illustrate this, using the Vim command line feature (in other words, you can copy and paste the following into a terminal to run it). xxd shows a hexdump of the resulting file.
echo bar > test
(echo 'Before:'; xxd test) > output.txt
vim test '+s/b/\n/' '+s/a/\r/' +wq
(echo 'After:'; xxd test) >> output.txt
more output.txt
Before:
0000000: 6261 720a bar.
After:
0000000: 000a 720a ..r.
In other words, \n has inserted the byte 0x00 into the text; \r has inserted the byte 0x0a.
Here's the trick:
First, set your Vi(m) session to allow pattern matching with special characters (i.e.: newline). It's probably worth putting this line in your .vimrc or .exrc file:
:set magic
Next, do:
:s/,/,^M/g
To get the ^M character, type Ctrl + V and hit Enter. Under Windows, do Ctrl + Q, Enter. The only way I can remember these is by remembering how little sense they make:
A: What would be the worst control-character to use to represent a newline?
B: Either q (because it usually means "Quit") or v because it would be so easy to type Ctrl + C by mistake and kill the editor.
A: Make it so.
In the syntax s/foo/bar, \r and \n have different meanings, depending on context.
Short:
For foo:
\r == "carriage return" (CR / ^M)
\n == matches "line feed" (LF) on Linux/Mac, and CRLF on Windows
For bar:
\r == produces LF on Linux/Mac, CRLF on Windows
\n == "null byte" (NUL / ^#)
When editing files in linux (i.e. on a webserver) that were initially created in a windows environment and uploaded (i.e. FTP/SFTP) - all the ^M's you see in vim, are the CR's which linux does not translate as it uses only LF's to depict a line break.
Longer (with ASCII numbers):
NUL == 0x00 == 0 == Ctrl + # == ^# shown in vim
LF == 0x0A == 10 == Ctrl + J
CR == 0x0D == 13 == Ctrl + M == ^M shown in vim
Here is a list of the ASCII control characters. Insert them in Vim via Ctrl + V,Ctrl + ---key---.
In Bash or the other Unix/Linux shells, just type Ctrl + ---key---.
Try Ctrl + M in Bash. It's the same as hitting Enter, as the shell realizes what is meant, even though Linux systems use line feeds for line delimiting.
To insert literal's in bash, prepending them with Ctrl + V will also work.
Try in Bash:
echo ^[[33;1mcolored.^[[0mnot colored.
This uses ANSI escape sequences. Insert the two ^['s via Ctrl + V, Esc.
You might also try Ctrl + V,Ctrl + M, Enter, which will give you this:
bash: $'\r': command not found
Remember the \r from above? :>
This ASCII control characters list is different from a complete ASCII symbol table, in that the control characters, which are inserted into a console/pseudoterminal/Vim via the Ctrl key (haha), can be found there.
Whereas in C and most other languages, you usually use the octal codes to represent these 'characters'.
If you really want to know where all this comes from: The TTY demystified. This is the best link you will come across about this topic, but beware: There be dragons.
TL;DR
Usually foo = \n, and bar = \r.
You need to use:
:%s/,/^M/g
To get the ^M character, press Ctrl + v followed by Enter.
\r can do the work here for you.
With Vim on Windows, use Ctrl + Q in place of Ctrl + V.
This is the best answer for the way I think, but it would have been nicer in a table:
Why is \r a newline for Vim?
So, rewording:
You need to use \r to use a line feed (ASCII 0x0A, the Unix newline) in a regex replacement, but that is peculiar to the replacement - you should normally continue to expect to use \n for line feed and \r for carriage return.
This is because Vim used \n in a replacement to mean the NIL character (ASCII 0x00). You might have expected NIL to have been \0 instead, freeing \n for its usual use for line feed, but \0 already has a meaning in regex replacements, so it was shifted to \n. Hence then going further to also shift the newline from \n to \r (which in a regex pattern is the carriage return character, ASCII 0x0D).
Character | ASCII code | C representation | Regex match | Regex replacement
-------------------------+------------+------------------+-------------+------------------------
nil | 0x00 | \0 | \0 | \n
line feed (Unix newline) | 0x0a | \n | \n | \r
carriage return | 0x0d | \r | \r | <unknown>
NB: ^M (Ctrl + V Ctrl + M on Linux) inserts a newline when used in a regex replacement rather than a carriage return as others have advised (I just tried it).
Also note that Vim will translate the line feed character when it saves to file based on its file format settings and that might confuse matters.
From Eclipse, the ^M characters can be embedded in a line, and you want to convert them to newlines.
:s/\r/\r/g
But if one has to substitute, then the following thing works:
:%s/\n/\r\|\-\r/g
In the above, every next line is substituted with next line, and then |- and again a new line. This is used in wiki tables.
If the text is as follows:
line1
line2
line3
It is changed to
line1
|-
line2
|-
line3
Here's the answer that worked for me. From this guy:
----quoting Use the vi editor to insert a newline char in replace
Something else I have to do and cannot remember and then have to look up.
In vi, to insert a newline character in a search and replace, do the following:
:%s/look_for/replace_with^M/g
The command above would replace all instances of “look_for” with “replace_with\n” (with \n meaning newline).
To get the “^M”, enter the key combination Ctrl + V, and then after that (release all keys) press the Enter key.
If you need to do it for a whole file, it was also suggested to me that you could try from the command line:
sed 's/\\n/\n/g' file > newfile
in vim editor the following command successfully replaced \n with new line
:%s/\\n/\r/g

Resources