How to clean a data file from binary junk?

How to clean a data file from binary junk? - linux

I have this data file, which is supposed to be a normal ASCII file. However, it has some junk in the end of the first line. It only shows when I look at it with vi or less -->
y mon d h XX11 XX22 XX33 XX44 XX55 XX66^#
2011 6 6 10 14.0 15.5 14.3 11.3 16.2 16.1
grep is also saying that it's a binary file: Binary file data.dat matches
This is causing some trouble in my parsing script. I'm splitting each line and putting them to array. The last element(XX66) in first array is somehow corrupted, because of the junk and I can't make a match to it.
How to clean that line or the array? I have tried dos2unix to the file and substituting array members with s/\s+$//. What is that junk anyway? Unfortunately I have no control over the data, it's a third party data.
Any ideas?

Grep is trying to be smart and, when it sees an unprintable character, switches to "binary" mode. Add "-a" or "--text" to force grep to stay in "text" mode.
As for sed, try sed -e 's/\([^ -~]*\)//g', which says, "change everything not between space and tilde (chars 0x20 and 0x7E, respectively) into nothing". That'll strip tabs, too, but you can insert a tab character before the space to include them (or any other special character).
The "^#" is one way to represent an NUL (aka "ascii(0)" or "\0"). Some programs may also see that as an end-of-file if they were implemented in a naive way.

If it's always the same codes (eg ^# or related) then you can find/replace them.
In Vim for example:
:%s/^#//g in edit mode will clear out any of those characters.
To enter a character such as ^#, press and hold down the Ctrl button, press 'v' and then press the character you need - in the above case, remember to hold shift down to get the # key. The Ctrl key should be held down til the end.

The ^# looks like it's a control character. I can't figure out what character it should be, but I guess that's not important.
You can use s/^#//g to get rid of them, but you have to actually COPY the character, just putting ^ and # together won't do it.
e:f;b.

I created this small script to remove all binary, non-ASCII and some annoying characters from a file. Notice that the char are octal-based:
#!/usr/bin/perl
use strict;
use warnings;
my $filename = $ARGV[0];
open my $fh, '<', $filename or die "File not found: $!";
open my $fh2, '>', 'report.txt' ;
binmode($fh);
my ($xdr, $buffer) = "";
# read 1 byte at a time until end of file ...
while (read ($fh, $buffer, 1) != 0) {
# append the buffer value to xdr variable
$xdr .= $buffer;
if (!($xdr =~ /[\0-\11]/) and (!($xdr =~ /[\13-\14]/))and (!($xdr =~ /[\16-\37]/)) and (!($xdr =~ /[\41-\55]/)) and (!($xdr =~ /[\176-\177]/))) {
print $fh2 $xdr;
}
$xdr = "";
}
# finaly, clean all the characters that are not ASCII.
system("perl -plne 's/[^[:ascii:]]//g' report.txt > $filename.clean.txt");

Stripping individual characters using sed is going to be very slow, perhaps several minutes for 100MB file.
As an alternative, if you know the format/structure of the file, e.g. a log file where the "good" lines of the file start with a timestamp, then you can grep out the good lines and redirect those to a new file.
For example, if we know that all good lines start with a timestamp with the year 2021, we can use this expression to only output those lines to a new file:
grep -a "^2021" mylog.log > mylog2.log
Note that you must use the -a or --text option with grep to force grep to output lines when it detects that the file is binary.

Related

vim - why will search find it but search and replace not? (this escaped special char pattern)

want to search and replace in vim, the /find finds the pattern but :s%//g will not?
have a script that monitors software raid (if interested check it out https://dwaves.org/2019/09/06/linux-server-monitor-software-raid-mail-notification-on-failure/)
echo "=== smart status of all drives ==="| tee -a /scripts/monitor/raid_status_mail.log
# want to search and replace the /path/to/file.sh with $LOGFILE
# searching for the pattern works like charm
/\/scripts\/monitor\/raid_status_mail.log
# but replacing it won't
:s%/\/scripts\/monitor\/raid_status_mail\.log/\$LOGFILE/g
# what does one do wrong?
should replace /scripts/monitor/raid_status_mail.log with $LOGFILE

The substitution operation needs to be prefixed with %s and not the other way around as s%. So doing
%s/\/scripts\/monitor\/raid_status_mail\.log/\$LOGFILE/g
should work as expected. Or just the Vim's equivalent ex in command line mode as
printf '%s\n' "%s/\/scripts\/monitor\/raid_status_mail\.log/\$LOGFILE/g" w q | ex -s file

You inverted the beginning s%. Use %s instead.
Also, you use / as separation for the different fields, it works but makes the command less readable. You can replace the separation character by anything else. You could use : for example:
%s:/scripts/monitor/raid_status_mail.log:$LOGFILE:g
One last tip: install vim-over
This will highlight your searches in live while replacing something in vim.

sed command working on command line but not in perl script

I have a file in which i have to replace all the words like $xyz and for them i have to substitutions like these:
$xyz with ${xyz}.
$abc_xbs with ${abc_xbc}
$ab,$cd with ${ab},${cd}
This file also have some words like ${abcd} which i don't have to change.
I am using this command
sed -i 's?\$([A-Z_]+)?\${\1}?g' file
its working fine on command line but not inside a perl script as
sed -i 's?\$\([A-Z_]\+\)?\$\{\1\}?g' file;
What i am missing?
I think adding some backslashes would help.I tried adding some but no success.
Thanks

In a Perl script you need valid Perl language, just like you need valid C text in a C program. In the terminal sed.. is understood and run by the shell as a command but in a Perl program it is just a bunch of words, and that line sed.. isn't valid Perl.
You would need this inside qx() (backticks) or system() so that it is run as an external command. Then you'd indeed need "some backslashes," which is where things get a bit picky.
But why run a sed command from a Perl script? Do the job with Perl
use warnings;
use strict;
use File::Copy 'move';
my $file = 'filename';
my $out_file = 'new_' . $file;
open my $fh, '<', $file or die "Can't open $file: $!";
open my $fh_out, '>', $out_file or die "Can't open $out_file: $!";
while (<$fh>)
{
s/\$( [^{] [a-z_]* )/\${$1}/gix;
print $fh_out $_;
}
close $fh_out;
close $fh;
move $out_file, $file or die "Can't move $out_file to $file: $!";
The regex uses a negated character class, [^...], to match any character other than { following $, thus excluding already braced words. Then it matches a sequence of letters or underscore, as in the question (possibly none, since the first non-{ already provides at least one).
With 5.14+ you can use the non-destructive /r modifier
print $fh_out s/\$([^{][a-z_]*)/\${$1}/gir;
with which the changed string is returned (and original is unchanged), right for the print.
The output file, in the end moved over the original, should be made using File::Temp. Overwriting the original this way changes $file's inode number; if that's a concern see this post for example, for how to update the original inode.
A one-liner (command-line) version, to readily test
perl -wpe's/\$([^{][a-z_]*)/\${$1}/gi' file
This only prints to console. To change the original add -i (in-place), or -i.bak to keep backup.
A reasonable question of "Isn't there a shorter way" came up.
Here is one, using the handy Path::Tiny for a file that isn't huge so we can read it into a string.
use warnings;
use strict;
use Path::Tiny;
my $file = 'filename';
my $out_file = 'new_' . $file;
my $new_content = path($file)->slurp =~ s/\$([^{][a-z_]*)/\${$1}/gir;
path($file)->spew( $new_content );
The first line reads the file into a string, on which the replacement runs; the changed text is returned and assigned to a variable. Then that variable with new text is written out over the original.
The two lines can be squeezed into one, by putting the expression from the first instead of the variable in the second. But opening the same file twice in one (complex) statement isn't exactly solid practice and I wouldn't recommend such code.
However, since module's version 0.077 you can nicely do
path($file)->edit_lines( sub { s/\$([^{][a-z_]*)/\${$1}/gi } );
or use edit to slurp the file into a string and apply the callback to it.
So this cuts it to one nice line after all.
I'd like to add that shaving off lines of code mostly isn't worth the effort while it sure can lead to trouble if it disturbs the focus on the code structure and correctness even a bit. However, Path::Tiny is a good module and this is legitimate, while it does shorten things quite a bit.

perl output messed up in fedora, ubuntu

I wrote a perl script for mapping two data sets. When I run the program using the Linux terminal, the output is messed up. It seems like the output is overlapping. I am using Fedora 25. I have tried the code on Windows and it works fine.
Same problem is there on Ubuntu as well.
DESIRED:
ADAM 123 JOHN 321
TOM 473 BENTLY 564
and so on....
OUTPUT that i am getting:
ADAM 123N 321
TOM 473TLY 564
and so on......
I have tested the code on Windows and it works perfectly fine. Though the same problem remains on Ubuntu 16.04 lts.
please help.
code:
use warnings;
open F, "friendship_network_wo_weights1.txt", or die;
open G, "username_gender_1.txt", or die;
while (<G>){
chomp $_;
my #a = split /\t/, $_;
$list{$a[0]} = $a[1];
}
close G;
while (<F>){
chomp $_;
my #b = split /\t/, $_;
if ((exists $list{$b[0]}) && (exists $list{$b[1]})){
$get = "$b[0]\t${list{$b[0]}}\t$b[1]\t${list{$b[1]}}\n";
$get =~ s/\r//g;
print "$get";
}
}
close F;

The problem is on Windows the newline is \r\n. On everything else it's \n. Assuming these files were created on Windows, when you read them on Unix each line will still have a trailing \r after the chomp.
\r is the "carriage return" character. It's like on an old typewriter how you had to move the whole typehead back to the left side at the end of a line, computer displays used to be fancy typewriters called Teleprinters. When you print it, the cursor moves back to the beginning of the line. Anything you print after that gets overwritten. Here's a simple example.
print "foo\rbar\r\n";
What you'll see is bar. This is because it prints...
foo
\r sends the cursor back to the start of the line
bar overwrites foo
\r sends the cursor back to the start of the line
\n goes to the start of the next line (doesn't matter where the cursor is)
chomp will only remove whatever is in $/ off the end of the string. On Unix that's \n. On Windows it's \r\n.
There's a number of ways to solve this. One of the safest is to manually remove newlines of both types with a regex.
# \015 is octal character 015 which is carriage return.
# \012 is octal character 012 which is newline
$line =~ s{\015?\012$}{};
That says to remove maybe a \r and definitely a \n at the end of the line.

Linux command to replace string in HUGE file with another string

I have a huge file (8GB), I want replace on the first 30 lines the String LATIN1 with UTF-8 what is the most efficient method? Means exist there a way to use probably sed but to quit after parsed first 30 lines.
VIM was not able to save the file in 3 hours.

The problem is that in the event of a replacement, all programs will make a copy of the file with the substitution in place in order to replace the original file ultimately -- they don't want to risk losing the original for obvious reasons.
With perl, you can do this in a one-liner, but that doesn't make it any shorter (well, it probably does compared to vim, since vim preserves history in yet another file, which perl doesn't):
perl -pi -e 's,\bLATIN1\b,UTF-8,g if $. <= 30' thefile

With sed, you can quit using q:
sed -e 's/LATIN1/UTF-8/g' -e 30q

untested, but I think ed will edit the file in-place without writing to a temp file.
ed yourBigFile << END
1,30s/LATIN1/UTF-8/g
w
q
END

How do I write a sed script to grep information from a text file

I'm trying to do my homework that is restricted to only using sed to filter an input file to a certain format of output. Here is the input file (named stocks):
Symbol;Name;Volume
================================================
BAC;Bank of America Corporation Com;238,059,612
CSCO;Cisco Systems, Inc.;28,159,455
INTC;Intel Corporation;22,501,784
MSFT;Microsoft Corporation;23,363,118
VZ;Verizon Communications Inc. Com;5,744,385
KO;Coca-Cola Company (The) Common;3,752,569
MMM;3M Company Common Stock;1,660,453
================================================
And the output needs to be:
BAC, CSCO, INTC, MSFT, VZ, KO, MMM
I did come up with a solution, but it's not efficient. Here is my sed script (named try.sed):
/.*;.*;[0-9].*/ { N
N
N
N
N
N
s/\(.*\);.*;.*\n\(.*\);.*;.*\n\(.*\);.*;.*\n\(.*\);.*;.*\n\(.*\);.*;.*\n\(.*\);.*;.*\n\(.*\);.*;.*/\1, \2, \3, \4, \5, \6, \7/gp
}
The command that I run on shell is:
$ sed -nf try.sed stocks
My question is, is there a better way of using sed to get the same result? The script I wrote only works with 7 lines of data. If the data is longer, I need to re-modify my script. I'm not sure how I can make it any better, so I'm here asking for help!
Thanks for any recommendations.

One more way using sed:
sed -ne '/^====/,/^====/ { /;/ { s/;.*$// ; H } }; $ { g ; s/\n// ; s/\n/, /g ; p }' stocks
Output:
BAC, CSCO, INTC, MSFT, VZ, KO, MMM
Explanation:
-ne # Process each input line without printing and execute next commands...
/^====/,/^====/ # For all lines between these...
{
/;/ # If line has a semicolon...
{
s/;.*$// # Remove characters from first semicolon until end of line.
H # Append content to 'hold space'.
}
};
$ # In last input line...
{
g # Copy content of 'hold space' to 'pattern space' to work with it.
s/\n// # Remove first newline character.
s/\n/, /g # substitute the rest with output separator, comma in this case.
p # Print to output.

Edit: I've edited my algorithm, since I had neglected to consider the header and footer (I thought they were just for our benefit).
sed, by its design, accesses every line of an input file, and then performs expressions on ones that match some specification (or none). If you're tailoring your script to a certain number of lines, you're definitely doing something wrong! I won't write you a script since this is homework, but the general idea for one way to go about it is to write a script that does the following. Think of the ordering as the order things should be in a script.
Skip the first three lines using d, which deletes the pattern space and immediately moves on to the next line.
For each line that isn't a blank line, do the following steps. (This would all be in a single set of curly braces.)
Replace everything after and including the first semicolon (;) with a comma-and-space (", ") using the s (substitute) command.
Append the current pattern space into the hold buffer (look at H).
Delete the pattern space and move on to the next line, like in step 1.
For each line that gets to this point in the script (should be the first blank line), retrieve the contents of the hold space into the pattern space. (This would be after the curly braces above.)
Substitute all newlines in the pattern space with nothing.
Next, substitute the last comma-and-space in the pattern space with nothing.
Finally, quit the program so you don't process any more lines. My script worked without this, but I'm not 100% sure why.
That being said, that's just one way to go about it. sed often offers varying ways of varying complexity to accomplish a task. A solution I wrote with this method is 10 lines long.
As a note, I don't bother suppressing printing (with -n) or manually printing (with p); each line is printed by default. My script runs like this:
$ sed -f companies.sed companies
BAC, CSCO, INTC, MSFT, VZ, KO, MMM

This sed command should produce your required output:
sed -rn '/[0-9]+$/{s/^([^;]*).*$/\1/p;}' file.txt
OR on Mac:
sed -En '/[0-9]+$/{s/^([^;]*).*$/\1/p;}' file.txt

This might work for you:
sed '1d;/;/{s/;.*//;H};${g;s/.//;s/\n/, /g;q};d' stocks
We don't want the headings so let's delete them. 1d
All data items are delimited by ;'s so let's concentrate on those lines. /;/
Of the things above delete everything from the first ; to the end of line and then stuff it away in the the hold space (HS) {s/;.*//;H}
When you get to the last line, overwrite it with the HS using the g command, delete the first newline (generated by the H command), replace all subsequent newlines with a comma and a space and print out what's left. ${g;s/.//;s/\n/, /g;q}
Delete everything else d
Here's a terminal session showing the incremental refinement of building a sed command:
cat <<! >stock # paste the file into a here doc and pass it on to a file
> Symbol;Name;Volume
> ================================================
>
> BAC;Bank of America Corporation Com;238,059,612
> CSCO;Cisco Systems, Inc.;28,159,455
> INTC;Intel Corporation;22,501,784
> MSFT;Microsoft Corporation;23,363,118
> VZ;Verizon Communications Inc. Com;5,744,385
> KO;Coca-Cola Company (The) Common;3,752,569
> MMM;3M Company Common Stock;1,660,453
>
> ================================================
> !
sed '1d;/;/!d' stock # delete headings and everything but data lines
BAC;Bank of America Corporation Com;238,059,612
CSCO;Cisco Systems, Inc.;28,159,455
INTC;Intel Corporation;22,501,784
MSFT;Microsoft Corporation;23,363,118
VZ;Verizon Communications Inc. Com;5,744,385
KO;Coca-Cola Company (The) Common;3,752,569
MMM;3M Company Common Stock;1,660,453
sed '1d;/;/{s/;.*//p};d' stock # delete all non essential data
BAC
CSCO
INTC
MSFT
VZ
KO
MMM
sed '1d;/;/{s/;.*//;H};${g;l};d' stock # use the l command to see what's really there!
\nBAC\nCSCO\nINTC\nMSFT\nVZ\nKO\nMMM$
sed '1d;/;/{s/;.*//;H};${g;s/.//;s/\n/, /g;l};d' stock # refine refine
BAC, CSCO, INTC, MSFT, VZ, KO, MMM$
sed '1d;/;/{s/;.*//;H};${g;s/.//;s/\n/, /g;q};d' stock # all done!
BAC, CSCO, INTC, MSFT, VZ, KO, MMM

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string