how to split the data in the unix file - linux

I've a file in Unix (solaris) system with data like below
[TYPEA]:/home/typeb/file1.dat
[TYPEB]:/home/typeb/file2.dat
[TYPEB]:/home/typeb/file3.dat
[TYPE_C]:/home/type_d/file4.dat
[TYPE_C]:/home/type_d/file5.dat
[TYPE_C]:/home/type_d/file6.dat
I want to separate the headings like below
[TYPEA]
/home/typeb/file1.dat
[TYPEB]
/home/typeb/file2.dat
/home/typeb/file3.dat
[TYPE_C]
/home/type_d/file4.dat
/home/type_d/file5.dat
/home/type_d/file6.dat
Files with similar type have to come under one type.
Please help me with any logic to achieve this without hardcoding.

Assuming the input is sorted by type like in your example,
awk -F : '$1 != prev { print $1 } { print $2; prev=$1 }' file
If there are more than 2 fields you will need to adjust the second clause.

sed 'H;$ !b
x
s/\(\(\n\)\(\[[^]]\{1,\}]\):\)/\1\2\1/g
:cycle
=;l
s/\(\n\[[^]]\{1,\}]\)\(.*\)\1/\1\2/g
t cycle
s/^\n//' YourFile
Posix sed version a bit unreadeable due to presence of [ in pattern
- allow : in label or file/path
- failed if same label have a line with another label between them (sample seems ordered).

If you can use perl you will be able to make use of hashes to create a simple data structure:
#! /usr/bin/perl
use warnings;
use strict;
my %h;
while(<>){
chomp;
my ($key,$value) = split /:/;
$h{$key} = [] unless exists $h{$key};
push ${h{$key}},$value;
}
foreach my $key (sort keys %h) {
print "$key"."\n";
foreach my $value (#{$h{$key}}){
print "$value"."\n";
}
}
In action:
perl script.pl file
[TYPEA]
/home/typeb/file1.dat
[TYPEB]
/home/typeb/file2.dat
/home/typeb/file3.dat
[TYPE_C]
/home/type_d/file4.dat
/home/type_d/file5.dat
/home/type_d/file6.dat
If you like it, there is a wholeTutorial to solve this simple problem. It's worth reading it.

Related

How to use Regex in Perl

I need some help , I have an output from a command and need to extract only the time i.e. "10:57:09" from the output.
The command is: tail -f /var/log/sms
command output:
Thu 2016/08/04 10:57:09 gammu-smsd[48014]: Read 0 messages
how could I do this in perl and put the result into variable
Thank you
Normally, we'd expect you to show some evidence of trying to solve the problem yourself before giving an answer.
You use the match operator (m/.../) to check if a string matches a regular expression. The m is often omitted so you'll see it written as /.../. By default, it matches against the variable $_ but you can change that by using the binding operator, =~. If a regex includes parentheses ((...)) then whatever is matched by that section of the regex is stored in $1 (and $2, $3, etc for subsequent sets of parentheses). Those "captured" values are also returned by the match operator when it is evaluated in list context.
It's always a good idea to check the return value from the match operator, as you'll almost certainly want to take different actions if the match was unsuccessful.
See perldoc perlop for more details of the match operator and perldoc perlre for more details of Perl's regex support.
#!/usr/bin/perl
use strict;
use warnings;
use 5.010;
$_ = 'Thu 2016/08/04 10:57:09 gammu-smsd[48014]: Read 0 messages';
if (my ($time) = /(\d\d:\d\d:\d\d)/) {
say "Time is '$time'";
} else {
say 'No time found in string';
}
And to get the data from your external process...
#!/usr/bin/perl
use strict;
use warnings;
use 5.010;
open my $tail_fh, 'tail -f /var/log/sms |' or die $!;
while (<$tail_fh>) {
if (my ($time) = /(\d\d:\d\d:\d\d)/) {
say "Time is '$time'";
} else {
say 'No time found in string';
}
}
Perl code:
$txt = "Thu 2016/08/04 10:57:09 gammu-smsd[48014]: Read 0 messages";
$txt =~ /(\d{2}:\d{2}:\d{2})/;
print $1; # result of regex
print "\n"; # new line
And it prints:
10:57:09
The result goes to a variable called $1, due to the capturing parenthesis. Had there been more capturing parenthesis their captured text would have put int $2, $3 etc...
EDIT
To read the line from console, use in the above script:
$txt = <STDIN>;
Now, suppose the script is called myscript.pl, execute tail like so:
tail -f /var/log/sms | myscript.pl

make a change on the string based on mapping

I have the following string format
str="aaa.[any_1].bbb.[any_2].ccc"
I have the following mapping
map1:
any_1 ==> 1
cny_1 ==> 2
map2
any_2 ==> 1
bny_2 ==> 2
cny_2 ==> 3
What's the best command to execute on the str with taking account the above mapping in order to get
$ command $str
aaa.1.bbb.1.ccc
Turn your map files into sed scripts:
sed 's%^%s/%;s% ==> %/%;s%$%/g%' map?
Apply the resulting script to the input string. You can do it directly by process substitution:
sed 's%^%s/%;s% ==> %/%;s%$%/g%' map? | sed -f- <(echo "$str")
Output:
aaa.[1].bbb.[1].ccc
Update: I now think that I didn't understand the question correctly, and my solution therefore is wrong. I'm leaving it in here because I don't know if parts of this answer will be helpful to your question, but I encourage you to look at the other answers first.
Not sure what you mean. But here's something:
any_1="1"
any_2="2"
str="aaa.${any_1}.bbb.${any_2}.ccc"
echo $str
The curly brackets tell the interpreter where the variable name ends and the normal string resumes. Result:
aaa.1.bbb.2.ccc
You can loop this:
for any_1 in {1..2}; do
for any_2 in {1..3}; do
echo aaa.${any_1}.bbb.${any_2}.ccc
done
done
Here {1..3} represents the numbers 1, 2, and 3. Result
aaa.1.bbb.1.ccc
aaa.1.bbb.2.ccc
aaa.1.bbb.3.ccc
aaa.2.bbb.1.ccc
aaa.2.bbb.2.ccc
aaa.2.bbb.3.ccc
{
echo "${str}"
cat Map1
cat Map2
} | sed -n '1h;1!H;$!d
x
s/[[:space:]]*==>[[:space:]]*/ /g
:a
s/\[\([^]]*\)\]\(.*\)\n\1 \([^[:cntrl:]]*\)/\3\2/
ta
s/\n.*//p'
you could use several mapping, not limited to 2 (even and find to cat every mapping found).
based on fact that alias and value have no space inside (can be adapted if any)
I have upvoted #chw21's answer as it promotes - right tool for the problem scenario. However,
You can devise a perlbased command based on the following.
#!/usr/bin/perl
use strict;
use warnings;
my $text = join '',<DATA>;
my %myMap = (
'any_1' => '1',
'any_2' => '2'
);
$text =~s/\[([^]]+)\]/replace($1)/ge;
print $text;
sub replace {
my ($needle) = #_;
return "\[$needle\]" if ! exists $myMap{ lc $needle};
return $myMap{lc $needle};
}
__DATA__
aaa.[any_1].bbb.[any_2].ccc
Only thing that requires a bit of explanation is may be the regex,it matches text that comes between square brackets and sends the text to replace routine. In replace routine, we get mapped value from map corresponding to its argument.
$ cat tst.awk
BEGIN {
FS=OFS="."
m["any_1"]=1; m["cny_1"]=2
m["any_2"]=1; m["bny_2"]=2; m["cny_2"]=3
for (i in m) map["["i"]"] = m[i]
}
{
for (i=1;i<=NF;i++) {
$i = ($i in map ? map[$i] : $i)
}
print
}
$ awk -f tst.awk <<<'aaa.[any_1].bbb.[any_2].ccc'
aaa.1.bbb.1.ccc

I need to alter lines of a unix file that contain a certain string

I have a file with lines of a set format, using delimiters. I want to identify all the lines with the string "password" in the 3rd field and edit them out. As in, put a "# " at the beginning of them.
I would also like to remove the existing value for the 4th field.
I can't quite work out how to do this. It looks like it should be able to be done in two steps but I cannot work them out. I am using unix shell, so SED, AWK etc.
Sample lines of the file are:
database2|~|t1||~|${topuser.username}|~|topuser
database2|~|t1||~|${topuser.password}|~|H4rdt0Gu3ss
database2|~|t1||~|${loguser.username}|~|LOG
database2|~|t1||~|${loguser.password}|~|Ih4v3n01d34y0utry
# database2|~|t1||~|${open.var1}|~|connect
database2|~|t1||~|${tablespace}|~|gis_tbs1
Some may be edited out already and the delimeter is "|~|".
Any help please.
Here is a Perl script to achieve your goal:
#!/usr/bin/perl
use warnings;
use strict;
while (<DATA>) {
chomp;
my #fields = split /\|~\|/;
if ($fields[2] =~ /password/) {
$fields[0] = "# $fields[0]";
$fields[3] = '';
}
print join("|~|", #fields), "\n";
}
__DATA__
database2|~|t1||~|${topuser.username}|~|topuser
database2|~|t1||~|${topuser.password}|~|H4rdt0Gu3ss
database2|~|t1||~|${loguser.username}|~|LOG
database2|~|t1||~|${loguser.password}|~|Ih4v3n01d34y0utry
# database2|~|t1||~|${open.var1}|~|connect
# database2|~|t1||~|${tablespace}|~|gis_tbs1
Here is the one-liner version:
perl -F'/\|~\|/' -ane '$"="|~|"; if ($F[2] =~ /password/) { $F[0]="# $F[0]"; $F[3] = "\n"; } print "#F";' datafile
If I understood your question correctly, then the following should work:
$ awk 'BEGIN{FS=OFS="\|"}$6~/password/{$6="# "$6;$NF=""}1' file
database2|~|t1||~|${topuser.username}|~|topuser
database2|~|t1||~|# ${topuser.password}|~|
database2|~|t1||~|${loguser.username}|~|LOG
database2|~|t1||~|# ${loguser.password}|~|
# database2|~|t1||~|${open.var1}|~|connect
database2|~|t1||~|${tablespace}|~|gis_tbs1
If you mean edit them out by putting # at the start of the line, then you can do:
$ awk 'BEGIN{FS=OFS="\|"}$6~/password/{$NF="";$0="# "$0}1' file
database2|~|t1||~|${topuser.username}|~|topuser
# database2|~|t1||~|${topuser.password}|~|
database2|~|t1||~|${loguser.username}|~|LOG
# database2|~|t1||~|${loguser.password}|~|
# database2|~|t1||~|${open.var1}|~|connect
database2|~|t1||~|${tablespace}|~|gis_tbs1

Efficient way to replace strings in one file with strings from another file

Searched for similar problems and could not find anything that suits my needs exactly:
I have a very large HTML file scraped from multiple websites and I would like to replace all
class="key->from 2nd file"
with
style="xxxx"
At the moment I use sed - it works well but only with small files
while read key; do sed -i "s/class=\"$key\"/style=\"xxxx\"/g"
file_to_process; done < keys
When I'm trying to process something larger it takes ages
Example:
keys - Count: 1233 lines
file_to_ process - Count: 1946 lines
It takes about 40 s to complete only 1/10 of processing I need
real 0m40.901s
user 0m8.181s
sys 0m15.253s
Untested since you didn't provide any sample input and expected output:
awk '
NR==FNR { keys = keys sep $0; sep = "|"; next }
{ gsub("class=\"(" keys ")\"","style=\"xxxx\"") }
1' keys file_to_process > tmp$$ &&
mv tmp$$ file_to_process
I think it's time to Perl (untested):
my $keyfilename = 'somekeyfile'; // or pick up from script arguments
open KEYFILE, '<', $keyfilename or die("Could not open key file $keyfilename\n");
my %keys = map { $_ => 1 } <KEYFILE>; // construct a map for lookup speed
close KEYFILE;
my $htmlfilename = 'somehtmlfile'; // or pick up from script arguments
open HTMLFILE, '<', $htmlfilename or die("Could not open html file $htmlfilename\n");
my $newchunk = qq/class="xxxx"/;
for my $line (<$htmlfile>) {
my $newline = $line;
while($line =~ m/(class="([^"]+)")/) {
if(defined($keys{$2}) {
$newline =~ s/$1/$newchunk/g;
}
}
print $newline;
}
This uses a hash for lookups of keys, which should be reasonably fast, and does this only on the key itself when the line contains a class statement.
Try to generate a very long sed script with all sub commands from the keys file, something like:
s/class=\"key1\"/style=\"xxxx\"/g; s/class=\"key2\"/style=\"xxxx\"/g ...
and use this file.
This way you will read the input file only once.
Here's one way using GNU awk:
awk 'FNR==NR { array[$0]++; next } { for (i in array) { a = "class=\"" i "\""; gsub(a, "style=\"xxxx\"") } }1' keys.txt file.txt
Note that the keys in keys.txt are taken as the whole line, including whitespace. If leading and lagging whitespace could be a problem, use $1 instead of $0. Unfortunately I cannot test this properly without some sample data. HTH.
First convert your keys file into a sed or-pattern which looks like this: key1|key2|key3|.... This can be done using the tr command. Once you have this pattern, you can use it in a single sed command.
Try the following:
sed -i -r "s/class=\"($(tr '\n' '|' < keys | sed 's/|$//'))\"/style=\"xxxx\"/g" file

Bash: How to keep lines in a file that have fields that match lines in another file?

I have two big files with a lot of text, and what I have to do is keep all lines in file A that have a field that matches a field in file B.
file A is something like:
Name (tab) # (tab) # (tab) KEYFIELD (tab) Other fields
file B I managed to use cut and sed and other things to basically get it down to one field that is a list.
So The goal is to keep all lines in file A in the 4th field (it says KEYFIELD) if the field for that line matches one of the lines in file B. (Does NOT have to be an exact match, so if file B had Blah and file A said Blah_blah, it'd be ok)
I tried to do:
grep -f fileBcutdown fileA > outputfile
EDIT: Ok I give up. I just force killed it.
Is there a better way to do this? File A is 13.7MB and file B after cutting it down is 32.6MB for anyone that cares.
EDIT: This is an example line in file A:
chr21 33025905 33031813 ENST00000449339.1 0 - 33031813 33031813 0 3 1835,294,104, 0,4341,5804,
example line from file B cut down:
ENST00000111111
Here's one way using GNU awk. Run like:
awk -f script.awk fileB.txt fileA.txt
Contents of script.awk:
FNR==NR {
array[$0]++
next
}
{
line = $4
sub(/\.[0-9]+$/, "", line)
if (line in array) {
print
}
}
Alternatively, here's the one-liner:
awk 'FNR==NR { array[$0]++; next } { line = $4; sub(/\.[0-9]+$/, "", line); if (line in array) print }' fileB.txt fileA.txt
GNU awk can also perform the pre-processing of fileB.txt that you described using cut and sed. If you would like me to build this into the above script, you will need to provide an example of what this line looks like.
UPDATE using files HumanGenCodeV12 and GenBasicV12:
Run like:
awk -f script.awk HumanGenCodeV12 GenBasicV12 > output.txt
Contents of script.awk:
FNR==NR {
gsub(/[^[:alnum:]]/,"",$12)
array[$12]++
next
}
{
line = $4
sub(/\.[0-9]+$/, "", line)
if (line in array) {
print
}
}
This successfully prints lines in GenBasicV12 that can be found in HumanGenCodeV12. The output file (output.txt) contains 65340 lines. The script takes less than 10 seconds to complete.
You're hitting the limit of using the basic shell tools. Assuming about 40 characters per line, File A has 400,000 lines in it and File B has about 1,200,000 lines in it. You're basically running grep for each line in File A and having grep plow through 1,200,000 lines with each execution. that's 480 BILLION lines you're parsing through. Unix tools are surprisingly quick, but even something fast done 480 billion times will add up.
You would be better off using a full programming scripting language like Perl or Python. You put all lines in File B in a hash. You take each line in File A, check to see if that fourth field matches something in the hash.
Reading in a few hundred thousand lines? Creating a 10,000,000 entry hash? Perl can parse both of those in a matter of minutes.
Something -- off the top of my head. You didn't give us much in the way of spects, so I didn't do any testing:
#! /usr/bin/env perl
use strict;
use warnings;
use autodie;
use feature qw(say);
# Create your index
open my $file_b, "<", "file_b.txt";
my %index;
while (my $line = <$file_b>) {
chomp $line;
$index{$line} = $line; #Or however you do it...
}
close $file_b;
#
# Now check against file_a.txt
#
open my $file_a, "<", "file_a.txt";
while (my $line = <$file_a>) {
chomp $line;
my #fields = split /\s+/, $line;
if (exists $index{$field[3]}) {
say "Line: $line";
}
}
close $file_a;
The hash means you only have to read through file_b once instead of 400,000 times. Start the program, go grab a cup of coffee from the office kitchen. (Yum! non-dairy creamer!) By the time you get back to your desk, it'll be done.
grep -f seems to be very slow even for medium sized pattern files (< 1MB). I guess it tries every pattern for each line in the input stream.
A solution, which was faster for me, was to use a while loop. This assumes that fileA is reasonably small (it is the smaller one in your example), so iterating multiple times over the smaller file is preferable over iterating the larger file multiple times.
while read line; do
grep -F "$line" fileA
done < fileBcutdown > outputfile
Note that this loop will output a line several times if it matches multiple patterns. To work around this limitation use sort -u, but this might be slower by quite a bit. You have to try.
while read line; do
grep -F "$line" fileA
done < fileBcutdown | sort -u | outputfile
If you depend on the order of the lines, then I don't think you have any other option than using grep -f. But basically it boils down to trying m*n pattern matches.
use the below command:
awk 'FNR==NR{a[$0];next}($4 in a)' <your filtered fileB with single field> fileA

Resources