Perl cutting string - string

I have an array of numbers and file of strings looks like the below and I have written down a perl code which named string cutter. I can get the cut strings, but I can't get the frist "n" strings assgined by the array of numbers. Anyidea? I do not know why substr does not work.
string Cutter
file 1:
1234567890
0987654321
1234546789
ABCDEFGHIJ
JIHGFEDCBA
file 2: array of given length
2, 3, 4, 2, 1
Current Result:
34567890
7654321
546789
CDEFGHIJ
IHGFEDCBA
Supposed to be Result (perhaps \t delimited):
12 34567890
098 7654321
1234 546789
AB CDEFGHIJ
J IHGFEDCBA
My code:
#!/usr/bin/perl
use warnings;
use strict;
if (#ARGV != 2) {
die "Invalid usage\n"
. "Usage: perl program.pl [num_list] [string_file]\n";
}
my ($number_f, $string_f) = #ARGV;
open my $LIST, '<', $number_f or die "Cannot open $number_f: $!";
my #numbers = split /, */, <$LIST>;
close $LIST;
open my $DATA, '<', $string_f or die "Cannot open $string_f: $!";
while (my $string = <$DATA>) {
substr $string, 0, shift #numbers, q(); # Replace the first n characters with an empty string.
print $string;
}
Many thanks

perldoc -f substr:
Extracts a substring out of EXPR and returns it
So you should do this way:
$prefix = substr $string, 0, shift #numbers, q();
print $prefix . " " . $string;

Related

How to remove values from text file from different position

I have a file containing different values:
30,-4,098511E-02
30,05,-4,098511E-02
41,9,15,54288
I need to remove values from this file but from different position, for example:
30
30,05
41,9
I tried to do it with sed to remove the last value but my problem is when I encounter the 41,9,15,54288 it does not work. Any idea if there is a way to do it?
I tried this
echo "30,-4,098511E-02" | sed 's/,.*/,/'
Using sed
$ sed -E 's/(([0-9]+,?){1,2}),[0-9-].*/\1/' input_file
30
30,05
41,9
I would do it using perl, like this:
#!/usr/bin/perl
use strict;
use warnings;
# my $inputPath = '/Users/myuser/Desktop/inputs/a.txt';
# my $outputPath = '/Users/myuser/Desktop/outputs/a_result.txt';
if ($inputPath eq "") {
print "Enter the full path of your input file: ";
$inputPath = <STDIN>;
chomp $inputPath;
}
if ($outputPath eq "") {
print "Enter the full path of your input file: ";
$outputPath = <STDIN>;
chomp $outputPath;
}
open my $info, $inputPath or die "Could not open $inputPath: $!";
open FH, '>', $outputPath or die "Could not open $outputPath : $!";
while( my $line = <$info>) {
chomp $line;
# print "line read: $line\n";
# 30,05,-4,098511E-02
# [0-9]: begins with a digit
# 3
# [0-9]+: begins with two digits
# 30
# [0-9]+: begins with two digits and a comma
# [0-9]+?: begins with two digits and has or has not a comma
# [0-9]+?: begins with two digits and has or has not a comma
# 30,
# {1,2}: one or two times
# 30,05,
# [0-9-]: anything that is a digit, or a dash
# 30,05,-
# [0-9-].: anything that is a digit, or a dash and any character after that
# 30,05,-4
# *: Matches anything in the place of the *, or a "greedy" match (e.g. ab*c returns abc, abbcc, abcdc)
# 30,05,-4,098511E-02
if ($line =~ m{((([0-9]+,?){1,2}),[0-9-].*)}) {
# print "becomes: $1\n";
print FH "$1\n"; # Print to the file
} else {
print "not found!\n";
}
}
close $info;
I wrote the explanations of my regex in the comments of my code.

How to create new string by replacing with string from another file in awk/linux

I have two files as shown below:
FileA:
AGCTTTTCATTC...
FileB:
POS ID REF ALT
2 . G C
8 . C -
12 . - T
POS column in FileB: gives the string position in FileA: i.e 2 under POS means second letter "G" in the sequence in FileA.
Based on the POS column in FileB, FileA should be modified to FileA'and FileB'as shown below.
FileA' is formed by inserting string under REF column at respective positions in POS column
FileB' is formed by replacing string under ALT column at respective positions in POS column. And
FileA':
AGCTTTTCATT-TC...
FileB':
ACCTTTT-ATTTTC...
Perl solution:
#!/usr/bin/perl
use warnings;
use strict;
open my $DNA, '<', 'FileA' or die $!;
open my $POS, '<', 'FileB' or die $!;
my $dna = <$DNA>;
my $ref_out = $dna;
my $alt_out = $dna;
while (<$POS>) {
my ($pos, $id, $ref, $alt) = split;
next unless $pos =~ /^\d+$/; # Skip the header.
substr $ref_out, $pos - 1, 1, $ref;
substr $alt_out, $pos - 1, 1, $alt;
}
open my $REF, '>', "FileA'" or die $!;
print {$REF} $ref_out;
close $REF;
open my $ALT, '>', "FileB'" or die $!;
print {$ALT} $alt_out;
close $ALT;
You just keep two copies of the input string and modify it via substr for each line of FileB.

Perl: append numbers from one file to strings of second file

I would like to append numbers attached to (Unicode) strings in one file, to matched strings in a second file. Somehow I can't wrap my head around how to do this. Here is what my two files look like.
File 1:
दौरा, 2
प्रोत्साहन, 1
प्रगति, 4
File 2:
दौरा
dorA
प्रोत्साहन
prua2ts3Ahan
prua2ts2Ahan
prua2tsAhan
prua2t2s3Ahan
prua2t2s2Ahan
prua2t2sAhan
prOts3Ahan
prOts2Ahan
prOtsAhan
prOt2s3Ahan
prOt2s2Ahan
prOt2sAhan
प्रगति
praGat2I
praGatI
pragat2I
pragatI
The desired result would look like this:
Output:
dorA, 2
prua2ts3Ahan, 1
prua2ts2Ahan, 1
prua2tsAhan, 1
prua2t2s3Ahan, 1
prua2t2s2Ahan, 1
prua2t2sAhan, 1
prOts3Ahan, 1
prOts2Ahan, 1
prOtsAhan, 1
prOt2s3Ahan, 1
prOt2s2Ahan, 1
prOt2sAhan, 1
praGat2I, 4
praGatI, 4
pragat2I, 4
pragatI, 4
I have a hash created from File 1 that has the strings as keys, and the numbers as values. Now its a matter of matching these keys in File 2, collecting all following lines after the match, and appending the values to those following lines. Can someone point me in the right direction?
Your description of the solution is correct. Now just translate it to code:
#!/usr/bin/perl
use warnings;
use strict;
my %hash;
open my $F1, '<:encoding(UTF-8)', 'file.1' or die $!;
while (<$F1>) {
chomp;
my ($word, $num) = split /, /;
$hash{$word} = $num;
}
open my $F2, '<:encoding(UTF-8)', 'file.2' or die $!;
my $word;
while (<$F2>) {
chomp;
if (exists $hash{$_}) {
$word = $_;
} elsif ($_) {
print "$_, $hash{$word}\n";
} else {
print "\n";
}
}

Perl Inserting a string from a file after every occurence of a slash in a url

I have the following URL's:
FILE1.txt
http://www.stackoveflow.com/dog/cat/rabbit/hamster/
192.168.192.168/lion/tiger/elephant/
FILE2.txt
HELLO
GOODBYE
The output I am trying to achieve:
http://www.stackoveflow.com/dogHELLO/cat/rabbit/hamster/
http://www.stackoveflow.com/dog/catHELLO/rabbit/hamster/
http://www.stackoveflow.com/dog/cat/rabbitHELLO/hamster/
http://www.stackoveflow.com/dog/cat/rabbit/hamsterHELLO/
http://www.stackoveflow.com/dog/cat/rabbit/hamster/HELLO
http://www.stackoveflow.com/dogGOODBYE/cat/rabbit/hamster/
http://www.stackoveflow.com/dog/catGOODBYE/rabbit/hamster/
http://www.stackoveflow.com/dog/cat/rabbitGOODBYE/hamster/
http://www.stackoveflow.com/dog/cat/rabbit/hamsterGOODBYE/
http://www.stackoveflow.com/dog/cat/rabbit/hamster/GOODBYE
192.168.192.168/lionHELLO/tiger/elephant/
192.168.192.168/lion/tigerHELLO/elephant/
192.168.192.168/lion/tiger/elephantHELLO/
192.168.192.168/lion/tiger/elephant/HELLO
192.168.192.168/lionGOODBYE/tiger/elephant/
192.168.192.168/lion/tigerGOODBYE/elephant/
192.168.192.168/lion/tiger/elephantGOODBYE/
192.168.192.168/lion/tiger/elephant/GOODBYE
As you can see the strings HELLO and GOODBYE are inserted after every slash, and if there is already a string after the slash it will append the HELLO and GOODBYE after that (e.g http://www.stackoveflow.com/dogHELLO/cat/rabbit/hamster/ and so on).
What I have tried
use strict;
use warnings;
my #f1 = do {
open my $fh, '<', 'FILE1.txt';
<$fh>;
};
chomp #f1;
my #f2 = do {
open my $fh, '<', 'FILE2.txt';
<$fh>;
};
chomp #f2;
for my $f1 (#f1) {
my #fields = $f1 =~ m{[^/]+}g;
for my $f2 (#f2) {
for my $i (0 .. $#fields) {
my #new = #fields;
$new[$i] .= $f2;
print qq{/$_/\n}, for join '/', #new;
}
print "\n\n";
}
}
#courtesy of Borodin
However this code does not cater for url's that have the slashes in the http:// part as these are replaced with http:HELLO/ when it should not do.
Also it does not put HELLO or GOODBYE after the slash if there is no string already there e.g http://www.stackoveflow.com/dog/cat/rabbit/hamster/<--SHOULD PUT HELLO AFTER THIS SLASH AS WELL BUT DOSN'T
It appears that this code removes then re-inserts the slashes with the strings from FILE2.txt, as opposed to inserting HELLO and GOODBYE in the correct place to start with.
My question
Is there a better method of going about achieving the output I require or is there something I can do to my existing code to cater for the problems described above?
Your help is much appreciated, many thanks
Here is the algorithm in prose:
Open File2.txt. Read in all lines, removing the newline. We call the array #words.
Open File2.txt. We call the file handle $fh.
As long as we can read a $line from $fh:
Remove the newline, remove starting and ending slashes.
Split the $line at every slash, call the array #animals.
Loop through the #words, calling each element $word:
Loop through the indices of the #animals, calling each index $i:
Make a #copy of the #animals.
Append the $word to the $i-th element of #copy.
Join the #copy with slashes, surround it with slashes, and print with newline.
Print an empty line.
This program will do what you ask.
use strict;
use warnings;
use autodie;
my #f1 = do {
open my $fh, '<', 'FILE1.txt';
<$fh>;
};
chomp #f1;
my #f2 = do {
open my $fh, '<', 'FILE2.txt';
<$fh>;
};
chomp #f2;
for my $f1 (#f1) {
my #fields = $f1 =~ m{[^/]+}g;
for my $f2 (#f2) {
for my $i (0 .. $#fields) {
my #new = #fields;
$new[$i] .= $f2;
print qq{/$_/\n}, for join '/', #new;
}
print "\n\n";
}
}
output
/dogHELLO/cat/rabbit/hamster/
/dog/catHELLO/rabbit/hamster/
/dog/cat/rabbitHELLO/hamster/
/dog/cat/rabbit/hamsterHELLO/
/dogGOODBYE/cat/rabbit/hamster/
/dog/catGOODBYE/rabbit/hamster/
/dog/cat/rabbitGOODBYE/hamster/
/dog/cat/rabbit/hamsterGOODBYE/
/lionHELLO/tiger/elephant/
/lion/tigerHELLO/elephant/
/lion/tiger/elephantHELLO/
/lionGOODBYE/tiger/elephant/
/lion/tigerGOODBYE/elephant/
/lion/tiger/elephantGOODBYE/
Rather than splitting the line on every slash, you can do it all with a regex.
Updated version:
#!usr/bin/perl
use strict;
use warnings;
my #insert_words = qw/HELLO GOODBYE/;
my $word = 0;
while (<DATA>)
{
chomp;
foreach my $word (#insert_words)
{
my $repeat = 1;
while ((my $match=$_) =~ s|(?<!/)(?:/(?!/)[^/]*){$repeat}[^/]*\K|$word|)
{
print "$match\n";
$repeat++;
}
print "\n";
}
}
__DATA__
/dog/cat/rabbit/hamster/
http://www.stackoverflow.com/dog/cat/rabbit/hamster/
The key is the substitution operator: s|(?<!/)(?:/(?!/)[^/]*){$repeat}[^/]*\K|$word|.
(?<!/) and (?!/) are negative look-behind and look-ahead, respectively. They ensure that we are only matching a single /, thus ignoring http://.
(?:/(?!/)[^/]*){$repeat} is a capturing group that must match a specified number of times, and we increase that number until it no longer matches.
I had to use [^/]* instead of [^/]+ to meet your requirement of matching at the end of the string. That is why both the look-behind and the look-ahead are needed.
\K means "match everything up to this point, but don't include it in the match itself." Thus we don't have to worry about including the whole beginning of the string that matched in the replacement.
Note: The r option is another way to perform substitution without modifying the original string. However, it requires Perl 5.16 (thanks Amon). Thus I removed it from the example.

unix - breakdown of how many records in file with number of character occurrences

Is there an inbuilt command to do this or has anyone had any luck with a script that does it?
I am looking to get counts of how many records (as defined by a specific EOL such as "^%!") had how many occurrences of a specfic character. (sorted descending by the number of occurrences)
For example, with this sample file:
jdk,|ljn^%!dk,|sn,|fgc^%!
ydfsvuyx^%!67ds5,|bvujhy,|s6d75
djh,|sudh^%!nhjf,|^%!fdiu^%!
Suggested input: delimiter EOL and filename as arguments.
bash/perl some_script_name ",|" "^%!" samplefile
Desired output:
occs count
3 1
2 1
1 2
0 2
This is because the 1st record had one delimiter, 2nd record had 2, 3rd record had 0, 4th record had 3, 5th record had 1, 6th record had 0.
Bonus pts if you can make the delimiter and EOL argument accept hex input (ie 2C7C) or normal character input (ie ,|) .
Script:
#!/usr/bin/perl
use strict;
$/ = $ARGV[1];
open my $fh, '<', $ARGV[2] or die $!;
my #records = <$fh> and close $fh;
$/ = $ARGV[0];
my %counts;
$counts{(split $_)-1}++ for #records;
delete $counts{-1};
print "$_\t$counts{$_}\n" for (reverse sort keys %counts);
Test:
perl script.pl ',|' '^%!' samplefile
Output:
3 1
2 1
1 2
0 2
This is what perl lives for:
#!perl -w
use 5.12.0;
my ($delim, $eol, $file) = #ARGV;
open my $fh, "<$file" or die "error opening $file $!";
$/ = $eol; # input record separator
my %counts;
while (<$fh>) {
my $matches = () = $_ =~ /(\Q$delim\E)/g; # "goatse" operator
$counts{$matches}++;
}
say "occs\tcount";
foreach my $num (reverse sort keys %counts) {
say "$num\t$counts{$num}";
}
(if you haven't got 5.12, remove the "use 5.12" line and replace the say with print)
A solution in awk:
BEGIN {
RS="\\^%!"
FS=",\\|"
max_occ = 0
}
{
if(match($0, "^ *$")) { # This is here to deal with the final separator.
next
}
if(NF - 1 > max_occ) {
max_occ = NF - 1
}
count[NF - 1]=count[NF - 1] + 1
}
END {
printf("occs count\n")
for(i = 0; i <= max_occ; i++) {
printf("%s %s\n", i, count[i])
}
}
Well, there's one more empty record at the end of the file which has 0. So, here's a script to do what you wanted. Adding headers and otherwise tweaking the printf output is left as an excercise for you. :)
Basically, read the whole file in, split it into records, and for each record, use a /g regex to count the sub-delimiters. Since /g returns an array of all matches, use #{[]} to make an arrayref then deref that in scalar context to get a count. There has to be a more elegant solution to that particular part of the problem, but whatever; it's perl line noise. ;)
user#host[/home/user]
$ ./test.pl ',|' '^%!' test.in
3 1
2 1
1 2
0 3
user#host[/home/user]
$ cat test.in
jdk,|ljn^%!dk,|sn,|fgc^%!
ydfsvuyx^%!67ds5,|bvujhy,|s6d75
djh,|sudh^%!nhjf,|^%!fdiu^%!
user#host[/home/user]
$ cat test.pl
#!/usr/bin/perl
my( $subdelim, $delim, $in,) = #ARGV;
$delim = quotemeta $delim;
$subdelim = quotemeta $subdelim;
my %counts;
open(F, $in) or die qq{Failed opening $in: $?\n};
foreach( split(/$delim/, join(q{}, <F>)) ){
$counts{ scalar(#{[m/.*?($subdelim)/g]}) }++;
}
printf( qq{%i% 4i\n}, $_, $counts{$_} ) foreach (sort {$b<=>$a} keys %counts);
And here's a modified version which only keeps fields which contain at least one non-space character. That removes the last field, but also has the consequence of removing any other empty fields. It also uses $/ and \Q\E to reduce a couple of explicit function calls (thank, Alex). And, like the previous one, it works with strict + warnings;
#!/usr/bin/perl
my( $subdelim, $delim, $in ) = #ARGV;
local $/=$delim;
my %counts;
open(F, $in) or die qq{Failed opening $in: $?\n};
foreach ( grep(/\S/, <F>) ){
$counts{ scalar(#{[m/.*?(\Q$subdelim\E)/g]}) }++;
}
printf( qq{%i% 4i\n}, $_, $counts{$_} ) foreach (sort {$b<=>$a} keys %counts);
If you really only want to remove the last record unconditionally, I'm partial to using pop:
#!/usr/bin/perl
my( $subdelim, $delim, $in ) = #ARGV;
local $/=$delim;
my %counts;
open(F, $in) or die qq{Failed opening $in: $?\n};
my #lines = <F>;
pop #lines;
$counts{ scalar(#{[m/.*?(\Q$subdelim\E)/g]}) }++ foreach (#lines);
printf( qq{%i% 4i\n}, $_, $counts{$_} ) foreach (sort {$b<=>$a} keys %counts);

Resources