Perl script: Removing self deplicate rows - linux

I want to process some twitter data sets with a perl script. The file is in a csv format.
I want to remove self addressing mentions
the csv column and data is this way for example
user, mention(user), message
vims789, vnjuei234, yea this is good
dfion, youwen12, this is win
don234, don234, this is green
wen123, tileas, this is blue
The duplicate which is "don234, don234" mentioning itself, the line should be deleted. Example
user, mention(user), message
vims789, vnjuei234, yea this is good
dfion, youwen12, this is win
wen123, tileas, this is blue

Maybe something like this:
#!/usr/bin/perl
use strict;
use warnings;
use Text::CSV;
my $csv = Text::CSV->new();
while ( my $row = $csv->getline( \*DATA ) ) {
my ( $user, $mention, $message ) = #$row;
print $message,"\n" unless $user eq $mention;
}
__DATA__
user, mention(user), Message
vims789, vnjuei234, yea this is good
dfion, youwen12, this is win
don234, don234, this is green
wen123, tileas, this is blue

You can do this very quickly with a back-reference. Since you want to find something, a comma, some space, and then that something again, assuming that the string will be all word characters, this should work:
my $regex
= qr{ ^ # beginning of the line
(\w+) # A "word"
, # A comma
\s+ # space
\1 # a back reference to the first capture.
\b # demand that it end the sequence of word characters.
}x;
my #filtered_lines = grep { !m/$regex/ } #lines;

Related

How can I split my data in small enough chunks to feed to Seq?

I am working on a bioinformatics project where I am looking at very large genomes. Seg only reads 135 lines at a time, so when we feed the genomes in it gets overloaded. I am trying to create a perl command that will split the sections into 135 line sections. The character limit would be 10,800 since there are 80 columns. This is what i have so far
#!usr/bin/perl
use warnings;
use strict;
my $str =
'>AATTCCGG
TTCCGGAA
CCGGTTAA
AAGGTTCC
>AATTCCGG';
substr($str,17) = "";
print "$str";
It splits at the 17th character but only prints that section, I want it to continue printing the rest of the data. How do i add a command that allows the rest of the data to be shown. Like it should split at every 17th character continuing. (then of course i can go back in and scale it up to the size i actually need. )
I assume that the "very large genome" is stored in a very large file, and that it is fine to collect data by number of lines (and not by number of characters) since this is the first mentioned criterion.
Then you can read the file line by line and assemble lines until there is 135 of them. Then hand them off to a program or routine that processes that, empty your buffer, and keep going
use warnings;
use strict;
use feature 'say';
my $file = shift || 'default_filename.txt';
my $num_lines_to_process = 135;
open my $fh, '<', $file or die "Can't open $file: $!";
my ($line_counter, #buffer);
while (<$fh>) {
chomp;
if ($line_counter == $num_lines_to_process)
{
process_data(\#buffer);
#buffer = ();
$line_counter = 0;
}
push #buffer, $_;
++$line_counter;
}
process_data(\#buffer) if #buffer; # last batch
sub process_data {
my ($rdata) = #_;
say for #$rdata; say '---'; # print data for a test
}
If your processing application/routine wants a string, you can append to a string every time instead of adding to an array, $buffer .= $_; and clear that by $buffer = ''; as needed.
If you need to pass a string but there is also some use of an array while collecting data (intermediate checks/pruning/processing?), then collect lines into an array and use as needed, and join into a string before handing it off, my $data = join '', #buffer;
You can also make use of the $. variable and the modulo operator (%)
while (<$fh>) {
chomp;
push #buffer, $_;
if ($. % $num_lines_to_process == 0) # every $num_lines_to_process
{
process_data(\#buffer);
#buffer = ();
}
}
process_data(\#buffer) if #buffer; # last batch
In this case we need to first store a line and then check its number, since $. (line number read from a filehandle, see docs linked above) starts from 1 (not 0).
substr returns the removed part of a string; you can just run it in a loop:
while (length $str) {
my $substr = substr $str, 0, 17, "";
print $substr, "\n";
}

Perl: String in Substring or Substring in String

I'm working with DNA sequences in a file, and this file is formatted something like this, though with more than one sequence:
>name of sequence
EXAMPLESEQUENCEATCGATCGATCG
I need to be able to tell if a variable (which is also a sequence) matches any of the sequences in the file, and what the name of the sequence it matches, if any, is. Because of the nature of these sequences, my entire variable could be contained in a line of the file, or a line of the variable could be a part of my variable.
Right now my code looks something like this:
use warnings;
use strict;
my $filename = "/users/me/file/path/file.txt";
my $exampleentry = "ATCG";
my $returnval = "The sequence does not match any in the file";
open file, "<$filename" or die "Can't find file";
my #Name;
my #Sequence;
my $inx = 0;
while (<file>){
$Name[$inx] = <file>;
$Sequence[$inx] = <file>;
$indx++;
}unless(index($Sequence[$inx], $exampleentry) != -1 || index($exampleentry, $Sequence[$inx]) != -1){
$returnval = "The sequence matches: ". $Name[$inx];
}
print $returnval;
However, even when I purposely set $entry as a match from the file, I still return The sequence does not match any in the file. Also, when running the code, I get Use of uninitialized value in index at thiscode.pl line 14, <file> line 3002. as well as Use of uninitialized value within #Name in concatenation (.) or string at thiscode.pl line 15, <file> line 3002.
How can I perform this search?
I will assume that the purpose of this script is to determine if $exampleentry matches any record in the file file.txt. A record describes here a DNA sequence and corresponds to three consecutive lines in the file. The variable $exampleentry will match the sequence if it matches the third line of the record. A match means here that either
$exampleentry is a substring of $line, or
$line is a substring of $exampleentry,
where $line referes to the corresponding line in the file.
First, consider the input file file.txt:
>name of sequence
EXAMPLESEQUENCEATCGATCGATCG
in the program you try to read these two lines, using three calls to readline. Accordingly, that last call to readline will return undef since there are no more lines to read.
It therefore seems reasonable that the two last lines in file.txt are malformed, and the correct format should be:
>name of sequence
EXAMPLESEQUENCE
ATCGATCGATCG
If I now understand you correctly, I hope this could solve your problem:
use feature qw(say);
use strict;
use warnings;
my $filename = "file.txt";
my $exampleentry = "ATCG";
my $returnval = "The sequence does not match any in the file";
open (my $fh, '<', $filename ) or die "Can't find file: $!";
my #name;
my #sequence;
my $inx = 0;
while (<$fh>) {
chomp ($name[$inx] = <$fh>);
chomp ($sequence[$inx] = <$fh>);
if (
index($sequence[$inx], $exampleentry) != -1
|| index($exampleentry, $sequence[$inx]) != -1
) {
$returnval = "The sequence matches: ". $name[$inx];
last;
}
}
say $returnval;
Notes:
I have changed variable names to follow snake_case convention. For example the variable #Name is better written using all lower case as #name.
I changed the open() call to follow the new recommended 3-parameter style, see Don't Open Files in the old way for more information.
Used feature say instead of print
Added a chomp after each readline to avoid storing newline characters in the arrays.

perl program for reading file contents

I want to write a perl program for opening a file and reading its content and the printing the number of lines, words and characters there are. I also want to print the number of times a specific word appeared in the file. Here is what I have done:
#! /usr/bin/perl
open( FILE, "test1.txt" ) or die "could not open file $1";
my ( $line, $word, $chars ) = ( 0, 0, 0 );
while (<FILE>) {
$line++;
$words += scalar( split( /\s+/, $_ ) );
$chars += length($_);
print $_;
}
$chars -= $words;
print(
"Total number of lines in the file:= $line \nTotal number of words in the file:= $words \nTotal number of chars in the file:= $chars\n"
);
As you can clearly see, I don't have any provision for taking user input of the words whose occurrence is to be counted. Because I don't know how to do it. Please help with counting of the number of occurrence part. Thank you
I guess you're doing this for learning purposes, so here is a good readable version of your problem (there might be a thousand others, because it's perl). If not, there's wc on the linxux command line.
Note that I'm using three argument open, it's generally better to do that.
For counting single words you'll most probably need a hash. And I used <<HERE docs, because they are nicer for formating. If you have any doubts, just look in the perldoc and ask your questions.
#!/usr/bin/env perl
use warnings; # Always use this
use strict; # ditto
my ($chars,$word_count ,%words);
{
open my $file, '<', 'test.txt'
or die "couldn't open `test.txt':\n$!";
while (<$file>){
foreach (split){
$word_count++;
$words{$_}++;
$chars += length;
}
}
} # $file is now closed
print <<THAT;
Total number of lines: $.
Total number of words: $word_count
Total number of chars: $chars
THAT
# Now to your questioning part:
my $prompt= <<PROMPT.'>>';
Please enter the words you want the occurrences for. (CTRL+D ends the program)
PROMPT
print $prompt;
while(<STDIN>){
chomp; # get rid of the newline
print "$_ ".(exists $words{$_}?"occurs $words{$_} times":"doesn't occur")
." in the file\n",$prompt;
}

Perl: Removing characters up to certain point.

I've tried searching through questions already asked, but can't seem to find anything. I'm sure its incredibly simple to do, but I am completely new to Perl.
What I am trying to do is remove characters in an string up to a certain point. For example, I have:
Parameter1 : 0xFFFF
and what I would like to do is remove the "Parameter1:" and be left with just the "0xFFFF". If anyone can help and give a simple explanation of the operators used, that'd be great.
Sounds like you need the substr function.
#!/usr/bin/perl
use strict;
use warnings;
my $string = 'Parameter1 : 0xFFFF';
my $fragment = substr $string, 12;
print " string: <$string>\n";
print "fragment: <$fragment>\n";
s/.*:\s*//;
or
$s =~ s/.*:\s*//;
This deletes everything up to and including the first occurrence of : followed by zero or more whitespace characters. With $s =~ it's applied to $s; without it, it's applied to $_.
Have you considered using something like Config::Std?
Here is how to parse a configuration file like that by hand:
#!/usr/bin/perl
use strict; use warnings;
my %params;
while ( my $line = <DATA> ) {
if ($line =~ m{
^
(?<param> Parameter[0-9]+)
\s*? : \s*?
(?<value> 0x[[:xdigit:]]+)
}x ) {
$params{ $+{param} } = $+{value};
}
}
use YAML;
print Dump \%params;
__DATA__
Parameter1 : 0xFFFF
Parameter3 : 0xFAFF
Parameter4 : 0xCAFE
With Config::Std:
#!/usr/bin/perl
use strict; use warnings;
use Config::Std;
my $config = do { local $/; <DATA> };
read_config \$config, my %params;
use YAML;
print Dump \%params;
__DATA__
Parameter1 : 0xFFFF
Parameter3 : 0xFAFF
Parameter4 : 0xCAFE
Of course, in real life, you'd pass a file name to read_config instead of slurping it.
I like split for these parameter/value pairs.
my $str = "Parameter1 : 0xFFFF";
my ($param, $value) = split /\s*:\s*/, $str, 2;
Note the use of LIMIT in the split, which limits the split to two fields (in case of additional colons in the value).

How can I parse people's full names into user names in Perl?

I need to convert a name in the format Parisi, Kenneth into the format kparisi.
Does anyone know how to do this in Perl?
Here is some sample data that is abnormal:
Zelleb, Charles F.,,IV
Eilt, John,, IV
Wods, Charles R.,,III
Welkt, Craig P.,,Jr.
These specific names should end up as czelleb, jeilt, cwoods, cwelkt, etc.
I have one more condition that is ruining my name builder
O'Neil, Paulso far, Vinko Vrsalovic's answer is working the best when weird/corrupt names are in the mix, but this example above would come out as "pneil"... id be damned below judas if i cant get that o between the p and the n
vinko#parrot:~$ cat genlogname.pl
use strict;
use warnings;
my #list;
push #list, "Zelleb, Charles F.,,IV";
push #list, "Eilt, John,, IV";
push #list, "Woods, Charles R.,,III";
push #list, "Welkt, Craig P.,,Jr.";
for my $name (#list) {
print gen_logname($name)."\n";
}
sub gen_logname {
my $n = shift;
#Filter out unneeded characters
$n =~ s/['-]//g;
#This regex will grab the lastname a comma, optionally a space (the
#optional space is my addition) and the first char of the name,
#which seems to satisfy your condition
$n =~ m/(\w+), ?(.)/;
return lc($2.$1);
}
vinko#parrot:~$ perl genlogname.pl
czelleb
jeilt
cwoods
cwelkt
I would start by filtering the abnormal data so you only have regular names. Then something like this should do the trick
$t = "Parisi, Kenneth";
$t =~ s/(.+),\s*(.).*/\l$2\l$1/;
Try:
$name =~ s/(\w+),\s(\w)/$2$1/;
$name = lc $name;
\w here matches an alphanumerical character. If you want to be more specific, you could also use [a-z] instead, and pass the i flag (case insensitive):
$name =~ s/([a-z]+)\s([a-z])/$2$1/i;
Here's a one line solution, assuming you store all the names in a file called "names" (one per line) and you will do duplicated name detection somehow later.
cat names | perl -e 'while(<>) {/^\s*(\S*)?,\s*(\S)/; print lc "$2$1\n";}' | sed s/\'//g
It looks like your input data is comma-separated. To me, the clearest way to do this would be split into components, and then generate the login names from that:
while (<>) {
chomp;
my ($last, $first) = split /,/, lc $_;
$last =~ s/[^a-z]//g; # strip out nonletters
$first =~ s/[^a-z]//g; # strip out nonletters
my $logname = substr($first, 0, 1) . $last;
print $logname, "\n";
}
$rowfetch =~ s/['-]//g; #All chars inside the [ ] will be filtered out.
$rowfetch =~ m/(\w+), ?(.)/;
$rowfetch = lc($2.$1);
this is how I ended up using Vinko Vrsalovic's solution... its inside a while loop that goes through a sql query result ... thanks again vinko
This should do what you need
use strict;
use warnings;
use 5.010;
while ( <DATA> ) {
say abbreviate($_);
}
sub abbreviate {
for ( #_ ) {
s/[-']+//g;
tr/A-Z/a-z/;
tr/a-z/ /c;
return "$2$1" if /([a-z]+)\s+([a-z])/;
}
}
__DATA__
Zelleb, Charles F.,,IV
Eilt, John,, IV
Woods, Charles R.,,III
Welkt, Craig P.,,Jr.
O'Neil, Paul
output
czelleb
jeilt
cwoods
cwelkt
poneil

Resources