At the beginning I simply used the following to count the length of each line:
while(<FH>){
chomp;
$length=length($_);
}
but when I compared the result I got with the one produced by linux command WC, I found a problem:
all tab characters in my file are treated as of 1 character length in perl, whereas it is 8 for wc, so I did the following modification:
while(<FH>){
chomp;
my $length=length($_);
my $tabCount= tr/\t/\t/;
my $lineLength=$wc-$tabCount+($tabCount*8);
}
for the above code it works for all most all the cases now, except for one, in wc not all tabs are counted, but only the one that has not be taken with some characters, for example, if at the start of a line, I type in1234and then press a tab, in wc it is not counted as a tab, but the above code counted that, are there any ways I could adopt to solve this issue? Thanks
Solved it, used tab expansion, here is the code:
1 while $string =~ s/\t+/' ' x (length($&) * 8 - length($`) % 8)/e;
$length=length($string);
if anyone could give it an explanation, that would be awesome, I tested it to be working, but don't quite understand it. Anyways, thanks for all the help
I don't think tabs are your problem, wc doesn't count a tab as eight characters. I think your problem is that you're stripping EOLs but wc counts them. Also, you're not accumulating the lengths, you were just tracking the length of the last line. This:
while(<FH>){
chomp;
$length=length($_);
}
Should be more like this:
my $length = 0;
while(<FH>) {
$length += length($_);
}
# $length now has the total number of characters
Solved it, used tab expansion, here is the code:
1 while $string =~ s/\t+/' ' x (length($&) * 8 - length($`) % 8)/e;
$length=length($string);
if anyone could give it an explanation, that would be awesome, I tested it to be working, but don't quite understand it. Anyways, thanks for all the help
How about just calling wc from within perl?
$result = `wc -l /path/to/file`
Related
Question:
I have 2 files, file 1 is a TSV (BED) file that has 23 base-pair sequences in column 7, for example:
1 779692 779715 Sample_3 + 1 ATGGTGCTTTGTTATGGCAGCTC
1 783462 783485 Sample_4 - 1 ATGAATAAGTCAAGTAAATGGAC
File 2 is a FASTA file (hg19.fasta) that looks like this. Although it breaks across the lines, this continous string of A,C,G, and T's reprsents a continous string (i.e. a chromsosome). This file is the entire human reference genome build 19, so the two > headers followed by sequences essentially occurs 23 times for each of the 23 chromosomes:
>1 dna:chromosome chromosome:GRCh37:1:1:249250621:1
AATTTGACCAGAAGTTATGGGCATCCCTCCCCTGGGAAGGAGGCAGGCAGAAAAGTTTGGAATCTATGTAGTAAAATATG
TTACTCTTTTATATATATGAATAAGTCAAGTAAATGGACATACATATATGTGTGTATATGTGTATATATATATACACACA
TATATACATACATACATACATACATATTATCTGAATTAGGCCATGGTGCTTTGTTATGGCAGCTCTCTGGGATACATGTG
CAGAATGTACAGGTTTGTTACACAGGTATACACCTGCCATGGTTGTTTGCTGCACCCATCAACTCACCATCTACATTAGG
TATTTCTCCTAACGTTATCCCTCATGAATAAGTCAAGTAAATGGAC
>2 dna:chromosome chromosome:GRCh37:1:1:2492300:1
AATTTGACCAGAAGTTATGGGCATCCCTCCCCTGGGAAGGAGGCAGGCAGAAAAGTTTGGAATCTATGTAGTAAAATATG
TTACTCTTTTATATATATGAATAAGTCAAGTAAATGGACATACATATATGTGTGTATATGTGTATATATATATACACACA
TATATACATACATACATACATACATATTATCTGAATTAGGCCATGGTGCTTTGTTATGGCAGCTCTCTGGGATACATGTG
I want to 1) Find out how many times each 23bp sequence appears in the second file without overlapping any others including sequences that break across the lines and 2) append this number to a new column next to the sequence so the new file looks like this:
Desired Output:
1 779692 779715 Sample_3 + 1 ATGGTGCTTTGTTATGGCAGCTC 1
1 783462 783485 Sample_4 - 1 ATGAATAAGTCAAGTAAATGGAC 2
My attempt:
I imagine solving the first part will be some variation on grep, so far I've managed:
grep -o ATGGTGCTTTGTTATGGCAGCTC "$file_2" | grep -c ""
which gets the count of a specific sequence, but not each sequence in the column. I think appending the grep results will require awk and paste but I haven't gotten that far!
Any help is appreciated as always! =)
Updates and Edits:
The actual size of these files is relatively massive (30mb or ~500,000 lines for each tsv/BED file) and the FASTA file is the entire human reference genome build 19, which is ~60,000,000 lines. The perl solution proposed by #choroba works, but doesn't scale well to these sizes.
Unfortunately, because of the need to identify matches across the lines, the awk and bash/grep solutions memtnioned below won't work.
I want multiple non-overlapping hits in the same chromosome to count as the actual number of hits. I.e. If you search for a sequence and get 2 hits in a single chromosome and 1 in another chromosome, the total count should be 3.
Ted Lyngmo is very kindly helping me develop a solution in C++ that allows this to be run in a realistic timeframe, there's more detail on his post in this thread. And link to the Github for this is here =)
If the second file is significantly bigger than the first one, I would try this awk script:
awk 'v==1 {a[$7];next} # Get the pattern from first file into the array a
v==2 { # For each line of the second file
for(i in a){ # Loop though all patterns
a[i]=split($0,b,i)-1 # Get the number of pattern match in the line
}
}
v==3 {print $0,a[$7]} # Re-read first file to add the number of pattern matches
' v=1 file1 v=2 file2 v=3 file1
I'd reach for a programming language like Perl.
#!/usr/bin/perl
use warnings;
use strict;
my ($fasta_file, $bed_file) = #ARGV;
open my $fasta, '<', $fasta_file or die "$fasta_file: $!";
open my $bed, '<', $bed_file or die "$bed_file: $!";
my $seq;
while (<$fasta>) {
$seq .= "\n", next if /^>/;
chomp;
$seq .= $_;
}
while (<$bed>) {
chomp;
my $short_seq = (split /\t/, $_)[-1];
my $count = () = $seq =~ /\Q$short_seq\E/g;
print "$_\t$count\n";
}
To count overlapping sequences, change the regex to a lookahead.
my $count = () = $seq =~ /(?=\Q$short_seq\E)/g;
Since grep -c seems to give you the correct count (matching lines, not counting multiple occurances on the same line) you could read the 7 fields from the TSV (BED) file and just print them again with the grep output added to the end:
#!/bin/bash
# read the fields into the array `v`:
while read -ra v
do
# print the 7 first elements in the array + the output from `grep -c`:
echo "${v[#]:0:7}" "$(grep -Fc "${v[6]}" hg19.fasta)"
done < tsv.bed > outfile
outfile will now contain
1 779692 779715 Sample_3 + 1 ATGGTGCTTTGTTATGGCAGCTC 1
1 783462 783485 Sample_4 - 1 ATGAATAAGTCAAGTAAATGGAC 2
Benchmarks
This table is a comparison of the three different solutions presented as answers here, with timings to finish different amount of tsv/bed records with the full hg19.fa file (excluding the records containing only N:s). hg19.fa contains 57'946'726 such records. As a baseline I've used two versions of a C++ program (called hgsearch/hgsearchmm). hgsearch reads the whole hg19.fa file into memory and then searches it in parallel. hgsearchmm uses a memory mapped file instead and then searches that (also in parallel).
search \ beds
1
2
100
1000
10000
awk
1m0.606s
17m19.899s
-
-
-
perl
13.263s
15.618s
4m48.751s
48m27.267s
-
bash/grep
2.088s
3.670s
3m27.378s
34m41.129s
-
hgsearch
8.776s
9.425s
30.619s
3m56.508s
38m43.984s
hgsearchmm
1.942s
2.146s
21.715s
3m28.265s
34m56.783s
The tests were run on an Intel Core i9 12 cores/24 HT:s in WSL/Ubuntu 20.04 (SSD disk).
The sources for the scripts and baseline programs used can be found here
I'm parsing a log file that is space delimited for the first 7 elements and then a log message or sentence follows. I know just enough to get around in PS, and I'm learning more each day, so I'm not sure this is the best way to do this and apologies if I'm not leveraging a more efficient means that would be second nature to you. I'm using -split(' ')[n] to extract each field of the log file line by line. I'm able to extract the first parts fine as they are space-delimited, but I'm not sure how to get the rest of the elements up to the end of the line.
$logFile=Get-Content $logFilePath
$dateStamp=$logfile -split(' ')[0]
$timeStamp=$logfile -split(' ')[1]
$requestID=$logfile -split(' ')[3]
$binaryID=$logfile -split(' ')[4]
$logID=$logfile -split(' ')[5]
$action=$logfile -split(' ')[6]
$logMessage=$logfile -split(' ')[?]
This is not a CSV that I can import. I'm more familiar with string manipulation in bash so I am able to successfully replace spaces in the first 7 elements, and the end, with "," :
#!/bin/bash
inputFile="/cygdrive/c/Temp/logfile.log"
outputFile="/cygdrive/c/Temp/test_log.csv"
echo "\"DATE\",\"TIME\",\"HYPEN\",\"REQUESTID\",\"BINARY\",\"PROC_NUMBER\",\"MESSAGE\"" > $outputFile
while read -a line
do
arrLength=$(echo ${#line[#]})
echo \"${line[0]}\",\"${line[1]}\",\"${line[2]}\",\"${line[3]}\",\"${line[4]}\",\"${line[5]}\",\"${line[#]:6:$arrLength}\"
done < $inputFile >> $outputFile
Can you help either printing the array elements from position n to the end, or replacing the spaces appropriately in PS so I have a CSV that I can import? Just trying to avoid the two-step process of converting it in bash, then importing it in PS but I'm still researching. I did find this post Parsing Text file and placing contents into an Array Powershell
for importing the file assuming it's space-delimited and that works for the first 7 elements but not sure about everything after that.
Of course I welcome any other PS solutions such as one of those [something]::SOMETHING things I've seen by googling that might do all this much more seamlessly.
You can specify the maximum number of substrings in which the string is split like this:
$splittedRow = $logfile.split(' ',8)
$dateStamp=$splittedRow[0]
$timeStamp=$splittedRow[1]
$requestID=$splittedRow[3]
$binaryID=$splittedRow[4]
$logID=$splittedRow[5]
$action=$spltttedRow[6]
$logMessage=$splittedRow[7]
As an addition to Viktor Be's answer:
$data = "111 22222 333 4444444 5 6 77 888888 9999999 0" #this is the content of file below for testing purposes
#$data = get-content -path C:\temp\mytest.txt
foreach ($line in $data){
$splitted = $line.split(' ',8)
$line_output= ""
for ($i = 0;$i -lt 7;$i++){
$line_output += "$($splitted[$i]);"
}
$line_output += $splitted[7]
$line_output | out-file "C:\temp\MyCsvThatPowershellCanRead.csv" -append
}
You should be able to iterate over each line in the logfile and get the information you need the way you are doing. However, it's easy to grab the message field, which could include n number of spaces in the log message with a regular expression.
The following regex should work for you. Assuming $line is the current line you are on:
$line -match '(?<=(\S+\s+){6}).*'
$logMessage = $matches[0]
The way this expression works is that it looks for .* (which means any character 0 or more times) that comes after 6 occurences of non-whitespace characters followed by whitespace characters. The .* in this expression should match on your log message.
I'm trying to find the location of all instances of a string in a particular file; however, the code I'm currently running only returns the location of the first instance and then stops there. Here is what I'm currently running:
str=$(cat temp1.txt)
tmp="${str%%<C>*}"
if [ "$tmp" != "$str" ]; then
echo ${#tmp}
fi
The file is only one line of string and I would display it but the format questions need to be in won't allow me to add the proper amount of spaces between each character.
I am not sure of many details of your requirements, however this is an awk one-liner:
awk -vRS='<C>' '{printf("%u:",a+=length($0));a+=length(RS)}END{print ""}' temp1.txt
Let’s test it with an actual line of input:
$ awk -vRS='<C>' \
'{printf("%u:",a+=length($0));a+=length(RS)}END{print ""}' \
<<<" <C> <C> "
4:14:20:
This means: the first <C> is at byte 4, the second <C> is at byte 14 (including the three bytes of the first <C>), and the whole line is 20 bytes long (including final newline).
Is this what you want?
Explanation
We set (-v) record separator (RS) as <C>. Then we keep a variable a with the count of all bytes processed so far. For each “line” (i.e., <C>-separated substrings) we add the length of the current line to a, printf it with a suitable format "%u:", and increase a by the length of the separator which ended the current line. Since no printing so far included newlines, at the END we print an empty string, which is an idiom to output a final newline.
Look at the basically the same question asked here.
In particular your question may be answered for multiple instances thanks to user
JRFerguson response using perl.
EDIT: I found another solution that might just do the trick here. (The main question and response post is found here.)
I changed the shell from ksh to bash, changed the searched string to include multiple <C>'s to better demonstrate an answer the question, and named it "tester":
#!/bin/bash
printf '%s\n' '<C>abc<C>xyz<C>123456<C>zzz<C>' | awk -v s="$1" '
{ d = ""
for(i = 1; x = index(substr($0, i), s); i = i + x + length(s) - 1) {
printf("%s%d", d, i + x - 1)
d = ":"
}
print ""
}'
This is how I ran it:
$ tester '<C>'
1:7:13:22:28
I haven't figured the code out (I like to know why it works) but it seems to work! It would nice to get an explanation and an elegant way to feed your string into this script. Cheers.
I have an assignment asking me to print x iterations of a string for each character in that string. So if the string input is "Gum", then it should print out:
Gum
Gum
Gum
Right now my code is
my $string = <>;
my $length = length($string);
print ($string x $length, "\n");
And I'm getting gum printed five times as my output.
Those who have said you will get CR + LF at the end of the line on a Windows system are mistaken. Perl will convert the native line ending to a simple newline \n on any platform.
You must bear this in mind whether you are reading from the terminal or from a file.
The built-in chomp function will remove the line terminator character from the end of a string variable. If the string doesn't end with a line terminator then it will have no effect.
So when you type GumEnter you are setting $string to "Gum\n", and length will show that it has four characters.
You are seeing it five times on your screen because the first line is what you typed in yourself. The following four are printed by the program.
After a chomp, $string is just "Gum" with a length of three characters, which is what you want.
To output this on separate lines you have to print a newline after each line, so you can write
my $string = <>;
chomp $string;
my $length = length $string;
print ("$string\n" x $length);
or perhaps
print $string, "\n" for 1 .. $length;
I hope that helps
As you are simply using the input string, it still contains the newline at the end. This is also counted as a character. On my system, it outputs 4 Gum\n.
chomp($string) will remove the line ending, but the output will then also run together, resulting in GumGumGum\n
When You insert input and press enter afterwards You don't enter "Gum" but "Gum\r\n" which is a string of length 5. You should do trimming.
Your code is working fine. See this: http://ideone.com/AsPFh3
Possibility 1: It might be that you're putting 2 spaces while giving input from command line, that's why the length comes out to be 5, and it prints 5 times. Something like this: http://ideone.com/fsvnrd
In above case the my $string=<>; will give you my $string = "gum "; so length will be 5.
Possibility 2:
Another possibility is that if you use Windows then it will add carriage return (\r) and new line (due to enter \n) at the end of string. So it makes the length 5.
Edit: To print in new line: Use the below code.
#!/usr/bin/perl
# your code goes here
chomp(my $string=<>);
my $length = length($string);
print ("$string\n" x $length);
Demo
Edit 2: To remove \r\n use the below:
$string=~ s/\r|\n//g; Read more here.
I want to remove n characters from each line using PERL.
For example, I have the following input:
catbathatxx (length 11; 11%3=2 characters) (Remove 2 characters from this line)
mansunsonx (length 10; 10%3=1 character) (Remove 1 character from this line)
#!/usr/bin/perl -w
open FH, "input.txt";
#array=<FH>;
foreach $tmp(#array)
{
$b=length($tmp)%3;
my $c=substr($tmp, 0, length($tmp)-$b);
print "$c\n";
}
I want to output the final string (after the characters have been removed).
However, this program is not giving the correct result. Can you please guide me on what the mistake is?
Thanks a lot. Please let me know if there are any doubts/clarifications.
I am assuming trailing whitespace is not significant.
#!/usr/bin/env perl
use strict; use warnings;
use constant MULTIPLE_OF => 3;
while (my $line = <DATA>) {
$line =~ s/\s+\z//;
next unless my $length = length $line;
my $chars_to_remove = $length % MULTIPLE_OF;
$line =~ s/.{$chars_to_remove}\z//;
print $line, "\n";
}
__DATA__
catbathatxx
mansunsonx
0123456789
012345678
The \K regex sequence makes this a lot clearer; it was introduced in Perl v5.10.0.
The code looks like this
use 5.10.0;
use warnings;
for (qw/ catbathatxx mansunsonx /) {
(my $s = $_) =~ s/^ (?:...)* \K .* //x;
say $s;
}
output
catbathat
mansunson
In general you would want to post the result you are getting. That being said...
Each line in the file has a \n (or \r\n on windows) on the end of it that you're not accounting for. You need to chomp() the line.
Edit to add: My perl is getting rusty from non-use but if memory serves me correct you can actually chomp() the entire array after reading the file: chomp(#array)
You should use chomp() on your array, like this:
#array=<FH>;
chomp(#array);
perl -plwe 'chomp; $c = length($_) % 3; chop while $c--' < /tmp/zock.txt
Look up the options in perlrun. Note that line endings are characters, too. Get them out of the way using chomp; re-add them on output using the -l option. Use chop to efficiently remove characters from the end of a string.
Reading your code, you are trying to print just the first 'nx3' characters for the largest value of n for each line.
The following code does this using a simple regular expression.
For each line, it first removes the line ending, then greedy matches
as many .{3} as it can (. matches any character, {3} asks for exactly 3 of them).
The memory requirement of this approach (compared with using an array the size of your file) is fixed. Not too important if your file is small compared with your free memory, but sometimes files are gigabytes, and sometimes memory is very small.
It's always worth using variable names that reflect the purpose of the variable, rather than things like $a or #array. In this case I used only one variable, which I called $line.
It's also good practice to close files as soon as you have finished with them.
#!/usr/bin/perl
use strict;
use warnings; # This will apply warnings even if you use command perl to run it
open FH, '<', 'input.txt'; # Use three part file open - single quote where no interpolation required.
for my $line (<FH>){
chomp($line);
$line =~ s/((.{3})*).*/$1\n/;
print $line;
}
close FH;