I am trying to manipulate a list (about 50 columns) where I basically want to select some columns (some 7 or 10). However, some of those columns have empty entries. I am guessing something like this is a minimal working example:
A B C D E#note these are 5 tab separated columns
this that semething something more the end
this.line is very incomplete #column E empty
but this is v.very complete
whereas this is not #column B empty
As you can see, the 3rd line is empty in the last position.
I want to find a way of efficiently replacing all empty fields of my interest by a string, say "NA".
Of course, I could do it in the following way, but it is not very elegant to do this for all the 10 columns that I have in my real data:
#!/usr/local/bin/perl
use strict;
use warnings;
open my $file,"<","$path\\file.txt"; #with correct path
my #selecteddata;my $blankE;my $blankB;
while (<$data>) {
chomp $_;
my #line= split "\t";
if (not defined $line[4]){
$blankE="NA";
} else {
$blankE=$line[4];
}
if (not defined $line[1]){
$blankB="NA";
} else {
$blankB=$line[1];
}
push #selecteddata,"$blankB[0]\t$line[1]\t$line[2]\t$line[3]$line[4]\n";
}
close $data;
Alternatively, I can pre-process the file and replace all undefined entries by "NA", but I would like to avoid this.
So the main question is this: is there a more elegant way to replace blank entries only in the columns that I am interested by some word?
Thank you!
The trick to not ignoring trailing tabs is to specify a negative LIMIT as the 4th argument to split (kudos ikegami).
map makes light work of setting the "NA" values:
while ( <$data> ) {
chomp;
my #fields = split /\t/, $_, -1;
#fields = map { length($_) ? $_ : 'NA' } #fields; # Transform #fields
my $updated = join("\t", #fields) . "\n";
push #selected_data, $updated ;
}
In one-liner mode:
$ perl -lne 'print join "\t", map { length ? $_ : "NA" } split /\t/, $_, -1' input > output
I would say that using split and join undoubtedly is the most clear, since you'll likely need to be doing that for other parsing as well. However, this could be solved using look around assertions as well
Basically, the boundary between elements will either be a tab or the end or beginning of a string, so if those conditions are true for both directions, then we have an empty field:
use strict;
use warnings;
while (<DATA>) {
s/(?:^|(?<=\t))(?=\t|$)/NA/g;
print;
}
__DATA__
a b c d e
a b c d e
a b d e
b c d e
a b
a b d
a e
Outputs:
a b c d e
a b c d e
a b NA d e
NA b c d e
a b NA NA NA
a b NA d NA
a NA NA NA e
Turning this into a one liner is trivial, but I will point out that this could be done using \K as well saving 2 characters: s/(?:\t|^)\K(?=\t|$)/NA/g;
I'm not sure if just using a sequence of substitutions looking for tabs that are either preceded/followed by spaces would catch everything but it's quick and easy if you have a lazy brain ;-)
perl -pne 's/\t\t/\tNA\t/;s/\t\s/\tNA/;s/^\t/NA\t/' col_data-undef.txt
I'm not sure if in a neater scriptish format it looks less or more yucky :-)
#!/usr/bin/env perl
# read_cols.pl - munge tab separated data with empty "cells"
use strict;
use warnings;
while (<>){
s/\t\t/\tNA\t/;
s/\t\s/\tNA/;
s/^\t/NA\t/;
print ;
}
Here's the output:
Here's vim buffers of the input and output with tabs as ^I in red ;-)
./read_cols.pl col_data-undef.txt > col_data-NA.txt
Is everything in the correct order? Would it work on 50 columns ?!?
Sometimes lazy is good but sometimes you need #ikegami ... :-)
Related
I have two files, A and B with the columns separated by \.
Column 2 of file A is exactly the same as column 1 of file B.
I want to merge these two files keeping file B the same, add a new column based on the same fields between the two files and a partial match between column 1 of file A and column 2 of file B.
By partial match I mean something like this:
File A (column 1)
File B (column 2)
A=B?
A
A?
True
A
Asd
True
B
B
True
C
c
True
C
CA
True
D
A
False
If there are values with the same column 1 and 2 in file A, they must be added to file B separated by ;
File A
A\2022.10.10\note a
A\2022.10.10\note b
B\2022.10.14\note c
A\2022.10.14\note d
C\2022.10.15\note e
File B
2022.10.10\A?
2022.10.14\B?
2022.10.14\a
2022.10.15\C
2022.10.15\D
Desired output
2022.10.10\A?\note a;note b\
2022.10.14\B?\note c\
2022.10.14\a\note d\
2022.10.15\C\note e\
2022.10.15\D\
How can I do this with awk?
An awk script might work depending on the details of your requirements (e.g. of the keys are case sensitive or not).
An awk script like this might work:
function make_key(k)
{
# either return k or an uppercase version for case-insensitive keys
# return k;
return toupper(k);
}
BEGIN {
FS="\\";
}
NR==FNR {
key=make_key($2 "\\" $1);
if( key in notes){
notes[key]=notes[key] ";" $3
}
else {
notes[key]=$3
}
}
NR!=FNR {
for(k in notes){
pos=index(make_key($0),k);
if(pos==1){
printf "%s%s%s%s\n", $0, FS, notes[k], FS;
next;
}
}
print $0 FS;
}
You would use it like this:
awk -f script.awk file_A file_B
In the function make_key you can configure the case sensitiveness by either returning k or an uppercase version.
The NF==FNR block is used during reading the first file (file_A) here the notes are stored under a key made out of the second and first column. Notes are appended if the key is already existing.
In the NF!=FNR block the second file (file_B) is read. Here we print the line of file_b and the matching notes by comparing every key if the line of file_B starts with the key. If no key matches, then only the line of file_B is printed.
The BEGIN block just sets up the field separator.
To process the first file, we could use condition NR==FNR. During the first file processing, I used two hash tables, namely pattern_ht to store column 1 values in lower case and note_ht to store column 3 values. Since date string itself is not unique to serve as key, we will use the concatenation of columns 2 and 3 as key for both hash tables.
When processing the second file, firstly check if date string matches column 1 exactly after performing a split of the key string. If matched, perform the partial matching of column 2 to pattern_ht values after converting to lower case. If it matches and this is the first match, record the third_col value. If it already has another value, append to it with ";". Finally, display accordingly:
awk -F'\' 'NR==FNR {pattern_ht[$2";"$3] = tolower($1); note_ht[$2";"$3] = $3; next}
{third_col="";
for (key in pattern_ht) {
split(key,date_str,";");
if (date_str[1] == $1) {
if (tolower($2) ~ pattern_ht[key]) {
if (length(third_col) == 0)
third_col = note_ht[key]
else
third_col = third_col ";" note_ht[key]
}
}
}
if (length(third_col) != 0)
print $1"\\"$2"\\"third_col"\\"
else
print $1"\\"$2"\\"
}' fileA fileB
Here is the output:
2022.10.10\A?\note a;note b\
2022.10.14\B?\note c\
2022.10.14\a\note d\
2022.10.15\C\note e\
2022.10.15\D\
Let us take 2 strings a and b and concatenate them using + using print() function.
a = 'Hello'
b = 'World'
print(a + b, sep = ' ')
# prints HelloWorld
print(a + ' ' + b)
# prints Hello World
I have 2 questions:
a) Can I use sep to add a space between the concatenated strings a and b?
b) If not, then, Is there any other way to add a space between the concatenated strings a and b?
If you really like to make use of the plus sign to concatenate strings with a delimiter. You can use plus to make a list first and than apply something that will delimit the arguments in the list.
# So, let's make the list first
str_lst = [a] + [b] # you could also do [a, b], but we wanted to make use of the plus sign.
# now we can for example pass this to print and unpack it with *. print delimits by space by default.
print(*str_list) # which is the same as print(str_list[0], str_list[1]) or print(a, b), but that would not make use of the plus sign.
# Or you could use join the concetenate the string.
" ".join(*str_list)
Okay, so hope you learned some new things today. But please don't do it like this. This is not how it meant to be done.
Hello my question is how do i keep the format for a string that has had the .split run on it. What i want
$test="a.b.c.d.e"
$test2="abc"
#split test
#append split to test2
#desired output
abc
a
b
c
d
e
I know if i perform split on a string such as
$test="a.b.c.d.e"
$splittest=$test.split(".")
$splittest
#output
a
b
c
d
e
However when i try to make it so that i want to append the above split to a string
$test2="abc"
$test2+$splittest
#output
abca b c d e
while
$splittest+$abc
#output
a
b
c
d
e
abc
Is there a way to append the split string to another string while keeping this split format or will i have to foreach loop through the split string and append it to the $test2 string one by one.
foreach ($line in $splittest)
{
$test2="$($test2)`n$(splittest)"
}
I would prefer not to use the foreach method as it seems to slow down a script i am working on which requires text to be split and appended over 500k times on the small end.
What you're seeing is the effect of how PowerShell's operator overload resolution.
When PowerShell sees +, it needs to decide whether + means sum (1 + 1 = 2), concatenate (1 + 1 = "11"), or add (1 + 1 = [1,1]) in the given context.
It does so by looking at the type of the left hand side argument, and attempts to convert the right hand side argument to a type that the chosen operator overload expects.
When you use + in the order you need, the string value is to the left, and so it results in a string concatenation operation.
There are multiple ways of prepending the string to the existing array:
# Convert scalar to array before +
$newarray = #($abc) + $splittest
# Flatten items inside an array subexpression
$newarray = #($abc;$splittest)
Now all you have to do is join the strings by a newline:
$newarray -join [System.Environment]::NewLine
Or you can change the output field separator ($OFS) to a newline and have it joined implicitly:
$OFS = [System.Environment]::NewLine
"$newarray"
Finally, you could pipe the array to Out-String, but that will add a trailing newline to the entire string:
#($abc;$splittest) |Out-String
I have a data frame with character strings in column1 and ID in column2. The string contains A,T,G or C.
I would like to print the lines that have an A at position 1.
Then I would like to print the lines that have A at position 2 and so on and save them in separate files.
So far I have used biostrings in R for similar analysis, but it won't work for this problem exactly. I would like to use perl.
Sequence ID
TATACAAGGGCAAGCTCTCTGT mmu-miR-381-3p
TCGGATCCGTCTGAGCT mmu-miR-127-3p
ATAGTAGACCGTATAGCGTACG mmu-miR-411-5p
......
600 more lines
Biostrings will work perfectly, and will be pretty fast. Let's call your DNA stringset mydata
HasA <- sapply(mydata,function(x) as.character(x[2]) == "A")
Now you have a vector of TRUE or FALSE indicating which sequence has an A at position 2. You can make that into a nice data frame like this
HasA.df <- data.frame("SeqName" = names(mydata), "A_at_2" = HasA)
Not sure about the expected result,
mydata <- read.table(text="Sequence ID
TATACAAGGGCAAGCTCTCTGT mmu-miR-381-3p
TCGGATCCGTCTGAGCT mmu-miR-127-3p
ATAGTAGACCGTATAGCGTACG mmu-miR-411-5p",sep="",header=T,stringsAsFactors=F)
mCh <- max(nchar(mydata[,1])) #gives the maximum number of characters in the first column
sapply(seq(mCh), function(i) substr(mydata[,1],i,i)=="A") #gives the index
You can use which to get the index of the row that satisfies the condition for each position
res <- stack(setNames(sapply(seq(mCh),
function(i) which(substr(mydata[,1],i,i)=="A")),1:mCh))[,2:1]
tail(res, 5) #for the 13th position, 1st and 3rd row of the sequence are TRUE
ind values
#11 13 1
#12 13 3
#13 14 2
#14 15 3
#15 20 3
use the index values to extract the rows. For the 1st position
mydata[res$values[res$ind==1],]
# Sequence ID
# 3 ATAGTAGACCGTATAGCGTACG mmu-miR-411-5p
Using a perl one-liner
perl -Mautodie -lane '
BEGIN {($f) = #ARGV}
next if $. == 1;
my #c = split //, $F[0];
for my $i (grep {$c[$_] eq "A"} (0..$#c)) {
open my $fh, ">>", "$f.$i";
print $fh $_;
}
' file
I found the following AWK program on the internet and tweaked it slightly to look at column $2:
{ a[$2,NR]=$0; c[$2]++ }
END {
for( k in a ) {
split(k,b,SUBSEP)
t=c[b[1]] # added this bit to capture count
if( b[1] in c && t>1 ) { # added && t>1 only print if count more than 1
print RS "TIMES ID" RS c[b[1]] " " b[1] RS
delete c[b[1]]
}
for(i=1;i<=NR;i++) if( a[b[1],i] ) {
if(t>1){print a[b[1],i]} # added if(t>1) only print lines if count more than 1
delete a[b[1],i]
}
}
}
Given the following file:
abc,2,3
def,3,4
ghi,2,3
jkl,5,9
mno,3,2
The output is as follows when the command is run:
Command: awk -F, -f find_duplicates.awk duplicates
Output:
TIMES ID
2 2
abc,2,3
ghi,2,3
TIMES ID
2 3
def,3,4
mno,3,2
This is fine.
I would like to understand what is happening in the AWK program.
I understand that the first line is loading each line into a multidimentional array ?
So first line of file would be a['2','1']='abc,2,3' and so on.
However I'm a bit confised as to what c[$2]++ does, and also what is the significance of split(k,b,SUBSEP) ??
Would appreciate it if someone could explain line by line what is going on in this AWK program.
Thanks.
The increment operator simply adds one to the value of the referenced variable. So c[$2]++ takes the value for c[$2] and adds one to it. If $2 is a and c["a"] was 3 before, its value will be 4 after this. So c keeps track of how many of each $2 value you have seen.
for (k in a) loops over the keys of a. If the value of $2 on the first line was "a", the first value of k will be "a","1" (with 1 being the line number). The next time, it will be the combination of the value of $2 from the second line and the line number 2, etc.
The split(k,b,SUBSEP) will create a new array b from the compound value in k, i.e. basically reconstruct the parts of the compound key that went into a. The value in b[1] will now be the value which was in $2 when the corresponding value in a was created, and the value in b[2] will be the corresponding line number.
The final loop is somewhat inefficient; it loops over all possible line numbers, then skips immediately to the next one if an entry for that ID and line number did not exist. Because this runs inside the outer loop for (k in a) it will be repeated a large number of times if you have a large number of inputs (it will loop over all input line numbers for each input line). It would be more efficient, at the expense of some additional memory, to just build a final output incrementally, then print it all after you have looped over all of a, by which time you have processed all input lines anyway. Perhaps something like this:
END {
for (k in a) {
split (k,b,SUBSEP)
if (c[b[1]] > 1) {
if (! o[b[1]]) o[b[1]] = c[b[1]] " " b[1] RS
o[b[1]] = o[b[1]] RS a[k]
}
delete a[k]
}
for (q in o) print o[q] RS
}
Update: Removed the premature deletion of c[b[1]].