bash join multiple files with empty replacement (-e option) - linux

I have following code to join multiple files together. It works fine but I want to replace the empty values to 0, so I used -e "0". But it doesn't work.
Any ideas?
for k in `ls file?`
do
if [ -a final.results ]
then
join -a1 -a2 -e "0" final.results $k > tmp.res
mv tmp.res final.results
else
cp $k final.results
fi
done
example:
file1:
a 1
b 2
file2:
a 1
c 2
file3:
b 1
d 2
Results:
a 1 0 1 0
b 2 1 0
c 2
d 2
expected:
a 1 1 0
b 2 0 1
c 0 2 0
d 0 0 2

An aside, the GNU version of join supports -o auto. The -e and -o cause enough frustration to turn people to learning awk. (See also How to get all fields in outer join with Unix join?). As cmh said: it's [not] documented, but when using join the -e option only works in conjunction with the -o option.
General solution:
cut -d ' ' -f1 file? | sort -u > tmp.index
for k in file?; do join -a1 -e '0' -o '2.2' tmp.index $k > tmp.file.$k; done
paste -d " " tmp.index tmp.file.* > final.results
rm tmp*
Bonus: how do I compare multiple branches in git?
for k in pmt atc rush; do git ls-tree -r $k | cut -c13- > ~/tmp-branch-$k; done
cut -f2 ~/tmp-branch-* | sort -u > ~/tmp-allfiles
for k in pmt atc rush; do join -a1 -e '0' -t$'\t' -11 -22 -o '2.2' ~/tmp-allfiles ~/tmp-branch-$k > ~/tmp-sha-$k; done
paste -d " " ~/tmp-allfiles ~/tmp-sha-* > final.results
egrep -v '(.{40}).\1.\1' final.results # these files are not the same everywhere

It's poorly documented, but when using join the -e option only works in conjunction with the -o option. The order string needs to be amended each time around the loop. The following code should generate your desired output.
i=3
orderl='0,1.2'
orderr=',2.2'
for k in $(ls file?)
do
if [ -a final.results ]
then
join -a1 -a2 -e "0" -o "$orderl$orderr" final.results $k > tmp.res
orderl="$orderl,1.$i"
i=$((i+1))
mv tmp.res final.results
else
cp $k final.results
fi
done
As you can see, it starts to become messy. If you need to extend this much further it might be worth deferring to a beefier tool such as awk or python.

Assuming there are no duplicate keys in a single file and the keys do not contain whitespace, you could use gawk and a sorted glob of files. This approach would be quite quick for large files and would use only a relatively small amount of memory compared to a glob of all of the data. Run like:
gawk -f script.awk $(ls -v file*)
Contents of script.awk:
BEGINFILE {
c++
}
z[$1]
$1 in a {
a[$1]=a[$1] FS ($2 ? $2 : "0")
next
}
{
for(i=1;i<=c;i++) {
r = (r ? r FS : "") \
(i == c ? ($2 ? $2 : "0") : "0")
}
a[$1]=r; r=""
b[++n]=$1
}
ENDFILE {
for (j in a) {
if (!(j in z)) {
a[j]=a[j] FS "0"
}
}
delete z
}
END {
for (k=1;k<=n;k++) {
print b[k], a[b[k]]
}
}
Test input / Results of grep . file*:
file1:a 1
file1:x
file1:b 2
file2:a 1
file2:c 2
file2:g
file3:b 1
file3:d 2
file5:m 6
file5:a 4
file6:x
file6:m 7
file7:x 9
file7:c 8
Results:
a 1 1 0 4 0 0
x 0 0 0 0 0 9
b 2 0 1 0 0 0
c 0 2 0 0 0 8
g 0 0 0 0 0 0
d 0 0 2 0 0 0
m 0 0 0 6 7 0

I gave up using join and wrote my script in other way
keywords=`cat file? | awk '{print $1}' | sort | uniq | xargs`
files=`ls file? | xargs`
for p in $keywords
do
x=`echo $p`
for k in $files
do
if grep -q ^$p $k
then
y=`cat $k | grep ^$p | awk '{print $2}'`
x=`echo $x $y`
else
echo $p $k
x=`echo $x 0`
fi
done
echo $x >> final.results
done

Related

if array element not present in a file need to print that as array element as zero

consider a array element :
args=("a" "b" "c")
now i need to check the array element in a file .
consider the file as :
file:
a 10
c 30
grep ${args[#]/#/-e } file
Output:
a 10
c 30
Expected Output:
a 10
b 0
c 30
I don't think there is an option in grep to print a string if the no matches are found.
I would do this with below script:
for i in ${args[#]}; do
grep $i file.txt
if [ $? -ne 0 ]; then
echo $i" 0"
fi
done
Using awk and process substitution this is much simpler:
args=("a" "b" "c")
awk 'FNR==NR{a[$1]=$0; next} {
print ($1 in a) ? a[$1] : $1, 0}' file <(printf "%s\n" "${args[#]}")
a 10
b 0
c 30

Random selection of columns using linux command

I have a flat file (.txt) with 606,347 columns and I want to extract 50,000 RANDOM columns, with exception of the first column, which is sample identification. How can I do that using Linux commands?
My file looks like:
ID SNP1 SNP2 SNP3
1 0 0 2
2 1 0 2
3 2 0 1
4 1 1 2
5 2 1 0
It is TAB delimited.
Thank you so much.
Cheers,
Paula.
awk to the rescue!
$ cat shuffle.awk
function shuffle(a,n,k) {
for(i=1;i<=k;i++) {
j=int(rand()*(n-i))+i
if(j in a) a[i]=a[j]
else a[i]=j
a[j]=i;
}
}
BEGIN {srand()}
NR==1 {shuffle(ar,NF,ncols)}
{for(i=1;i<=ncols;i++) printf "%s", $(ar[i]) FS; print ""}
general usage
$ echo $(seq 5) | awk -f shuffle.awk -v ncols=5
3 4 1 5 2
in your special case you can print $1 and start the function loop from 2.
i.e. change
for(i=1;i<=k;i++) to a[1]=1; for(i=2;i<=k;i++)
Try this:
echo {2..606347} | tr ' ' '\n' | shuf | head -n 50000 | xargs -d '\n' | tr ' ' ',' | xargs -I {} cut -d $'\t' -f {} file
Update:
echo {2..606347} | tr ' ' '\n' | shuf | head -n 50000 | sed 's/.*/&p/' | sed -nf - <(tr '\t' '\n' <file) | tr '\n' '\t'
#karakfa 's answer is great, but the NF value can't be obtained in the BEGIN{} part of the awk script. Refer to: How to get number of fields in AWK prior to processing
I edited the code as:
head -4 10X.txt | awk '
function shuffle(a,n,k){
for(i=1;i<=k;i++) {
j=int(rand()*(n-i))+i
if(j in a) a[i]=a[j]
else a[i]=j
a[j]=i;
}
}
BEGIN{
FS=" ";OFS="\t"; ncols=10;
}NR==1{shuffle(tmp_array,NF,ncols);
for(i=1;i<=ncols;i++){
printf "%s", $(tmp_array[i]) OFS;
}
print "";
}NR>1{
printf "%s", $1 OFS;
for(i=1;i<=ncols;i++){
printf "%s", $(tmp_array[i]+1) OFS;
}
print "";
}'
Because I am processing the single-cell gene expression profiles, so from the second row, the first column will be gene names.
My output is:
D4-2_3095 D6-1_3010 D16-2i_1172 D4-1_337 iPSCs-2i_227 D4-2_170 D12-serum_1742 D4-1_1747 D10-2-2i_1373 D4-1_320
Sox17 0 0 0 0 0 0 0 0 0 0
Mrpl15 0.987862442831866 1.29176904082314 2.12650693025845 0 1.33257747910871 0 1.58815046312948 1.18541326956528 1.12103842107813 0.656789854017254
Lypla1 0 1.29176904082314 0 0 0.443505832809852 0.780385141793088 0.57601629238987 0 0 0.656789854017254

array length in ksh always return 1 and why array is not lines

I need to echo information of a process for a UID in ksh:
#!/bin/ksh
read userid
arr=$(ps -elf | nawk -v pattern=${userid} '{if ($3==pattern) print}')
arrlen=${#arr[#]}
echo $arrlen
for f in "${arr[#]}"; do
echo $f
done
arr is an array of process for this UID.
arrlen always equal 1.
Why?
My second question:
I try to echo all elements in arr and output is
0
S
s157759
22594
1
0
50
20
?
2:06
/usr/lib/firefox/firefox-bin
instead of
0 S s157759 22594 1 0 50 20 ? 62628 ? 11:14:06 ? 2:06 /usr/lib/firefox/firefox-bin
in one line
I want to create an array with lines, not words.
You aren't creating an array; you're creating a string with newline-separated values. Replace
arr=$(ps -elf | nawk -v pattern=${userid} '{if ($3==pattern) print}')
with
arr=( $(ps -elf | nawk -v pattern=${userid} '{if ($3==pattern) print}') )
However, this still leaves you with the problem that the array will treat each field of each line from ps as a separate element. A better solution is to read directory from ps using the read built-in:
ps -elf | while read -a fields; do
if [[ ${fields[2]} = $userid ]]; then
continue
fi
echo "${fields[#]}"
done

sorting a "key/value pair" array in bash

How do I sort a "python dictionary-style" array e.g. ( "A: 2" "B: 3" "C: 1" ) in bash by the value? I think, this code snippet will make it bit more clear about my question.
State="Total 4 0 1 1 2 0 0"
W=$(echo $State | awk '{print $3}')
C=$(echo $State | awk '{print $4}')
U=$(echo $State | awk '{print $5}')
M=$(echo $State | awk '{print $6}')
WCUM=( "Owner: $W;" "Claimed: $C;" "Unclaimed: $U;" "Matched: $M" )
echo ${WCUM[#]}
This will simply print the array: Owner: 0; Claimed: 1; Unclaimed: 1; Matched: 2
How do I sort the array (or the output), eliminating any pair with "0" value, so that the result like this:
Matched: 2; Claimed: 1; Unclaimed: 1
Thanks in advance for any help or suggestions. Cheers!!
Quick and dirty idea would be (this just sorts the output, not the array):
echo ${WCUM[#]} | sed -e 's/; /;\n/g' | awk -F: '!/ 0;?/ {print $0}' | sort -t: -k 2 -r | xargs
echo -e ${WCUM[#]} | tr ';' '\n' | sort -r -k2 | egrep -v ": 0$"
Sorting and filtering are independent steps, so if you only like to filter 0 values, it would be much more easy.
Append an
| tr '\n' ';'
to get it to a single line again in the end.
nonull=$(for n in ${!WCUM[#]}; do echo ${WCUM[n]} | egrep -v ": 0;"; done | tr -d "\n")
I don't see a good reason to end $W $C $U with a semicolon, but $M not, so instead of adapting my code to this distinction I would eliminate this special case. If not possible, I would append a semicolon temporary to $M and remove it in the end.
Another attempt, using some of the bash features, but still needs sort, that is crucial:
#! /bin/bash
State="Total 4 1 0 4 2 0 0"
string=$State
for i in 1 2 ; do # remove unnecessary fields
string=${string#* }
string=${string% *}
done
# Insert labels
string=Owner:${string/ /;Claimed:}
string=${string/ /;Unclaimed:}
string=${string/ /;Matched:}
# Remove zeros
string=(${string[#]//;/; })
string=(${string[#]/*:0;/})
string=${string[#]}
# Format
string=${string//;/$'\n'}
string=${string//:/: }
# Sort
string=$(sort -t: -nk2 <<< "$string")
string=${string//$'\n'/;}
echo "$string"

BASH: how to perform arithmetic on numbers in a pipe

I am getting a stream of numbers in a pipe, and would like to perform some operations before passing them on to the next section, but I'm a little lost about how I would go about it without breaking the pipe.
for example
> echo "1 2 3 4 5" | some command | cat
1 4 9 16 25
>
Would you have any ideas on how to make something like this work? The actual operation I want to perform is simply adding one to every number.
echo 1 2 3 4 5|{
read line;
for i in $line;
do
echo -n "$((i * i)) ";
done;
echo
}
The {} creates a grouping. You could instead create a script for that.
I'd write:
echo "1 2 3 4 5" | {
for N in $(cat); do
echo $((N ** 2))
done | xargs
}
We can think of it as a "map" (functional programming). There are a lot of ways of writing a "map" function in bash (using stdin, function args, ...), for example:
map_stdin() {
local FUNCTION=$1
while read LINE; do
$FUNCTION $LINE
done
}
square() { echo "$(($1 * $1))"; }
$ echo "1 2 3 4 5" | xargs -n1 | map_stdin square | xargs
1 4 9 16 25
Or..
echo "1 2 3 4 5" | xargs -n 1 | while read number
do
echo $((number * number))
done
echo 1 2 3 4 5 | xargs -n 1 expr -1 +
echo 1 2 3 4 5 | xargs -n 1 bash -c 'echo $(($1*$1))' args
Using awk is another solution, which also works with floats
echo "1 2 3 4 5" | xargs -n1 | awk '{print $1^2}' | xargs
or use a loop
for x in 1 2 3 4 5; do echo $((x**2)); done | xargs
for x in $(echo "1 2 3 4 5"); do echo $x^2 | bc; done | xargs # alternative solution
for x in $(seq 5); do python -c "print($x**2)"; done | xargs # alternative solution but slower than the above
# or make it neat by defining a function to do basic math in bash, e.g.:
calc() { awk "BEGIN{print $*}"; }
for x in $(seq 5); do calc $x^2; done | xargs
Or you can pipe to expression to bc:
echo "1 2 3 4 5" | (
read line;
for i in $line;
do
echo $i^2 | bc;
done;
echo
)
If you prefer Python:
#!/bin/python
num = input()
while num:
print(int(num) + 1) # Whatever manipulation you want
try:
num = input()
except EOFError:
break
xargs, xargs, xargs
echo 1 2 3 4 5 | xargs -n1 echo | xargs -I NUMBER expr NUMBER \* NUMBER | xargs
Or, go parallel:
squareit () { expr $1 \* $1; }
export -f squareit
echo 1 2 3 4 5 | xargs -n1 | parallel --gnu squareit | xargs
Which would be way simpler if you passed your pipe as a standard set of args:
parallel --gnu "expr {} \* {}" ::: $(echo 1 2 3 4 5) | xargs
Or even:
parallel --gnu "expr {} \* {}" ::: 1 2 3 4 5 | xargs
Really worth taking a look at the examples in the doc: https://www.gnu.org/software/parallel/man.html
Yoi might like something like this:
echo "1 2 3 4 5" | perl -ne 'print $_ ** 2, " " for split / /, $_'
or even like this:
echo "1 2 3 4 5" | perl -ne 'print join " ", map {$_ ** 2} split / /, $_'

Resources