How to split values in a column into separate column

How to split values in a column into separate column - linux

My tab-delimited file looks like this:
ID Pop snp1 snp2 snp3 snp4 snp5
AD62 1 0/1 1/1 . 1/1 0/.
AD75 1 0/0 1/1 . ./0 1/0
AD89 1 . 1/0 1/1 0/0 1/.
I want to separate the columns (starting from column 3) so that the values separated by the "/" character are delimited into a column of its own. However there are also columns whereby the values are missing (they only contain the "." character) and I want this to be treated as though it was "./." so that the two "." characters are then divided into their own columns. For example:
ID Pop snp1 snp2 snp3 snp4 snp5
AD62 1 0 1 1 1 . . 1 1 0 .
AD75 1 0 0 1 1 . . . 0 1 0
AD89 1 . . 1 0 1 1 0 0 1 .
Thanks

You can use sed:
sed -e 's/ \. /\.\t\. /g' -e 's/\//\t/g' <your_file>

Tried this and works well, you can tweak this as per your requirement.
Assuming data is in data.txt file.
cat data.txt | sed 1d | tr '/' '\t'| sed 's/\./.\t./g'
This gives the output, but you need to get a work around for the spaces and tab that are getting messed up.

This might work for you (GNU sed):
sed ''1s/\t/&&/3g;s/\t\.\t/\t.\t.\t/g;y/\//\t/' file

A fairly robust way, using awk and a few if statements:
awk '{ for (i = 1; i <= NF; i++) if (i >= 3 && i < NF && NR == 1) printf "%s\t\t", $i; else if (i == NF && NR == 1) print $i; else if ($i == "." && NR >= 2) printf ".\t.\t", $i; else { sub ("/", "\t", $i); if (i == NF) printf "%s\n", $i; else { printf "%s\t", $i; } } }' file.txt
Broken out on multiple lines:
awk '{ for (i = 1; i <= NF; i++)
if (i >= 3 && i < NF && NR == 1) printf "%s\t\t", $i;
else if (i == NF && NR == 1) print $i;
else if ($i == "." && NR >= 2) printf ".\t.\t", $i;
else {
sub ("/", "\t", $i);
if (i == NF) printf "%s\n", $i;
else {
printf "%s\t", $i;
}
}
}' file.txt
HTH

Related

Group the consecutive numbers in shell

$ foo="1,2,3,6,7,8,11,13,14,15,16,17"
In shell, how to group the numbers in $foo as 1-3,6-8,11,13-17

Given the following function:
build_range() {
local range_start= range_end=
local -a result
end_range() {
: range_start="$range_start" range_end="$range_end"
[[ $range_start ]] || return
if (( range_end == range_start )); then
# single number; just add it directly
result+=( "$range_start" )
elif (( range_end == (range_start + 1) )); then
# emit 6,7 instead of 6-7
result+=( "$range_start" "$range_end" )
else
# larger span than 2; emit as start-end
result+=( "$range_start-$range_end" )
fi
range_start= range_end=
}
# use the first number to initialize both values
range_start= range_end=
result=( )
for number; do
: number="$number"
if ! [[ $range_start ]]; then
range_start=$number
range_end=$number
continue
elif (( number == (range_end + 1) )); then
(( range_end += 1 ))
continue
else
end_range
range_start=$number
range_end=$number
fi
done
end_range
(IFS=,; printf '%s\n' "${result[*]}")
}
...called as follows:
# convert your string into an array
IFS=, read -r -a numbers <<<"$foo"
build_range "${numbers[#]}"
...we get the output:
1-3,6-8,11,13-17

awk solution for an extended sample:
foo="1,2,3,6,7,8,11,13,14,15,16,17,19,20,33,34,35"
awk -F',' '{
r = nxt = 0;
for (i=1; i<=NF; i++)
if ($i+1 == $(i+1)){ if (!r) r = $i"-"; nxt = $(i+1) }
else { printf "%s%s", (r)? r nxt : $i, (i == NF)? ORS : FS; r = 0 }
}' <<<"$foo"
The output:
1-3,6-8,11,13-17,19-20,33-35

As an alternative, you can use this awk command:
cat series.awk
function prnt(delim) {
printf "%s%s", s, (p > s ? "-" p : "") delim
}
BEGIN {
RS=","
}
NR==1 {
s = $1
}
p < $1-1 {
prnt(RS)
s = $1
}
{
p = $1
}
END {
prnt(ORS)
}
Now run it as:
$> foo="1,2,3,6,7,8,11,13,14,15,16,17"
$> awk -f series.awk <<< "$foo"
1-3,6-8,11,13-17
$> foo="1,3,6,7,8,11,13,14,15,16,17"
$> awk -f series.awk <<< "$foo"
1,3,6-8,11,13-17
$> foo="1,3,6,7,8,11,13,14,15,16,17,20"
$> awk -f series.awk <<< "$foo"
1,3,6-8,11,13-17,20
Here is an one-liner for doing the same:
awk 'function prnt(delim){printf "%s%s", s, (p > s ? "-" p : "") delim}
BEGIN{RS=","} NR==1{s = $1} p < $1-1{prnt(RS); s = $1} {p = $1}END {prnt(ORS)}' <<< "$foo"
In this awk command we keep 2 variables:
p for storing previous line's number
s for storing start of the range that need to be printed
How it works:
When NR==1 we set s to first line's number
When p is less than (current_number -1) or $1-1 that indicates we have a break in sequence and we need to print the range.
We use a function prnt for doing the printing that accepts only one argument that is end delimiter. When prnt is called from p < $1-1 { ...} block then we pass RS or comma as end delimiter and when it gets called from END{...} block then we pass ORS or newline as delimiter.
Inside p < $1-1 { ...} we reset s (start range) to $1
After processing each line we store $1 in variable p.
prnt uses printf for formatted output. It always prints starting number s first. Then it checks if p > s and prints hyphen followed by p if that is the case.

Random selection of columns using linux command

I have a flat file (.txt) with 606,347 columns and I want to extract 50,000 RANDOM columns, with exception of the first column, which is sample identification. How can I do that using Linux commands?
My file looks like:
ID SNP1 SNP2 SNP3
1 0 0 2
2 1 0 2
3 2 0 1
4 1 1 2
5 2 1 0
It is TAB delimited.
Thank you so much.
Cheers,
Paula.

awk to the rescue!
$ cat shuffle.awk
function shuffle(a,n,k) {
for(i=1;i<=k;i++) {
j=int(rand()*(n-i))+i
if(j in a) a[i]=a[j]
else a[i]=j
a[j]=i;
}
}
BEGIN {srand()}
NR==1 {shuffle(ar,NF,ncols)}
{for(i=1;i<=ncols;i++) printf "%s", $(ar[i]) FS; print ""}
general usage
$ echo $(seq 5) | awk -f shuffle.awk -v ncols=5
3 4 1 5 2
in your special case you can print $1 and start the function loop from 2.
i.e. change
for(i=1;i<=k;i++) to a[1]=1; for(i=2;i<=k;i++)

Try this:
echo {2..606347} | tr ' ' '\n' | shuf | head -n 50000 | xargs -d '\n' | tr ' ' ',' | xargs -I {} cut -d $'\t' -f {} file
Update:
echo {2..606347} | tr ' ' '\n' | shuf | head -n 50000 | sed 's/.*/&p/' | sed -nf - <(tr '\t' '\n' <file) | tr '\n' '\t'

#karakfa 's answer is great, but the NF value can't be obtained in the BEGIN{} part of the awk script. Refer to: How to get number of fields in AWK prior to processing
I edited the code as:
head -4 10X.txt | awk '
function shuffle(a,n,k){
for(i=1;i<=k;i++) {
j=int(rand()*(n-i))+i
if(j in a) a[i]=a[j]
else a[i]=j
a[j]=i;
}
}
BEGIN{
FS=" ";OFS="\t"; ncols=10;
}NR==1{shuffle(tmp_array,NF,ncols);
for(i=1;i<=ncols;i++){
printf "%s", $(tmp_array[i]) OFS;
}
print "";
}NR>1{
printf "%s", $1 OFS;
for(i=1;i<=ncols;i++){
printf "%s", $(tmp_array[i]+1) OFS;
}
print "";
}'
Because I am processing the single-cell gene expression profiles, so from the second row, the first column will be gene names.
My output is:
D4-2_3095 D6-1_3010 D16-2i_1172 D4-1_337 iPSCs-2i_227 D4-2_170 D12-serum_1742 D4-1_1747 D10-2-2i_1373 D4-1_320
Sox17 0 0 0 0 0 0 0 0 0 0
Mrpl15 0.987862442831866 1.29176904082314 2.12650693025845 0 1.33257747910871 0 1.58815046312948 1.18541326956528 1.12103842107813 0.656789854017254
Lypla1 0 1.29176904082314 0 0 0.443505832809852 0.780385141793088 0.57601629238987 0 0 0.656789854017254

How to define two variables with one-to-one relationship in a loop in Shell

I have a file whose 3rd column is
ifile.txt
2
1
4
3
5
I need to assign these values in a loop with one-to-one relationship. i.e.
for i in 20 45 50 68 90; do
if [ $i == 20 ]; then j=$(awk 'NR==1 {print $3}' ifile.txt)
if [ $i == 45 ]; then j=$(awk 'NR==2 {print $3}' ifile.txt)
if [ $i == 50 ]; then j=$(awk 'NR==3 {print $3}' ifile.txt)
if [ $i == 68 ]; then j=$(awk 'NR==4 {print $3}' ifile.txt)
if [ $i == 90 ]; then j=$(awk 'NR==5 {print $3}' ifile.txt)
fi
Can anybody suggest me a simpler way to do it.

I've sometimes used a technique like this:
for code in 20/1 45/2 50/3 68/4 90/5
do
i=${code%/*}
j=$(awk -v line=${code#*/} 'NR == line { print $3 }' ifile.txt)
# …use $i and $j…
echo "i = $i, j = $j"
done
This works for erratic orders or gaps etc. In this case, the if code has the indexes going from 1 to 5, so an array would also work:
array=('' 20 45 50 68 90) # Indexing from zero
for line in $(seq 1 $((${#array[#]}-1)) )
do
i=${array[$line]}
j=$(awk -v line=$line 'NR == line { print $3 }' ifile.txt)
# …use $i and $j…
echo "i = $i, j = $j"
done
Given data file ifile.txt:
A B 2
A B 1
A B 4
A B 3
A B 5
Both the scripts shown produce:
i = 20, j = 2
i = 45, j = 1
i = 50, j = 4
i = 68, j = 3
i = 90, j = 5

Find minimum value using awk

I am trying to find the minimum value from a file.
input.txt
1
2
4
5
6
4
This is the code that I am using:
awk '{sum += $1; min = min < $1 ? min : $1} !(FNR%6){print min;sum=min = ""}' input.txt
But it is not working. Can anybody see the error in my code?

use below scrip to find min value in txt file.
awk 'min=="" || $1 < min {min=$1} END {print min}' input.txt

Set min to $1 on the first line
awk 'NR == 1 {min = $1} {sum += $1; min = min < $1 ? min : $1} !(FNR%6){print min;sum=min = ""}' input.txt
output:
1
Note that sum isn't used, you could simplify to this:
awk 'NR == 1 {min = $1;} {min = min < $1 ? min : $1} !(FNR%6){print min;}' input.txt
To allow any number of lines:
awk 'NR == 1 {min = $1;} {min = min < $1 ? min : $1} END{print min;}' input.txt

calculation of average with linux shell and searching for maximum with its label

I have an input like following
*KEYWORD
$TIME_VALUE = 9.9999993e-004
$STATE_NO = 2
$Output for State 2 at time = 0.001
*END
$NODAL_RESULTS
$RESULT OF Resultant Displacement
721810 1.7188E-2
721812 6.1973E-2
721825 1.1481E+0
721827 1.0962E+0
721852 5.1831E-1
721854 1.3085E-2
721867 1.1077E+0
. .
. .
. .
I need to find the maximum of the value in column 2 and also its average. Then I also need to output the
number which stands in the first column for the maximum value.
I used the following code for calculation of maximum and the average however a division by zero came.
awk: cmd. line:5: fatal: division by zero attempted
The code is as follows
# 1.k is the input file name.
sed -n '/^[0-9]\{1\}/p' 1.k > 2.k # delete all lines not starting with number
mv 2.k 1.k
sed -i -e '/^$/d' 1.k # delete all lines that are empty
#sed -i -e 's/^[ \t]*//;s/[ \t]*$//' 1.k
awk 'BEGIN{min=999}
{a[NR]=$0;if($2<min){min=$2;m[1]=NR;}if($2>max){max=$2;m[2]=NR;}m[2]+=$2;}
END{print "Min:"a[m[1]];
print "Max:"a[m[2]];
print "Number Of Nodes:" NR;
print "Avg:"m[3]/NR}' 1.k
Can anybody help me with this problem?
regards,

calculate.awk:
{
sum += $2
if (NR == 1) {
min = max = $2
minv= maxv= $1
}
if (min > $2) { min = $2; minv = $1 }
if (max < $2) { max = $2; maxv = $1 }
}
END {
print "Min: " minv ", " min
print "Avg: " sum / NR
print "Max: " maxv ", " max
print "# Nodes: " NR
}

If you first filter out the non-numeric info, then this awk script should do:
awk 'BEGIN{max=-999}\
{\
col1[NR]=$1;\
col2[NR]=$2;\
if($2>max){max=$2;imax=NR};\
sum+=$2\
}\
END{print col1[imax]" "col2[imax]" average: "sum/NR}' yourinputfile

After trying and trying, I found a working but I think not an optimumum solution to the problem.
sed -i -e 's/^[ \t]*//;s/[ \t]*$//' 1.k
sed -n '/^[0-9]\{1\}/p' 1.k > 2.k
mv 2.k 1.k
sed -i -e '/^$/d' 1.k
awk 'BEGIN{min=999}
{a[NR]=$0;if($2<min){min=$2;m[1]=NR;}if($2>max){max=$2;m[2]=NR;}m[3]+=$2;}
END{ print "Max:"a[m[2]];
print "Min:"a[m[1]];
print "Number Of Calls:" NR;
print "Avg:"m[3]/NR}' 1.k > result
Thanks for your valueable suggestion folks.

Perl solution:
<1.k perl -ne 'next unless ($key, $val) = /^([0-9]+)\s+([-+E.0-9]+)/;# Only process the important lines.
if ($val > $max) { # New maximum.
$max = $val;
#maxk = $key;
} elsif ($max == $val) { # The maximum appears more than once.
push #maxk, $key;
}
$sum += $val;
$count++;
} {
print "MAX: $max at #maxk, AVG: ", $sum / $count ,"\n"; '

Try simple-r solution:
perl -walne 'print $F[1] if /^\d/' | r summary -
Simple-r is an R wrapper for fast statistical analysis in command line. It can be found at:
https://code.google.com/p/simple-r/

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string

How to split values in a column into separate column - linux

You can use sed: sed -e 's/ \. /\.\t\. /g' -e 's/\//\t/g' <your_file>

Tried this and works well, you can tweak this as per your requirement. Assuming data is in data.txt file. cat data.txt | sed 1d | tr '/' '\t'| sed 's/\./.\t./g' This gives the output, but you need to get a work around for the spaces and tab that are getting messed up.

This might work for you (GNU sed): sed ''1s/\t/&&/3g;s/\t\.\t/\t.\t.\t/g;y/\//\t/' file

Related

Group the consecutive numbers in shell

Random selection of columns using linux command

How to define two variables with one-to-one relationship in a loop in Shell

Find minimum value using awk

calculation of average with linux shell and searching for maximum with its label

Categories

Resources