Split one column by every into "n" columns with one character each - linux

I have a file with one single column and 10 rows. Every column has the same number of characters (5). From this file I would like to get a file with 10 rows and 5 columns, where each column has 1 character only. I have no idea on how to do that in linux.. Any help? Would AWK do this?
The real data has many more rows (>4K) and characters (>500K) though. Here is a short version of the real data:
31313
30442
11020
12324
00140
34223
34221
43124
12211
04312
Desired output:
3 1 3 1 3
3 0 4 4 2
1 1 0 2 0
1 2 3 2 4
0 0 1 4 0
3 4 2 2 3
3 4 2 2 1
4 3 1 2 4
1 2 2 1 1
0 4 3 1 2
Thanks!

I think that this does what you want:
$ awk -F '' '{ $1 = $1 }1' file
3 1 3 1 3
3 0 4 4 2
1 1 0 2 0
1 2 3 2 4
0 0 1 4 0
3 4 2 2 3
3 4 2 2 1
4 3 1 2 4
1 2 2 1 1
0 4 3 1 2
The input field separator is set to the empty string, so every character is treated as a field. $1 = $1 means that awk "touches" every record, causing it to be reformatted, inserting the output field separator (a space) between every character. 1 is the shortest "true" condition, causing awk to print every record.
Note that the behaviour of setting the field separator to an empty string isn't well-defined, so may not work on your version of awk. You may find that setting the field separator differently e.g. by using -v FS= works for you.
Alternatively, you can do more or less the same thing in Perl:
perl -F -lanE 'say "#F"' file
-a splits each input record into the special array #F. -F followed by nothing sets the input field separator to the empty string. The quotes around #F mean that the list separator (a space by default) is inserted between each element.

You can use this sed as well:
sed 's/./& /g; s/ $//' file
3 1 3 1 3
3 0 4 4 2
1 1 0 2 0
1 2 3 2 4
0 0 1 4 0
3 4 2 2 3
3 4 2 2 1
4 3 1 2 4
1 2 2 1 1
0 4 3 1 2

Oddly enough, this isn't trivial to do with most standard Unix tools (update: except, apparently, with awk). I would use Python:
python -c 'import sys; map(sys.stdout.write, map(" ".join, sys.stdin))' in.txt > new.txt
(This isn't the greatest idiomatic Python, but it suffices for a simple one-liner.)

another unix toolchain for this task
$ while read line; do echo $line | fold -w1 | xargs; done < file
3 1 3 1 3
3 0 4 4 2
1 1 0 2 0
1 2 3 2 4
0 0 1 4 0
3 4 2 2 3
3 4 2 2 1
4 3 1 2 4
1 2 2 1 1
0 4 3 1 2

Related

Encoding Dataframe features numerically

I am having a dataframe, with a number of features. There is one particular feature, that is totally dynamic and I aim to encode it. I cannot use one-hot encoding as the unique count of values can change. LabelEncoder can be of use, but can the number of classes/target labels be changed ?
Consider an example (the Name feature):
index | A | B | Name
------+---+---+-----
1 5 6 abc
2 4 7 abc
2 3 0 def
2 3 0 ghi
3 3 0 abc
3 3 0 def
I wish to encode it as
index | A | B | Name
------+---+---+-----
1 5 6 1
2 4 7 1
2 3 0 2
2 3 0 3
3 3 0 1
3 3 0 2
And also keeping in mind that if later on another value different than all these comes up, they automatically get stored in the encoder by the next successive value like even if the next row input is
index | A | B | Name
------+---+---+-----
1 5 6 xyz
It gets encoded to and is used as
index | A | B | Name
------+---+---+-----
1 5 6 4
And how do I get the original value back ?
You can try factorize
df.Name=df.Name.factorize()[0]+1
You can use astype category and then use the category accessor .cat to get the assigned codes:
df['Name'] = df['Name'].astype('category').cat.codes + 1
Output:
index A B Name
0 1 5 6 1
1 2 4 7 1
2 2 3 0 2
3 2 3 0 3
4 3 3 0 1
5 3 3 0 2

Average from different columns in shell script

I have a datafile with 10 columns as given below
ifile.txt
2 4 4 2 1 2 2 4 2 1
3 3 1 5 3 3 4 5 3 3
4 3 3 2 2 1 2 3 4 2
5 3 1 3 1 2 4 5 6 8
I want to add 11th column which will show the average of each rows along 10 columns. i.e. AVE(2 4 4 2 1 2 2 4 2 1) and so on. Though my following script is working well, but I would like to make it more simpler and short. I appreciate, in advance, for any kind help or suggestions in this regard.
awk '{for(i=1;i<=NF;i++){s+=$i;ss+=$i}m=s/NF;$(NF+1)=ss/NF;s=ss=0}1' ifile.txt
This should work
awk '{for(i=1;i<=NF;i++)x+=$i;$(NF+1)=x/NF;x=0}1' file
For each field you add the value to x in the loop.
Next you set field 11 to the sum in xdivided by the number of fields NF.
Reset x to zero for the next line.
1 equates to true and performs the default action in awk which is to print the line.
is this helping
awk '{for(i=1;i<=NF;i++)s+=$i;print $0,s/NF;s=0}' ifile.txt
or
awk '{for(i=1;i<=NF;i++)ss+=$i;$(NF+1)=ss/NF;ss=0}1' ifile.txt

how to calculate standard deviation from different colums in shell script

I have a datafile with 10 columns as given below
ifile.txt
2 4 4 2 1 2 2 4 2 1
3 3 1 5 3 3 4 5 3 3
4 3 3 2 2 1 2 3 4 2
5 3 1 3 1 2 4 5 6 8
I want to add 11th column which will show the standard deviation of each rows along 10 columns. i.e. STDEV(2 4 4 2 1 2 2 4 2 1) and so on.
I am able to do by taking tranpose, then using the following command and again taking transpose
awk '{x[NR]=$0; s+=$1} END{a=s/NR; for (i in x){ss += (x[i]-a)^2} sd = sqrt(ss/NR); print sd}'
Can anybody suggest a simpler way so that I can do it directly along each row.
You can do the same with one pass as well.
awk '{for(i=1;i<=NF;i++){s+=$i;ss+=$i*$i}m=s/NF;$(NF+1)=sqrt(ss/NF-m*m);s=ss=0}1' ifile.txt
Do you mean something like this ?
awk '{for(i=1;i<=NF;i++)s+=$i;M=s/NF;
for(i=1;i<=NF;i++)sd+=(($i-M)^2);$(NF+1)=sqrt(sd/NF);M=sd=s=0}1' file
2 4 4 2 1 2 2 4 2 1 1.11355
3 3 1 5 3 3 4 5 3 3 1.1
4 3 3 2 2 1 2 3 4 2 0.916515
5 3 1 3 1 2 4 5 6 8 2.13542
You just use the fields instead of transposing and using the rows.

Reverse sort order of a multicolumn file in BASH

I've the following file:
1 2 3
1 4 5
1 6 7
2 3 5
5 2 1
and I want that the file be sorted for the second column but from the largest number (in this case 6) to the smallest. I've tried with
sort +1 -2 file.dat
but it sorts in ascending order (rather than descending).
The results should be:
1 6 7
1 4 5
2 3 5
5 2 1
1 2 3
sort -nrk 2,2
does the trick.
n for numeric sorting, r for reverse order and k 2,2 for the second column.
Have you tried -r ? From the man page:
-r, --reverse
reverse the result of comparisons
As mention most version of sort have the -r option if yours doesn't try tac:
$ sort -nk 2,2 file.dat | tac
1 6 7
1 4 5
2 3 5
5 2 1
1 2 3
$ sort -nrk 2,2 file.dat
1 6 7
1 4 5
2 3 5
5 2 1
1 2 3
tac - concatenate and print files in reverse

Convert column pattern

I have this kind of file:
1 0 1
2 0 3
2 1 2
3 0 3
4 0 1
4 1 1
4 2 1
4 3 1
5 0 1
8 0 1
10 0 1
11 0 1
The RS separator is an empty line by default.
If there was a double blank line, we have to substitute on of them by a pattern $1 0 0, where $1 means the increased "number" before the $1 0 * record.
If the separator is empty line + 1 empty line we have to increase the $1 by 1.
If the separator is empty line + 2 empty line we have to increase the $1 by 2.
...
and I need to get this output:
1 0 1
2 0 3
2 1 2
3 0 3
4 0 1
4 1 1
4 2 1
4 3 1
5 0 1
6 0 0
7 0 0
8 0 1
9 0 0
10 0 1
11 0 1
Thanks in advance!
awk 'NF{f=0;n=$1;print;next}f{print ++n " 0 0"}{print;f=1}' ./infile
Output
$ awk 'NF{f=0;n=$1;print;next}f{print ++n " 0 0"}{print;f=1}' ./infile
1 0 1
2 0 3
2 1 2
3 0 3
4 0 1
4 1 1
4 2 1
4 3 1
5 0 1
6 0 0
7 0 0
8 0 1
9 0 0
10 0 1
11 0 1
Explanation
NF{f=0;n=$1;print;next}: if the current line has data, unset flag f, save the number in the first field to n, print the line and skip the rest of the script
{print;f=1}: We only reach this action if the current line is blank. If so, print the line and set the flag f
f{print ++n " 0 0"}: We only execute this action if the flag f is set which only happens if the previous line was blank. If we enter this action, print the missing fields with an incremented n
You can try something like this. The benefit of this way is that your input file need not have an empty line for the missing numbers.
awk -v RS="" -v ORS="\n\n" -v OFS="\n" '
BEGIN{getline; col=$1;line=$0;print line}
$1==col{print $0;next }
($1==col+1){print $0;col=$1;next}
{x=$1;y=$0; col++; while (col < x) {print col" 0 0";col++};print y;next}' file
Input File:
[jaypal:~/Temp] cat file
1 0 1
2 0 3
2 1 2
3 0 3
4 0 1
4 1 1
4 2 1
4 3 1
5 0 1
8 0 1
10 0 1
11 0 1
Script Output:
[jaypal:~/Temp] awk -v RS="" -v ORS="\n\n" -v OFS="\n" '
BEGIN{getline; col=$1;line=$0;print line}
$1==col{print $0;next }
($1==col+1){print $0;col=$1;next}
{x=$1;y=$0; col++; while (col < x) {print col" 0 0";col++};print y;next}' file
1 0 1
2 0 3
2 1 2
3 0 3
4 0 1
4 1 1
4 2 1
4 3 1
5 0 1
6 0 0
7 0 0
8 0 1
9 0 0
10 0 1
11 0 1

Resources