Obtaining UNIQUE VALUE occurrences count in a set of COLUMNS using AWK - linux

IGNORING columns 1 & 2 (only the rest of the columns); I would like to obtain the occurrence COUNT of UNIQUE EVEN values (ignoring ODD ones) for the following set of data.
I have tried:
awk '{ a[$3, $4, $5, $6, $7]++ } END { for (b in a) { cnt+=1 } {print cnt}}' file
I obtain 76 but I don’t expect this value.
> 0 0
> 1 0 0
> 2 0 2
> 3 0 0 6
> 4 0 0 8
> 5 0 0 10
> 6 0 2 14
> 7 0 2 16
> 8 0 0 6 20
> 9 0 0 8 24
> 10 0 0 8 26
> 11 0 0 10 32
> 12 0 0 10 34
> 13 0 2 14 40
> 14 0 2 16 42
> 15 0 0 8 24 48
> 16 0 0 8 24 50
> 17 0 0 8 26 56
> 18 0 0 10 32 60
> 19 0 0 10 34 64
> 20 0 0 10 34 66
> 21 0 2 14 40 72
> 22 0 0 8 24 48 76
> 23 0 0 8 24 50 82
> 24 0 0 8 26 56 88
> 25 0 0 8 26 56 90
> 26 0 0 10 32 60 96
> 27 0 0 10 32 60 98
> 28 0 0 10 34 64 104
> 29 0 0 10 34 64 106
> 30 0 0 10 34 66 112
> 31 0 0 10 34 66 114
> 0 1
> 1 1 2 5
> 2 1 2
> 3 1 2 12 23 19
> 4 1 2 12 23
> 5 1 2 12
> 6 1 2 12 28
> 7 1 2 12 28 36
> 8 1 2 12 30 47 45
> 9 1 2 12 30 47
> 10 1 2 12 30
> 11 1 2 12 30 52
> 12 1 2 12 28 38
> 13 1 2 12 28 38 62
> 14 1 2 12 28 38 62 68
> 15 1 2 12 30 54 75
> 16 1 2 12 30 54
> 17 1 2 12 30 54 78
> 18 1 2 12 30 54 78 84
> 19 1 2 12 30 54 78 84 92
> 20 1 2 12 28 38 62 70
> 21 1 2 12 28 38 62 70 108
> 22 1 2 12 30 54 80
> 23 1 2 12 30 54 78 86
> 24 1 2 12 30 54 78 86 120
> 25 1 2 12 30 54 78 84 94
> 26 1 2 12 30 54 78 84 94 124
> 27 1 2 12 30 54 78 84 92 102
> 28 1 2 12 30 54 78 84 92 102 128
> 29 1 2 12 28 38 62 70 110
> 30 1 2 12 28 38 62 70 110 130
> 31 1 2 12 28 38 62 70 108 116
> 0 2
> 1 2 2 5
> 2 2 2
> 3 2 2 5 6
> 4 2 2 5 6 18
> 5 2 2 5 6 18 22
> 6 2 2 14
> 7 2 2 16
> 8 2 2 5 6 20
> 9 2 2 5 6 20 44
> 10 2 2 5 6 18 26
> 11 2 2 5 6 18 22 32
> 12 2 2 5 6 18 22 32 58
> 13 2 2 14 40
> 14 2 2 16 42
> 15 2 2 5 6 20 44 50 75
> 16 2 2 5 6 20 44 50
> 17 2 2 5 6 18 26 56
> 18 2 2 5 6 18 22 32 60
> 19 2 2 14 40 72 109 101
> 20 2 2 14 40 72 109
> 21 2 2 14 40 72
> 22 2 2 5 6 20 44 50 80
> 23 2 2 5 6 20 44 50 80 118
> 24 2 2 5 6 20 44 50 80 118 120
> 25 2 2 5 6 20 44 50 80 118 120 122
> 26 2 2 14 40 72 109 101 102 127
> 27 2 2 14 40 72 109 101 102
> 28 2 2 14 40 72 109 101 104
> 29 2 2 14 40 72 116 133 131
> 30 2 2 14 40 72 116 133
> 31 2 2 14 40 72 116
> 0 3
> 1 3 0
> 2 3 0 4
> 3 3 0 6
> 4 3 0 6 18
> 5 3 0 6 18 22
> 6 3 0 4 16 37
> 7 3 0 4 16
> 8 3 0 6 20
> 9 3 0 6 18 26 47
> 10 3 0 6 18 26
> 11 3 0 6 18 22 32
> 12 3 0 6 18 22 32 58
> 13 3 0 4 16 42 69
> 14 3 0 4 16 42
> 15 3 0 6 18 26 47 48
> 16 3 0 6 18 26 47 48 74
> 17 3 0 6 18 26 56
> 18 3 0 6 18 22 32 60
> 19 3 0 6 18 22 32 58 64
> 20 3 0 6 18 22 32 58 66
> 21 3 0 6 18 22 32 58 66 108
> 22 3 0 6 18 26 47 48 76
> 23 3 0 6 18 26 56 86
> 24 3 0 6 18 26 56 88
> 25 3 0 6 18 26 56 90
> 26 3 0 6 18 22 32 60 96
> 27 3 0 6 18 22 32 60 98
> 28 3 0 6 18 22 32 58 64 104
> 29 3 0 6 18 22 32 58 64 106
> 30 3 0 6 18 22 32 58 66 112
> 31 3 0 6 18 22 32 58 66 114
> 0 4
> 1 4 0
> 2 4 2
> 3 4 0 6
> 4 4 0 8
> 5 4 0 10
> 6 4 2 16 37
> 7 4 2 16
> 8 4 0 8 24 45
> 9 4 0 8 24
> 10 4 0 8 26
> 11 4 0 8 26 52
> 12 4 2 16 37 38
> 13 4 2 16 42 69
> 14 4 2 16 42
> 15 4 0 8 24 48
> 16 4 0 8 24 50
> 17 4 0 8 26 56
> 18 4 0 8 26 52 60
> 19 4 2 16 37 38 64
> 20 4 2 16 42 69 70
> 21 4 2 16 42 69 72
> 22 4 0 8 24 48 76
> 23 4 0 8 24 50 82
> 24 4 0 8 26 56 88
> 25 4 0 8 26 52 60 94
> 26 4 0 8 26 52 60 96
> 27 4 0 8 26 52 60 98
> 28 4 2 16 37 38 64 104
> 29 4 2 16 42 69 70 110
> 30 4 2 16 42 69 70 112
> 31 4 2 16 42 69 70 114

You can try this awk command to count unique values ignoring 1st and 2nd column:
awk '{$1=$2=""; !seen[$0]++} END{print length(seen)}' file
130
If you are counting uniques excluding 1st and 2nd columns and ignoring odd numbers then use:
awk '{for (i=3; i<=NF; i++) !($i%2) && !seen[$i]++} END{print length(seen)}' file
63

Related

I would like to find consecutive numbers in column A and column B in python (pandas)

I would like to find consecutive numbers in column A and Column B in python, Column A should be ascending but Column B is descending. I am attaching an example file.
Input file
nucleotide
Pos_A
Pos_B
Connection_Pos20_Pos102
20
102
Connection_Pos19_Pos102
19
102
Connection_Pos20_Pos101
20
101
Connection_Pos18_Pos102
18
102
Connection_Pos19_Pos101
19
101
Connection_Pos20_Pos100
20
100
Connection_Pos17_Pos102
17
102
Connection_Pos18_Pos101
18
101
Connection_Pos19_Pos100
19
100
Connection_Pos20_Pos99
20
99
Connection_Pos16_Pos102
16
102
Connection_Pos17_Pos101
17
101
Connection_Pos18_Pos100
18
100
Connection_Pos19_Pos99
19
99
Connection_Pos20_Pos98
20
98
Connection_Pos15_Pos102
15
102
Connection_Pos16_Pos101
16
101
Connection_Pos17_Pos100
17
100
Connection_Pos18_Pos99
18
99
Connection_Pos19_Pos98
19
98
Connection_Pos20_Pos97
20
97
Connection_Pos14_Pos102
14
102
Connection_Pos15_Pos101
15
101
Connection_Pos16_Pos100
16
100
Output:
nucleotide
Pos_A
Pos_B
Consecutive ID
Consecutive Number (Size)
Connection_Pos20_Pos102
20
102
101
1
Connection_Pos19_Pos102
19
102
100
2
Connection_Pos20_Pos101
20
101
100
2
Connection_Pos18_Pos102
18
102
99
3
Connection_Pos19_Pos101
19
101
99
3
Connection_Pos20_Pos100
20
100
99
3
Connection_Pos17_Pos102
17
102
98
4
Connection_Pos18_Pos101
18
101
98
4
Connection_Pos19_Pos100
19
100
98
4
Connection_Pos20_Pos99
20
99
98
4
Connection_Pos16_Pos102
16
102
97
5
Connection_Pos17_Pos101
17
101
97
5
Connection_Pos18_Pos100
18
100
97
5
Connection_Pos19_Pos99
19
99
97
5
Connection_Pos20_Pos98
20
98
97
5
Connection_Pos15_Pos102
15
102
96
6
Connection_Pos16_Pos101
16
101
96
6
Connection_Pos17_Pos100
17
100
96
6
Connection_Pos18_Pos99
18
99
96
6
Connection_Pos19_Pos98
19
98
96
6
Connection_Pos20_Pos97
20
97
96
6
Connection_Pos14_Pos102
14
102
95
7
Connection_Pos15_Pos101
15
101
95
7
Connection_Pos16_Pos100
16
100
95
7
Connection_Pos17_Pos99
17
99
95
7
Connection_Pos18_Pos98
18
98
95
7
Connection_Pos19_Pos97
19
97
95
7
Connection_Pos20_Pos96
20
96
95
7
For Consecutive ID, if Pos_B's shifted difference != 1, then we want to subtract 1, so we mark those indexes as -1 with mul(-1) and cumsum them:
df['ID'] = df.Pos_B.shift().sub(df.Pos_B).ne(1).mul(-1).cumsum() + df.Pos_B[0]
For Consecutive Number, if Pos_A's shifted difference != -1, then we want to add 1, so we mark those indexes as 1 and cumsum again:
df['Number'] = df.Pos_A.shift().sub(df.Pos_A).ne(-1).mul(1).cumsum()
Result:
nucleotide Pos_A Pos_B ID Number
0 Connection_Pos20_Pos102 20 102 101 1
1 Connection_Pos19_Pos102 19 102 100 2
2 Connection_Pos20_Pos101 20 101 100 2
3 Connection_Pos18_Pos102 18 102 99 3
4 Connection_Pos19_Pos101 19 101 99 3
5 Connection_Pos20_Pos100 20 100 99 3
6 Connection_Pos17_Pos102 17 102 98 4
7 Connection_Pos18_Pos101 18 101 98 4
8 Connection_Pos19_Pos100 19 100 98 4
9 Connection_Pos20_Pos99 20 99 98 4
10 Connection_Pos16_Pos102 16 102 97 5
11 Connection_Pos17_Pos101 17 101 97 5
12 Connection_Pos18_Pos100 18 100 97 5
13 Connection_Pos19_Pos99 19 99 97 5
14 Connection_Pos20_Pos98 20 98 97 5
15 Connection_Pos15_Pos102 15 102 96 6
16 Connection_Pos16_Pos101 16 101 96 6
17 Connection_Pos17_Pos100 17 100 96 6
18 Connection_Pos18_Pos99 18 99 96 6
19 Connection_Pos19_Pos98 19 98 96 6
20 Connection_Pos20_Pos97 20 97 96 6
21 Connection_Pos14_Pos102 14 102 95 7
22 Connection_Pos15_Pos101 15 101 95 7
23 Connection_Pos16_Pos100 16 100 95 7
Do it one by one then groupby with ngroup
s1 = df.Pos_A.diff().le(0).cumsum()
s2 = df.Pos_B.diff().ge(0).cumsum()
df['out'] = df.groupby([s1,s2]).ngroup()+1
Out[452]:
0 1
1 2
2 2
3 3
4 3
5 3
6 4
7 4
8 4
9 4
10 5
11 5
12 5
13 5
14 5
15 6
16 6
17 6
18 6
19 6
20 6
21 7
22 7
23 7
24 7
25 7
26 7
27 7
dtype: int64

how to shift column labels to left python

I have dataframe i want to move column name to left from specific column. original dataframe have many columns can not do this by rename columns
df=pd.DataFrame({'A':[1,3,4,7,8,11,1,15,20,15,16,87],
'H':[1,3,4,7,8,11,1,15,78,15,16,87],
'N':[1,3,4,98,8,11,1,15,20,15,16,87],
'p':[1,3,4,9,8,11,1,15,20,15,16,87],
'B':[1,3,4,6,8,11,1,19,20,15,16,87],
'y':[0,0,0,0,1,1,1,0,0,0,0,0]})
print((df))
A H N p B y
0 1 1 1 1 1 0
1 3 3 3 3 3 0
2 4 4 4 4 4 0
3 7 7 98 9 6 0
4 8 8 8 8 8 1
5 11 11 11 11 11 1
6 1 1 1 1 1 1
7 15 15 15 15 19 0
8 20 78 20 20 20 0
9 15 15 15 15 15 0
10 16 16 16 16 16 0
11 87 87 87 87 87 0
Here i want to remove label N first dataframe after removing label N
A H p B y
0 1 1 1 1 1 0
1 3 3 3 3 3 0
2 4 4 4 4 4 0
3 7 7 98 9 6 0
4 8 8 8 8 8 1
5 11 11 11 11 11 1
6 1 1 1 1 1 1
7 15 15 15 15 19 0
8 20 78 20 20 20 0
9 15 15 15 15 15 0
10 16 16 16 16 16 0
11 87 87 87 87 87 0
Rrquired output:
A H P B y
0 1 1 1 1 1 0
1 3 3 3 3 3 0
2 4 4 4 4 4 0
3 7 7 98 9 6 0
4 8 8 8 8 8 1
5 11 11 11 11 11 1
6 1 1 1 1 1 1
7 15 15 15 15 19 0
8 20 78 20 20 20 0
9 15 15 15 15 15 0
10 16 16 16 16 16 0
11 87 87 87 87 87 0
Here last column can be ignore
Note: in original dataframe have many columns , can not rename columns , so need some auto method to shift column names lef
You can do
df.columns=sorted(df.columns.str.replace('N',''),key=lambda x : x=='')
df
A H p B y
0 1 1 1 1 1 0
1 3 3 3 3 3 0
2 4 4 4 4 4 0
3 7 7 98 9 6 0
4 8 8 8 8 8 1
5 11 11 11 11 11 1
6 1 1 1 1 1 1
7 15 15 15 15 19 0
8 20 78 20 20 20 0
9 15 15 15 15 15 0
10 16 16 16 16 16 0
11 87 87 87 87 87 0
Replace the columns with your own custom list.
>>> cols = list(df.columns)
>>> cols.remove('N')
>>> df.columns = cols + ['']
Output
>>> df
A H p B y
0 1 1 1 1 1 0
1 3 3 3 3 3 0
2 4 4 4 4 4 0
3 7 7 98 9 6 0
4 8 8 8 8 8 1
5 11 11 11 11 11 1
6 1 1 1 1 1 1
7 15 15 15 15 19 0
8 20 78 20 20 20 0
9 15 15 15 15 15 0
10 16 16 16 16 16 0
11 87 87 87 87 87 0

printing a string like a matrix

Trying to let the user input a number, and print a table according to the square of its size. Here's an example.
Size--> 3
0 1 2
3 4 5
6 7 8
Size--> 4
0 1 2 3
4 5 6 7
8 9 10 11
12 13 14 15
Size--> 6
0 1 2 3 4 5
6 7 8 9 10 11
12 13 14 15 16 17
18 19 20 21 22 23
24 25 26 27 28 29
30 31 32 33 34 35
Size--> 9
0 1 2 3 4 5 6 7 8
9 10 11 12 13 14 15 16 17
18 19 20 21 22 23 24 25 26
27 28 29 30 31 32 33 34 35
36 37 38 39 40 41 42 43 44
45 46 47 48 49 50 51 52 53
54 55 56 57 58 59 60 61 62
63 64 65 66 67 68 69 70 71
72 73 74 75 76 77 78 79 80
Here's is the code that i have tried.
length=int(input('Size--> '))
size=length*length
biglist=[]
for i in range(size):
biglist.append(i)
biglist = [str(i) for i in biglist]
for i in range(0, len(biglist), length):
print(' '.join(biglist[i: i+length]))
but instead here's what i got
Size--> 3
0 1 2
3 4 5
6 7 8
Size--> 4
0 1 2 3
4 5 6 7
8 9 10 11
12 13 14 15
Size--> 6
0 1 2 3 4 5
6 7 8 9 10 11
12 13 14 15 16 17
18 19 20 21 22 23
24 25 26 27 28 29
30 31 32 33 34 35
As you can see the rows are not aligned properly like the example.
What's the simplest way of presenting it in a proper alignment? Thx :)
Using .format on string with right aligning.
And strlen is the number of characters required for each number.
length = int(input('Size--> '))
size = length*length
biglist = []
for i in range(size):
biglist.append(i)
biglist = [str(i) for i in biglist]
strlen = len(str(length**2-1))+1
for i in range(0, len(biglist), length):
# print(' '.join(biglist[i: i+length]))
for x in biglist[i: i+length]:
print(f"{x:>{strlen}}", end='')
print()

How to divide 1 column into 5 segments with pandas and python?

I have a list of 1 column and 50 rows.
I want to divide it into 5 segments. And each segment has to become a column of a dataframe. I do not want the NAN to appear (figure2). How can I solve that?
Like this:
df = pd.DataFrame(result_list)
AWA=df[:10]
REM=df[10:20]
S1=df[20:30]
S2=df[30:40]
SWS=df[40:50]
result = pd.concat([AWA, REM, S1, S2, SWS], axis=1)
result
Figure2
You can use numpy's reshape function:
result_list = [i for i in range(50)]
pd.DataFrame(np.reshape(result_list, (10, 5), order='F'))
Out:
0 1 2 3 4
0 0 10 20 30 40
1 1 11 21 31 41
2 2 12 22 32 42
3 3 13 23 33 43
4 4 14 24 34 44
5 5 15 25 35 45
6 6 16 26 36 46
7 7 17 27 37 47
8 8 18 28 38 48
9 9 19 29 39 49

Incorrect logical indexing?

For the code:
dataset = pd.read_csv("/Users/Akshita/Desktop/EE660/donor_raw_data_medmean.csv", header=None, names=None)
# Separate data and label
X_label = dataset[1:19373][0]
X_data = dataset[1:19373]
print(X_data[X_label==1])
I get the output:(There are actually 4000~ samples with label=1)
0 1 2 3 4 5 6 7 8 9 ... 51 52 53 54 55 56 57 58 \
16386 1 17 60 0 1 0 0 0 0 1 ... 0 20 20 20 5 10 15 15
16396 1 137 60 0 1 0 0 0 0 1 ... 15 25 10 15 6 14 16 120
16399 1 89 54 0 1 0 0 0 0 1 ... 10 15 5 15 6 14 16 79
16402 1 89 75 0 1 0 0 0 0 1 ... 25 35 10 35 6 13 15 79
..
..
19356 1 101 80 1 0 0 1 0 0 2 ... 25 30 5 28 7 16 18 101
19363 1 65 70 1 0 0 1 0 0 1 ... 7 12 5 10 4 8 20 63
19372 1 29 70 0 0 0 1 0 0 2 ... 0 25 25 25 4 9 24 24
..
[859 rows x 61 columns]
and for
print(X_data[X_label==0])
I get the output:(There are about 15000~ samples with label=0)
0 1 2 3 4 5 6 7 8 9 ... 51 52 53 54 55 56 57 58 \
16384 0 17 74 0 1 0 0 0 0 1 ... 0 15 15 15 4 10 17 17
16385 0 17 60 0 1 0 0 0 0 2 ... 0 15 15 15 4 11 17 17
16387 0 29 67 0 1 0 0 0 0 1 ... 0 20 20 20 5 11 23 28
16388 0 53 60 0 1 0 0 0 0 1 ... 5 30 25 30 5 11 26 52
16389 0 65 49 0 1 0 0 0 0 1 ... 30 35 5 27 6 13 16 56
..
..
19369 0 137 77 1 0 1 0 0 0 1 ... 9 10 1 10 6 13 21 130
19370 0 29 60 1 0 0 1 0 0 1 ... 0 15 15 15 3 9 23 23
19371 0 129 78 1 0 0 1 0 0 2 ... 20 25 5 25 7 24 8 129
What can I be doing wrong?

Resources