Find common values in multiple in awk - linux

I would like to find common values from multiple files and corresponding counts of appearance using awk. I have, say four files such as: input1, input2, input3, input4:
input1: input2: input3: input4
AA AB AA AC
AB AC AC AF
AC AF AF AD
AD AG AH AH
AF AH AK AK
AI
I would like the answer to be:
Variable: Count
AA 2
AB 2
AC 4
AD 2
AF 4
AH 3
AK 2
AI 1
Any comments, please !!

awk '{a[$0]++}END{for(x in a)print x,a[x]}' input*
with your inputs, output would be:
AA 2
AB 2
AC 4
AD 2
AF 4
AG 1
AH 3
AI 1
AK 2

Related

Fill 0s with mean values of previous and next in a Pandas DataFrame

Fill the 0s with the following conditions:
If there is one 0 value, do a simple average between the data after and before the 0. In this scenario 0 is replaced by mean of A and B
If there are two consecutive 0s, fill the first missing value with data from previous period and doing a simple average between the data after and before the second 0. First 0 is replaced by A and second 0 by mean of A and B.
If there are 3 consecutive 0s, replace first and second 0 with A and 3rd by mean of A and B.
Ticker is an identifier and would be common for every block(can be ignored). The entire table is 1000 rows long and in no case consecutive 0s would exceed 3. I am unable to manage scenario 2 and 3.
ID
asset
AA
34861000
AA
1607498
AA
0
AA
3530000000
AA
3333000000
AA
3179000000
AA
4053000000
AA
4520000000
AB
15250209
AB
0
AB
14691049
AB
0
AB
5044421
CC
5609212
CC
0
CC
0
CC
3673639
CC
132484747
CC
0
CC
0
CC
0
CC
141652646
You can use interpolate per group, on the reversed Series, with a limit of 1:
df['asset'] = (df
.assign(asset=df['asset'].replace(0, float('nan')))
.groupby('ID')['asset']
.transform(lambda s: s[::-1].interpolate(limit=1).bfill())
)
output:
ID asset
0 AA 3.486100e+07
1 AA 1.607498e+06
2 AA 1.765804e+09
3 AA 3.530000e+09
4 AA 3.333000e+09
5 AA 3.179000e+09
6 AA 4.053000e+09
7 AA 4.520000e+09
8 AB 1.525021e+07
9 AB 1.497063e+07
10 AB 1.469105e+07
11 AB 9.867735e+06
12 AB 5.044421e+06
13 CC 5.609212e+06
14 CC 5.609212e+06
15 CC 4.318830e+06
16 CC 3.673639e+06
17 CC 1.324847e+08 # X
18 CC 1.324847e+08 # filled X
19 CC 1.324847e+08 # filled X
20 CC 1.393607e+08 # (X+Y)/2
21 CC 1.416526e+08 # Y
Ok, compiling the answer here with help from #jezrael and #mozway
df['asset'] =df['asset'].replace(0, float('nan'))
df.loc[mask, 'asset'] = df.loc[mask | ~m, 'asset'].groupby(df['ID']).transform(lambda x: x.interpolate())
df.ffill()

Compare 2 sets of 2 columns in excel with a lookup

Hi all I need help with the following formula I have looked up ways to compare different datasets in excel but this particular is a little different to the examples ive seen. Say i have the following data set
A
B
C
D
E
F
AB
75
AB
75
Bob
AC
56
AC
68
Fre
AB
75
AB
75
Jill
I need a formula that compares (AB with CD) and prints out E where F is.
for example the result above would like this this since AB & CD are equal so print the name
A
B
C
D
E
F
AB
75
AB
75
Bob
Bob, Jill
AC
56
AC
68
Fre
Fre
AB
75
AB
75
Jill
Give a try on below formula.
=TEXTJOIN(", ",TRUE,FILTER($E$1:$E$3,MMULT(($A$1:$B$3=A1:B1)*($C$1:$D$3=C1:D1),TRANSPOSE({1,1}))))

The `column` utility without the -t option

After reading the man pages for column and trying a few examples, I wonder: what does this command do when it is not supplied the -t option?
It takes lines and puts them on separate cell inside a table, filling rows first:
$ seq 40 | column
1 4 7 10 13 16 19 22 25 28 31 34 37 40
2 5 8 11 14 17 20 23 26 29 32 35 38
3 6 9 12 15 18 21 24 27 30 33 36 39
It's similar to ls output, but the separator is tab between columns, while ls uses spaces:
$ ls
a ab ad af ah aj al an ap ar at av ax az c e g i k m o q s u w y
aa ac ae ag ai ak am ao aq as au aw ay b d f h j l n p r t v x z
$ printf "%s\n" * | column
a ac af ai al ao ar au ax b e h k n q t w z
aa ad ag aj am ap as av ay c f i l o r u x
ab ae ah ak an aq at aw az d g j m p s v y
If you have some newline-separated data that you want to represent in a condensed form in a nicely indented table, column is the way to go.

Combining two pandas dataframes/tables of unequal lengths

I have two data frames of different lengths:
df1: df2:
Column1 Column2 ColumnA ColumnG ColumnSG
0 ab1 bc1 ab1 A AA
1 ab2 ab5 bc1 B BB
2 ab3 bc4 ab3 C CC
3 ab4 ab5 ab1 D DD
4 ab5 ab1 ab5 E EE
bc4 F FF
ab2 G GG
ab4 H HH
I would like an output that looks something like this:
df1:
OUTPUT
What I tried so far?
for row in df1, df2:
if df1[Column1] == df2[ColumnA]:
df1[ColumnG1] = df2[ColumnG]
df1[ColumnSG1] = df2[ColumnSG]
But, this gave me an error saying:
ValueError: Can only compare identically-labeled Series objects
How can I solve this?
Is it merge twice:
(df1.merge(df2.rename(columns={"ColumnA":'Column1'}), on='Column1',how='left')
.merge(df2.rename(columns={'ColumnA':'Column2'}), on='Column2',how='left',
suffixes=['1','2'])
)
Output:
Column1 Column2 ColumnG1 ColumnSG1 ColumnG2 ColumnSG2
0 ab1 bc1 A AA B BB
1 ab1 bc1 D DD B BB
2 ab2 ab5 G GG E EE
3 ab3 bc4 C CC F FF
4 ab4 ab5 H HH E EE
5 ab5 ab1 E EE A AA
6 ab5 ab1 E EE D DD

How to count all combinations of 1 column against multiple columns

I have CSV exports of data for individual pieces in a collection that each have two initial columns: A & B.
Column A has topics and column B has 0+ tags for those topics separated by commas.
There exists a master list of all possible singular combinations of topics/column A and tags/column B, but each CSV export may one, more, or none of any particular combination and there can be many total combinations.
In the actual master list, there are about 20 topics and anywhere from 2 to maybe 50 tags per topic.
Master list example:
Topics/A Tags/B
AA XX
AA XY
AA XZ
AB VV
AB VW
AB VX
AB VY
AB VZ
AC YY
AC YZ
Individual piece CSV example:
Topics/A Tags/B
AA
AA XZ
AA XZ, XX
AA XZ, XX
AB VV, VY
AB VY
AB VX, VV, VZ
AB VY
AB VZ, VW, VV, VY, VX
AC YY
AC YY
AC YY
AC YY, YZ
I want the final result to be a count of all combinations of topics/A and tags/B.
Final result for individual piece example from above (option 1):
Topics/A Tags/B Count
AA none 1
AA XX 2
AA XY 0
AA XZ 3
AB none 0
AB VV 3
AB VW 1
AB VX 2
AB VY 4
AB VZ 2
AC none 0
AC YY 4
AC YZ 1
Final result for individual piece example from above (option 2a, as seen below, or option 2b with columns and rows swapped):
AA (none) XX XY XZ AB (none) VV VW VX VY VZ AC (none) YY YZ
AA 1 2 0 3
AB 0 3 1 2 4 2
AC 0 4 1
I'm assuming I have to separate out the tags/B column so that it's something like:
AA
AA XZ
AA XZ XX
AA XZ XX
AB VV VY
AB VY
AB VX VV VZ
AB VY
AB VZ VW VV VY VX
AC YY
AC YY
AC YY
AC YY YZ
But, after this, I'm pretty stuck on what to do.
I tried looking up how to unpivot the above, but I would like some sort of formula or method that would be universal when the number of tags applied for a topic in a singular instance are generally unknown.
I've seen formulas that "flatten" the data, but I think they're designed for a fixed number of possible "tag" columns, which doesn't work for me.
I don't want to resort to manually entering formulas to flatten out the delimited matrix like:
Topics (flattened) Tags (flattened)
=A2 =B2
=A2 =C2
=A2 =D2
=A2 =E2
=A2 =F2
=A3 =B3
=A3 =C3
=A3 =D3
=A3 =E3
=A3 =F3
=A4 =B4
... ...
Please help.
Thank you.
Here's the formula method. There's one part you might need a vba script for - let me know.
Replace all blanks in your csv with "none" (highlight it>cntrl+h>replace nothing with none)
Add "none" rows to your master list (let me know if you need me to write a macro that does this)
Use this formula with control + shift + enter
=SUM((A2=$A$17:$A$29)*(ISNUMBER(FIND(B2,$B$17:$B$29))))

Resources