I have CSV exports of data for individual pieces in a collection that each have two initial columns: A & B.
Column A has topics and column B has 0+ tags for those topics separated by commas.
There exists a master list of all possible singular combinations of topics/column A and tags/column B, but each CSV export may one, more, or none of any particular combination and there can be many total combinations.
In the actual master list, there are about 20 topics and anywhere from 2 to maybe 50 tags per topic.
Master list example:
Topics/A Tags/B
AA XX
AA XY
AA XZ
AB VV
AB VW
AB VX
AB VY
AB VZ
AC YY
AC YZ
Individual piece CSV example:
Topics/A Tags/B
AA
AA XZ
AA XZ, XX
AA XZ, XX
AB VV, VY
AB VY
AB VX, VV, VZ
AB VY
AB VZ, VW, VV, VY, VX
AC YY
AC YY
AC YY
AC YY, YZ
I want the final result to be a count of all combinations of topics/A and tags/B.
Final result for individual piece example from above (option 1):
Topics/A Tags/B Count
AA none 1
AA XX 2
AA XY 0
AA XZ 3
AB none 0
AB VV 3
AB VW 1
AB VX 2
AB VY 4
AB VZ 2
AC none 0
AC YY 4
AC YZ 1
Final result for individual piece example from above (option 2a, as seen below, or option 2b with columns and rows swapped):
AA (none) XX XY XZ AB (none) VV VW VX VY VZ AC (none) YY YZ
AA 1 2 0 3
AB 0 3 1 2 4 2
AC 0 4 1
I'm assuming I have to separate out the tags/B column so that it's something like:
AA
AA XZ
AA XZ XX
AA XZ XX
AB VV VY
AB VY
AB VX VV VZ
AB VY
AB VZ VW VV VY VX
AC YY
AC YY
AC YY
AC YY YZ
But, after this, I'm pretty stuck on what to do.
I tried looking up how to unpivot the above, but I would like some sort of formula or method that would be universal when the number of tags applied for a topic in a singular instance are generally unknown.
I've seen formulas that "flatten" the data, but I think they're designed for a fixed number of possible "tag" columns, which doesn't work for me.
I don't want to resort to manually entering formulas to flatten out the delimited matrix like:
Topics (flattened) Tags (flattened)
=A2 =B2
=A2 =C2
=A2 =D2
=A2 =E2
=A2 =F2
=A3 =B3
=A3 =C3
=A3 =D3
=A3 =E3
=A3 =F3
=A4 =B4
... ...
Please help.
Thank you.
Here's the formula method. There's one part you might need a vba script for - let me know.
Replace all blanks in your csv with "none" (highlight it>cntrl+h>replace nothing with none)
Add "none" rows to your master list (let me know if you need me to write a macro that does this)
Use this formula with control + shift + enter
=SUM((A2=$A$17:$A$29)*(ISNUMBER(FIND(B2,$B$17:$B$29))))
Related
Hi all I need help with the following formula I have looked up ways to compare different datasets in excel but this particular is a little different to the examples ive seen. Say i have the following data set
A
B
C
D
E
F
AB
75
AB
75
Bob
AC
56
AC
68
Fre
AB
75
AB
75
Jill
I need a formula that compares (AB with CD) and prints out E where F is.
for example the result above would like this this since AB & CD are equal so print the name
A
B
C
D
E
F
AB
75
AB
75
Bob
Bob, Jill
AC
56
AC
68
Fre
Fre
AB
75
AB
75
Jill
Give a try on below formula.
=TEXTJOIN(", ",TRUE,FILTER($E$1:$E$3,MMULT(($A$1:$B$3=A1:B1)*($C$1:$D$3=C1:D1),TRANSPOSE({1,1}))))
I have two data frames of different lengths:
df1: df2:
Column1 Column2 ColumnA ColumnG ColumnSG
0 ab1 bc1 ab1 A AA
1 ab2 ab5 bc1 B BB
2 ab3 bc4 ab3 C CC
3 ab4 ab5 ab1 D DD
4 ab5 ab1 ab5 E EE
bc4 F FF
ab2 G GG
ab4 H HH
I would like an output that looks something like this:
df1:
OUTPUT
What I tried so far?
for row in df1, df2:
if df1[Column1] == df2[ColumnA]:
df1[ColumnG1] = df2[ColumnG]
df1[ColumnSG1] = df2[ColumnSG]
But, this gave me an error saying:
ValueError: Can only compare identically-labeled Series objects
How can I solve this?
Is it merge twice:
(df1.merge(df2.rename(columns={"ColumnA":'Column1'}), on='Column1',how='left')
.merge(df2.rename(columns={'ColumnA':'Column2'}), on='Column2',how='left',
suffixes=['1','2'])
)
Output:
Column1 Column2 ColumnG1 ColumnSG1 ColumnG2 ColumnSG2
0 ab1 bc1 A AA B BB
1 ab1 bc1 D DD B BB
2 ab2 ab5 G GG E EE
3 ab3 bc4 C CC F FF
4 ab4 ab5 H HH E EE
5 ab5 ab1 E EE A AA
6 ab5 ab1 E EE D DD
I have dataframe that looks like this:
title answer
0 aa zz
1 bb xx
2 cc yy
3 dd ll
I want to rotate by index value and reset index the rows as index and value for only two rows like this:
bb cc
0 xx yy
How do I do this?
I tried Transpose:
df[['title','answer']].T
but that puts integer values in the column headers and not sure how to select by index number.
IIUC,
we can set title as the index and then transpose, the reason you get integers is because your tranposing on the index
df_new = df.set_index('title').T
print(df_new)
title aa bb cc dd
answer zz xx yy ll
if you want to get rid of the index as well :
df_new = df.set_index('title').T.reset_index(drop=True)
df_new.columns.name = ''
print(df_new)
aa bb cc dd
0 zz xx yy ll
We have below pandas dataframe
and we need to convert into below dataframe
while using pd.wide_to_long command we are getting below error:-
ValueError: stubname can't be identical to a column name
This command is being use:-
pd.wide_to_long(df,['Org','City'],i=['First Name','Middle Name','Last Name','Years'],j='drop').reset_index(level=[0,1]
For me your solution working, also added parameter stubnames. Mayb eit is bug in oldier pandas version, link, so you can try to upgrade pandas to last version:
df = pd.wide_to_long(df, stubnames=['Org','City'],
i=['First Name','Middle Name','Last Name','Years'],
j='drop').reset_index().drop('drop', 1)
print (df)
First Name Middle Name Last Name Years Org City
0 aa cc dd 2019 v n
1 aa cc dd 2019 m m
2 aa cc dd 2019 d n
3 aa cc dd 2019 p j
4 zz yy xx 2018 p n
5 zz yy xx 2018 q n
6 zz yy xx 2018 i d
7 zz yy xx 2018 NaN NaN
EDIT: If possible some duplicates in data is possible create default index by reset_index and add column index to i variables:
print (df)
First Name Middle Name Last Name Years Org0 Org1 Org2 Org3 City0 City1 \
0 aa cc dd 2019 v m d p n m
1 zz yy xx 2018 p q i NaN n n
City2 City3
0 n j
1 d NaN
df = pd.wide_to_long(df.reset_index(), stubnames=['Org','City'],
i=['index','First Name','Middle Name','Last Name','Years'],
j='drop').reset_index().drop(['drop', 'index'], 1)
print (df)
First Name Middle Name Last Name Years Org City
0 aa cc dd 2019 v n
1 aa cc dd 2019 m m
2 aa cc dd 2019 d n
3 aa cc dd 2019 p j
4 zz yy xx 2018 p n
5 zz yy xx 2018 q n
6 zz yy xx 2018 i d
7 zz yy xx 2018 NaN NaN
I would like to find common values from multiple files and corresponding counts of appearance using awk. I have, say four files such as: input1, input2, input3, input4:
input1: input2: input3: input4
AA AB AA AC
AB AC AC AF
AC AF AF AD
AD AG AH AH
AF AH AK AK
AI
I would like the answer to be:
Variable: Count
AA 2
AB 2
AC 4
AD 2
AF 4
AH 3
AK 2
AI 1
Any comments, please !!
awk '{a[$0]++}END{for(x in a)print x,a[x]}' input*
with your inputs, output would be:
AA 2
AB 2
AC 4
AD 2
AF 4
AG 1
AH 3
AI 1
AK 2