Combining two pandas dataframes/tables of unequal lengths - python-3.x

I have two data frames of different lengths:
df1: df2:
Column1 Column2 ColumnA ColumnG ColumnSG
0 ab1 bc1 ab1 A AA
1 ab2 ab5 bc1 B BB
2 ab3 bc4 ab3 C CC
3 ab4 ab5 ab1 D DD
4 ab5 ab1 ab5 E EE
bc4 F FF
ab2 G GG
ab4 H HH
I would like an output that looks something like this:
df1:
OUTPUT
What I tried so far?
for row in df1, df2:
if df1[Column1] == df2[ColumnA]:
df1[ColumnG1] = df2[ColumnG]
df1[ColumnSG1] = df2[ColumnSG]
But, this gave me an error saying:
ValueError: Can only compare identically-labeled Series objects
How can I solve this?

Is it merge twice:
(df1.merge(df2.rename(columns={"ColumnA":'Column1'}), on='Column1',how='left')
.merge(df2.rename(columns={'ColumnA':'Column2'}), on='Column2',how='left',
suffixes=['1','2'])
)
Output:
Column1 Column2 ColumnG1 ColumnSG1 ColumnG2 ColumnSG2
0 ab1 bc1 A AA B BB
1 ab1 bc1 D DD B BB
2 ab2 ab5 G GG E EE
3 ab3 bc4 C CC F FF
4 ab4 ab5 H HH E EE
5 ab5 ab1 E EE A AA
6 ab5 ab1 E EE D DD

Related

Index Match 2 columns and find lowest Value

Im struggling to get a formula to work for the following any help apreciated.
Eg. AB DD match with DD AB lowset value of the matched pair is DD AB 1.29 and result in Column 4
The Pairs are always opposing.
| Column 1 | Column 2 | Column 3 | Column 4 |
AB DD 2.34 0
XC TT 0.34 1
ST HU 3.57 0
DD AB 1.29 1
TT XC 1.01 0
Assuming data in A1:C5, in D1:
=N(C1=MIN(MINIFS(C$1:C$5,A$1:A$5,CHOOSE({1,2},A1,B1),B$1:B$5,CHOOSE({1,2},B1,A1))))
and copied down as required.
Note that this also assumes that a 'pair' exists for every entry.

How to recalculate DataFrame column values based on condition dict (Pandas Python)

Lets say I have the following DataFrame:
A
B
0
aa
4.32
1
aa
7.00
2
bb
8.00
3
dd
74.00
4
cc
30.00
5
bb
2.00
And let's say I have the following dict which determs the condition for column A in its keys and determs the multiplier for coulmn B in its values:
dict1={'aa':-1, 'bb':2}
All I want is to multiply values in column B with vulues from dict1 based on condition that values in column A are queal to dict1 keys.
So the ouptput should be:
A
B
0
aa
-4.32
1
aa
-7.00
2
bb
16.00
3
dd
74.00
4
cc
30.00
5
bb
4.00
Thanks
Use pd.Series.map:
print (df["A"].map(dict1).fillna(1)*df["B"])
0 -4.32
1 -7.00
2 16.00
3 74.00
4 30.00
5 4.00
dtype: float64

How to count all combinations of 1 column against multiple columns

I have CSV exports of data for individual pieces in a collection that each have two initial columns: A & B.
Column A has topics and column B has 0+ tags for those topics separated by commas.
There exists a master list of all possible singular combinations of topics/column A and tags/column B, but each CSV export may one, more, or none of any particular combination and there can be many total combinations.
In the actual master list, there are about 20 topics and anywhere from 2 to maybe 50 tags per topic.
Master list example:
Topics/A Tags/B
AA XX
AA XY
AA XZ
AB VV
AB VW
AB VX
AB VY
AB VZ
AC YY
AC YZ
Individual piece CSV example:
Topics/A Tags/B
AA
AA XZ
AA XZ, XX
AA XZ, XX
AB VV, VY
AB VY
AB VX, VV, VZ
AB VY
AB VZ, VW, VV, VY, VX
AC YY
AC YY
AC YY
AC YY, YZ
I want the final result to be a count of all combinations of topics/A and tags/B.
Final result for individual piece example from above (option 1):
Topics/A Tags/B Count
AA none 1
AA XX 2
AA XY 0
AA XZ 3
AB none 0
AB VV 3
AB VW 1
AB VX 2
AB VY 4
AB VZ 2
AC none 0
AC YY 4
AC YZ 1
Final result for individual piece example from above (option 2a, as seen below, or option 2b with columns and rows swapped):
AA (none) XX XY XZ AB (none) VV VW VX VY VZ AC (none) YY YZ
AA 1 2 0 3
AB 0 3 1 2 4 2
AC 0 4 1
I'm assuming I have to separate out the tags/B column so that it's something like:
AA
AA XZ
AA XZ XX
AA XZ XX
AB VV VY
AB VY
AB VX VV VZ
AB VY
AB VZ VW VV VY VX
AC YY
AC YY
AC YY
AC YY YZ
But, after this, I'm pretty stuck on what to do.
I tried looking up how to unpivot the above, but I would like some sort of formula or method that would be universal when the number of tags applied for a topic in a singular instance are generally unknown.
I've seen formulas that "flatten" the data, but I think they're designed for a fixed number of possible "tag" columns, which doesn't work for me.
I don't want to resort to manually entering formulas to flatten out the delimited matrix like:
Topics (flattened) Tags (flattened)
=A2 =B2
=A2 =C2
=A2 =D2
=A2 =E2
=A2 =F2
=A3 =B3
=A3 =C3
=A3 =D3
=A3 =E3
=A3 =F3
=A4 =B4
... ...
Please help.
Thank you.
Here's the formula method. There's one part you might need a vba script for - let me know.
Replace all blanks in your csv with "none" (highlight it>cntrl+h>replace nothing with none)
Add "none" rows to your master list (let me know if you need me to write a macro that does this)
Use this formula with control + shift + enter
=SUM((A2=$A$17:$A$29)*(ISNUMBER(FIND(B2,$B$17:$B$29))))

How to replace the pandas column value based on others dataframe columns

I have 2 pandas dataframe as below
df1:-
col1 col2 col3
aa b c
aa d c
bb d t
bb b g
cc e c
dd g c
and 2nd dataframe:-
col1 col2
aa b
cc e
bb d
And I want to change the value of col3 of dataframe1 to 'cc'. like below. based on 2nd dataframe column col1 and col2.
col1 col2 col3
aa b cc
aa d c
bb d cc
bb b g
cc e cc
dd g c
In short, I want to map 2nd dataframe columns(col1,col2) with 1st dataframe of columns(col1,col2) and change the column(col3) of 1st dataframe where it matches.
Use DataFrame.merge with left join and indicator parameter for helper column, compare by Series.eq for == with both and last set values in DataFrame.loc:
m = df1.merge(df2, on=['col1','col2'],indicator=True, how='left')['_merge'].eq('both')
df1.loc[m, 'col3'] = 'cc'
print (df1)
col1 col2 col3
0 aa b cc
1 aa d c
2 bb d cc
3 bb b g
4 cc e cc
5 dd g c
You can use pd.concat and drop_duplicates after assign a value for 'col3' on dataframe, df2 :
df = pd.concat([df2.assign(col3='cc'), df1]).drop_duplicates(['col1','col2']).reset_index(drop=True)
df
Output:
col1 col2 col3
0 aa b cc
1 cc e cc
2 bb d cc
3 aa d c
4 bb b g
5 dd g c

How to pass 2 or more array criteria in SUMIFS formula?

I have a below table as below
A B C
a aa 1
a aa 1
a dd 1
a aa 1
b aa 1
b bb 1
b aa 1
b bb 1
c cc 1
c bb 1
c bb 1
c cc 1
d cc 1
d aa 1
d bb 1
d cc 1
When i put the formula
=SUMPRODUCT(SUMIFS(C1:C16,A1:A16,{"a","b","c"}))
it returns 12
However when i put
=SUMPRODUCT(SUMIFS(C1:C16,A1:A16,{"a","b","c"},B1:B16,{"aa","bb"}))
it returns only 5
Can one one help me with this. I dont wnat to use multi formula like
=SUMPRODUCT(SUMIFS(C1:C16,A1:A16,{"a","b","c"},B1:B16,"aa"})) + SUMPRODUCT(SUMIFS(C1:C16,A1:A16,{"a","b","c"},B1:B16,"bb"}))
I got the answer that i was expecting from another site.
Thanks http://www.excelforum.com/members/30486.html
=SUMPRODUCT(SUMIFS(C1:C16,A1:A16,{"a","b","c"},B1:B16,{"aa";"bb"}))
I would use this one:
=SUMPRODUCT(ISNUMBER(MATCH(A1:A16,{"a","b","c"},0)*MATCH(B1:B16,{"aa","bb"},0))*(C1:C16))
or, if C1:C16 always contains only 1, simply:
=SUMPRODUCT(1*ISNUMBER(MATCH(A1:A16,{"a","b","c"},0)*MATCH(B1:B16,{"aa","bb"},0)))

Resources