I have the pandas dataframe as below
A B C
Apple 20 A1
Apple 30 A2
Apple 40 A3
Kiwi 20 K1
Kiwi 30 K2
Kiwi 10 K3
I want the output as
A B C
Apple 20 A1
30 A2
40 A3
Kiwi 20 K1
30 K2
10 K3
Use Groupby.cumcount with df.loc and Series.ne:
In [938]: df.loc[df.groupby('A').cumcount().ne(0), 'A'] = ''
In [939]: df
Out[939]:
A B C
0 Apple 20 A1
1 30 A2
2 40 A3
3 Kiwi 20 K1
4 30 K2
5 10 K3
Related
have four columns:
Name A1
Count B1
Name C1
Count D1
Audi
4
MB
10
Bmw
7
Toyota
15
MB
15
Audi
12
Toyota
12
vW
3
VW
10
BMW
8
What formula can I use to subtract the number of D1 from the number of column B1 according to the name of C1?
example
Name A1
Count B1
Name C1
Count D1
Difference
Audi
4
MB
10
10 - 15 = -5
Bmw
7
Toyota
15
15 - 12 = 3
MB
15
Audi
12
12 - 4 = 8
Toyota
12
vW
3
3 - 10 = -7
VW
10
BMW
8
8 - 7 = 1
You can use this formula:
=D2-INDEX($B$2:$B$6,MATCH(C2,$A$2:$A$6,0))
Using SUMPRODUCT()
• Formula used in cell E2
=D2-SUMPRODUCT((C2=$A$2:$A$6)*($B$2:$B$6))
I've 2 related Data Frames, Is there any easy way to combine into multi-indexes dataframe?
import pandas as pd
df = pd.DataFrame( [[
1,0,1,0],
[1,1,0,0],
[1,0,0,1],
[0,1,0,1]], columns=["c1","c2","c3", "c4"]
)
idx= pd.Index(['p1','p2','p3','p4'])
df = df.set_index(idx)
df output is:
c1 c2 c3 c4
p1 1 0 1 0
p2 1 1 0 0
p3 1 0 0 1
p4 0 1 0 1
df2 = pd.DataFrame( [[
0,10,30,0],
[20,10,0,0],
[0,10,0,6],
[15,0,18,5]], columns=["c1","c2","c3", "c4"]
)
idx2= pd.Index(['a1','a2','a3','a4'])
df2 = df2.set_index(idx2)
df2 output is:
c1 c2 c3 c4
a1 0 10 30 0
a2 20 10 0 0
a3 0 10 0 6
a4 15 0 18 5
The final dataframe is multi-indexing (p,c,a) single column (value):
value
p1 c1 a2 20
a4 15
c3 a1 30
a4 18
p2 c1 a2 20
a4 15
c2 a1 10
a2 10
a3 10
p3 c1 a2 20
a4 15
c4 a3 6
a4 5
p4 c2 a1 10
a2 10
a3 10
c4 a3 6
a4 5
You can reshape an merge:
(df.reset_index().melt('index')
.loc[lambda x: x.pop('value').eq(1)]
.merge(df2.reset_index().melt('index').query('value != 0'),
on='variable')
.set_index(['index_x', 'variable', 'index_y'])
.rename_axis([None, None, None])
)
output:
value
p1 c1 a2 20
a4 15
p2 c1 a2 20
a4 15
p3 c1 a2 20
a4 15
p2 c2 a1 10
a2 10
a3 10
p4 c2 a1 10
a2 10
a3 10
p1 c3 a1 30
a4 18
p3 c4 a3 6
a4 5
p4 c4 a3 6
a4 5
If order matters:
(df.stack().reset_index()
.loc[lambda x: x.pop(0).eq(1)]
.set_axis(['index', 'variable'], axis=1)
.merge(df2.reset_index().melt('index').query('value != 0'),
on='variable', how='left')
.set_index(['index_x', 'variable', 'index_y'])
.rename_axis([None, None, None])
)
output:
value
p1 c1 a2 20
a4 15
c3 a1 30
a4 18
p2 c1 a2 20
a4 15
c2 a1 10
a2 10
a3 10
p3 c1 a2 20
a4 15
c4 a3 6
a4 5
p4 c2 a1 10
a2 10
a3 10
c4 a3 6
a4 5
With Pandas I'm trying to rename unnamed columns in dataframe with values on the first ligne of data.
My dataframe:
id
store
unnamed: 1
unnamed: 2
windows
unnamed: 3
unnamed: 4
0
B1
B2
B3
B1
B2
B3
1
2
c
12
15
15
14
2
4
d
35
14
14
87
My wanted result:
id
store_B1
store_B3
store_B2
windows_B1
windows_B2
windows_B3
0
B1
B2
B3
B1
B2
B3
1
2
c
12
15
15
14
2
4
d
35
14
14
87
I don't know how I can match the column name with the value in my data. Thanks for your help. Regards
You can use df.columns.where to make unnamed: columns NaN, then convert it to a Series and use ffill:
df.columns = pd.Series(df.columns.where(~df.columns.str.startswith('unnamed:'))).ffill() + np.where(~df.columns.isin(['id','col2']), ('_' + df.iloc[0].astype(str)).tolist(), '')
Output:
>>> df
id store_B1 store_B2 store_B3 windows_B1 windows_B2 windows_B3
0 0 B1 B2 B3 B1 B2 B3
1 1 2 c 12 15 15 14
2 2 4 d 35 14 14 87
I am attempting merge and fill in missing values in one data frame from another one. Hopefully this isn't too long of an explanation i have just been wracking my brain around this for too long. I am working with 2 huge CSV files so i made a small example here. I have included the entire code at the end in case you were curious to assist. THANK YOU SO MUCH IN ADVANCE. Here we go!
print(df1)
A B C D E
0 1 B1 D1 E1
1 C1 D1 E1
2 1 B1 D1 E1
3 2 B2 D2 E2
4 B2 C2 D2 E2
5 3 D3 E3
6 3 B3 C3 D3 E3
7 4 C4 D4 E4
print(df2)
A B C F G
0 1 C1 F1 G1
1 B2 C2 F2 G2
2 3 B3 F3 G3
3 4 B4 C4 F4 G4
I would essentially like to merge df2 into df1 by 3 different columns. i understand that you can merge on multiple column names but it seems to not give me the desired result. I would like to KEEP all data in df1, and fill in the data from df2 so i use how='left'.
I am fairly new to python and have done a lot of research but have hit a stuck point. Here is what i have tried.
data3 = df1.merge(df2, how='left', on=['A'])
print(data3)
A B_x C_x D E B_y C_y F G
0 1 B1 D1 E1 C1 F1 G1
1 C1 D1 E1 B2 C2 F2 G2
2 1 B1 D1 E1 C1 F1 G1
3 2 B2 D2 E2 NaN NaN NaN NaN
4 B2 C2 D2 E2 B2 C2 F2 G2
5 3 D3 E3 B3 F3 G3
6 3 B3 C3 D3 E3 B3 F3 G3
7 4 C4 D4 E4 B4 C4 F4 G4
As you can see here it sort of worked with just A, however since this is a csv file with blank values. the blank values seem to merge together. which i do not want. because df2 was blank in row 2 it filled in the data where it saw blanks, which is not what i want. it should be NaN if it could not find a match.
whenever i start putting additional rows into my "on=['A', 'B'] it does not do anything different. in-fact, A no longer merges.
data3 = df1.merge(df2, how='left', on=['A', 'B'])
print(data3)
A B C_x D E C_y F G
0 1 B1 D1 E1 NaN NaN NaN
1 C1 D1 E1 NaN NaN NaN
2 1 B1 D1 E1 NaN NaN NaN
3 2 B2 D2 E2 NaN NaN NaN
4 B2 C2 D2 E2 C2 F2 G2
5 3 D3 E3 NaN NaN NaN
6 3 B3 C3 D3 E3 F3 G3
7 4 C4 D4 E4 NaN NaN NaN
Rows A, B, and C are the values i want to correlate and merge on. Using both data frames it should know enough to fill in all the gaps. my ending df should look like:
print(desired_output):
A B C D E F G
0 1 B1 C1 D1 E1 F1 G1
1 1 B1 C1 D1 E1 F1 G1
2 1 B1 C1 D1 E1 F1 G1
3 2 B2 C2 D2 E2 F2 G2
4 2 B2 C2 D2 E2 F2 G2
5 3 B3 C3 D3 E3 F3 G3
6 3 B3 C3 D3 E3 F3 G3
7 4 B4 C4 D4 E4 F4 G4
even though A, B, and C have repeating rows i want to keep ALL the data and just fill in the data from df2 where it might fit, even if it is repeat data. i also do not want to have all of the _x and _y the suffix's from merging. i know how to rename but doing 3 different merges and merging those merges starts to get really complicated really fast with repeated rows and suffix's...
long story short, how can i merge both data-frames by A, and then B, and then C? order in which it happens is irrelevant.
Here is a sample of actual data. I have my own data that has additional data and i relate it to this data by certain identifiers. basically by MMSI, Name and IMO. i want to keep duplicates because they aren't actually duplicates, just additional data points for each vessel
MMSI BaseDateTime LAT LON VesselName IMO CallSign
366940480.0 2017-01-04T11:39:36 52.48730 -174.02316 EARLY DAWN 7821130 WDB7319
366940480.0 2017-01-04T13:51:07 52.41575 -174.60041 EARLY DAWN 7821130 WDB7319
273898000.0 2017-01-06T16:55:33 63.83668 -174.41172 MYS CHUPROVA NaN UAEZ
352844000.0 2017-01-31T22:51:31 51.89778 -176.59334 JACHA 8512920 3EFC4
352844000.0 2017-01-31T23:06:31 51.89795 -176.59333 JACHA 8512920 3EFC4
I have a dataset which looks like:
Product Metrics C1 C2 C3
A1 Q1 20 30 10
Q2 213123 2312 32123
Q3 454 65 45
Q4 3 4 6
A2 Q1 10 5 1
Q2 123 13 23
Q3 454 65 45
Q4 3 4 6
A3 Q1 18 6 3
Q2 123 13 23
Q3 454 65 45
Q4 3 4 6
Now I want to sort the values based on metric Q1 - From smallest to largest (comparing against the product -A1,A2) then the final dataset should look like,
Product Metrics C1 C2 C3
A2 Q1 10 5 1
Q2 123 13 23
Q3 454 65 45
Q4 3 4 6
A3 Q1 18 6 3
Q2 123 13 23
Q3 454 65 45
Q4 3 4 6
A1 Q1 20 30 10
Q2 213123 2312 32123
Q3 454 65 45
Q4 3 4 6
hope this gives a clear picture. Thanks in advance guys
The way I would probably do it is transpose your columns and rows so that you have columns for Q1, Q2, Q3, Q4.
Like this:
Product Metrics Q1 Q2 Q3 Q4
A1 C1 20 213123 454 3
A1 C2 30 2312 65 4
A1 C3 10 32123 45 6
A2 C1 10 123 454 3
A2 C2 5 13 65 45
A2 C3 1 23 45 6
Then you can sort by Q1 using Data>Sort & Filter
CBRF23 already pointed in the right direction but I believe you have to go even a little bit further and flatten each product related sub-array into a single row like
A | B C D | E F G | H I J | K L M
---| Q1 --------| Q2 ------------ | Q3 ------- | Q4 -------
Pr | C1 C2 C3 | C1 C2 C3 | C1 C2 C3 | C1 C2 C3
A1 | 20 30 10 | 213123 2312 32123 | 454 65 45 | 3 4 6
A2 | 10 5 1 | 123 13 23 | 454 65 45 | 3 4 6
A3 | 18 6 3 | 123 13 23 | 454 65 45 | 3 4 6
(The first row just shows the Excel columns, second row the flattened Q1,Q2,Q3 and Q4 sections and third row the sub-headers for each column)
Now you can safely sort by column B. In case you want to sort by the sum of all Q1 metrics you could introduce another column N being the sum of B,C and D and use that for sorting.
Update:
To get your desired output format back there are basically to possibilities:
If the number of records is known and fixed you can set-up a "results" page in your excel folder with a list of small "sub-tables". The fields of each sub-array then directly reference the "transposed" fields in a line of the sorted master results array.
If the number of results is variable you will have to construct/reconstruct the results page mentioned above using a suitable vba script. The vba generated page can of course also consist of the sorted values directly rather than referencing the values in the sorted master array.