how to use sqldf to group rows and paste values - sqldf

I now have a data look like this:
V1 V2 V3 V4
a 1 2 name1
b 3 4 name2
b 3 4 name3
c 2 5 name4
and I want to get this result by grouping based on V1,V2,V3 using sqldf
V1 V2 V3 V4
a 1 2 name1
b 3 4 name2+name3
c 2 5 name4
I am thinking to use sth like
sqldf("select *, paste(V4) as V5 from table group by V1,V2,V3")
but I have trouble finding the right function to put as "paste" above. I wrote a complicated loop to solve this problem, but I am wondering if there is a simple way. Could someone help me out? Any input would be very appreciated! Thank you for your time!
Thanks,
Raine

Try this:
sqldf("select V1, V2, V3, group_concat(V4) V4 from DF group by V1, V2, V3")
giving:
V1 V2 V3 V4
1 a 1 2 name1
2 b 3 4 name2,name3
3 c 2 5 name4

Related

Optimize groupby->pd.DataFrame->.reset_index->.rename(columns)

I am very new at this, so bear with me please.
I do this:
example=
index Date Column_1 Column_2
1 2019-06-17 Car Red
2 2019-08-10 Car Yellow
3 2019-08-15 Truck Yellow
4 2020-08-12 Truck Yellow
data = example.groupby([pd.Grouper(freq='Y', key='Date'),'Column_1']).nunique()
df1=pd.DataFrame(data)
df2 = df1.reset_index(level=['Column_1','Date'])
df2 = df2.rename(columns={'Date':'interval_year','Column_2':'Sum'})
In order to get this:
df2=
index interval_year Column_1 Sum
1 2019-12-31 Car 2
2 2019-12-31 Truck 1
3 2020-12-31 Car 1
I get the expected result but my code gives me a lot of headache. I create 2 additional DataFrames and sometimes, when I get 2 columns with same name (one as index), the code becomes even more complicated.
Any solution how to make this more efficient?
Thank you
You can use pd.NamedAgg to do some renaming for you in the groupby like this:
example.groupby([pd.Grouper(key='Date', freq='Y'),'Column_1']).agg(sum=('Date','nunique')).reset_index()
Output:
Date Column_1 sum
0 2019-12-31 Car 2
1 2019-12-31 Truck 1
2 2020-12-31 Truck 1
To reduce visible noise and to make your code more performant, I suggest you to do method chaining.
Try this :
df2 = (
example
.assign(Date= pd.to_datetime(df["Date"]))
.groupby([pd.Grouper(freq='Y', key='Date'),'Column_1']).nunique()
.reset_index()
.rename(columns={'Date':'interval_year','Column_2':'Sum'})
)
# Output :
print(df2)
interval_year Column_1 Sum
0 2019-12-31 Car 2
1 2019-12-31 Truck 1
2 2020-12-31 Truck 1

Updating a Value of A Panda Dataframe with a Function

I have a function which updates a dataframe that I have passed in:
def update_df(df, x, i):
for i in range(x):
list = ['name' + str(i), i + 2, i - 1]
df.loc[i] = list
return df, i
df = pd.DataFrame(columns=['lib', 'qty1', 'qty2'])
i = 0
df, i = update_df(df, 6, i)
What I would like to do, is be able to modify the dataframe after running it through the update_df(df, x, i) function I created but I am not having much luck with it. The following is an example of what I am trying to do I am just trying to is concatenate the values of the last two rows in the lib column together:
temp = df.loc[i][0]+df.loc[i-1][0]
print(temp)
df.loc[i][0] = temp
print(df)
The following is the output I get:
name5name4
lib qty1 qty2
0 name0 2 -1
1 name1 3 0
2 name2 4 1
3 name3 5 2
4 name4 6 3
5 name5 7 4
But what I hope to get is:
name5name4
lib qty1 qty2
0 name0 2 -1
1 name1 3 0
2 name2 4 1
3 name3 5 2
4 name4 6 3
5 name5name4 7 4
This is part of a larger project where I will be constantly writing to a dataframe and then eventually write the dataframe to a file. Periodically I will want to update the last row of the dataframe which is where I am looking to figure out how to update it appropriately. I would like to avoid making a copy of the dataframe and just stick with one that I update throughout my code.
If all you want is to concatenate the last two values in the lib column, and reassign the last row's lib column to that value:
df.loc[df.index[-1], "lib"] = df[-2:]["lib"].sum()

Sumif with only first value of each group in column

I have a dataset similar to this, but really extensive:
Row
Levels
Level 1
Size
Department
1
1
AA
2.0
Dept 1
2
2
AA
0.8
Dept 1
3
3
AA
1.5
Dept 1
4
2
BB
3.0
Dept 1
5
3
BB
2.0
Dept 1
6
3
BB
2.5
Dept 2
7
2
CC
5.0
Dept 2
8
3
CC
1.5
Dept 2
9
3
DD
0.5
Dept 2
10
3
DD
3.0
Dept 2
11
2
EE
4.0
Dept 2
12
3
EE
2.0
Dept 2
What I need is to achieve a total size per Department, however I want to sum only the first match per Level 1, i.e.:
Department 1 would be 2.0 (row 1) + 3.0 (row 4) = 5.0
Department 2 would be 2.5 (row 6) + 5.0 (row 7) + 0.5 (row 9) + 4.0 (row 11) = 12.0
Does anyone have any idea how to accomplish this in Excel?
Alternate solution to the same formula:
=SUM(XLOOKUP(UNIQUE(FILTER(C:C,(ROW(C:C)>1)*(E:E=#$F$2#))&#$F$2#),C:C&E:E,D:D))
Where F2 holds =UNIQUE(FILTER(E:E,(ROW(E:E)>1)*(E:E<>"")))
If you have Excel 365, you could try something like this:
=LET(FilteredLevel,FILTER(C$2:C$13,E$2:E$13=H2),
SUM(XLOOKUP(UNIQUE(FilteredLevel),FilteredLevel,FILTER(D$2:D$13,E$2:E$13=H2))))
Note
You can also use full-column references if you wish
=LET(FilteredLevel,FILTER(C:C,E:E=H2),
SUM(XLOOKUP(UNIQUE(FilteredLevel),FilteredLevel,FILTER(D:D,E:E=H2))))
SUMIFS() will not do what you want. Use SUMPRODUCT() with some boolean:
=SUMPRODUCT($C$2:$C$13*($D$2:$D$13=F2)*(COUNTIFS(OFFSET($B$2,0,0,ROW($B$2:$B$13)-1),$B$2:$B$13,OFFSET($D$2,0,0,ROW($B$2:$B$13)-1),F2)=1))
One note, the use of OFFSET() makes this a volatile function, meaning that it will recalc with every change made to excel. If there are too many then it will slow down the responsiveness in Excel.
To do it without the volatility we need a helper column. In E2 put:
=COUNTIFS($D$2:D2,D2,$B$2:B2,B2)=1
And copy down. Then we can use SUMIFS():
=SUMIFS(C:C,D:D,F2,E:E,TRUE)

How to convert elements of a column to a list in pandas

I have a dataframe df like
A B C
1 2 {'id':1}
3 3 {'id':2}
5 4 {'id':3}
I want an output like this.
A B C
1 2 [{'id':1}]
3 3 [{'id':2}]
5 4 [{'id':3}]
Any help please. Thanks
Try with
df['C'] = df['C'].apply(lambda x : [x])

pandas: do not count nan in an aggregate function

I have the following code:
data_agg_df = data_df.groupby("team", as_index=False).player.agg({"player_set": lambda x: set(list(x)), "player_count": "nunique"})
Then my results look like:
team player_set player_count
-------------------------------------------------
A {John, Mary} 2
B {nan} 0
C {Dave,nan} 1
I am wondering how to not show the nana in the player_set? i.e. I want the resulting data frame look like:
team player_set player_count
-------------------------------------------------
A {John, Mary} 2
B {} 0
C {Dave} 1
Thanks!
replace
set(list(x))
with
set(list(i for i in x if pd.notnull(i)))
to take out the nans

Resources