How to groupby one column of DataFrame while it appends corresponding rows in another column and multiplies to the number of itself in its column? - python-3.x

Imagine we have 2 columns DataFrame, col1 has a unique number while col2 has repeated number like below.
I want to be like this:

Try:
# Setup
df = pd.DataFrame({'col1':{0:89,1:53,2:97,3:106,4:115,5:56,6:55,7:105,8:71,9:70,10:110},'col2':{0:205,1:205,2:205,3:203,4:203,5:203,6:202,7:201,8:200,9:200,10:198}})
df_new = df.groupby('col2', sort=False)['col1'].apply(list).reset_index()
df_new['col2'] = df_new['col1'].str.len().astype(str) + '*' + df_new.pop('col2').astype(str)
print(df_new)
[out]
col1 col2
0 [89, 53, 97] 3*205
1 [106, 115, 56] 3*203
2 [55] 1*202
3 [105] 1*201
4 [71, 70] 2*200
5 [110] 1*198

Related

Is there a Pandas way to group dates that are less than 2 minutes apart in a dataframe?

i would like to make a groupby on my data to put together dates that are close. (less than 2 minutes)
Here an example of what i get
> datas = [['A', 51, 'id1', '2020-05-27 05:50:43.346'], ['A', 51, 'id2',
> '2020-05-27 05:51:08.347'], ['B', 45, 'id3', '2020-05-24
> 17:23:55.142'],['B', 45, 'id4', '2020-05-24 17:23:30.141'], ['C', 34,
> 'id5', '2020-05-23 17:31:10.341']]
>
> df = pd.DataFrame(datas, columns = ['col1', 'col2', 'cold_id',
> 'dates'])
The 2 first rows have close dates, same for the 3th and 4th rows, 5th row is alone.
I would like to get something like this :
> datas = [['A', 51, 'id1 id2', 'date_1'], ['B', 45, 'id3 id4',
> 'date_2'], ['C', 34, 'id5', 'date_3']]
>
> df = pd.DataFrame(datas, columns = ['col1', 'col2', 'col_id',
> 'dates'])
Making it in a pythonic way is not that hard, but i have to make it on big dataframe, a pandas way using groupby method would be much efficient.
After apply a datetime method on the dates column i tried :
> df.groupby([df['dates'].dt.date]).agg(','.join)
but the .dt.date method gives a date every day and not every 2 minutes.
Do you have a solution ?
Thank you
Looking at the output seems like we are trying to group dates by 2 min freq and col1, col2.
Code
df['dates'] = pd.to_datetime(df.dates)
df.groupby([pd.Grouper(key='dates', freq='2 min'),
'col1', 'col2']).agg(','.join).reset_index().sort_values('col1').reset_index(drop=True)
Output
dates col1 col2 cold_id
0 2020-05-27 05:50:00 A 51 id1,id2
1 2020-05-24 17:22:00 B 45 id3,id4
2 2020-05-23 17:30:00 C 34 id5
Using only dates, this is what I did to classifiate your rows :
First of all, convert date to timestamp to compare them easily :
from datetime import datetime
import time
df["dates"] = df["dates"].apply(lambda x : int(time.mktime(datetime.strptime(x,"%Y-%m-%d %H:%M:%S.%f").timetuple())))
Then, sort them by date :
df = df.sort_values("dates")
Finally, using this answer, I create a column group in order to identify close dates. The first line add 1 into group column if dates are close enough (120 seconds in our case). The second line will fill in the column group to remove the NaN and assign groups :
df.loc[(df.dates.shift() < df.dates - 120),"group"] = 1
df['group'] = df['group'].cumsum().ffill().fillna(0)
This gives me :
col1 col2 cold_id dates group
4 C 34 id5 1590247870 0.00
3 B 45 id4 1590333810 1.00
2 B 45 id3 1590333835 1.00
0 A 51 id1 1590551443 2.00
1 A 51 id2 1590551468 2.00
Now, to concatenate your cold_id, you groupby your groups and join the cold_id of each group by using transform:
df["cold_id"] = df.groupby(["group"],as_index=False)["cold_id"].transform(lambda x: ','.join(x))
df = df.drop_duplicates(subset=["cold_id"])
This finally gives you this dataframe :
col1 col2 cold_id dates group
4 C 34 id5 1590247870 0.00
3 B 45 id4,id3 1590333810 1.00
0 A 51 id1,id2 1590551443 2.00

In pandas dataframe, how to make one column act on all the others?

Consider the small following dataframe:
import pandas as pd
value1 = [15, 20, 50, 70]
value2 = [15, 80, 45, 30]
base = [175, 150, 200, 125]
df = pd.DataFrame({"val1": value1, "val2": value2, "base": base})
df
val1 val2 base
0 15 15 175
1 20 80 150
2 50 45 200
3 70 30 125
Actually, there are much more rows and much more val*** columns...
I would like to express the figures given in the columns val*** as percent of their corresponding base (in the same row); as an example, 70 (last in val1) should become (70/125)*100, (which is 56), or 30 (last in val2) should become (30/125)*100 (which is 28) ; and so on for every figure.
I am sure the solution lies in a correct use of assign or apply and lambda, but I can't find how to do it ...
We can filter the val like columns then divide these columns by the base column along axis=0 followed by multiplication with 100 to calculate the percentage
df.filter(like='val').div(df['base'], axis=0).mul(100).add_suffix('%')
val1% val2%
0 8.571429 8.571429
1 13.333333 53.333333
2 25.000000 22.500000
3 56.000000 24.000000

Using groupby and filters on a dataframe

I have a dataframe with both string and integer values.
Attaching a sample data dictionary to understand the dataframe that I have:
data = {
'col1': ['A','A','A','B','B','B','C','C','C','D','D','D'],
'col2': [10,20,30,10,20,30,10,20,30,10,20,30],
'col3': ['X','X','X','X','Y','X','X','X','Y','Y','X','X'],
'col4': [45,23,78,56,12,34,87,54,43,89,43,12]
'col5': [3,4,6,4,3,2,4,3,5,3,4,6]
}
I need to extract data as under:
Max value from col4
Grouped by col1
Filtered out col3 from the result if value is Y
Filter col5 from the result to show only values not more than 5.
So I tried something and faced following problems.
1- I used following method to find max value from the data. But I am not able to find max value from each group.
print(dataframe['col4'].max()) #this worked to get one max value
print(dataframe.groupby('col1').max() #this doesn't work
Second one doesn't work for me as that returns maximum value for col2 as well. I need the result to have col2 value against the max row under each group.
2- I am not able to apply filter on both col3 (str) and col5 (int) in one command. Any way to do that?
print(dataframe[dataframe['col3'] != 'Y' & dataframe['col5'] < 6]) #generates an error
The output that I am expecting through this is:
col1 col2 col3 col4 col5
0 A 10 X 45 3
3 B 10 X 56 4
6 C 10 X 87 4
10 D 20 X 43 4
#
# 78 is max in group A, but ignored as col5 is 6 (we need < 6)
# Similarly, 89 is max in group D, but ignored as col3 is Y.
I apologize if I am doing something wrong. I am quite new to this.
Thank you.
I'm not a python developer, but im my opinion you do it in a wrong way.
You shoud have a list of structure insted of structure of list.
Then you can start workin on such list.
This is an example solution, so probably it coud be done im much smootcher way:
data = {
'col1': ['A','A','A','B','B','B','C','C','C','D','D','D'],
'col2': [10,20,30,10,20,30,10,20,30,10,20,30],
'col3': ['X','X','X','X','Y','X','X','X','Y','Y','X','X'],
'col4': [45,23,78,56,12,34,87,54,43,89,43,12],
'col5': [3,4,6,4,3,2,4,3,5,3,4,6]
}
newData = [];
for i in range(len(data['col1'])):
newData.append({'col1' : data['col1'][i], 'col2' : data['col2'][i], 'col3' : data['col3'][i], 'col4' : data['col4'][i], 'col5' : data['col5'][i]})
withoutY = list(filter(lambda d: d['col3'] != 'Y', newData))
lessThen5 = list(filter(lambda d: d['col5'] < 5, withoutY))
values = set(map(lambda d: d['col1'], lessThen5))
groupped = [[d1 for d1 in lessThen5 if d1['col1']==d2] for d2 in values]
result = [];
for i in range(len(groupped)):
result.append(max(groupped[i], key = lambda g: g['col4']))
sortedResult = sorted(result, key = lambda r: r['col1'])
print (sortedResult)
result:
[
{'col1': 'A', 'col2': 10, 'col3': 'X', 'col4': 45, 'col5': 3},
{'col1': 'B', 'col2': 10, 'col3': 'X', 'col4': 56, 'col5': 4},
{'col1': 'C', 'col2': 10, 'col3': 'X', 'col4': 87, 'col5': 4},
{'col1': 'D', 'col2': 20, 'col3': 'X', 'col4': 43, 'col5': 4}
]
Ok, I didn't actually notice.
So i was try something like this:
#fd is a filtered data
fd=data.query('col3 != "Y"').query('col5 < 6')
# or fd=data[data.col3 != 'Y'][data.col5 < 6]
#m is max for col4 grouped by col1
m=fd.groupby('col1')['col4'].max()
This will group by col1 and get max from col4, but in result we have 2 colums (col1 and col4).
I don't know what do you want to achieve.
If you want to have all line, here is the code:
result=fd[lambda x: x.col4 == m.get(x.col1).values]
You need to be careful, because you not alway will have one line for "col1".
E.g. For data
data = pd.DataFrame({
'col1': ['A','A','A','A','B','B','B','B','C','C','C','D','D','D'],
'col2': [20,10,20,30,10,20,20,30,10,20,30,10,20,30],
'col3': ['X','X','X','X','X','X','Y','X','X','X','Y','Y','X','X'],
'col4': [45,45,23,78,45,56,12,34,87,54,43,89,43,12],
'col5': [1,3,4,6,1,4,3,2,4,3,5,3,4,6]})
Result will be:
col1 col2 col3 col4 col5
0 A 20 X 45 1
1 A 10 X 45 3
5 B 20 X 56 4
8 C 10 X 87 4
12 D 20 X 43 4
Additionally if you want to have normal index instead of ..., 8, 9 12, you could use "where" instead of "query"

Combine data from two columns into one without affecting the data values

I have two columns in a data frame. I want to combine those columns into a single column.
df = pd.DataFrame({'a': [500, 200, 13, 47], 'b':['$', '€', .586,.02]})
df
Out:
a b
0 500 $
1 200 €
2 13 .586
3 47 .02
I want to merge that two columns without affecting the data.
Expected output:
df
Out:
a
0 500$
1 200€
2 13.586
3 47.02
Please help me with this...
I tried the below solution, but it does not work for me,
df.b=np.where(df.b,df.b,df.a)
df.loc[df['b'] == '', 'b'] = df['a']
First solution working with convert both columns to strings and then join with +, last convert Series to one column DataFrame - but it working only if numbers less like 1 for column b:
df1 = df.astype(str)
df = (df1.a + df1.b.str.replace(r'^0', '')).to_frame('a')
print (df)
a
0 500$
1 200€
2 13.586
3 47.02
Or if want mixed values numeric for last 2 rows and strings for first 2 rows use:
m = df.b.apply(lambda x: isinstance(x, str))
df.loc[m, 'a'] = df.loc[m, 'a'].astype(str) + df.b
df.loc[~m, 'a'] += df.pop('b')
print (df)
a
0 500$
1 200€
2 13.586
3 47.02

How to add column name to cell in pandas dataframe?

How do I take a normal data frame, like the following:
d = {'col1': [1, 2], 'col2': [3, 4]}
df = pd.DataFrame(data=d)
df
col1 col2
0 1 3
1 2 4
and produce a dataframe where the column name is added to the cell in the frame, like the following:
d = {'col1': ['col1=1', 'col1=2'], 'col2': ['col2=3', 'col2=4']}
df = pd.DataFrame(data=d)
df
col1 col2
0 col1=1 col2=3
1 col1=2 col2=4
Any help is appreciated.
Make a new DataFrame containing the col*= strings, then add it to the original df with its values converted to strings. You get the desired result because addition concatenates strings:
>>> pd.DataFrame({col:str(col)+'=' for col in df}, index=df.index) + df.astype(str)
col1 col2
0 col1=1 col2=3
1 col1=2 col2=4
You can use apply to set column name in cells and then join them with '=' and the values.
df.apply(lambda x: x.index+'=', axis=1)+df.astype(str)
Out[168]:
col1 col2
0 col1=1 col2=3
1 col1=2 col2=4
You can try this
df.ne(0).mul(df.columns)+'='+df.astype(str)
Out[1118]:
col1 col2
0 col1=1 col2=3
1 col1=2 col2=4

Resources