pandas random shuffling dataframe with constraints - python-3.x

I have a dataframe that I need to randomise in a very specific way with a particular rule, and I'm a bit lost. A simplified version is here:
idx type time
1 a 1
2 a 1
3 a 1
4 b 2
5 b 2
6 b 2
7 a 3
8 a 3
9 a 3
10 b 4
11 b 4
12 b 4
13 a 5
14 a 5
15 a 5
16 b 6
17 b 6
18 b 6
19 a 7
20 a 7
21 a 7
If we consider this as containing seven "bunches", I'd like to randomly shuffle by those bunches, i.e. retaining the time column. However, the constraint is that after shuffling, a particular bunch type (a or b in this case) cannot appear more than n (e.g. 2) times in a row. So an example correct result looks like this:
idx type time
21 a 7
20 a 7
19 a 7
7 a 3
8 a 3
9 a 3
17 b 6
16 b 6
18 b 6
6 b 2
5 b 2
4 b 2
2 a 1
3 a 1
1 a 1
14 a 5
13 a 5
15 a 5
12 b 4
11 b 4
10 b 4
I was thinking I could create a separate "order" array from 1 to 7 and np.random.shuffle() it, then sort the dataframe by time in that order, which will probably work - I can think of ways to do that part, but I'm especially struggling with the rule of restricting the number of repeats.
I know roughly that I should use a while loop, shuffle it in that way, loop over the frame and track the number of consecutive types, if it exceeds my n then break out and start the while loop again until it completes without breaking out, in which case set a value to end the while loop. But this got so messy and didn't work.
Any ideas?

See if this works.
import pandas as pd
import numpy as np
n = [['a',1],['a',1],['a',1],
['b',2],['b',2],['b',2],
['a',3],['a',3],['a',3]]
df = pd.DataFrame(n)
df.columns = ['type','time']
print(df)
order = np.unique(np.array(df['time']))
print("Before Shuffling",order)
np.random.shuffle(order)
print("Shuffled",order)
n =2
for i in order:
print(df[df['time']==i].iloc[0:n])

Related

Calculate mean value by interval coordinates in pandas

I have a dataframe such as :
Name Position Value
A 1 10
A 2 11
A 3 10
A 4 8
A 5 6
A 6 12
A 7 10
A 8 9
A 9 9
A 10 9
A 11 9
A 12 9
and I woulde like for each interval of 3 position, to calculate the mean of Values.
And create a new df with start and end coordinates (of length 3 then), with the Mean_value column.
Name Start End Mean_value
A 1 3 10.33 <---- here this is (10+11+10)/3 = 10.33
A 4 6 8.7
A 7 9 9.3
A 10 13 9
Does someone have an idea using pandas please ?
Solution for get each 3 rows (if exist) per Name groups - first get counter by GroupBy.cumcount with integer division and pass it to named aggregations:
g = df.groupby('Name').cumcount() // 3
df = df.groupby(['Name',g]).agg(Start=('Position','first'),
End=('Position','last'),
Value=('Value','mean')).droplevel(1).reset_index()
print (df)
Name Start End Value
0 A 1 3 10.333333
1 A 4 6 8.666667
2 A 7 9 9.333333
3 A 10 12 9.000000

Trouble Changing csv row based on column values in python3 and pandas

I am having issues trying to modify a value if columnA contains lets say 0,2,12,44,75,81 (looking at 33 different numerical values in columnA. If columnA contains 1 of the 33 vaules defined, i need to then change the same row of colulmnB to "approved".
I have tried this:
df[(df.columnA == '0|2|12|44|75|81')][df.columnB] = 'approved'
I get an error that there are none of index are in the columns but, i know that isn't correct looking at my csv file data. I must be searching wrong or using the wrong syntax.
As others have said, if you need the value in column A to match exactly with a value in the list, then you can use .isin. If you need a partial match, then you can convert to string dtypes and use .contains.
setup:
nums_to_match = [0,2,9]
df
A B
0 6 e
1 1 f
2 3 b
3 6 g
4 6 i
5 0 f
6 9 a
7 6 i
8 6 a
9 2 h
Exact match:
df.loc[df.A.isin(nums_to_match),'B']='approved'
df:
A B
0 6 e
1 1 f
2 3 b
3 6 g
4 6 i
5 0 approved
6 9 approved
7 6 i
8 6 a
9 2 approved
partial match:
nums_to_match_str = list(map(str,nums_to_match))
df['A']=df['A'].astype(str)
df.loc[df.A.str.contains('|'.join(nums_to_match_str),case=False,na=False),'B']='approved'
df
A B
0 1 h
1 4 c
2 6 i
3 7 d
4 3 d
5 9 approved
6 5 i
7 1 c
8 0 approved
9 5 d

Iterating through a data frame and grouping values in a range

I have a python data frame of weekly data like this :
Week Val
1 11
2 11
3 11
4 11
5 9
6 9
7 9
8 9
I would like create an output table like this:
Week 1 Week 2 Val
1 4 11
5 8 9
Apologies, I am quite new to python and its iterative tools. I am not sure how to solve this problem.
I tried to match using the previous row columns but I do not think how to go further:
df['Match'] = df['Val'].eq(df['Val'].shift(-1))
You want to groupby the consecutive blocks of Val. So you can use cumsum on the non-zero differences to get the block:
blocks = df['Val'].ne(df['Val'].shift(1)).cumsum()
(df.groupby(blocks, as_index=False)
.agg(Week1=('Week','min'), Week2=('Week','max'), Val=('Val', 'first'))
)
Or you can chain:
(df.groupby(df['Val'].ne(df['Val'].shift(1)).cumsum(), as_index=False)
.agg(Week1=('Week','min'), Week2=('Week','max'),Val=('Val', 'first'))
)
Output:
Week1 Week2 Val
0 1 4 11
1 5 8 9

pandas groupby extract top percent n data(descending)

I have some data like this
df = pd.DataFrame({'class':['a','a','b','b','a','a','b','c','c'],'score':[3,5,6,7,8,9,10,11,14]})
df
class score
0 a 3
1 a 5
2 b 6
3 b 7
4 a 8
5 a 9
6 b 10
7 c 11
8 c 14
I want to use groupby function extract top n% data(descending by score),i know the nlargest can make it,but the number of every group is different,so i don't know how to do it
I tried this function
top_n = 0.5
g = df.groupby(['class'])['score'].apply(lambda x:x.nlargest(int(round(top_n*len(x))),keep='all')).reset_index()
g
class level_1 score
0 a 5 9
1 a 4 8
2 b 6 10
3 b 3 7
4 c 8 14
but it can not deal with big data(more than 10 million),it is very slow,how do i speed it,thank you!

How to select last 5 rows of each unique records in pandas

Using python 3 am trying for each uniqe row in the column 'Name' to get the last 5 records from the column 'Number'. How exactly can this be done in python?
My df looks like this:
Name Number
a 5
a 6
b 7
b 8
a 9
a 10
b 11
b 12
a 9
b 8
I saw same exmples(like this one Get sum of last 5 rows for each unique id ) in SQL but that is time consuming and I would like to learn how to do it in python.
My expected output df would be like this:
Name 1 2 3 4 5
a 5 6 9 10 9
b 7 8 11 12 8
I think you need something like this:
df_out = df.groupby('Name').tail(5)
df_out.set_index(['Name', df_out.groupby('Name').cumcount() +1])['Number'].unstack()
Output:
1 2 3 4 5
Name
a 5 6 9 10 9
b 7 8 11 12 8
Looks like you need pivot after a groupby.cumcount()
df1=df.groupby('Name').tail(5)
final=(df1.assign(k=df1.groupby('Name').cumcount()+1)
.pivot(index='Name', columns='k', values='Number')
.reset_index().rename_axis(None, axis=1))
print(final)
Name 1 2 3 4 5
0 a 5 6 9 10 9
1 b 7 8 11 12 8

Resources