Iterating through a data frame and grouping values in a range - python-3.x

I have a python data frame of weekly data like this :
Week Val
1 11
2 11
3 11
4 11
5 9
6 9
7 9
8 9
I would like create an output table like this:
Week 1 Week 2 Val
1 4 11
5 8 9
Apologies, I am quite new to python and its iterative tools. I am not sure how to solve this problem.
I tried to match using the previous row columns but I do not think how to go further:
df['Match'] = df['Val'].eq(df['Val'].shift(-1))

You want to groupby the consecutive blocks of Val. So you can use cumsum on the non-zero differences to get the block:
blocks = df['Val'].ne(df['Val'].shift(1)).cumsum()
(df.groupby(blocks, as_index=False)
.agg(Week1=('Week','min'), Week2=('Week','max'), Val=('Val', 'first'))
)
Or you can chain:
(df.groupby(df['Val'].ne(df['Val'].shift(1)).cumsum(), as_index=False)
.agg(Week1=('Week','min'), Week2=('Week','max'),Val=('Val', 'first'))
)
Output:
Week1 Week2 Val
0 1 4 11
1 5 8 9

Related

Calculate mean value by interval coordinates in pandas

I have a dataframe such as :
Name Position Value
A 1 10
A 2 11
A 3 10
A 4 8
A 5 6
A 6 12
A 7 10
A 8 9
A 9 9
A 10 9
A 11 9
A 12 9
and I woulde like for each interval of 3 position, to calculate the mean of Values.
And create a new df with start and end coordinates (of length 3 then), with the Mean_value column.
Name Start End Mean_value
A 1 3 10.33 <---- here this is (10+11+10)/3 = 10.33
A 4 6 8.7
A 7 9 9.3
A 10 13 9
Does someone have an idea using pandas please ?
Solution for get each 3 rows (if exist) per Name groups - first get counter by GroupBy.cumcount with integer division and pass it to named aggregations:
g = df.groupby('Name').cumcount() // 3
df = df.groupby(['Name',g]).agg(Start=('Position','first'),
End=('Position','last'),
Value=('Value','mean')).droplevel(1).reset_index()
print (df)
Name Start End Value
0 A 1 3 10.333333
1 A 4 6 8.666667
2 A 7 9 9.333333
3 A 10 12 9.000000

compare two data frames and update value in one data frame by comparing another data frame value

I have two data frames. Examples:
df1:
A B C
5 7 6
8 1 1
1 0 7
3 4 9
5 7 4
9 2 0
df2:
A B C
3 2 1
6 5 7
9 7 9
1 1 2
6 4 5
0 8 6
Both data frames have same index.
What I want is , wherever df1's value is less than 5,
I want to update df2's value to 0, else keep it same.
I tried the following code:
df2[df1<5]=0
but when I am printing df2, its showing same values as original df2.
I know I am missing something really simple.
Please help me.
Thank you.

pandas random shuffling dataframe with constraints

I have a dataframe that I need to randomise in a very specific way with a particular rule, and I'm a bit lost. A simplified version is here:
idx type time
1 a 1
2 a 1
3 a 1
4 b 2
5 b 2
6 b 2
7 a 3
8 a 3
9 a 3
10 b 4
11 b 4
12 b 4
13 a 5
14 a 5
15 a 5
16 b 6
17 b 6
18 b 6
19 a 7
20 a 7
21 a 7
If we consider this as containing seven "bunches", I'd like to randomly shuffle by those bunches, i.e. retaining the time column. However, the constraint is that after shuffling, a particular bunch type (a or b in this case) cannot appear more than n (e.g. 2) times in a row. So an example correct result looks like this:
idx type time
21 a 7
20 a 7
19 a 7
7 a 3
8 a 3
9 a 3
17 b 6
16 b 6
18 b 6
6 b 2
5 b 2
4 b 2
2 a 1
3 a 1
1 a 1
14 a 5
13 a 5
15 a 5
12 b 4
11 b 4
10 b 4
I was thinking I could create a separate "order" array from 1 to 7 and np.random.shuffle() it, then sort the dataframe by time in that order, which will probably work - I can think of ways to do that part, but I'm especially struggling with the rule of restricting the number of repeats.
I know roughly that I should use a while loop, shuffle it in that way, loop over the frame and track the number of consecutive types, if it exceeds my n then break out and start the while loop again until it completes without breaking out, in which case set a value to end the while loop. But this got so messy and didn't work.
Any ideas?
See if this works.
import pandas as pd
import numpy as np
n = [['a',1],['a',1],['a',1],
['b',2],['b',2],['b',2],
['a',3],['a',3],['a',3]]
df = pd.DataFrame(n)
df.columns = ['type','time']
print(df)
order = np.unique(np.array(df['time']))
print("Before Shuffling",order)
np.random.shuffle(order)
print("Shuffled",order)
n =2
for i in order:
print(df[df['time']==i].iloc[0:n])

How to select last 5 rows of each unique records in pandas

Using python 3 am trying for each uniqe row in the column 'Name' to get the last 5 records from the column 'Number'. How exactly can this be done in python?
My df looks like this:
Name Number
a 5
a 6
b 7
b 8
a 9
a 10
b 11
b 12
a 9
b 8
I saw same exmples(like this one Get sum of last 5 rows for each unique id ) in SQL but that is time consuming and I would like to learn how to do it in python.
My expected output df would be like this:
Name 1 2 3 4 5
a 5 6 9 10 9
b 7 8 11 12 8
I think you need something like this:
df_out = df.groupby('Name').tail(5)
df_out.set_index(['Name', df_out.groupby('Name').cumcount() +1])['Number'].unstack()
Output:
1 2 3 4 5
Name
a 5 6 9 10 9
b 7 8 11 12 8
Looks like you need pivot after a groupby.cumcount()
df1=df.groupby('Name').tail(5)
final=(df1.assign(k=df1.groupby('Name').cumcount()+1)
.pivot(index='Name', columns='k', values='Number')
.reset_index().rename_axis(None, axis=1))
print(final)
Name 1 2 3 4 5
0 a 5 6 9 10 9
1 b 7 8 11 12 8

Pandas DataFrame: how do we keep columns based on the index name?

I seem to run into some python or enumerate bugs that I am not quite sure how to fix it (See here for more details).
Long story short, I desire to see multiple data sets that has a column name of 0,4,6,8,10,12,14.
0 4 6 8 10 12
1 2 5 4 2 1
5 3 0 1 5 10
....
But my current data looks like the following
0 4 2 6 8 10 12
1 2 5 4 2 1
5 3 0 1 5 10
....
Therefore, I would like to add a code that keeps columns based on the index number (including only 0,4,6,8,10,12).
Is there a pandas function that can help with this?

Resources