Create Multiple rows for each value in given column in pandas df - python-3.x

I have a dataframe with points given in two columns x and y.
Thing x y length_x length_y
0 A 1 3 1 2
1 B 2 3 2 1
These (x,y) points are situated in the middle of one of the sides of a rectangle with vertex lengths length_x and length_y. What I wish to do is for each of these points give the coordinates of the rectangles they are on. That is: the following coordinated for Thing A would be:
(1+1*0.5, 3), (1-1*0.5,3), (1+1*0.5,3-2*0.5), (1-1*0.5, 3-2*0.5)
The half comes from the fact that the given lengths are the middle-points of an object so half the length is the distance from that point to the corner of the rectangle.
Hence my desired output is:
Thing x y Corner_x Corner_y length_x length_y
0 A 1 3 1.5 2.0 1 2
1 A 1 3 1.5 1.0 1 2
2 A 1 3 0.5 2.0 1 2
3 A 1 3 0.5 1.0 1 2
4 A 1 3 1.5 2.0 1 2
5 B 2 3 3.0 3.0 2 1
6 B 2 3 3.0 2.5 2 1
7 B 2 3 1.0 3.0 2 1
8 B 2 3 1.0 2.5 2 1
9 B 2 3 3.0 3.0 2 1
I tried to do this with defining a lambda returning two value but failed. Tried even to create multiple columns and then stack them, but it's really dirty.
bb = []
for thing in list_of_things:
new_df = df[df['Thing']=='{}'.format(thing)]
df = df.sort_values('x',ascending=False)
df['corner 1_x'] = df['x']+df['length_x']/2
df['corner 1_y'] = df['y']
df['corner 2_x'] = df['x']+1df['x_length']/2
df['corner 2_y'] = df['y']-df['length_y']/2
.........
Note also that the first corner's coordinates need to be repeated as I later what to use geopandas to transform each of these sets of coordinates into a POLYGON.
What I am looking for is a way to generate these rows is a fast and clean way.

You can use apply to create your corners as lists and explode them to the four rows per group.
Finally join the output to the original dataframe:
df.join(df.apply(lambda r: pd.Series({'corner_x': [r['x']+r['length_x']/2, r['x']-r['length_x']/2],
'corner_y': [r['y']+r['length_y']/2, r['y']-r['length_y']/2],
}), axis=1).explode('corner_x').explode('corner_y'),
how='right')
output:
Thing x y length_x length_y corner_x corner_y
0 A 1 3 1 2 1.5 4
0 A 1 3 1 2 1.5 2
0 A 1 3 1 2 0.5 4
0 A 1 3 1 2 0.5 2
1 B 2 3 2 1 3 3.5
1 B 2 3 2 1 3 2.5
1 B 2 3 2 1 1 3.5
1 B 2 3 2 1 1 2.5

Related

Pandas dataframe how to add column of distance from previous row?

I have a dataframe of locations:
df = X Y
1 1
2 1
2 1
2 2
3 3
5 5
5.5 5.5
I want to add a columns, with the distance to the previous point:
So it will be:
df = X Y Distance
1 1 0
2 1 1
2 1 0
2 2 1
3 3 2
5 5 2
5.5 5.5 1
What is the best way to do so?
You can use the pd.Series.diff method.
For instance, to compute the eulerian distance, using also np.sqrt, you would do like this:
import numpy as np
df["Distance"] = np.sqrt(df.X.diff()**2 + df.Y.diff()**2)

Calculate column by condition

I have table in df:
X1 X2
1 1
1 2
2 2
2 2
3 3
3 3
And i want calculate Y, where Y = Yprevious + 1 if X1=X1previous and X2=X2previous, elso 0. Y on first line = 0. Example.
X1 X2 Y
1 1 0
2 2 0
2 2 1
2 2 2
2 2 3
3 3 0
Not a duplicate... Previously, the question was simpler - addition with a value in a specific line. Now the term appears in the calculation process. I need some cumulative calculation
What I need, more example:
X1 X2 Y
1 1 0
2 2 0
2 2 1
2 2 2
2 2 3
3 3 0
3 3 1
2 2 0
What I get on the link to the duplicate
X1 X2 Y
1 1 0
2 2 0
2 2 1
2 2 2
2 2 3
3 3 0
3 3 1
2 2 4
Use GroupBy.cumcount with new columns by consecutive values:
df1 = df[['X1','X2']].ne(df[['X1','X2']].shift()).cumsum()
df['Y'] = df.groupby([df1['X1'], df1['X2']]).cumcount()
print (df)
X1 X2 Y
0 1 1 0
1 2 2 0
2 2 2 1
3 2 2 2
4 2 2 3
5 3 3 0
6 3 3 1
7 2 2 0

Pandas how to turn each group into a dataframe using groupby

I have a dataframe looks like,
A B
1 2
1 3
1 4
2 5
2 6
3 7
3 8
If I df.groupby('A'), how do I turn each group into sub-dataframes, so it will look like, for A=1
A B
1 2
1 3
1 4
for A=2,
A B
2 5
2 6
for A=3,
A B
3 7
3 8
By using get_group
g=df.groupby('A')
g.get_group(1)
Out[367]:
A B
0 1 2
1 1 3
2 1 4
You are close, need convert groupby object to dictionary of DataFrames:
dfs = dict(tuple(df.groupby('A')))
print (dfs[1])
A B
0 1 2
1 1 3
2 1 4
print (dfs[2])
A B
3 2 5
4 2 6

Conditional cumulative sum in Python/Pandas

Consider my dataframe, df:
data data_binary sum_data
2 1 1
5 0 0
1 1 1
4 1 2
3 1 3
10 0 0
7 0 0
3 1 1
How can I calculate the cumulative sum of data_binary within groups of contiguous 1 values?
The first group of 1's had a single 1 and sum_data has only a 1. However, the second group of 1's has 3 1's and sum_data is [1, 2, 3].
I've tried using np.where(df['data_binary'] == 1, df['data_binary'].cumsum(), 0), but that returns
array([1, 0, 2, 3, 4, 0, 0, 5])
Which is not what I want.
You want to take the cumulative sum of data_binary and subtract the most recent cumulative sum where data_binary was zero.
b = df.data_binary
c = b.cumsum()
c.sub(c.mask(b != 0).ffill(), fill_value=0).astype(int)
Output
0 1
1 0
2 1
3 2
4 3
5 0
6 0
7 1
Name: data_binary, dtype: int64
Explanation
Let's start by looking at each step side by side
cols = ['data_binary', 'cumulative_sum', 'nan_non_zero', 'forward_fill', 'final_result']
print(pd.concat([
b, c,
c.mask(b != 0),
c.mask(b != 0).ffill(),
c.sub(c.mask(b != 0).ffill(), fill_value=0).astype(int)
], axis=1, keys=cols))
Output
data_binary cumulative_sum nan_non_zero forward_fill final_result
0 1 1 NaN NaN 1
1 0 1 1.0 1.0 0
2 1 2 NaN 1.0 1
3 1 3 NaN 1.0 2
4 1 4 NaN 1.0 3
5 0 4 4.0 4.0 0
6 0 4 4.0 4.0 0
7 1 5 NaN 4.0 1
The problem with cumulative_sum is that the rows where data_binary is zero, do not reset the sum. And that is the motivation for this solution. How do we "reset" the sum when data_binary is zero? Easy! I slice the cumulative sum where data_binary is zero and forward fill the values. When I take the difference between this and the cumulative sum, I've effectively reset the sum.
I think you can groupby with DataFrameGroupBy.cumsum by Series, where first compare the next value by the shifted column if not equal (!=) and then create groups by cumsum. Last, replace 0 by column data_binary with mask:
print (df.data_binary.ne(df.data_binary.shift()).cumsum())
0 1
1 2
2 3
3 3
4 3
5 4
6 4
7 5
Name: data_binary, dtype: int32
df['sum_data1'] = df.data_binary.groupby(df.data_binary.ne(df.data_binary.shift()).cumsum())
.cumsum()
df['sum_data1'] = df['sum_data1'].mask(df.data_binary == 0, 0)
print (df)
data data_binary sum_data sum_data1
0 2 1 1 1
1 5 0 0 0
2 1 1 1 1
3 4 1 2 2
4 3 1 3 3
5 10 0 0 0
6 7 0 0 0
7 3 1 1 1
If you want the excellent piRSquared's answer in just one single command:
df['sum_data'] = df[['data_binary']].apply(
lambda x: x.cumsum().sub(x.cumsum().mask(x != 0).ffill(), fill_value=0).astype(int),
axis=0)
Note that the double squared bracket on the right hand side is necessary to make a one-column DataFrame instead of a Series in order to use apply with the axis argument (which is not available when apply is used on Series).

In Python Pandas using cumsum with groupby and reset of cumsum when value is 0

I'm rather new at python.
I try to have a cumulative sum for each client to see the consequential months of inactivity (flag: 1 or 0). The cumulative sum of the 1's need therefore to be reset when we have a 0. The reset need to happen as well when we have a new client. See below with example where a is the column of clients and b are the dates.
After some research, I found the question 'Cumsum reset at NaN' and 'In Python Pandas using cumsum with groupby'. I assume that I kind of need to put them together.
Adapting the code of 'Cumsum reset at NaN' to the reset towards 0, is successful:
cumsum = v.cumsum().fillna(method='pad')
reset = -cumsum[v.isnull() !=0].diff().fillna(cumsum)
result = v.where(v.notnull(), reset).cumsum()
However, I don't succeed at adding a groupby. My count just goes on...
So, a dataset would be like this:
import pandas as pd
df = pd.DataFrame({'a' : [1,1,1,1,1,1,1,2,2,2,2,2,2,2],
'b' : [1/15,2/15,3/15,4/15,5/15,6/15,1/15,2/15,3/15,4/15,5/15,6/15],
'c' : [1,0,1,0,1,1,0,1,1,0,1,1,1,1]})
this should result in a dataframe with the columns a, b, c and d with
'd' : [1,0,1,0,1,2,0,1,2,0,1,2,3,4]
Please note that I have a very large dataset, so calculation time is really important.
Thank you for helping me
Use groupby.apply and cumsum after finding contiguous values in the groups. Then groupby.cumcount to get the integer counting upto each contiguous value and add 1 later.
Multiply with the original row to create the AND logic cancelling all zeros and only considering positive values.
df['d'] = df.groupby('a')['c'] \
.apply(lambda x: x * (x.groupby((x != x.shift()).cumsum()).cumcount() + 1))
print(df['d'])
0 1
1 0
2 1
3 0
4 1
5 2
6 0
7 1
8 2
9 0
10 1
11 2
12 3
13 4
Name: d, dtype: int64
Another way of doing would be to apply a function after series.expanding on the groupby object which basically computes values on the series starting from the first index upto that current index.
Use reduce later to apply function of two args cumulatively to the items of iterable so as to reduce it to a single value.
from functools import reduce
df.groupby('a')['c'].expanding() \
.apply(lambda i: reduce(lambda x, y: x+1 if y==1 else 0, i, 0))
a
1 0 1.0
1 0.0
2 1.0
3 0.0
4 1.0
5 2.0
6 0.0
2 7 1.0
8 2.0
9 0.0
10 1.0
11 2.0
12 3.0
13 4.0
Name: c, dtype: float64
Timings:
%%timeit
df.groupby('a')['c'].apply(lambda x: x * (x.groupby((x != x.shift()).cumsum()).cumcount() + 1))
100 loops, best of 3: 3.35 ms per loop
%%timeit
df.groupby('a')['c'].expanding().apply(lambda s: reduce(lambda x, y: x+1 if y==1 else 0, s, 0))
1000 loops, best of 3: 1.63 ms per loop
I think you need custom function with groupby:
#change row with index 6 to 1 for better testing
df = pd.DataFrame({'a' : [1,1,1,1,1,1,1,2,2,2,2,2,2,2],
'b' : [1/15,2/15,3/15,4/15,5/15,6/15,1/15,2/15,3/15,4/15,5/15,6/15,7/15,8/15],
'c' : [1,0,1,0,1,1,1,1,1,0,1,1,1,1],
'd' : [1,0,1,0,1,2,3,1,2,0,1,2,3,4]})
print (df)
a b c d
0 1 0.066667 1 1
1 1 0.133333 0 0
2 1 0.200000 1 1
3 1 0.266667 0 0
4 1 0.333333 1 1
5 1 0.400000 1 2
6 1 0.066667 1 3
7 2 0.133333 1 1
8 2 0.200000 1 2
9 2 0.266667 0 0
10 2 0.333333 1 1
11 2 0.400000 1 2
12 2 0.466667 1 3
13 2 0.533333 1 4
def f(x):
x.ix[x.c == 1, 'e'] = 1
a = x.e.notnull()
x.e = a.cumsum()-a.cumsum().where(~a).ffill().fillna(0).astype(int)
return (x)
print (df.groupby('a').apply(f))
a b c d e
0 1 0.066667 1 1 1
1 1 0.133333 0 0 0
2 1 0.200000 1 1 1
3 1 0.266667 0 0 0
4 1 0.333333 1 1 1
5 1 0.400000 1 2 2
6 1 0.066667 1 3 3
7 2 0.133333 1 1 1
8 2 0.200000 1 2 2
9 2 0.266667 0 0 0
10 2 0.333333 1 1 1
11 2 0.400000 1 2 2
12 2 0.466667 1 3 3
13 2 0.533333 1 4 4

Resources