How to expand Python Pandas Dataframe in linearly spaced increments - python-3.x

Beginner question:
I have a pandas dataframe that looks like this:
x1 y1 x2 y2
0 0 2 2
10 10 12 12
and I want to expand that dataframe by half units along the x and y coordinates to look like this:
x1 y1 x2 y2 Interpolated_X Interpolated_Y
0 0 2 2 0 0
0 0 2 2 0.5 0.5
0 0 2 2 1 1
0 0 2 2 1.5 1.5
0 0 2 2 2 2
10 10 12 12 10 10
10 10 12 12 10.5 10.5
10 10 12 12 11 11
10 10 12 12 11.5 11.5
10 10 12 12 12 12
Any help would be much appreciated.

The cleanest way I know how to expand rows like this is through groupby.apply. May be faster to use something like itertuples in pandas but it will be a little more complicated code (keep that in mind if your data-set is larger).
groupby the index which will send each row to my apply function (your index has to be unique for each row, if its not just run reset_index). I can return a DataFrame from my apply therefore we can expand from one row to multiple rows.
caveat, your x2-x1 and y2-y1 distance must be the same or this won't work.
import pandas as pd
import numpy as np
def expand(row):
row = row.iloc[0] # passes a dateframe so this gets reference to first and only row
xdistance = (row.x2 - row.x1)
ydistance = (row.y2 - row.y1)
xsteps = np.arange(row.x1, row.x2 + .5, .5) # create steps arrays
ysteps = np.arange(row.y1, row.y2 + .5, .5)
return (pd.DataFrame([row] * len(xsteps)) # you can expand lists in python by multiplying like this [val] * 3 = [val, val, val]
.assign(int_x = xsteps, int_y = ysteps))
(df.groupby(df.index) # "group" on each row
.apply(expand) # send row to expand function
.reset_index(level=1, drop=True)) # groupby gives us an extra index we don't want
starting df
x1 y1 x2 y2
0 0 2 2
10 10 12 12
ending df
x1 y1 x2 y2 int_x int_y
0 0 0 2 2 0.0 0.0
0 0 0 2 2 0.5 0.5
0 0 0 2 2 1.0 1.0
0 0 0 2 2 1.5 1.5
0 0 0 2 2 2.0 2.0
1 10 10 12 12 10.0 10.0
1 10 10 12 12 10.5 10.5
1 10 10 12 12 11.0 11.0
1 10 10 12 12 11.5 11.5
1 10 10 12 12 12.0 12.0

Related

Create Multiple rows for each value in given column in pandas df

I have a dataframe with points given in two columns x and y.
Thing x y length_x length_y
0 A 1 3 1 2
1 B 2 3 2 1
These (x,y) points are situated in the middle of one of the sides of a rectangle with vertex lengths length_x and length_y. What I wish to do is for each of these points give the coordinates of the rectangles they are on. That is: the following coordinated for Thing A would be:
(1+1*0.5, 3), (1-1*0.5,3), (1+1*0.5,3-2*0.5), (1-1*0.5, 3-2*0.5)
The half comes from the fact that the given lengths are the middle-points of an object so half the length is the distance from that point to the corner of the rectangle.
Hence my desired output is:
Thing x y Corner_x Corner_y length_x length_y
0 A 1 3 1.5 2.0 1 2
1 A 1 3 1.5 1.0 1 2
2 A 1 3 0.5 2.0 1 2
3 A 1 3 0.5 1.0 1 2
4 A 1 3 1.5 2.0 1 2
5 B 2 3 3.0 3.0 2 1
6 B 2 3 3.0 2.5 2 1
7 B 2 3 1.0 3.0 2 1
8 B 2 3 1.0 2.5 2 1
9 B 2 3 3.0 3.0 2 1
I tried to do this with defining a lambda returning two value but failed. Tried even to create multiple columns and then stack them, but it's really dirty.
bb = []
for thing in list_of_things:
new_df = df[df['Thing']=='{}'.format(thing)]
df = df.sort_values('x',ascending=False)
df['corner 1_x'] = df['x']+df['length_x']/2
df['corner 1_y'] = df['y']
df['corner 2_x'] = df['x']+1df['x_length']/2
df['corner 2_y'] = df['y']-df['length_y']/2
.........
Note also that the first corner's coordinates need to be repeated as I later what to use geopandas to transform each of these sets of coordinates into a POLYGON.
What I am looking for is a way to generate these rows is a fast and clean way.
You can use apply to create your corners as lists and explode them to the four rows per group.
Finally join the output to the original dataframe:
df.join(df.apply(lambda r: pd.Series({'corner_x': [r['x']+r['length_x']/2, r['x']-r['length_x']/2],
'corner_y': [r['y']+r['length_y']/2, r['y']-r['length_y']/2],
}), axis=1).explode('corner_x').explode('corner_y'),
how='right')
output:
Thing x y length_x length_y corner_x corner_y
0 A 1 3 1 2 1.5 4
0 A 1 3 1 2 1.5 2
0 A 1 3 1 2 0.5 4
0 A 1 3 1 2 0.5 2
1 B 2 3 2 1 3 3.5
1 B 2 3 2 1 3 2.5
1 B 2 3 2 1 1 3.5
1 B 2 3 2 1 1 2.5

Pandas dataframe how to add column of distance from previous row?

I have a dataframe of locations:
df = X Y
1 1
2 1
2 1
2 2
3 3
5 5
5.5 5.5
I want to add a columns, with the distance to the previous point:
So it will be:
df = X Y Distance
1 1 0
2 1 1
2 1 0
2 2 1
3 3 2
5 5 2
5.5 5.5 1
What is the best way to do so?
You can use the pd.Series.diff method.
For instance, to compute the eulerian distance, using also np.sqrt, you would do like this:
import numpy as np
df["Distance"] = np.sqrt(df.X.diff()**2 + df.Y.diff()**2)

Replace a column value with number using pandas

For the following dataset, I can replace column 1 with the numeric value easily.
df['1'].replace(['A', 'B', 'C', 'D'], [0, 1, 2, 3], inplace=True)
But if I have 3600 or more than that different values in a column, how can I replace it with the numeric values without writing the value of the column.
Please let me know. I don't understand how to do that. If anybody has any solution please share with me.
Thanks in advance.
import pandas as pd
df = pd.DataFrame({1:['A','B','C','C','D','A'],
2:[0.6,0.9,5,4,7,1,],
3:[0.3,1,0.7,8,2,4]})
print(df)
1 2 3
0 A 0.6 0.3
1 B 0.9 1.0
2 C 5.0 0.7
3 C 4.0 8.0
4 D 7.0 2.0
5 A 1.0 4.0
np.where makes it easy.
import numpy as np
df[1] = np.where(df[1]=="A", "0",
np.where(df[1]=="B", "1",
np.where(df[1]=="C","2",
np.where(df[1]=="D","3",np.nan))))
print(df)
1 2 3
0 0 0.6 0.3
1 1 0.9 1.0
2 2 5.0 0.7
3 2 4.0 8.0
4 3 7.0 2.0
5 0 1.0 4.0
But if you have a lot of categories, you might want to think about other ways.
import string
upper=list(string.ascii_uppercase)
a=pd.DataFrame({'Alp':upper})
print(a)
Alp
0 A
1 B
2 C
3 D
4 E
5 F
6 G
7 H
8 I
9 J
.
.
19 T
20 U
21 V
22 W
23 X
24 Y
25 Z
for k in np.arange(0,26):
a=a.replace(to_replace =upper[k],value =k)
print(a)
Alp
0 0
1 1
2 2
3 3
4 4
5 5
6 6
7 7
8 8
9 9
.
.
.
21 21
22 22
23 23
24 24
25 25
If there is many values for replace you can use factorize:
df[1] = pd.factorize(df[1])[0] + 1
print (df)
1 2 3
0 1 0.6 0.3
1 2 0.9 1.0
2 3 5.0 0.7
3 3 4.0 8.0
4 4 7.0 2.0
5 1 1.0 4.0
You could do something like
df.loc[df['1'] == 'A','1'] = 0
df.loc[df['1'] == 'B','1'] = 1
### Or
keys = df['1'].unique().tolist()
i = 0
for key in keys
df.loc[df['1'] == key,'1'] = i
i = i+1

Cannot assign row in pandas.Dataframe

I am trying to calculate the mean of the rows of a DataFrame which have the same value on a specified column col. However I'm stuck at assigning a row of the pandas DataFrame.
Here's my code:
def code(data, col):
""" Finds average value of all rows that have identical col values from column col .
Returns new Pandas.DataFrame with the data
"""
values = pd.unique(data[col])
rows = len(values)
res = pd.DataFrame(np.zeros(shape = (rows, len(data.columns))), columns = data.columns)
for i, v in enumerate(values):
e = data[data[col] == v].mean().to_frame().transpose()
res[i:i+1] = e
return res
The problem is that the code only works for the first row, and puts NaN values on the next rows. I have checked the value of e and confirmed it to be good, so there is a problem with the assignment res[i:i+1] = e. I have also tried to do res.iloc[i] = e but i get "ValueError: Incompatible indexer with Series" Is there an alternate way to do this? It seems very straight forward and I'm baffled why it doesn't work...
E.g:
wdata
Out[78]:
Die Subsite Algorithm Vt1 It1 Ignd
0 1 0 0 0.0 -2.320000e-07 -4.862400e-08
1 1 0 0 0.1 -1.000000e-04 1.000000e-04
2 1 0 0 0.2 -1.000000e-03 1.000000e-03
3 1 0 0 0.3 -1.000000e-02 1.000000e-02
4 1 1 1 0.0 3.554000e-07 -2.012000e-07
5 1 2 2 0.0 5.353000e-08 -1.684000e-07
6 1 3 3 0.0 9.369400e-08 -2.121400e-08
7 1 4 4 0.0 3.286200e-08 -2.093600e-08
8 1 5 5 0.0 8.978600e-08 -3.262000e-07
9 1 6 6 0.0 3.624800e-08 -2.507600e-08
10 1 7 7 0.0 2.957000e-08 -1.993200e-08
11 1 8 8 0.0 7.732600e-08 -3.773200e-08
12 1 9 9 0.0 9.300000e-08 -3.521200e-08
13 1 10 10 0.0 8.468000e-09 -6.990000e-09
14 1 11 11 0.0 1.434200e-11 -1.200000e-11
15 2 0 0 0.0 8.118000e-11 -5.254000e-11
16 2 1 1 0.0 9.322000e-11 -1.359200e-10
17 2 2 2 0.0 1.944000e-10 -2.409400e-10
18 2 3 3 0.0 7.756000e-11 -8.556000e-11
19 2 4 4 0.0 1.260000e-11 -8.618000e-12
20 2 5 5 0.0 7.122000e-12 -1.402000e-13
21 2 6 6 0.0 6.224000e-11 -2.760000e-11
22 2 7 7 0.0 1.133400e-08 -6.566000e-09
23 2 8 8 0.0 6.600000e-13 -1.808000e-11
24 2 9 9 0.0 6.861000e-08 -4.063400e-08
25 2 10 10 0.0 2.743800e-10 -1.336000e-10
Expected output:
Die Subsite Algorithm Vt1 It1 Ignd
0 1 4.4 4.4 0.04 -0.00074 0.00074
0 2 5.5 5.5 0 6.792247e-09 -4.023330e-09
Instead, what i get is:
Die Subsite Algorithm Vt1 It1 Ignd
0 1 4.4 4.4 0.04 -0.00074 0.00074
0 NaN NaN NaN NaN NaN NaN
For example, this code results in:
In[81]: wdata[wdata['Die'] == 2].mean().to_frame().transpose()
Out[81]:
Die Subsite Algorithm Vt1 It1 Ignd
0 2 5.5 5.5 0 6.792247e-09 -4.023330e-09
For me works:
def code(data, col):
""" Finds average value of all rows that have identical col values from column col .
Returns new Pandas.DataFrame with the data
"""
values = pd.unique(data[col])
rows = len(values)
res = pd.DataFrame(columns = data.columns)
for i, v in enumerate(values):
e = data[data[col] == v].mean()
res.loc[i,:] = e
return res
col = 'Die'
print (code(data, col))
Die Subsite Algorithm Vt1 It1 Ignd
0 1 4.4 4.4 0.04 -0.000739957 0.000739939
1 2 5 5 0 7.34067e-09 -4.35482e-09
but same output has groupby with aggregate mean:
print (data.groupby(col, as_index=False).mean())
Die Subsite Algorithm Vt1 It1 Ignd
0 1 4.4 4.4 0.04 -7.399575e-04 7.399392e-04
1 2 5.0 5.0 0.00 7.340669e-09 -4.354818e-09
A few minutes after I posted the question I solved it by adding a .values to e.
e = data[data[col] == v].mean().to_frame().transpose().values
However it turns out that what I wanted to do is already done by Pandas. Thanks MaxU!
df.groupBy(col).mean()

In Python Pandas using cumsum with groupby and reset of cumsum when value is 0

I'm rather new at python.
I try to have a cumulative sum for each client to see the consequential months of inactivity (flag: 1 or 0). The cumulative sum of the 1's need therefore to be reset when we have a 0. The reset need to happen as well when we have a new client. See below with example where a is the column of clients and b are the dates.
After some research, I found the question 'Cumsum reset at NaN' and 'In Python Pandas using cumsum with groupby'. I assume that I kind of need to put them together.
Adapting the code of 'Cumsum reset at NaN' to the reset towards 0, is successful:
cumsum = v.cumsum().fillna(method='pad')
reset = -cumsum[v.isnull() !=0].diff().fillna(cumsum)
result = v.where(v.notnull(), reset).cumsum()
However, I don't succeed at adding a groupby. My count just goes on...
So, a dataset would be like this:
import pandas as pd
df = pd.DataFrame({'a' : [1,1,1,1,1,1,1,2,2,2,2,2,2,2],
'b' : [1/15,2/15,3/15,4/15,5/15,6/15,1/15,2/15,3/15,4/15,5/15,6/15],
'c' : [1,0,1,0,1,1,0,1,1,0,1,1,1,1]})
this should result in a dataframe with the columns a, b, c and d with
'd' : [1,0,1,0,1,2,0,1,2,0,1,2,3,4]
Please note that I have a very large dataset, so calculation time is really important.
Thank you for helping me
Use groupby.apply and cumsum after finding contiguous values in the groups. Then groupby.cumcount to get the integer counting upto each contiguous value and add 1 later.
Multiply with the original row to create the AND logic cancelling all zeros and only considering positive values.
df['d'] = df.groupby('a')['c'] \
.apply(lambda x: x * (x.groupby((x != x.shift()).cumsum()).cumcount() + 1))
print(df['d'])
0 1
1 0
2 1
3 0
4 1
5 2
6 0
7 1
8 2
9 0
10 1
11 2
12 3
13 4
Name: d, dtype: int64
Another way of doing would be to apply a function after series.expanding on the groupby object which basically computes values on the series starting from the first index upto that current index.
Use reduce later to apply function of two args cumulatively to the items of iterable so as to reduce it to a single value.
from functools import reduce
df.groupby('a')['c'].expanding() \
.apply(lambda i: reduce(lambda x, y: x+1 if y==1 else 0, i, 0))
a
1 0 1.0
1 0.0
2 1.0
3 0.0
4 1.0
5 2.0
6 0.0
2 7 1.0
8 2.0
9 0.0
10 1.0
11 2.0
12 3.0
13 4.0
Name: c, dtype: float64
Timings:
%%timeit
df.groupby('a')['c'].apply(lambda x: x * (x.groupby((x != x.shift()).cumsum()).cumcount() + 1))
100 loops, best of 3: 3.35 ms per loop
%%timeit
df.groupby('a')['c'].expanding().apply(lambda s: reduce(lambda x, y: x+1 if y==1 else 0, s, 0))
1000 loops, best of 3: 1.63 ms per loop
I think you need custom function with groupby:
#change row with index 6 to 1 for better testing
df = pd.DataFrame({'a' : [1,1,1,1,1,1,1,2,2,2,2,2,2,2],
'b' : [1/15,2/15,3/15,4/15,5/15,6/15,1/15,2/15,3/15,4/15,5/15,6/15,7/15,8/15],
'c' : [1,0,1,0,1,1,1,1,1,0,1,1,1,1],
'd' : [1,0,1,0,1,2,3,1,2,0,1,2,3,4]})
print (df)
a b c d
0 1 0.066667 1 1
1 1 0.133333 0 0
2 1 0.200000 1 1
3 1 0.266667 0 0
4 1 0.333333 1 1
5 1 0.400000 1 2
6 1 0.066667 1 3
7 2 0.133333 1 1
8 2 0.200000 1 2
9 2 0.266667 0 0
10 2 0.333333 1 1
11 2 0.400000 1 2
12 2 0.466667 1 3
13 2 0.533333 1 4
def f(x):
x.ix[x.c == 1, 'e'] = 1
a = x.e.notnull()
x.e = a.cumsum()-a.cumsum().where(~a).ffill().fillna(0).astype(int)
return (x)
print (df.groupby('a').apply(f))
a b c d e
0 1 0.066667 1 1 1
1 1 0.133333 0 0 0
2 1 0.200000 1 1 1
3 1 0.266667 0 0 0
4 1 0.333333 1 1 1
5 1 0.400000 1 2 2
6 1 0.066667 1 3 3
7 2 0.133333 1 1 1
8 2 0.200000 1 2 2
9 2 0.266667 0 0 0
10 2 0.333333 1 1 1
11 2 0.400000 1 2 2
12 2 0.466667 1 3 3
13 2 0.533333 1 4 4

Resources