assign a new column in pd.dataframe (pandas) by existing column and defining its starting position - python-3.x

assign a new column in pd.dataframe (pandas) by existing column and defining its starting position. (to shift the values of a col;umn up or down)
For example i cretae a new panda dataframe (pd.dataframe), and then from this column, i want to make a new column, that will start from its e.g. 2nd, 3rd, 4rth element, and the rest values to be replaced with a specific value / string. Can this be done easily, in one line?
# Creating a dataframe with 10 ones:
ww=pd.DataFrame(10*[1])
# then try to assign a new column, that will start from its 2nd element
# however, for soem reason (can you expalin it why?), it places a "nan"
# as the first element in that new column.
ww=ww.assign(ww=ww[1:])
ww
the output given is that:
0 ww
0 1 NaN
1 1 1.0
2 1 1.0
3 1 1.0
4 1 1.0
5 1 1.0
6 1 1.0
7 1 1.0
8 1 1.0
9 1 1.0
while the expected output will be:
0 ww
0 1 1.0
1 1 1.0
2 1 1.0
3 1 1.0
4 1 1.0
5 1 1.0
6 1 1.0
7 1 1.0
8 1 1.0
9 1 NaN

Related

Create Multiple rows for each value in given column in pandas df

I have a dataframe with points given in two columns x and y.
Thing x y length_x length_y
0 A 1 3 1 2
1 B 2 3 2 1
These (x,y) points are situated in the middle of one of the sides of a rectangle with vertex lengths length_x and length_y. What I wish to do is for each of these points give the coordinates of the rectangles they are on. That is: the following coordinated for Thing A would be:
(1+1*0.5, 3), (1-1*0.5,3), (1+1*0.5,3-2*0.5), (1-1*0.5, 3-2*0.5)
The half comes from the fact that the given lengths are the middle-points of an object so half the length is the distance from that point to the corner of the rectangle.
Hence my desired output is:
Thing x y Corner_x Corner_y length_x length_y
0 A 1 3 1.5 2.0 1 2
1 A 1 3 1.5 1.0 1 2
2 A 1 3 0.5 2.0 1 2
3 A 1 3 0.5 1.0 1 2
4 A 1 3 1.5 2.0 1 2
5 B 2 3 3.0 3.0 2 1
6 B 2 3 3.0 2.5 2 1
7 B 2 3 1.0 3.0 2 1
8 B 2 3 1.0 2.5 2 1
9 B 2 3 3.0 3.0 2 1
I tried to do this with defining a lambda returning two value but failed. Tried even to create multiple columns and then stack them, but it's really dirty.
bb = []
for thing in list_of_things:
new_df = df[df['Thing']=='{}'.format(thing)]
df = df.sort_values('x',ascending=False)
df['corner 1_x'] = df['x']+df['length_x']/2
df['corner 1_y'] = df['y']
df['corner 2_x'] = df['x']+1df['x_length']/2
df['corner 2_y'] = df['y']-df['length_y']/2
.........
Note also that the first corner's coordinates need to be repeated as I later what to use geopandas to transform each of these sets of coordinates into a POLYGON.
What I am looking for is a way to generate these rows is a fast and clean way.
You can use apply to create your corners as lists and explode them to the four rows per group.
Finally join the output to the original dataframe:
df.join(df.apply(lambda r: pd.Series({'corner_x': [r['x']+r['length_x']/2, r['x']-r['length_x']/2],
'corner_y': [r['y']+r['length_y']/2, r['y']-r['length_y']/2],
}), axis=1).explode('corner_x').explode('corner_y'),
how='right')
output:
Thing x y length_x length_y corner_x corner_y
0 A 1 3 1 2 1.5 4
0 A 1 3 1 2 1.5 2
0 A 1 3 1 2 0.5 4
0 A 1 3 1 2 0.5 2
1 B 2 3 2 1 3 3.5
1 B 2 3 2 1 3 2.5
1 B 2 3 2 1 1 3.5
1 B 2 3 2 1 1 2.5

Join with column having the max sequence number

I have a margin table
item margin
0 a 3
1 b 4
2 c 5
and an item table
item sequence
0 a 1
1 a 2
2 a 3
3 b 1
4 b 2
5 c 1
6 c 2
7 c 3
I want to join the two table so that the margin will only be joined to the product with maximum sequence number, the desired outcome is
item sequence margin
0 a 1 NaN
1 a 2 NaN
2 a 3 3.0
3 b 1 NaN
4 b 2 4.0
5 c 1 NaN
6 c 2 NaN
7 c 3 5.0
How to achieve this?
Below is the code for margin and item table
import pandas as pd
df_margin=pd.DataFrame({"item":["a","b","c"],"margin":[3,4,5]})
df_item=pd.DataFrame({"item":["a","a","a","b","b","c","c","c"],"sequence":[1,2,3,1,2,1,2,3]})
One option would be to merge then replace extra values with NaN via Series.where:
new_df = df_item.merge(df_margin)
new_df['margin'] = new_df['margin'].where(
new_df.groupby('item')['sequence'].transform('max').eq(new_df['sequence'])
)
Or with loc:
new_df = df_item.merge(df_margin)
new_df.loc[new_df.groupby('item')['sequence']
.transform('max').ne(new_df['sequence']), 'margin'] = np.NAN
Another option would be to assign a temp column to both frames df_item with True where the value is maximal, and df_margin is True everywhere then merge outer and drop the temp column:
new_df = (
df_item.assign(
t=df_item
.groupby('item')['sequence']
.transform('max')
.eq(df_item['sequence'])
).merge(df_margin.assign(t=True), how='outer').drop('t', 1)
)
Both produce new_df:
item sequence margin
0 a 1 NaN
1 a 2 NaN
2 a 3 3.0
3 b 1 NaN
4 b 2 4.0
5 c 1 NaN
6 c 2 NaN
7 c 3 5.0
You could do:
df_item.merge(df_item.groupby('item')['sequence'].max().\
reset_index().merge(df_margin), 'left')
item sequence margin
0 a 1 NaN
1 a 2 NaN
2 a 3 3.0
3 b 1 NaN
4 b 2 4.0
5 c 1 NaN
6 c 2 NaN
7 c 3 5.0
Breakdown:
df_new = df_item.groupby('item')['sequence'].max().reset_index().merge(df_margin)
df_item.merge(df_new, 'left')

Pandas: GroupBy Shift And Cumulative Sum

I want to do groupby, shift and cumsum which seems pretty trivial task but still banging my head over the result I'm getting. Can someone please tell what am I doing wrong. All the results I found online shows the same or the same variation of what I am doing. Below is my implementation.
temp = pd.DataFrame(data=[['a',1],['a',1],['a',1],['b',1],['b',1],['b',1],['c',1],['c',1]], columns=['ID','X'])
temp['transformed'] = temp.groupby('ID')['X'].cumsum().shift()
print(temp)
ID X transformed
0 a 1 NaN
1 a 1 1.0
2 a 1 2.0
3 b 1 3.0
4 b 1 1.0
5 b 1 2.0
6 c 1 3.0
7 c 1 1.0
This is wrong because the actual or what I am looking for is as below:
ID X transformed
0 a 1 NaN
1 a 1 1.0
2 a 1 2.0
3 b 1 NaN
4 b 1 1.0
5 b 1 2.0
6 c 1 NaN
7 c 1 1.0
Thanks a lot in advance.
You could use transform() to feed the separate groups that are created at each level of groupby into the cumsum() and shift() methods.
temp['transformed'] = \
temp.groupby('ID')['X'].transform(lambda x: x.cumsum().shift())
ID X transformed
0 a 1 NaN
1 a 1 1.0
2 a 1 2.0
3 b 1 NaN
4 b 1 1.0
5 b 1 2.0
6 c 1 NaN
7 c 1 1.0
For more info on transform() please see here:
https://jakevdp.github.io/PythonDataScienceHandbook/03.08-aggregation-and-grouping.html#Transformation
https://pandas.pydata.org/pandas-docs/version/0.22/groupby.html#transformation
You need using apply , since one function is under groupby object which is cumsum another function shift is for all df
temp['transformed'] = temp.groupby('ID')['X'].apply(lambda x : x.cumsum().shift())
temp
Out[287]:
ID X transformed
0 a 1 NaN
1 a 1 1.0
2 a 1 2.0
3 b 1 NaN
4 b 1 1.0
5 b 1 2.0
6 c 1 NaN
7 c 1 1.0
While working on this problem, as the DataFrame size grows, using lambdas on transform starts to get very slow. I found out that using some DataFrameGroupBy methods (like cumsum and shift instead of lambdas are much faster.
So here's my proposed solution, creating a 'temp' column to save the cumsum for each ID and then shifting in a different groupby:
df['temp'] = df.groupby("ID")['X'].cumsum()
df['transformed'] = df.groupby("ID")['temp'].shift()
df = df.drop(columns=["temp"])

python-3: how to create a new pandas column as subtraction of two consecutive rows of another column?

I have a pandas dataframe
x
1
3
4
7
10
I want to create a new column y as y[i] = x[i] - x[i-1] (and y[0] = x[0]).
So the above data frame will become:
x y
1 1
3 2
4 1
7 3
10 3
How to do that with python-3? Many thanks
Using .shift() and fillna():
df['y'] = (df['x'] - df['x'].shift(1)).fillna(df['x'])
To explain what this is doing, if we print(df['x'].shift(1)) we get the following series:
0 NaN
1 1.0
2 3.0
3 4.0
4 7.0
Which is your values from 'x' shifted down one row. The first row gets NaN because there is no value above it to shift down. So, when we do:
print(df['x'] - df['x'].shift(1))
We get:
0 NaN
1 2.0
2 1.0
3 3.0
4 3.0
Which is your subtracted values, but in our first row we get a NaN again. To clear this, we use .fillna(), telling it that we want to just take the value from df['x'] whenever a null value is encountered.

How to get the unique values from pandas dataframe based on same id of other column

I have pandas dataframe as follows:
user id
1 2
1 2
1 2
1 3
1 3
i want to group by values like this:
(1,1,1,2),(1,1,3)
I am using this and it is giving unique values of one column only
pd.unique(df[['id']].values.ravel())
but i want to groupby values with unique of id column using pandas.
One way, seems self-explanatory:
df = df.sort_values(['user', 'id'])
df['groups'] = (df.id!=df.id.shift()).cumsum() # pattern to number groups
df
Out[26]:
user id groups
0 1 2 1
1 1 2 1
2 1 2 1
3 1 3 2
4 1 3 2
df.id = df.id.drop_duplicates('last').reindex_like(df)
df
Out[28]:
user id groups
0 1 NaN 1
1 1 NaN 1
2 1 2.0 1
3 1 NaN 2
4 1 3.0 2
df.set_index('groups').stack()
Out[30]:
groups
1 user 1.0
user 1.0
user 1.0
id 2.0
2 user 1.0
user 1.0
id 3.0
dtype: float64
df.groupby(level=0).apply(tuple)
Out[36]:
groups
1 (1.0, 1.0, 1.0, 2.0)
2 (1.0, 1.0, 3.0)
dtype: object

Resources