How to get the unique values from pandas dataframe based on same id of other column - python-3.x

I have pandas dataframe as follows:
user id
1 2
1 2
1 2
1 3
1 3
i want to group by values like this:
(1,1,1,2),(1,1,3)
I am using this and it is giving unique values of one column only
pd.unique(df[['id']].values.ravel())
but i want to groupby values with unique of id column using pandas.

One way, seems self-explanatory:
df = df.sort_values(['user', 'id'])
df['groups'] = (df.id!=df.id.shift()).cumsum() # pattern to number groups
df
Out[26]:
user id groups
0 1 2 1
1 1 2 1
2 1 2 1
3 1 3 2
4 1 3 2
df.id = df.id.drop_duplicates('last').reindex_like(df)
df
Out[28]:
user id groups
0 1 NaN 1
1 1 NaN 1
2 1 2.0 1
3 1 NaN 2
4 1 3.0 2
df.set_index('groups').stack()
Out[30]:
groups
1 user 1.0
user 1.0
user 1.0
id 2.0
2 user 1.0
user 1.0
id 3.0
dtype: float64
df.groupby(level=0).apply(tuple)
Out[36]:
groups
1 (1.0, 1.0, 1.0, 2.0)
2 (1.0, 1.0, 3.0)
dtype: object

Related

Taking different records from groups using group by in pandas

Suppose I have dataframe like this
>>> df = pd.DataFrame({'id':[1,1,1,2,2,2,2,3,4],'value':[1,2,3,1,2,3,4,1,1]})
>>> df
id value
0 1 1
1 1 2
2 1 3
3 2 1
4 2 2
5 2 3
6 2 4
7 3 1
8 4 1
Now I want top all records from each group using group id except last 3. That means I want to drop last 3 records from all groups. How can I do it using pandas group_by. This is dummy data.
Use GroupBy.cumcount for counter from back by ascending=False and then compare by Series.gt for greater values like 2, because python count from 0:
df = df[df.groupby('id').cumcount(ascending=False).gt(2)]
print (df)
id value
3 2 1
Details:
print (df.groupby('id').cumcount(ascending=False))
0 2
1 1
2 0
3 3
4 2
5 1
6 0
7 0
8 0
dtype: int64

Create Python function to look for ANY NULL value in a group

I am trying to write a function that will check a specified column for nulls, within a group in a dataframe. The example dataframe has two columns, ID and VALUE. Multiple rows exist per ID. I want to know if ANY of the rows for a particular ID have a NULL value in VALUE.
I have tried building the function with iterrows().
df = pd.DataFrame({'ID':[1,2,2,3,3,3],
'VALUE':[50,None,30,20,10,None]})
def nullValue(col):
for i, row in col.iterrows():
if ['VALUE'] is None:
return 1
else:
return 0
df2 = df.groupby('ID').apply(nullVALUE)
df2.columns = ['ID','VALUE','isNULL']
df2
I am expecting to retrieve a dataframe with three columns, ID, VALUE, and isNULL. If any row in a grouped ID has a null, all of the rows for that ID should have a 1 under isNull.
Example:
ID VALUE isNULL
1 50.0 0
2 NaN 1
2 30.0 1
3 20.0 1
3 10.0 1
3 NaN 1
A quick solution, borrowed partially from this answer is to use groupby with transform:
df = pd.DataFrame({'ID':[1,2,2,3,3,3,3],
'VALUE':[50,None,None,30,20,10,None]})
df['isNULL'] = (df.VALUE.isnull().groupby([df['ID']]).transform('sum') > 0).astype(int)
Out[51]:
ID VALUE isNULL
0 1 50.0 0
1 2 NaN 1
2 2 NaN 1
3 3 30.0 1
4 3 20.0 1
5 3 10.0 1
6 3 NaN 1

Pandas: GroupBy Shift And Cumulative Sum

I want to do groupby, shift and cumsum which seems pretty trivial task but still banging my head over the result I'm getting. Can someone please tell what am I doing wrong. All the results I found online shows the same or the same variation of what I am doing. Below is my implementation.
temp = pd.DataFrame(data=[['a',1],['a',1],['a',1],['b',1],['b',1],['b',1],['c',1],['c',1]], columns=['ID','X'])
temp['transformed'] = temp.groupby('ID')['X'].cumsum().shift()
print(temp)
ID X transformed
0 a 1 NaN
1 a 1 1.0
2 a 1 2.0
3 b 1 3.0
4 b 1 1.0
5 b 1 2.0
6 c 1 3.0
7 c 1 1.0
This is wrong because the actual or what I am looking for is as below:
ID X transformed
0 a 1 NaN
1 a 1 1.0
2 a 1 2.0
3 b 1 NaN
4 b 1 1.0
5 b 1 2.0
6 c 1 NaN
7 c 1 1.0
Thanks a lot in advance.
You could use transform() to feed the separate groups that are created at each level of groupby into the cumsum() and shift() methods.
temp['transformed'] = \
temp.groupby('ID')['X'].transform(lambda x: x.cumsum().shift())
ID X transformed
0 a 1 NaN
1 a 1 1.0
2 a 1 2.0
3 b 1 NaN
4 b 1 1.0
5 b 1 2.0
6 c 1 NaN
7 c 1 1.0
For more info on transform() please see here:
https://jakevdp.github.io/PythonDataScienceHandbook/03.08-aggregation-and-grouping.html#Transformation
https://pandas.pydata.org/pandas-docs/version/0.22/groupby.html#transformation
You need using apply , since one function is under groupby object which is cumsum another function shift is for all df
temp['transformed'] = temp.groupby('ID')['X'].apply(lambda x : x.cumsum().shift())
temp
Out[287]:
ID X transformed
0 a 1 NaN
1 a 1 1.0
2 a 1 2.0
3 b 1 NaN
4 b 1 1.0
5 b 1 2.0
6 c 1 NaN
7 c 1 1.0
While working on this problem, as the DataFrame size grows, using lambdas on transform starts to get very slow. I found out that using some DataFrameGroupBy methods (like cumsum and shift instead of lambdas are much faster.
So here's my proposed solution, creating a 'temp' column to save the cumsum for each ID and then shifting in a different groupby:
df['temp'] = df.groupby("ID")['X'].cumsum()
df['transformed'] = df.groupby("ID")['temp'].shift()
df = df.drop(columns=["temp"])

how to change a value of a cell that contains nan to another specific value?

I have a dataframe that contains nan values in particular column. while iterating through the rows, if it come across nan(using isnan() method) then I need to change it to some other value(since I have some conditions). I tried using replace() and fillna() with limit parameter also but they are modifying whole column when they come across the first nan value? Is there any method that I can assign value to specific nan rather than changing all the values of a column?
Example: the dataframe looks like it:
points sundar cate king varun vicky john charlie target_class
1 x2 5 'cat' 4 10 3 2 1 NaN
2 x3 3 'cat' 1 2 3 1 1 NaN
3 x4 6 'lion' 8 4 3 7 1 NaN
4 x5 4 'lion' 1 1 3 1 1 NaN
5 x6 8 'cat' 10 10 9 7 1 0.0
an I have a list like
a = [1.0, 0.0]
and I expect to be like
points sundar cate king varun vicky john charlie target_class
1 x2 5 'cat' 4 10 3 2 1 1.0
2 x3 3 'cat' 1 2 3 1 1 1.0
3 x4 6 'lion' 8 4 3 7 1 1.0
4 x5 4 'lion' 1 1 3 1 1 0.0
5 x6 8 'cat' 10 10 9 7 1 0.0
I wanted to change the target_class values based on some conditions and assign values of the above list.
I believe need replace NaNs values to 1 only for indexes specified in list idx:
mask = df['target_class'].isnull()
idx = [1,2,3]
df.loc[mask, 'target_class'] = df[mask].index.isin(idx).astype(int)
print (df)
points sundar cate king varun vicky john charlie target_class
1 x2 5 'cat' 4 10 3 2 1 1.0
2 x3 3 'cat' 1 2 3 1 1 1.0
3 x4 6 'lion' 8 4 3 7 1 1.0
4 x5 4 'lion' 1 1 3 1 1 0.0
5 x6 8 'cat' 10 10 9 7 1 0.0
Or:
idx = [1,2,3]
s = pd.Series(df.index.isin(idx).astype(int), index=df.index)
df['target_class'] = df['target_class'].fillna(s)
EDIT:
From comments solution is assign values by index and columns values with DataFrame.loc:
df2.loc['x2', 'target_class'] = list1[0]
I suppose your conditions for imputing the nan values does not depend on the number of them in a column. In the code below I stored all the imputation rules in one function that receives as parameters the entire row (containing the nan) and the column you are investigating for. If you also need all the dataframe for the imputation rules, just pass it through the replace_nan function. In the example I imputate the col element with the mean values of the other columns.
import pandas as pd
import numpy as np
def replace_nan(row, col):
row[col] = row.drop(col).mean()
return row
df = pd.DataFrame(np.random.rand(5,3), columns = ['col1', 'col2', 'col3'])
col_to_impute = 'col1'
df.loc[[1, 3], col_to_impute] = np.nan
df = df.apply(lambda x: replace_nan(x, col_to_impute) if np.isnan(x[col_to_impute]) else x, axis=1)
The only thing that you should do is making the right assignation. That is, make an assignation in the rows that contain nulls.
Example dataset:
,event_id,type,timestamp,label
0,asd12e,click,12322232,0.0
1,asj123,click,212312312,0.0
2,asd321,touch,12312323,0.0
3,asdas3,click,33332233,
4,sdsaa3,touch,33211333,
Note: The last two rows contains nulls in column: 'label'. Then, we load the dataset:
df = pd.read_csv('dataset.csv')
Now, we make the appropiate condition:
cond = df['label'].isnull()
Now, we make the assignation over these rows (I don't know the logical of assignation. Therefore I assign 1 value to NaN's):
df1.loc[cond,'label'] = 1
There are another more accurate approaches. fillna() method could be used. You should provide the logical in order to help you.

Multiple columns difference of 2 Pandas DataFrame

I am new to Python and Pandas , can someone help me with below report.
I want to report difference of N columns and create new columns with difference value, is it possible to make it dynamic as I have more than 30 columns. (Columns are fixed numbers, rows values can change)
A and B can be Alpha numeric
Use join with sub for difference of DataFrames:
#if columns are strings, first cast it
df1 = df1.astype(int)
df2 = df2.astype(int)
#if first columns are not indices
#df1 = df1.set_index('ID')
#df2 = df2.set_index('ID')
df = df1.join(df2.sub(df1).add_prefix('sum'))
print (df)
A B sumA sumB
ID
0 10 2.0 5 3.0
1 11 3.0 6 5.0
2 12 4.0 7 5.0
Or similar:
df = df1.join(df2.sub(df1), rsuffix='sum')
print (df)
A B Asum Bsum
ID
0 10 2.0 5 3.0
1 11 3.0 6 5.0
2 12 4.0 7 5.0
Detail:
print (df2.sub(df1))
A B
ID
0 5 3.0
1 6 5.0
2 7 5.0
IIUC
df1[['C','D']]=(df2-df1)[['A','B']]
df1
Out[868]:
ID A B C D
0 0 10 2.0 5 3.0
1 1 11 3.0 6 5.0
2 2 12 4.0 7 5.0
df1.assign(B=0)
Out[869]:
ID A B C D
0 0 10 0 5 3.0
1 1 11 0 6 5.0
2 2 12 0 7 5.0
The 'ID' column should really be an index. See the Pandas tutorial on indexing for why this is a good idea.
df1 = df1.set_index('ID')
df2 = df2.set_index('ID')
df = df1.copy()
df[['C', 'D']] = df2 - df1
df['B'] = 0
print(df)
outputs
A B C D
ID
0 10 0 5 3.0
1 11 0 6 5.0
2 12 0 7 5.0

Resources