how to aggregate information in one column of dataframe in python3? - python-3.x

I have a dataframe:
import pandas as pd
d = {'user': ['bob','bob','peter','peter'], 'item':
['s1','s1','s2','s2'],'value':
[1,2,5,4]}
df = pd.DataFrame(data=d)
which is
user item value
0 bob s1 1
1 bob s1 2
2 peter s2 5
3 peter s2 4
I tend to aggregate the value based on [user, item]. My new dataframe will be
user item value
0 bob s1 [1,2]
1 peter s2 [5,4]
value is an array, how to do that ?

df.groupby(['user','item']).agg(list).reset_index()
Out[110]:
user item value
0 bob s1 [1, 2]
1 peter s2 [5, 4]

Related

Given a column value, check if another column value is present in preceding or next 'n' rows in a Pandas data frame

I have the following data
jsonDict = {'Fruit': ['apple', 'orange', 'apple', 'banana', 'orange', 'apple','banana'], 'price': [1, 2, 1, 3, 2, 1, 3]}
Fruit price
0 apple 1
1 orange 2
2 apple 1
3 banana 3
4 orange 2
5 apple 1
6 banana 3
What I want to do is check if Fruit == banana and if yes, I want the code to scan the preceding as well as the next n rows from the index position of the 'banana' row, for an instance where Fruit == apple. An example of the expected output is shown below taking n=2.
Fruit price
2 apple 1
5 apple 1
I have tried doing
position = df[df['Fruit'] == 'banana'].index
resultdf= df.loc[((df.index).isin(position)) & (((df['Fruit'].index+2).isin(['apple']))|((df['Fruit'].index-2).isin(['apple'])))]
# Output is an empty dataframe
Empty DataFrame
Columns: [Fruit, price]
Index: []
Preference will be given to vectorized approaches.
IIUC, you can use 2 masks and boolean indexing:
# df = pd.DataFrame(jsonDict)
n = 2
m1 = df['Fruit'].eq('banana')
# is the row ±n of a banana?
m2 = m1.rolling(2*n+1, min_periods=1, center=True).max().eq(1)
# is the row an apple?
m3 = df['Fruit'].eq('apple')
out = df[m2&m3]
output:
Fruit price
2 apple 1
5 apple 1

How to fill dataframe column as tuple of other columns values using np.where?

I have a dataframe as follows
id Domain City
1 DM Pune
2 VS Delhi
I want to create a new column which will contain tuple of column values id & Domain,
e.g
id Domain City New_Col
1 DM Pune (1,DM)
2 VS Delhi (2,VS)
I know I can create it easily using apply & lambda as follows:
df['New_Col'] = df.apply(lambda r:tuple(r[bkeys]),axis=1) ##here bkeys = ['id','Domain']
However I this takes hell lot of time for larger dataframes having > 100k records. Hence I want to use np.where like this
df['New_Col'] = np.where(True, tuple(df[bkeys]), '')
But this doesn't work, it gives values like: ('id','Domain')
Any suggestions?
Try this:
df.assign(new_col = df[['id','Domain']].agg(tuple, axis=1))
Output:
id Domain City new_col
0 1 DM Pune (1, DM)
1 2 VS Delhi (2, VS)
Something or other is giving people a wrong idea of what np.where does. I've seen similar error in other questions.
Let's make your dataframe:
In [2]: import pandas as pd
In [3]: df = pd.DataFrame([[1,'DM','Pune'],[2,'VS','Delhi']],columns=['id','Domain','City'])
In [4]: df
Out[4]:
id Domain City
0 1 DM Pune
1 2 VS Delhi
Your apply expression:
In [5]: bkeys = ['id','Domain']
In [6]: df.apply(lambda r:tuple(r[bkeys]),axis=1)
Out[6]:
0 (1, DM)
1 (2, VS)
dtype: object
what's happening here? apply is iterating on the rows of df. r is one row.
So the first row:
In [9]: df.iloc[0]
Out[9]:
id 1
Domain DM
City Pune
Name: 0, dtype: object
index with bkeys:
In [10]: df.iloc[0][bkeys]
Out[10]:
id 1
Domain DM
Name: 0, dtype: object
and make a tuple from that:
In [11]: tuple(df.iloc[0][bkeys])
Out[11]: (1, 'DM')
But what do we get when indexing the whole dataframe:
In [12]: df[bkeys]
Out[12]:
id Domain
0 1 DM
1 2 VS
In [15]: tuple(df[bkeys])
Out[15]: ('id', 'Domain')
np.where is a function; it is not an iterator. The interpreter evaluates each of its arguments, and passes them to the function.
In [16]: np.where(True, tuple(df[bkeys]), '')
Out[16]: array(['id', 'Domain'], dtype='<U6')
This is what you tried to assign to the new column.
In [17]: df
Out[17]:
id Domain City New_Col
0 1 DM Pune id
1 2 VS Delhi Domain
This assignment only works because the tuple has 2 elements, and df has 2 rows. Otherwise you'd get an error.
np.where is not a magical way of speeding up a dataframe apply. It's a way of creating an array of values, which, if the right size can be assigned to a dataframe column (series).
We could create a numpy array from the selected columns:
In [31]: df[bkeys].to_numpy()
Out[31]:
array([[1, 'DM'],
[2, 'VS']], dtype=object)
and from that get a list of lists, and assign that to a new column:
In [32]: df[bkeys].to_numpy().tolist()
Out[32]: [[1, 'DM'], [2, 'VS']]
In [33]: df['New_Col'] = _
In [34]: df
Out[34]:
id Domain City New_Col
0 1 DM Pune [1, DM]
1 2 VS Delhi [2, VS]
If you really want tuples, the sublists will have to be converted:
In [35]: [tuple(i) for i in df[bkeys].to_numpy().tolist()]
Out[35]: [(1, 'DM'), (2, 'VS')]
Another way of making a list of tuples (which works because array records display as tuples:
In [42]: df[bkeys].to_records(index=False).tolist()
Out[42]: [(1, 'DM'), (2, 'VS')]

list of visited interval

Given the following dataframe
import pandas as pd
df = pd.DataFrame({'visited': ['2015-3-1', '2015-3-5','2015-3-6','2016-3-4', '2016-3-6', '2016-3-8'],'name':['John','John','John','Mary','Mary','Mary']})
df['visited']=pd.to_datetime(df['visited'])
visited name
0 2015-03-01 John
1 2015-03-05 John
2 2015-03-06 John
3 2016-03-04 Mary
4 2016-03-06 Mary
5 2016-03-08 Mary
I wish to get the list of visited interval for two people, in this example, the outcome should be
avg_visited_interval name
0 [4,1] John
1 [2,2] Mary
How should I achieve this?
(e.g., for first example there is 4 days between rows 0 and 1 and 2 days between rows 1 and 2, which resulted in [4,1])
Use custom lambda function with Series.diff, remove first value by position, convert to integers and lists:
df = (df.groupby('name')['visited']
.apply(lambda x: x.diff().iloc[1:].dt.days.astype(int).tolist())
.reset_index(name='intervals'))
print (df)
name intervals
0 John [4, 1]
1 Mary [2, 2]

Getting all rows where for column 'C' the entry is larger than the preceding element in column 'C'

How can I select all rows of a data frame where a condition is met according to a column, which has to do with the relationship between every 2 entries of that column. To give the specific example, lets say I have a DataFrame:
>>>df = pd.DataFrame({'A': [ 1, 2, 3, 4],
'B':['spam', 'ham', 'egg', 'foo'],
'C':[4, 5, 3, 4]})
>>> df
A B C
0 1 spam 4
1 2 ham 5
2 3 egg 3
3 4 foo 4
>>>df2 = df[ return every row of df where C[i] > C[i-1] ]
>>> df2
A B C
1 2 ham 5
3 4 foo 4
There is plenty of great information about slicing and indexing in the pandas docs and here, but this is a bit more complicated, I think. I could also be going about it wrong. What I'm looking for is the rows of data where the value stored in C is no longer monotonously declining.
Any help is appreciated!
Use boolean indexing with compare by shifted column values:
print (df[df['C'] > df['C'].shift()])
A B C
1 2 ham 5
3 4 foo 4
Detail:
print (df['C'] > df['C'].shift())
0 False
1 True
2 False
3 True
Name: C, dtype: bool
If want all monotonously declining rows compare diff of column:
print (df[df['C'].diff() > 0])
A B C
1 2 ham 5
3 4 foo 4

Python/Pandas return column and row index of found string

I've searched previous answers relating to this but those answers seem to utilize numpy because the array contains numbers. I am trying to search for a keyword in a sentence in a dataframe ('Timeframe') where the full sentence is 'Timeframe for wave in ____' and would like to return the column and row index. For example:
df.iloc[34,0]
returns the string I am looking for but I am avoiding a hard code for dynamic reasons. Is there a way to return the [34,0] when I search the dataframe for the keyword 'Timeframe'
EDIT:
For check index need contains with boolean indexing, but then there are possible 3 values:
df = pd.DataFrame({'A':['Timeframe for wave in ____', 'a', 'c']})
print (df)
A
0 Timeframe for wave in ____
1 a
2 c
def check(val):
a = df.index[df['A'].str.contains(val)]
if a.empty:
return 'not found'
elif len(a) > 1:
return a.tolist()
else:
#only one value - return scalar
return a.item()
print (check('Timeframe'))
0
print (check('a'))
[0, 1]
print (check('rr'))
not found
Old solution:
It seems you need if need numpy.where for check value Timeframe:
df = pd.DataFrame({'A':list('abcdef'),
'B':[4,5,4,5,5,4],
'C':[7,8,9,4,2,'Timeframe'],
'D':[1,3,5,7,1,0],
'E':[5,3,6,9,2,4],
'F':list('aaabbb')})
print (df)
A B C D E F
0 a 4 7 1 5 a
1 b 5 8 3 3 a
2 c 4 9 5 6 a
3 d 5 4 7 9 b
4 e 5 2 1 2 b
5 f 4 Timeframe 0 4 b
a = np.where(df.values == 'Timeframe')
print (a)
(array([5], dtype=int64), array([2], dtype=int64))
b = [x[0] for x in a]
print (b)
[5, 2]
In case you have multiple columns where to look into you can use following code example:
import numpy as np
import pandas as pd
df = pd.DataFrame([[1,2,3,4],["a","b","Timeframe for wave in____","d"],[5,6,7,8]])
mask = np.column_stack([df[col].str.contains("Timeframe", na=False) for col in df])
find_result = np.where(mask==True)
result = [find_result[0][0], find_result[1][0]]
Then output for df and result would be:
>>> df
0 1 2 3
0 1 2 3 4
1 a b Timeframe for wave in____ d
2 5 6 7 8
>>> result
[1, 2]

Resources