how to aggregate information in one column of dataframe in python3?

how to aggregate information in one column of dataframe in python3? - python-3.x

I have a dataframe:
import pandas as pd
d = {'user': ['bob','bob','peter','peter'], 'item':
['s1','s1','s2','s2'],'value':
[1,2,5,4]}
df = pd.DataFrame(data=d)
which is
user item value
0 bob s1 1
1 bob s1 2
2 peter s2 5
3 peter s2 4
I tend to aggregate the value based on [user, item]. My new dataframe will be
user item value
0 bob s1 [1,2]
1 peter s2 [5,4]
value is an array, how to do that ?

df.groupby(['user','item']).agg(list).reset_index()
Out[110]:
user item value
0 bob s1 [1, 2]
1 peter s2 [5, 4]

Related

Given a column value, check if another column value is present in preceding or next 'n' rows in a Pandas data frame

I have the following data
jsonDict = {'Fruit': ['apple', 'orange', 'apple', 'banana', 'orange', 'apple','banana'], 'price': [1, 2, 1, 3, 2, 1, 3]}
Fruit price
0 apple 1
1 orange 2
2 apple 1
3 banana 3
4 orange 2
5 apple 1
6 banana 3
What I want to do is check if Fruit == banana and if yes, I want the code to scan the preceding as well as the next n rows from the index position of the 'banana' row, for an instance where Fruit == apple. An example of the expected output is shown below taking n=2.
Fruit price
2 apple 1
5 apple 1
I have tried doing
position = df[df['Fruit'] == 'banana'].index
resultdf= df.loc[((df.index).isin(position)) & (((df['Fruit'].index+2).isin(['apple']))|((df['Fruit'].index-2).isin(['apple'])))]
# Output is an empty dataframe
Empty DataFrame
Columns: [Fruit, price]
Index: []
Preference will be given to vectorized approaches.

IIUC, you can use 2 masks and boolean indexing:
# df = pd.DataFrame(jsonDict)
n = 2
m1 = df['Fruit'].eq('banana')
# is the row ±n of a banana?
m2 = m1.rolling(2*n+1, min_periods=1, center=True).max().eq(1)
# is the row an apple?
m3 = df['Fruit'].eq('apple')
out = df[m2&m3]
output:
Fruit price
2 apple 1
5 apple 1

How to fill dataframe column as tuple of other columns values using np.where?

I have a dataframe as follows
id Domain City
1 DM Pune
2 VS Delhi
I want to create a new column which will contain tuple of column values id & Domain,
e.g
id Domain City New_Col
1 DM Pune (1,DM)
2 VS Delhi (2,VS)
I know I can create it easily using apply & lambda as follows:
df['New_Col'] = df.apply(lambda r:tuple(r[bkeys]),axis=1) ##here bkeys = ['id','Domain']
However I this takes hell lot of time for larger dataframes having > 100k records. Hence I want to use np.where like this
df['New_Col'] = np.where(True, tuple(df[bkeys]), '')
But this doesn't work, it gives values like: ('id','Domain')
Any suggestions?

Try this:
df.assign(new_col = df[['id','Domain']].agg(tuple, axis=1))
Output:
id Domain City new_col
0 1 DM Pune (1, DM)
1 2 VS Delhi (2, VS)

Something or other is giving people a wrong idea of what np.where does. I've seen similar error in other questions.
Let's make your dataframe:
In [2]: import pandas as pd
In [3]: df = pd.DataFrame([[1,'DM','Pune'],[2,'VS','Delhi']],columns=['id','Domain','City'])
In [4]: df
Out[4]:
id Domain City
0 1 DM Pune
1 2 VS Delhi
Your apply expression:
In [5]: bkeys = ['id','Domain']
In [6]: df.apply(lambda r:tuple(r[bkeys]),axis=1)
Out[6]:
0 (1, DM)
1 (2, VS)
dtype: object
what's happening here? apply is iterating on the rows of df. r is one row.
So the first row:
In [9]: df.iloc[0]
Out[9]:
id 1
Domain DM
City Pune
Name: 0, dtype: object
index with bkeys:
In [10]: df.iloc[0][bkeys]
Out[10]:
id 1
Domain DM
Name: 0, dtype: object
and make a tuple from that:
In [11]: tuple(df.iloc[0][bkeys])
Out[11]: (1, 'DM')
But what do we get when indexing the whole dataframe:
In [12]: df[bkeys]
Out[12]:
id Domain
0 1 DM
1 2 VS
In [15]: tuple(df[bkeys])
Out[15]: ('id', 'Domain')
np.where is a function; it is not an iterator. The interpreter evaluates each of its arguments, and passes them to the function.
In [16]: np.where(True, tuple(df[bkeys]), '')
Out[16]: array(['id', 'Domain'], dtype='<U6')
This is what you tried to assign to the new column.
In [17]: df
Out[17]:
id Domain City New_Col
0 1 DM Pune id
1 2 VS Delhi Domain
This assignment only works because the tuple has 2 elements, and df has 2 rows. Otherwise you'd get an error.
np.where is not a magical way of speeding up a dataframe apply. It's a way of creating an array of values, which, if the right size can be assigned to a dataframe column (series).
We could create a numpy array from the selected columns:
In [31]: df[bkeys].to_numpy()
Out[31]:
array([[1, 'DM'],
[2, 'VS']], dtype=object)
and from that get a list of lists, and assign that to a new column:
In [32]: df[bkeys].to_numpy().tolist()
Out[32]: [[1, 'DM'], [2, 'VS']]
In [33]: df['New_Col'] = _
In [34]: df
Out[34]:
id Domain City New_Col
0 1 DM Pune [1, DM]
1 2 VS Delhi [2, VS]
If you really want tuples, the sublists will have to be converted:
In [35]: [tuple(i) for i in df[bkeys].to_numpy().tolist()]
Out[35]: [(1, 'DM'), (2, 'VS')]
Another way of making a list of tuples (which works because array records display as tuples:
In [42]: df[bkeys].to_records(index=False).tolist()
Out[42]: [(1, 'DM'), (2, 'VS')]

list of visited interval

Given the following dataframe
import pandas as pd
df = pd.DataFrame({'visited': ['2015-3-1', '2015-3-5','2015-3-6','2016-3-4', '2016-3-6', '2016-3-8'],'name':['John','John','John','Mary','Mary','Mary']})
df['visited']=pd.to_datetime(df['visited'])
visited name
0 2015-03-01 John
1 2015-03-05 John
2 2015-03-06 John
3 2016-03-04 Mary
4 2016-03-06 Mary
5 2016-03-08 Mary
I wish to get the list of visited interval for two people, in this example, the outcome should be
avg_visited_interval name
0 [4,1] John
1 [2,2] Mary
How should I achieve this?
(e.g., for first example there is 4 days between rows 0 and 1 and 2 days between rows 1 and 2, which resulted in [4,1])

Use custom lambda function with Series.diff, remove first value by position, convert to integers and lists:
df = (df.groupby('name')['visited']
.apply(lambda x: x.diff().iloc[1:].dt.days.astype(int).tolist())
.reset_index(name='intervals'))
print (df)
name intervals
0 John [4, 1]
1 Mary [2, 2]

Getting all rows where for column 'C' the entry is larger than the preceding element in column 'C'

How can I select all rows of a data frame where a condition is met according to a column, which has to do with the relationship between every 2 entries of that column. To give the specific example, lets say I have a DataFrame:
>>>df = pd.DataFrame({'A': [ 1, 2, 3, 4],
'B':['spam', 'ham', 'egg', 'foo'],
'C':[4, 5, 3, 4]})
>>> df
A B C
0 1 spam 4
1 2 ham 5
2 3 egg 3
3 4 foo 4
>>>df2 = df[ return every row of df where C[i] > C[i-1] ]
>>> df2
A B C
1 2 ham 5
3 4 foo 4
There is plenty of great information about slicing and indexing in the pandas docs and here, but this is a bit more complicated, I think. I could also be going about it wrong. What I'm looking for is the rows of data where the value stored in C is no longer monotonously declining.
Any help is appreciated!

Use boolean indexing with compare by shifted column values:
print (df[df['C'] > df['C'].shift()])
A B C
1 2 ham 5
3 4 foo 4
Detail:
print (df['C'] > df['C'].shift())
0 False
1 True
2 False
3 True
Name: C, dtype: bool
If want all monotonously declining rows compare diff of column:
print (df[df['C'].diff() > 0])
A B C
1 2 ham 5
3 4 foo 4

Python/Pandas return column and row index of found string

I've searched previous answers relating to this but those answers seem to utilize numpy because the array contains numbers. I am trying to search for a keyword in a sentence in a dataframe ('Timeframe') where the full sentence is 'Timeframe for wave in ____' and would like to return the column and row index. For example:
df.iloc[34,0]
returns the string I am looking for but I am avoiding a hard code for dynamic reasons. Is there a way to return the [34,0] when I search the dataframe for the keyword 'Timeframe'

EDIT:
For check index need contains with boolean indexing, but then there are possible 3 values:
df = pd.DataFrame({'A':['Timeframe for wave in ____', 'a', 'c']})
print (df)
A
0 Timeframe for wave in ____
1 a
2 c
def check(val):
a = df.index[df['A'].str.contains(val)]
if a.empty:
return 'not found'
elif len(a) > 1:
return a.tolist()
else:
#only one value - return scalar
return a.item()
print (check('Timeframe'))
0
print (check('a'))
[0, 1]
print (check('rr'))
not found
Old solution:
It seems you need if need numpy.where for check value Timeframe:
df = pd.DataFrame({'A':list('abcdef'),
'B':[4,5,4,5,5,4],
'C':[7,8,9,4,2,'Timeframe'],
'D':[1,3,5,7,1,0],
'E':[5,3,6,9,2,4],
'F':list('aaabbb')})
print (df)
A B C D E F
0 a 4 7 1 5 a
1 b 5 8 3 3 a
2 c 4 9 5 6 a
3 d 5 4 7 9 b
4 e 5 2 1 2 b
5 f 4 Timeframe 0 4 b
a = np.where(df.values == 'Timeframe')
print (a)
(array([5], dtype=int64), array([2], dtype=int64))
b = [x[0] for x in a]
print (b)
[5, 2]

In case you have multiple columns where to look into you can use following code example:
import numpy as np
import pandas as pd
df = pd.DataFrame([[1,2,3,4],["a","b","Timeframe for wave in____","d"],[5,6,7,8]])
mask = np.column_stack([df[col].str.contains("Timeframe", na=False) for col in df])
find_result = np.where(mask==True)
result = [find_result[0][0], find_result[1][0]]
Then output for df and result would be:
>>> df
0 1 2 3
0 1 2 3 4
1 a b Timeframe for wave in____ d
2 5 6 7 8
>>> result
[1, 2]

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string

how to aggregate information in one column of dataframe in python3? - python-3.x

df.groupby(['user','item']).agg(list).reset_index() Out[110]: user item value 0 bob s1 [1, 2] 1 peter s2 [5, 4]

Related

Given a column value, check if another column value is present in preceding or next 'n' rows in a Pandas data frame

How to fill dataframe column as tuple of other columns values using np.where?

list of visited interval

Getting all rows where for column 'C' the entry is larger than the preceding element in column 'C'

Python/Pandas return column and row index of found string

Categories

Resources