I have the following data frame.
item_id price quantile
0 1 10 0.1
1 3 20 0.2
2 4 30 0.3
3 6 40 0.4
4 11 50 0.5
5 12 60 0.6
6 15 70 0.7
7 20 80 0.8
8 25 90 0.9
9 26 100 1.0
I would like to have a customed rank function, which starts from the record whose quantile closest to 0.44, then goes down, and goes up, then goes down, and goes up ...
The result should look like:
item_id price quantile customed_rank
0 1 10 0.1 6
1 3 20 0.2 4
2 4 30 0.3 2
3 6 40 0.4 1
4 11 50 0.5 3
5 12 60 0.6 5
6 15 70 0.7 7
7 20 80 0.8 8
8 25 90 0.9 9
9 26 100 1.0 10
Other then looping over the entire data frame to do that, is there a more elegant way to achieve this? Thanks!
You want to rank by the absolute value of the difference between quantile and 0.44.
(df['quantile'] - 0.44).abs().rank()
0 7.0
1 5.0
2 3.0
3 1.0
4 2.0
5 4.0
6 6.0
7 8.0
8 9.0
9 10.0
Name: quantile, dtype: float64
A faster (but uglier) alternative is to argsort twice.
(df['quantile'] - 0.44).abs().values.argsort().argsort() + 1
array([ 7, 5, 3, 1, 2, 4, 6, 8, 9, 10])
Note that this solution is only faster if you work with Numpy array objects (through the values property), rather than Pandas series objects.
Related
input dataframe
row name Min
11 AA 0.3
11 AA 0.2
11 BB 0.3
11 CC 0.2
12 AS 0.3
12 BE 0.3
12 BE 0.4
need to generate new column 'count' which holds info on number of times each row-name combo occurs.
Expected Output
row name Min Count
11 AA 0.3 2
11 AA 0.2 2
11 BB 0.3 1
11 CC 0.2 1
12 AS 0.3 1
12 BE 0.3 2
12 BE 0.4 2
Use the below code to generate the column count using transform
df['count'] = df.groupby(['row', 'name'])["Min"].transform("count")
row name Min count
0 11 AA 0.3 2
1 11 AA 0.2 2
2 11 BB 0.3 1
3 11 CC 0.2 1
4 12 AS 0.3 1
5 12 BE 0.3 2
6 12 BE 0.4 2
merge in the counts?
df.merge(df.groupby(['row', 'name']).size().reset_index().rename(columns={0:'Count'}), on=['row','name'])
row name Min Count
0 11 AA 0.3 2
1 11 AA 0.2 2
2 11 BB 0.3 1
3 11 CC 0.2 1
4 12 AS 0.3 1
5 12 BE 0.3 2
6 12 BE 0.4 2
I have multiple datasets that has the same number of rows and columns. The column is 0.1,2,3,4,5,6,7,8.
For instance,
Data1
0.1 3
2 3
3 0.1
4 10
5 5
6 7
7 9
8 2
Data2
0.1 2
2 1
3 0.1
4 0.5
5 4
6 0.3
7 9
8 2
I want to combine the data sets. However, I would like to combine the data by keeping the column and by adding the 2nd columns for multiple files.
0.1 3 2
2 3 1
3 0.1 0.1
4 10 0.5
5 5 4
6 7 0.3
7 9 9
8 2 2
I prefer to use Pandas Dataframe. Any clever way to go about this?
Assuming the first column is the index and the second is data:
df = Data1.join(Data2, lsuffix='_1', rsuffix='_2')
Or using merge, and setting column names as 'A' and 'B'
pd.merge(df1, df2, on='A',suffixes=('_data1','_data2'))
A B_data1 B_data2
0 0.1 3.0 2.0
1 2.0 3.0 1.0
2 3.0 0.1 0.1
3 4.0 10.0 0.5
4 5.0 5.0 4.0
5 6.0 6.0 0.3
6 7.0 9.0 9.0
7 8.0 2.0 2.0
I have a data like this. first column is the number of days from one starting point. second column is value generated after each number of days as given.
example after 1 day i get 5$, after 2nd day i get 3$ and so on. there may be some time where there is no revenue like 4th day. the numbers are not consecutive.
data =pd.DataFrame({'day':[1,2,3,5,6,7,8,9,10,11,14,15,17,18,19],
'value':[5,3,7,8,9,4,6,5,2,8,6,7,9,5,2]})
I want to find total value after every 7 day window.
output should be like
day value
7 36
14 27
21 23
I am using loop to achieve this. is there a better pythonic way of doing this.
df =pd.DataFrame({})
sum_value=0
for index, row in data.iterrows():
sum_value+= row['value']
if row['day'] %7==0:
df = df.append(pd.DataFrame({'day':row['day'],'sum_value':[sum_value]}))
sum_value=0
pritn(df)
Also, how to find sum of previous 7 day values at each day (each row)
expected output
day value
1 5
2 8
3 15
5 23
6 32
7 36
8 37
9 39
10 34
and so on...
I hope i did the calculation right. it is basically running total of previous 7 days of values. it would be easier if the numbers are not missing in days column.
Use groupby with helper Series with subtract 1 and integer division with aggregate sum and last:
df = data.groupby((data['day'] - 1) // 7 , as_index=False).agg({'day':'last', 'value':'sum'})
print (df)
day value
0 7 36
1 14 27
2 19 23
Details:
print ((data['day'] - 1) // 7)
0 0
1 0
2 0
3 0
4 0
5 0
6 1
7 1
8 1
9 1
10 1
11 2
12 2
13 2
14 2
Name: day, dtype: int64
Similar solution if need divide day column by 7:
df = data.groupby((data['day'] - 1) // 7)['value'].sum().reset_index()
df['day'] = (df['day'] + 1) * 7
print (df)
day value
0 7 36
1 14 27
2 21 23
EDIT: Need rolling with sum, but first is necessary add missing dates by reindex - necessary unique values of day column.
idx = np.arange(data['day'].min(), data['day'].max() + 1)
df = data.set_index('day').reindex(idx).rolling(7, min_periods=1).sum()
df = df[df.index.isin(data['day'])]
print (df)
value
day
1 5.0
2 8.0
3 15.0
5 23.0
6 32.0
7 36.0
8 37.0
9 39.0
10 34.0
11 42.0
14 27.0
15 28.0
17 30.0
18 27.0
19 29.0
If get:
ValueError: cannot reindex from a duplicate axis
it means duplicates day values and solution is aggregate sum first:
#duplicated day 1
data =pd.DataFrame({'day':[1,1,3,5,6,7,8,9,10,11,14,15,17,18,19],
'value':[5,3,7,8,9,4,6,5,2,8,6,7,9,5,2]})
idx = np.arange(data['day'].min(), data['day'].max() + 1)
df = data.groupby('day')['value'].sum().reindex(idx).rolling(7, min_periods=1).sum()
df = df[df.index.isin(data['day'])]
print (df)
day
1 8.0
3 15.0
5 23.0
6 32.0
7 36.0
8 34.0
9 39.0
10 34.0
11 42.0
14 27.0
15 28.0
17 30.0
18 27.0
19 29.0
Name: value, dtype: float64
I have two pandas df that look like this
df1
Amount Price
0 5 50
1 10 53
2 15 55
3 30 50
4 45 61
df2
Used amount
0 4.5
1 1.2
2 6.2
3 4.1
4 25.6
5 31
6 19
7 15
I am trying to insert a new column on df2 that will give provide the price from the df1, df1 and df2 have different size, df1 is smaller
I am expecting something like this
df3
Used amount price
0 4.5 50
1 1.2 50
2 6.2 53
3 4.1 50
4 25.6 50
5 31 61
6 19 50
7 15 55
I am thinking to solve this, with something like this function
def price_function(key, table):
used_amount_df2 = (row[0] for row in df1)
price = filter(lambda x: x < key, used_amount_df1)
Here is my own solution
1st approach:
from itertools import product
import pandas as pd
df2=df2.reset_index()
DF=pd.DataFrame(list(product(df2.Usedamount, df1.Amount)), columns=['l1', 'l2'])
DF['DIFF']=(DF.l1-DF.l2)
DF=DF.loc[DF.DIFF<=0,]
DF=DF.sort_values(['l1','DIFF'],ascending=[True,False]).drop_duplicates(['l1'],keep='first')
df1.merge(DF,left_on='Amount',right_on='l2',how='left').merge(df2,left_on='l1',right_on='Usedamount',how='right').loc[:,['index','Usedamount','Price']].set_index('index').sort_index()
Out[185]:
Usedamount Price
index
0 4.5 50
1 1.2 50
2 6.2 53
3 4.1 50
4 25.6 50
5 31.0 61
6 19.0 50
7 15.0 55
2nd using pd.merge_asof I recommend this
df2=df2.rename({'Used amount':Amount}).sort_values('Amount')
df2=df2.reset_index()
pd.merge_asof(df2,df1,on='Amount',allow_exact_matches=True,direction='forward')\
.set_index('index').sort_index()
Out[206]:
Amount Price
index
0 4.5 50
1 1.2 50
2 6.2 53
3 4.1 50
4 25.6 50
5 31.0 61
6 19.0 50
7 15.0 55
Using pd.IntervalIndex you can
In [468]: df1.index = pd.IntervalIndex.from_arrays(df1.Amount.shift().fillna(0),df1.Amount)
In [469]: df1
Out[469]:
Amount Price
(0.0, 5.0] 5 50
(5.0, 10.0] 10 53
(10.0, 15.0] 15 55
(15.0, 30.0] 30 50
(30.0, 45.0] 45 61
In [470]: df2['price'] = df2['Used amount'].map(df1.Price)
In [471]: df2
Out[471]:
Used amount price
0 4.5 50
1 1.2 50
2 6.2 53
3 4.1 50
4 25.6 50
5 31.0 61
6 19.0 50
7 15.0 55
You can use cut or searchsorted for create bins.
Notice: Index in df1 has to be default - 0,1,2....
#create default index if necessary
df1 = df1.reset_index(drop=True)
#create bins
bins = [0] + df1['Amount'].tolist()
#get index values of df1 by values of Used amount
a = pd.cut(df2['Used amount'], bins=bins, labels=df1.index)
#assign output
df2['price'] = df1['Price'].values[a]
print (df2)
Used amount price
0 4.5 50
1 1.2 50
2 6.2 53
3 4.1 50
4 25.6 50
5 31.0 61
6 19.0 50
7 15.0 55
a = df1['Amount'].searchsorted(df2['Used amount'])
df2['price'] = df1['Price'].values[a]
print (df2)
Used amount price
0 4.5 50
1 1.2 50
2 6.2 53
3 4.1 50
4 25.6 50
5 31.0 61
6 19.0 50
7 15.0 55
You can use pd.DataFrame.reindex with method=bfill
df1.set_index('Amount').reindex(df2['Used amount'], method='bfill')
Price
Used amount
4.5 50
1.2 50
6.2 53
4.1 50
25.6 50
31.0 61
19.0 50
15.0 55
To add that to a new column we can use
join
df2.join(
df1.set_index('Amount').reindex(df2['Used amount'], method='bfill'),
on='Used amount'
)
Used amount Price
0 4.5 50
1 1.2 50
2 6.2 53
3 4.1 50
4 25.6 50
5 31.0 61
6 19.0 50
7 15.0 55
Or assign
df2.assign(
Price=df1.set_index('Amount').reindex(df2['Used amount'], method='bfill').values)
Used amount Price
0 4.5 50
1 1.2 50
2 6.2 53
3 4.1 50
4 25.6 50
5 31.0 61
6 19.0 50
7 15.0 55
I have a dataframe that looks like this:
userId movieId rating
0 1 31 2.5
1 1 1029 3.0
2 1 3671 3.0
3 2 10 4.0
4 2 17 5.0
5 3 60 3.0
6 3 110 4.0
7 3 247 3.5
8 4 10 4.0
9 4 112 5.0
10 5 3 4.0
11 5 39 4.0
12 5 104 4.0
I need to get a dataframe which has unique userId, number of ratings by the user and the average rating by the user as shown below:
userId count mean
0 1 3 2.83
1 2 2 4.5
2 3 3 3.5
3 4 2 4.5
4 5 3 4.0
Can someone help?
df1 = df.groupby('userId')['rating'].agg(['count','mean']).reset_index()
print(df1)
userId count mean
0 1 3 2.833333
1 2 2 4.500000
2 3 3 3.500000
3 4 2 4.500000
4 5 3 4.000000
Drop movieId since we're not using it, groupby userId, and then apply the aggregation methods:
import pandas as pd
df = pd.DataFrame({'userId': [1,1,1,2,2,3,3,3,4,4,5,5,5],
'movieId':[31,1029,3671,10,17,60,110,247,10,112,3,39,104],
'rating':[2.5,3.0,3.0,4.0,5.0,3.0,4.0,3.5,4.0,5.0,4.0,4.0,4.0]})
df = df.drop('movieId', axis=1).groupby('userId').agg(['count','mean'])
print(df)
Which produces:
rating
count mean
userId
1 3 2.833333
2 2 4.500000
3 3 3.500000
4 2 4.500000
5 3 4.000000
Here's a NumPy based approach using the fact that userID column appears to be sorted -
unq, tags, count = np.unique(df.userId.values, return_inverse=1, return_counts=1)
mean_vals = np.bincount(tags, df.rating.values)/count
df_out = pd.DataFrame(np.c_[unq, count], columns = (('userID', 'count')))
df_out['mean'] = mean_vals
Sample run -
In [103]: df
Out[103]:
userId movieId rating
0 1 31 2.5
1 1 1029 3.0
2 1 3671 3.0
3 2 10 4.0
4 2 17 5.0
5 3 60 3.0
6 3 110 4.0
7 3 247 3.5
8 4 10 4.0
9 4 112 5.0
10 5 3 4.0
11 5 39 4.0
12 5 104 4.0
In [104]: df_out
Out[104]:
userID count mean
0 1 3 2.833333
1 2 2 4.500000
2 3 3 3.500000
3 4 2 4.500000
4 5 3 4.000000