Sample dataframe with number of records sampled per hour predefined - python-3.x

I have to sample a dataframe (df1) and I have another dataframe (df2) that tells me how many records I should retrieve from each hour of the day.
For example,
df1:
Hour number
0. 00 A
1. 00 B
2. 00 C
3. 01 D
4. 01 A
5. 01 B
6. 01 D
df2:
Hour number
0. 00 1
1. 01 2
So that in the end, I would get for example, record number 1 for midnight and records 3 and 5 for 1 am (or any other combination so long as it respects the number in df2)
The thing is that I need to write this in a function in order for me to call this inside another function.
So far I have
def sampling(frame):
return np.random.choice(frame.index)
but I am failing to add the constraints of the df2.
Could anybody help?

First we add the number of samples required as a new column using merge and the apply sample to each group of Hour values. Finally we remove the added column by returning all but the last column:
def sampling(df1, df2):
return df1.merge(df2, on='Hour').groupby('Hour').apply(lambda x: x.sample(x.Number[0])).reset_index(0,True).iloc[:,:-1]
df1 = pd.DataFrame({'Hour': [0,0,0,1,1,1,1], 'Value': list('ABCDABD')})
df2 = pd.DataFrame({'Hour': [0,1], 'Number': [1,2]})
sampling(df1, df2)
Result:
Hour Value
2 0 C
4 1 A
5 1 B

Related

Finding intervals in pandas dataframe based on values in another dataframe

I have two data frames. One dataframe (A) looks like:
Name. gender start_coordinate end_coordinate ID
Peter M 30 150 1
Hugo M 4500 6000 2
Jennie F 300 700 3
The other dataframe (B) looks like
ID_sim. position string
1 89 aa
4 568 bb
5 938437 cc
I want to accomplish two tasks here:
I want to get a list of indices for rows (from dataframe B) for which position column falls in the interval (specified by start_coordinate and end_coordinate column) in dataframe A.
The result for this task will be:
lst = [0,1]. ### because row 0 of B falls in interval of row 1 in A and row 1 of B falls in interval of row 3 of A.
The indices that I get from task 1, I want to keep it from dataframe B to create a new dataframe. Thus, the new dataframe will look like:
position string
89 aa
568 bb
I used .between() to accomplish this task. The code is as follows:
lst=dfB[dfB['position'].between(dfA.loc[0,'start_coordinate'],dfA.loc[len(dfA)-1,'end_coordinate'])].index.tolist()
result=dfB[dfB.index.isin(lst)]
result.shape
However, when I run this piece of code I get the following error:
KeyError: 0
What could possibly be raising this error? And how can I solve this?
We can try numpy broadcasting here
s, e = dfA[['start_coordinate', 'end_coordinate']].to_numpy().T
p = dfB['position'].to_numpy()[:, None]
dfB[((p >= s) & (p <= e)).any(1)]
ID_sim. position string
0 1 89 aa
1 4 568 bb
You could use Pandas IntervalIndex to get the positions, and afterwards, use a boolean to pull the relevant rows from B :
Create IntervalIndex:
intervals = pd.IntervalIndex.from_tuples([*zip(A['start_coordinate'],
A['end_coordinate'])
],
closed='both')
Get indexers for B.position, create a boolean array with the values and filter B:
# get_indexer returns -1 if an index is not found.
B.loc[intervals.get_indexer(B.position) >= 0]
Out[140]:
ID_sim. position string
0 1 89 aa
1 4 568 bb
This should work. Less elegant but easier to comprehend.
import pandas as pd
data = [['Name.','gender', 'start_coordinate','end_coordinate','ID'],
['Peter','M',30,150,1],
['Hugo','M',4500,6000,2],
['Jennie','F',300,700,3]]
data2 = [['ID_sim.','position','string'],
['1',89,'aa'],
['4',568,'bb'],
['5',938437,'cc']]
df1 = pd.DataFrame(data[1:], columns=data[0])
df2 = pd.DataFrame(data2[1:], columns=data2[0])
merged = pd.merge(df1, df2, left_index=True, right_index=True)
print (merged[(merged['position'] > merged['start_coordinate']) & (merged['position'] < merged['end_coordinate'])])

Create Multiple Dataframes using Loop & function

I have a df over 1M rows similar to this
ID Date Amount
x May 1 10
y May 2 20
z May 4 30
x May 1 40
y May 1 50
z May 2 60
x May 1 70
y May 5 80
a May 6 90
b May 8 100
x May 10 110
I have to sort the data based on the date and then create new dataframes depending on the times the value is present in Amount column. So if x has made purchase 3 time then I need it in 3 different dataframes. first_purchase dataframe would have every ID that has purchased even once irrespective of date or amount.
If an ID purchases 3 times, I need that ID to be in first purchase then second and then 3rd with Date and Amount.
Doing it manually is easy with:-
df = df.sort_values('Date')
first_purchase = df.drop_duplicates('ID')
after_1stpurchase = df[~df.index.isin(first_purchase.index)]
second data frame would be created with:-
after_1stpurchase = after_1stpurchase.sort_values('Date')
second_purchase = after_1stpurchase.drop_duplicates('ID')
after_2ndpurchase = after_1stpurchase[~after_1stpurchase.index.isin(second_purchase.index)]
How do I create the loop to provide me with each dataframes?
IIUC, I was able to achieve what you wanted.
import pandas as pd
import numpy as np
# source data for the dataframe
data = {
"ID":["x","y","z","x","y","z","x","y","a","b","x"],
"Date":["May 01","May 02","May 04","May 01","May 01","May 02","May 01","May 05","May 06","May 08","May 10"],
"Amount":[10,20,30,40,50,60,70,80,90,100,110]
}
df = pd.DataFrame(data)
# convert the Date column to datetime and still maintain the format like "May 01"
df['Date'] = pd.to_datetime(df['Date'], format='%b %d').dt.strftime('%b %d')
# sort the values on ID and Date
df.sort_values(by=['ID', 'Date'], inplace=True)
df.reset_index(inplace=True, drop=True)
print(df)
Original Dataframe:
Amount Date ID
0 90 May 06 a
1 100 May 08 b
2 10 May 01 x
3 40 May 01 x
4 70 May 01 x
5 110 May 10 x
6 50 May 01 y
7 20 May 02 y
8 80 May 05 y
9 60 May 02 z
10 30 May 04 z
.
# create a list of unique ids
list_id = sorted(list(set(df['ID'])))
# create an empty list that would contain dataframes
df_list = []
# count of iterations that must be seperated out
# for example if we want to record 3 entries for
# each id, the iter would be 3. This will create
# three new dataframes that will hold transactions
# respectively.
iter = 3
for i in range(iter):
df_list.append(pd.DataFrame())
for val in list_id:
tmp_df = df.loc[df['ID'] == val].reset_index(drop=True)
# consider only the top iter(=3) values to be distributed
counter = np.minimum(tmp_df.shape[0], iter)
for idx in range(counter):
df_list[idx] = df_list[idx].append(tmp_df.loc[tmp_df.index == idx])
for df in df_list:
df.reset_index(drop=True, inplace=True)
print(df)
Transaction #1:
Amount Date ID
0 90 May 06 a
1 100 May 08 b
2 10 May 01 x
3 50 May 01 y
4 60 May 02 z
Transaction #2:
Amount Date ID
0 40 May 01 x
1 20 May 02 y
2 30 May 04 z
Transaction #3:
Amount Date ID
0 70 May 01 x
1 80 May 05 y
Note that in your data, there are four transactions for 'x'. If lets say you wanted to track the 4th iterative transaction as well. All you need to do is change the value if 'iter' to 4 and you will get the fourth dataframe as well with the following value:
Amount Date ID
0 110 May 10 x

Use Switch/Case Statement to build DF2, by Iterating Over Rows in DF1

I've loaded data from a tab deliminated file into a DF. The Tab data is a form filled out with a template.
A critical concept is that a variable number of rows makes up one entry in the form. In DF1 below, every time the index is "A", a new record is starting. So the code will need to iterate through the rows to rebuild each record in DF2. Each record will be represented as one row in DF2.
Based on the fact that each "A" row in DF1 starts a new form entry (and corresponding row in DF2), we can see in DF1 below there are just two entries in my example, and will be just two rows in DF2. Also imortant: there are a different number of pieces of data (columns) in each row. Z has 2 (then NAs), A has 3, B has 4.
All of this needs to be mapped to DF2 depending on the index letters Z, A, B (note there are more index letters but this is simplified for this example).
DF 1
- A B C D
Z xyz 5 NA NA
A COA aa bb NA
B RE 01 02 03
B DE 04 05 06
A COB dd ee NA
B RE 01 02 03
B DE 04 05 06
In the past i've done this type of thing in VBA and would have used a CASE statement to transform the data. I've found a good start using dictionaries in this thread:
Replacements for switch statement in Python?
One code example at the above thread suggests using a dictionary type case statement:
return{
'a': 1,
'b': 2,
}[x]
This seems like it would work although i'm not certain how to execute in practice. In addition for each A, B, etc above, I need to output multiple instructions, depending on the index letter. For the most part, the instructions are where to map in DF2. For example, in my:
Index A:
Map column A to DF2.iloc[1]['B']
Map column B to DF2.iloc[1]['C']
Map column C to DF2.iloc[1]['D']
Index B:
Would have four instructions, similar to above.
DF2 would end up looking like so
- A B C D E F G H I J K L
1 xyz COA aa bb RE 01 02 03 DE 04 05 06
2 xyz COB dd ee RE 01 02 03 DE 04 05 06
So for each row in DF1, a different number of instructions is being performed depending on the "index letter." All instructions are telling the code where to put the data in DF2. The mapping instruction for each different index letter will always be the same for the columns, only the row will be changing (some type of counter as you move from one record group to the next in DF2).
How can I handle the different number of instructions for each type of index letter in a switch/case type format?
Thank you
I think you can use:
#filter only 2,3 index rows
df1 = df[df.index.isin([2,3])].copy()
#create new column for same value if 2 in index
df1['new'] = np.where(df1.index == 2, 'Z', df1.A)
#create groups by compare 2
df1['g'] = (df1.index == 2).cumsum()
#convert columns to index and reshape, then change order
df1 = (df1.set_index(['g','new']).unstack()
.swaplevel(0,1, axis=1)
.sort_index(axis=1, ascending=[False, True]))
#default columns names
df1.columns = range(len(df1.columns))
print (df1)
0 1 2 3 4 5 6 7 8 9 10 11
g
1 ABC aa bb cc R 01 02 NaN D NaN 03 04
2 DEF dd ee ff R 01 02 NaN D NaN 03 04

Find a row matching multiple column criteria

I have a dataframe with 2M rows which is in the below format:
ID Number
1 30
1 40
1 60
2 10
2 30
3 60
I need to select the IDs have the number 30 and 40 present (in this case, output should be 1).
I know we can create a new DF having only numbers 30 & 40 and then groupby to see which IDs have more than count 1. But is there a way we can to do both in the groupby statement ?
My code:
a=df[(df['Number']==30) | (df['Number']==40) ]
b=a.groupby('ID')['Number'].nunique().to_frame(name='tt').reset_index()
b[b['tt'] > 1]
Use groupby filter and issubset
s = {30, 40}
df.groupby('ID').filter(lambda x: s.issubset(set(x.Number)))
Out[158]:
ID Number
0 1 30
1 1 40
2 1 60
I find the fact that the describe() method of Groupby objects returns a dataframe to be extremely helpful.
Output temp1 = a.groupby("ID").describe() and temp2 = a.groupby("ID").describe()["Number"] to a Jupyter notebook to see what they look like, then the following code (which follows on from yours) should make sense.
summary = a.groupby("ID").describe()["Number"]
summary.loc[summary["count"] > 1].index
I would create a df for each condition and then inner join them:
df1 = df[df.Number == 30][['Number']]
df2 = df[df.Number == 40][['Number']]
df3 = df1.join(df2,how='inner',on='Number')

Upsert function in Dataframe - Python

I am trying to update one dataframe with another dataframe with respect to the first column. If there is an extra row in the second dataframe, it should be inserted in the first dataframe. It there is a row with the same data in the first column but different data in the other coulmns, that row should be updated. Also, the row which has no value in the first column should be dropped.
Code used -
df = df_1.combine_first(df_2)\
.reset_index()\
.reindex(columns=df_1.columns)
df = df.drop_duplicates(subset='A', keep= 'last', inplace=False)
df.dropna(subset=['A'])
print ("Final Data")
print (df)
First Dataframe -
A B C
0 45 a b
1 98 c d
2 67 bn k
Second Dataframe -
A B C
0 45 a d
1 98 c d
2 67 bn k
3 90 x z
4
Final should look like -
A B C
0 45 a d
1 98 c d
2 67 bn k
3 90 x z
The final dataframe that I get -
A B C
0 45.0 a b
1 98.0 c d
2 67.0 bn k
3 90.0 x z
4
So, neither the data is getting updated, nor is it removing the row with null values. What am I missing?
Based on my understanding of your question, your second dataframe basically supercedes the first, if there is a matching index. If there isn't, then the difference is added to the first dataframe. I am also assuming that there are no duplicate keys in the first column, A.
Framing this requirement a little differently, the final output should contain all the rows in the second dataframe, as well as the values (since they are meant to overwrite the first dataframe if there's a match). Therefore, we will start off using the second dataframe as it is, and then add back the rows that exist in the first dataframe but not in the second. See the example below. (I'm also using a slightly different first dataframe to highlight the effects)
import pandas as pd
df1 = pd.DataFrame({'A':[45,98,67,91],'B':['a','c','bn','y'],'C':['b','d','k','oo']})
df2 = pd.DataFrame({'A':[45,98,67,90,''],'B':['a','c','bn','x',''],'C':['d','d','k','z','']})
# Remove rows with empty values in first column. This should be whatever conditions applicable to you i.e. checking for np.nan instead of str('')
df2 = df2.loc[df2['A'] != '']
df1.set_index('A', inplace=True)
df2.set_index('A', inplace=True)
# Find keys in dataframe 1 that are not in dataframe 2
idx_diff = df1.index.difference(df2.index)
# Append these rows to dataframe 2
df_ins = df1.loc[idx_diff]
df3 = df2.append(df_ins)
df3.reset_index(inplace=True)
>>>df3
A B C
0 45 a d
1 98 c d
2 67 bn k
3 90 x z
4 91 y oo

Resources