Pandas: Subtracting Two Mismatched Dataframes - python-3.x

Forgive me if this is a repeat question, but I can't find the answer and I'm not even sure what the right terminology is.
I have two dataframes that don't have completely matching rows or columns. Something like:
Balances = pd.DataFrame({'Name':['Alan','Barry','Carl', 'Debbie', 'Elaine'],
'Age Of Debt':[1,4,3,7,2],
'Balance':[500,5000,300,100,3000],
'Payment Due Date':[1,1,30,14,1]})
Payments = pd.DataFrame({'Name':['Debbie','Alan','Carl'],
'Balance':[50,100,30]})
I want to subtract the Payments dataframe from the Balances dataframe based on Name, so essentially a new dataframe that looks like this:
pd.DataFrame({'Name':['Alan','Barry','Carl', 'Debbie', 'Elaine'],
'Age Of Debt':[1,4,3,7,2],
'Balance':[400,5000,270,50,3000],
'Payment Due Date':[1,1,30,14,1]})
I can imagine having to iterate over the rows of Balances, but when both dataframes are very large I don't think it's very efficient.

You can use .merge:
tmp = pd.merge(Balances, Payments, on="Name", how="outer").fillna(0)
Balances["Balance"] = tmp["Balance_x"] - tmp["Balance_y"]
print(Balances)
Prints:
Name Age Of Debt Balance Payment Due Date
0 Alan 1 400.0 1
1 Barry 4 5000.0 1
2 Carl 3 270.0 30
3 Debbie 7 50.0 14
4 Elaine 2 3000.0 1

Related

How to map sales against purchases sequentially using python?

I have a transaction dataframe as under:
Item Date Code Qty Price Value
0 A 01-01-01 Buy 10 100.5 1005.0
1 A 02-01-01 Buy 5 120.0 600.0
2 A 03-01-01 Sell 12 125.0 1500.0
3 A 04-01-01 Buy 9 110.0 990.0
4 A 04-01-01 Sell 1 100.0 100.0
#and so on... there are a million rows with about thousand items (here just one item A)
What I want is to map each selling transaction against purchase transaction in a sequential manner of FIRST IN FIRST OUT. So, the purchase that was made first will be sold out first.
For this, I have added a new column bQty with opening balance same as purchase quantity. Then I iterate through the dataframe for each sell transaction to set the sold quantity off against purchase transaction before that date.
df['bQty'] = df[df['Code']=='Buy']['Quantity']
for each in df[df['Code']=='Sell']:
for each in df[(df['Code']=='Buy') & (df['Date'] <= sellDate)]:
#code#
Now this requires me to go through the whole dataframe again and again for each sell transaction.
For 1000 records it takes about 10 seconds to complete. So, we can assume that for a million records, this approach will take a lot time.
Is there any faster way to do this?
If you are only interested in the resulting final balance values per item, here is a fast way to calculate them:
Add two additional columns that contain the same absolute values as Qty and Value, but with a negative sign in those rows where the Code value is Sell. Then you can group by item and sum these values for each item, to get the remaining number of items and the money spent for them on balance.
sale = df.Code == 'Sell'
df['Qty_signed'] = df.Qty.copy()
df.loc[sale, 'Qty_signed'] *= -1
df['Value_signed'] = df.Value.copy()
df.loc[sale, 'Value_signed'] *= -1
qty_remaining = df.groupby('Item')['Qty_signed'].sum()
print(qty_remaining)
money_spent = df.groupby('Item')['Value_signed'].sum()
print(money_spent)
Output:
Item
A 11
Name: Qty_signed, dtype: int64
Item
A 995.0
Name: Value_signed, dtype: float64

MultiIndexing based on row values

Trying to create a simple program that finds negative values in a pandas dataframe and combines them with their matching row. Basically I have data that looks like this:
LastName DrugName RxNumber Amount ClientTotalCost
ADAMS Drug 100001 30 10.69
ADAMS Drug 100001 -25 -8.95
...
The idea is that I need to match up fills and refunds, then combine them into a single row. So, in the example above we'd have one row that looks like this:
LastName DrugName RxNumber Amount ClientTotalCost
ADAMS Drug 100001 5 1.74
Also, if a refund doesn't match to a fill row, I'm supposed to just delete it, which I've been accomplishing with .drop()
I'm imagining I can build a multiindex for this somehow, where each row that's a negative is marked as a refund and each fill row is marked as a fill. Then I just have some kind of for loop that goes through the list and attempts to match a certain number of times based on name/number of refunds.
Here's what I was trying:
pbm_negative_index = raw_pbm_data.loc['LastName','DrugName','RXNumber','ClientTotalCost']
names = pbm_negative_index = raw_pbm_data.loc[: , 'LastName']
unique_names = unique(pbm_negative_index)
for n in unique_names:
edf["Refund"] = edf["ClientTotalCost"].shift(1, fill_value=edf["ClientTotalCost"].head(1)) < 0
This obviously doesn't work and I'd like to use the indexing tools in Pandas to achieve a similar result.
Your specification reduces to two simple steps:
aggregate +ve & -ve matching rows
drop remaining -ve rows after aggregation
df = pd.read_csv(io.StringIO("""LastName DrugName RxNumber Amount ClientTotalCost
ADAMS Drug 100001 30 10.69
ADAMS Drug 100001 -25 -8.95
ADAMS2 Drug. 100001 -5 -1.95
"""), sep="\s+")
# aggregate
dfa = df.groupby(["LastName","DrugName","RxNumber"],as_index=False).agg({"Amount":"sum","ClientTotalCost":"sum"})
# drop remaining -ve amounts
dfa = dfa.drop(dfa.loc[dfa.Amount.lt(0)].index)
LastName
DrugName
RxNumber
Amount
ClientTotalCost
0
ADAMS
Drug
100001
5
1.74

Pandas Dataframe Entire Column to String Data Type

I know you can do this with a series, but I can't seem to do this with a dataframe.
I have the following:
name note age
0 jon likes beer on tuesdays 10
1 jon likes beer on tuesdays
2 steve tonight we dine in heck 20
3 steve tonight we dine in heck
I am trying to produce the following:
name note age
0 jon likes beer on tuesdays 10
1 jon likes beer on tuesdays 10
2 steve tonight we dine in heck 20
3 steve tonight we dine in heck 20
I know how to do this with string values using group by and join, but this only works on string values. I'm having issues converting the entire column of age to a string data type in the dataframe.
Any suggestions?
Use GroupBy.first with GroupBy.transform if want repeat first values per groups:
g = df.groupby('name')
df['note'] = g['note'].transform(' '.join)
df['age'] = g['age'].transform('first')
If need processing multiple columns - it means all numeric with first and all strings by join you can generate dictionary by columns names with functions, pass to GroupBy.agg and last use DataFrame.join:
cols1 = df.select_dtypes(np.number).columns
cols2 = df.columns.difference(cols1).difference(['name'])
d1 = dict.fromkeys(cols2, lambda x: ' '.join(x))
d2 = dict.fromkeys(cols1, 'first')
d = {**d1, **d2}
df1 = df[['name']].join(df.groupby('name').agg(d), on='name')

Using apply to modify different dataframe in pandas

I'm running into some issues regarding the use of apply in Pandas.
I have a dataframe, where there is multiple measurements made on certain days, on different measurement sites. To give an example, Site 1 has 2 measurements to make, every 7 or so days.
We know it has to be 2 measurements. So what I'm trying to do now is to check where on which days there were not enough measurements made.
site measurement expected date
0 1 2 1 01-01-2020
1 2 3 2 01-01-2020
2 3 4 2 01-01-2020
3 3 5 2 01-01-2020
4 2 1 2 08-01-2020
5 2 4 2 08-01-2020
I've made a sorted and aggregated DataFrame, that has aggregated the measurements, to basically be able to iterate over the measurements as to not go over the same days twice when there are multiple measurements.
site measurement expected date
0 1 2 1 01-01-2020
1 2 3 2 01-01-2020
2 3 9 2 01-01-2020
3 2 5 2 08-01-2020
For a function, I'm now using the function filter_amount(df_sorted, df, group).
group is to count the amount of measurements made df.groupby(["site_id", "date"]).count()
site measurement date
0 1 2 01-01-2020
1 2 1 01-01-2020
2 3 2 01-01-2020
3 2 2 08-01-2020
The current function basically goes like this:
def filter_amount(df_sorted, df, group):
for i in df_sorted.index:
"Locate amount of measurements for this day and site actually made, in group"
Check how many measurements are expected.
If not enough measurements:
find all measurements in normal df and drop them
So in this example, the measurements from site 2 on 1-1-20 have to be dropped, because there are not enough measurements. The ones from 8-1-20 are valid, because they expect 2, and 2 happen.
The problem is that this is extremely slow with over 500k rows.
The variables I need, I get by using .at[], but I'm trying to make it faster by using apply so I can parallelize the operations, but can't figure it out. I'm doing the apply on df_sorted and passing the arguments needed in, but it's not actually dropping the measurements from the original df.
I have a feeling that it's possible to do it with some sort of groupby on the original df to save operations..
I hope it's clear enough, happy to elaborate any questions.

Pyspark: How do I get today's score and 30 day avg score in a single row

I have use-case where I want to get the rank for today as well as 30 day average as a column. The data has 30 day data for a particular ID and type. The data looks like: -
Id Type checkInDate avgrank
1 ALONE 2019-04-24 1.333333
1 ALONE 2019-03-31 34.057471
2 ALONE 2019-04-17 1.660842
1 TOGETHER 2019-04-13 19.500000
1 TOGETHER 2019-04-08 5.481203
2 ALONE 2019-03-29 122.449156
3 ALONE 2019-04-07 3.375000
1 TOGETHER 2019-04-01 49.179719
5 TOGETHER 2019-04-17 1.391753
2 ALONE 2019-04-22 3.916667
1 ALONE 2019-04-15 2.459151
As my result I want to have output like
Id Type TodayAvg 30DayAvg
1 ALONE 30.0 9.333333
1 TOGETHER 1.0 34.057471
2 ALONE 7.8 99.660842
2 TOGETHER 3 19.500000
.
.
The way I think I can achieve it is having 2 dataframes, one doing a filter on today's date and the 2nd dataframe doing an average over 30 days and then joining the today dataframes on ID and Type
rank = glueContext.create_dynamic_frame.from_catalog(database="testing", table_name="rank", transformation_ctx="rank")
filtert_rank = Filter.apply(frame=rank, f=lambda x: (x["checkInDate"] == curr_dt))
rank_avg = glueContext.create_dynamic_frame.from_catalog(database="testing", table_name="rank", transformation_ctx="rank_avg")
rank_avg_f = rank_avg.groupBy("id", "type").agg(F.mean("avgrank"))
rank_join = filtert_rank.join(rank_avg, ["id", "type"], how='inner')
Is there a simpler way to do it i.e. without reading the dataframe twice?
You can convert the dynamic frame to a apache spark data frame and perform regular sql.
Check the documentation for toDF() and sparksql.

Resources