MultiIndexing based on row values - python-3.x

Trying to create a simple program that finds negative values in a pandas dataframe and combines them with their matching row. Basically I have data that looks like this:
LastName DrugName RxNumber Amount ClientTotalCost
ADAMS Drug 100001 30 10.69
ADAMS Drug 100001 -25 -8.95
...
The idea is that I need to match up fills and refunds, then combine them into a single row. So, in the example above we'd have one row that looks like this:
LastName DrugName RxNumber Amount ClientTotalCost
ADAMS Drug 100001 5 1.74
Also, if a refund doesn't match to a fill row, I'm supposed to just delete it, which I've been accomplishing with .drop()
I'm imagining I can build a multiindex for this somehow, where each row that's a negative is marked as a refund and each fill row is marked as a fill. Then I just have some kind of for loop that goes through the list and attempts to match a certain number of times based on name/number of refunds.
Here's what I was trying:
pbm_negative_index = raw_pbm_data.loc['LastName','DrugName','RXNumber','ClientTotalCost']
names = pbm_negative_index = raw_pbm_data.loc[: , 'LastName']
unique_names = unique(pbm_negative_index)
for n in unique_names:
edf["Refund"] = edf["ClientTotalCost"].shift(1, fill_value=edf["ClientTotalCost"].head(1)) < 0
This obviously doesn't work and I'd like to use the indexing tools in Pandas to achieve a similar result.

Your specification reduces to two simple steps:
aggregate +ve & -ve matching rows
drop remaining -ve rows after aggregation
df = pd.read_csv(io.StringIO("""LastName DrugName RxNumber Amount ClientTotalCost
ADAMS Drug 100001 30 10.69
ADAMS Drug 100001 -25 -8.95
ADAMS2 Drug. 100001 -5 -1.95
"""), sep="\s+")
# aggregate
dfa = df.groupby(["LastName","DrugName","RxNumber"],as_index=False).agg({"Amount":"sum","ClientTotalCost":"sum"})
# drop remaining -ve amounts
dfa = dfa.drop(dfa.loc[dfa.Amount.lt(0)].index)
LastName
DrugName
RxNumber
Amount
ClientTotalCost
0
ADAMS
Drug
100001
5
1.74

Related

How join two dataframes with multiple overlap in pyspark

Hi I have a dataset of multiple households where all people within households have been matched between two datasources. The dataframe therefore consists of a 'household' col, and two person cols (one for each datasource). However some people (like Jonathan or Peter below) where not able to be matched and so have a blank second person column.
Household
Person_source_A
Person_source_B
1
Oliver
Oliver
1
Jonathan
1
Amy
Amy
2
David
Dave
2
Mary
Mary
3
Lizzie
Elizabeth
3
Peter
As the dataframe is gigantic, my aim is to take a sample of the unmatched individuals, and then output a df that has all people within households where only sampled unmatched people exist. Ie say my random sample includes Oliver but not Peter, then I would only household 1 in the output.
My issue is I've filtered to take the sample and now am stuck making progress. Some combination of join, agg/groupBy... will work but I'm struggling. I add a flag to the sampled unmatched names to identify them which i think is helpful...
My code:
# filter to unmatched people
df_unmatched = df.filter(col('per_A').isNotNull()) & col('per_B').isNull())
# take random sample of 10%
df_unmatched_sample = df_unmatched.sample(0.1)
# add flag of sampled unmatched persons
df_unmatched_sample = df_unmatched.withColumn('sample_flag', lit('1'))
As it pertains to your intent:
I just want to reduce my dataframe to only show the full households of
households where an unmatched person exists that has been selected by
a random sample out of all unmatched people
Using your existing approach you could use a join on the Household of the sample records
# filter to unmatched people
df_unmatched = df.filter(col('per_A').isNotNull()) & col('per_B').isNull())
# take random sample of 10%
df_unmatched_sample = df_unmatched.sample(0.1).select("Household").distinct()
desired_df = df.join(df_unmatched_sample,["Household"],"inner")
Edit 1
In response to op's comment:
Is there a slightly different way that keeps a flag to identify the
sampled unmatched person (as there are some households with more than
one unmatched person)?
A left join on your existing dataset after adding the flag column to your sample may help you to achieve this eg:
# filter to unmatched people
df_unmatched = df.filter(col('per_A').isNotNull()) & col('per_B').isNull())
# take random sample of 10%
df_unmatched_sample = df_unmatched.sample(0.1).withColumn('sample_flag', lit('1'))
desired_df = (
df.alias("dfo").join(
df_unmatched_sample.alias("dfu"),
[
col("dfo.Household")==col("dfu.Household") ,
col("dfo.per_A")==col("dfu.per_A"),
col("dfo.per_B").isNull()
],
"left"
)
)

Pandas: Subtracting Two Mismatched Dataframes

Forgive me if this is a repeat question, but I can't find the answer and I'm not even sure what the right terminology is.
I have two dataframes that don't have completely matching rows or columns. Something like:
Balances = pd.DataFrame({'Name':['Alan','Barry','Carl', 'Debbie', 'Elaine'],
'Age Of Debt':[1,4,3,7,2],
'Balance':[500,5000,300,100,3000],
'Payment Due Date':[1,1,30,14,1]})
Payments = pd.DataFrame({'Name':['Debbie','Alan','Carl'],
'Balance':[50,100,30]})
I want to subtract the Payments dataframe from the Balances dataframe based on Name, so essentially a new dataframe that looks like this:
pd.DataFrame({'Name':['Alan','Barry','Carl', 'Debbie', 'Elaine'],
'Age Of Debt':[1,4,3,7,2],
'Balance':[400,5000,270,50,3000],
'Payment Due Date':[1,1,30,14,1]})
I can imagine having to iterate over the rows of Balances, but when both dataframes are very large I don't think it's very efficient.
You can use .merge:
tmp = pd.merge(Balances, Payments, on="Name", how="outer").fillna(0)
Balances["Balance"] = tmp["Balance_x"] - tmp["Balance_y"]
print(Balances)
Prints:
Name Age Of Debt Balance Payment Due Date
0 Alan 1 400.0 1
1 Barry 4 5000.0 1
2 Carl 3 270.0 30
3 Debbie 7 50.0 14
4 Elaine 2 3000.0 1

Pandas Dataframe Entire Column to String Data Type

I know you can do this with a series, but I can't seem to do this with a dataframe.
I have the following:
name note age
0 jon likes beer on tuesdays 10
1 jon likes beer on tuesdays
2 steve tonight we dine in heck 20
3 steve tonight we dine in heck
I am trying to produce the following:
name note age
0 jon likes beer on tuesdays 10
1 jon likes beer on tuesdays 10
2 steve tonight we dine in heck 20
3 steve tonight we dine in heck 20
I know how to do this with string values using group by and join, but this only works on string values. I'm having issues converting the entire column of age to a string data type in the dataframe.
Any suggestions?
Use GroupBy.first with GroupBy.transform if want repeat first values per groups:
g = df.groupby('name')
df['note'] = g['note'].transform(' '.join)
df['age'] = g['age'].transform('first')
If need processing multiple columns - it means all numeric with first and all strings by join you can generate dictionary by columns names with functions, pass to GroupBy.agg and last use DataFrame.join:
cols1 = df.select_dtypes(np.number).columns
cols2 = df.columns.difference(cols1).difference(['name'])
d1 = dict.fromkeys(cols2, lambda x: ' '.join(x))
d2 = dict.fromkeys(cols1, 'first')
d = {**d1, **d2}
df1 = df[['name']].join(df.groupby('name').agg(d), on='name')

Excel lookup value for multiple criteria and multiple columns

I am helping a friend with some data analysis in Excel.
Here's how our data looks like:
Car producer | Classification | Prices from 9 different vendors in 9 columns
AUDI | C | 100 200 300 400 500 600 700 800 900
AUDI | C | 100 900 800 200 700 300 600 400 500
AUDI | B | .. ..
Now, for each classification and each producer, we produced a list that shows which of the 9 vendors has offers the most lowest prices (in terms of count, so for example there are 2 cars from AUDI in the C class, so vendor A would offer the lowest price for both).
What we need: A way to calculate the average price for this vendor. So, if we see that the vendor A has the lowest price for AUDI cars in the C class, then we want to know the average price for vendor A for these cars.
I'm quite stumped since I can't use the "standard" index-match-small approach since the prices are stored in 9 different columns.
I've suggested to use a long if-chain like this: =if(vendor=A,averageif(enter the criteria and select the column of vendor A for average values),if(vendor=B,average(enter the criteria and select the column of vendor B for average values),... etc.).
But this method is obviously limited and does not scale well to higher dimensions.
We also would like to avoid using any addons.
You're going to need to create a separate table that has all unique classifications in the rows and all dealers in the columns (same as yours, but with duplicate rows removed). Then, in each cell, take the average price for that classification*vendor combination. This can be done by using a combination of sumif/countif. For example, if your second table had a column for classifications in cells M2:M[end], calculating the average price for the Audi C class offered by vendor 1 could be:
=sumif(C$2:C$[end],"="&$M2,$B$2:$B$[end])/countif($B$2:$B$[end],"="&$M2)
This would look something like this:
Then you could simply find the cheapest vendor by matching the min price. For example, the cheapest vendor for the audi C class in my example image would be:
=index($N$1:$V$1,match(min($N2:$V2),$N2:$V2,0))
A lot this could be done using PivotTables. If it is a one off thing, I would go that route, if it needs to be automated, then try using a multicondtional VLOOKUP (needs to be entered as a Matrix Formula: CTRL+ALT+SHIFT). This is simply an example, not based on your data:
{=VLOOKUP(A11&B11,CHOOSE({1\2},A2:A7&B2:B7,C2:C7),2,0)}
A better explanation is given here at chandoos site:http://chandoo.org/wp/2014/10/28/multi-condition-vlookup/

Count and average data inside categories under two types of conditions

I'm working on Excel with a lot of data and I'm having difficulty with knowing how to sort through it to get some important numbers. I have minimal Excel experience.
Right now I'm struggling with knowing how to get the average in the difference between two columns. The trick is that I have to get the average in difference when column A is less that column B and then, the same when it's more. And all that within a category.
So for example let's say I have 3 categories: Football, Soccer, and Basketball (these are just made up ones).
So in column A, I have: Soccer, Football, Basketball. Then, in column B and C, I have the scores for John and Adam for the last 3 months, respectively. Lastly, in column D, I have the differences between their scores.
So, for example:
Category John Adam Differences
Soccer 5 3 2
Soccer 6 2 4
Soccer 3 5 2
Soccer 4 0 4
I want to create a table for within each category I have a table like below:
NÂș of cases Avg. Difference between John and Adam
When John's score is >
When John's score is <
When they are equal
Is there some type of formula where I can say something like this:
If the category is Soccer (the category being in column A), take the difference between John's score (column B) and Adam's score (column C) when John's score is larger than Adam's score, then calculate the average of those differences? Then, I would use the same formula but tweak it when John's score is smaller.
Additionally, would there be a formula where I can also, calculate within the category Soccer, how many times John's score is bigger than Adam's?
My data is much larger and I can't do this manually.
A B C D
1 Sport John Adam Differences
2 Soccer 5 3 2
3 Soccer 6 2 4
4 Soccer 3 5 -2
5 Soccer 4 0 4
6 Basketball 20 15 5
7 Basketball 7 13 -6
8 Basketball 26 10 16
9 Basketball 8 11 -3
Type in D1:
=B1-C1
Drag the formula in Column D to all rows which there are values in columns A, B and C.
Create the PivotTable.
Drag Sport to "Row Label" field. Drag Differences to "Row Label" field under Sport.
Drag Differences to "Values" field as: Count of Differences (same way the previous question)
Drag Differences to "Values" field (below Count of Differences), and set the mathematical operation as "Average" of Differences (left-mouse click Differences, choose "Values fields settings" and select "Average").
Give a right-click mouse in cell A5 (see picture bellow) and select "Group" option.
Set "Starting at" = 0; "Ending at" = 1000; "By" = 1000 (as in the picture below). Click ok.
You will have in each Sport, the count (frequency) and average Differences values for two groups:
When the Difference B1-C1 is negative; and
When the Difference B1-C1 is zero or positive.
The average of Differences when the score is equal will be always zero.

Resources