Using a for loop to create additional columns that calculate percentages based on multiple conditions - python-3.x

I have a dataframe with survey data. The survey data assesses multiple factors like coaching, diversity, engagement etc. There are also several other columns which capture demographic data (e.g., age, department etc). I would like to add columns based on the columns that contain the ratings.
The purpose of adding the columns is to a) provide a count of Favourable responses, b) to get the percentage of Favourable responses (no of favourable responses / no of items in that factor) and c) to get the percentage of Favourable responses at the Factor level (with the condition that if there are missing reponses for any item, it would be NULL at the Factor level)
The table below shows the desired output where only Coaching items are factor are included. The table should contain other rating columns and should apply for Factors like Diversity, Leadership, Engagement etc.
Coach_q1 Coach_q2 Coach_q3 coach_fav_count coach_fav_perc coach_agg_perc
Favourable Neutral Favourable 2 66% 66%
Favourable Favourable Fabourable 3 100% 100%
NaN Favourable NaN 1 33% NaN
Favourable NaN Favourable 2 66% NaN
The following code works in getting the _favcount columns and the _fav% columns. The ratingcollist is used to only apply the transformations on columns with those prefixes. However, am unable to get the factor level column which aims to get the percentage of favourable responses for the entire factor - ONLY if all questions were answered for that particular factor (i.e., if there were missing responses in any of the items within a particular factor, then the factor would yield a NaN value).
Appreciate any form of help i can get, thank you.
ratingcollist = ['Coach_','Diversity_','Leadership_','Engagement_']
#create a for loop to get all the columns that match the column list keyword
for rat in ratingcollist:
cols = df.filter(like=rat).columns
#create 2 new columns for each factor, one for count of Favourable responses and one for percentage of Favourable responses
if len(cols) > 0:
df[f'{rat.lower()}fav_count'] = (df[cols] == 'Favourable').sum(axis=1)
df[f'{rat.lower()}fav_perc'] = (df[f'{rat.lower()}fav_count'] / len(cols)) * 100

You can add mask for test if all values are not missing by DataFrame.notna and DataFrame.all and only for this add percentage column in DataFrame.loc:
df = pd.DataFrame({'Coach_q1': ['Favourable', 'Favourable', np.nan, 'Favourable'], 'Coach_q2': ['Neutral', 'Favourable', 'Favourable', np.nan], 'Coach_q3': ['Favourable', 'Favourable', np.nan, 'Favourable']})
ratingcollist = ['Coach_','Diversity_','Leadership_','Engagement_']
#create a for loop to get all the columns that match the column list keyword
for rat in ratingcollist:
cols = df.filter(like=rat).columns
mask = df[cols].notna().all(axis=1)
#create 2 new columns for each factor, one for count
#of Favourable responses and one for percentage of Favourable responses
if len(cols) > 0:
df[f'{rat.lower()}fav_count'] = (df[cols] == 'Favourable').sum(axis=1)
df[f'{rat.lower()}fav_perc'] = (df.loc[mask, f'{rat.lower()}fav_count'] / len(cols)) * 100
print (df)
Coach_q1 Coach_q2 Coach_q3 coach_fav_count coach_fav_perc
0 Favourable Neutral Favourable 2 66.666667
1 Favourable Favourable Favourable 3 100.000000
2 NaN Favourable NaN 1 NaN
3 Favourable NaN Favourable 2 NaN

Related

How to map sales against purchases sequentially using python?

I have a transaction dataframe as under:
Item Date Code Qty Price Value
0 A 01-01-01 Buy 10 100.5 1005.0
1 A 02-01-01 Buy 5 120.0 600.0
2 A 03-01-01 Sell 12 125.0 1500.0
3 A 04-01-01 Buy 9 110.0 990.0
4 A 04-01-01 Sell 1 100.0 100.0
#and so on... there are a million rows with about thousand items (here just one item A)
What I want is to map each selling transaction against purchase transaction in a sequential manner of FIRST IN FIRST OUT. So, the purchase that was made first will be sold out first.
For this, I have added a new column bQty with opening balance same as purchase quantity. Then I iterate through the dataframe for each sell transaction to set the sold quantity off against purchase transaction before that date.
df['bQty'] = df[df['Code']=='Buy']['Quantity']
for each in df[df['Code']=='Sell']:
for each in df[(df['Code']=='Buy') & (df['Date'] <= sellDate)]:
#code#
Now this requires me to go through the whole dataframe again and again for each sell transaction.
For 1000 records it takes about 10 seconds to complete. So, we can assume that for a million records, this approach will take a lot time.
Is there any faster way to do this?
If you are only interested in the resulting final balance values per item, here is a fast way to calculate them:
Add two additional columns that contain the same absolute values as Qty and Value, but with a negative sign in those rows where the Code value is Sell. Then you can group by item and sum these values for each item, to get the remaining number of items and the money spent for them on balance.
sale = df.Code == 'Sell'
df['Qty_signed'] = df.Qty.copy()
df.loc[sale, 'Qty_signed'] *= -1
df['Value_signed'] = df.Value.copy()
df.loc[sale, 'Value_signed'] *= -1
qty_remaining = df.groupby('Item')['Qty_signed'].sum()
print(qty_remaining)
money_spent = df.groupby('Item')['Value_signed'].sum()
print(money_spent)
Output:
Item
A 11
Name: Qty_signed, dtype: int64
Item
A 995.0
Name: Value_signed, dtype: float64

calculate percentage of occurrences in column pandas

I have a column with thousands of rows. I want to select the top significant one. Let's say I want to select all the rows that would represent 90% of my sample. How would I do that?
I have a dataframe with 2 columns, one for product_id one showing whether it was purchased or not (value is or 0 or 1)
product_id purchased
a 1
b 0
c 0
d 1
a 1
. .
. .
with df['product_id'].value_counts() I can have all my product-ids ranked by number of occurrences.
Let's say now I want to get the number of product_ids that I should consider in my future analysis that would represent 90% of the total of occurences.
Is there a way to do that?
If want all product_id with counts under 0.9 then use:
s = df['product_id'].value_counts(normalize=True).cumsum()
df1 = df[df['product_id'].isin(s.index[s < 0.9])]
Or if want all rows sorted by counts and get 90% of them:
s1 = df['product_id'].map(df['product_id'].value_counts()).sort_values(ascending=False)
df2 = df.loc[s1.index[:int(len(df) * 0.9)]]

How to find columns/Features for which there are at least X percentage of rows with Identical Values? [Python]

Let us say I have an extremely large dataset with 'N' rows and 'M' features. I also have two inputs.
'm' : defines the number of features to check(m
'support' = Identical Rows/ Total rows for the 'm' subset of features. This is basically the minimum percentage of identical rows considering an 'm' number of features
I need to return the groups of features for which the 'support' value is greater than a predefined value.
For Example, let us take this dataset:
d = {
'A': [100, 200, 200, 400,400], 'B': [1,2,2,4,5],
'C':['2018-11-19','2018-11-19','2018-12-19','2018-11-19','2018-11-19']
}
df = pd.DataFrame(data=d)
A B C
0 100 1 2018-11-19
1 200 2 2018-11-19
2 200 2 2018-12-19
3 400 4 2018-11-19
4 400 5 2018-11-19
dataset
In the above example if let us say that
'm' = 2
'support' = 0.4
Then the function should return both ['A','B] and ['A','C'] as both these features when considered together have at least 2 identical rows out of a total of 5 rows (>= 0.4).
I realize that a naive solution would be to to compare all combinations of 'm' features out of 'M' and check the percentage of identical rows. However this will get incredibly complex after the magnitude of features crosses double digits, especially with thousands of rows. What would be an optimized code to tackling this problem?

How to join two dataframes for which column time values are within a certain range and are not datetime or timestamp objects?

I have two dataframes as shown below:
time browncarbon blackcarbon
181.7335 0.105270 NaN
181.3809 0.166545 0.001217
181.6197 0.071581 NaN
422 rows x 3 columns
start end toc
179.9989 180.0002 155.0
180.0002 180.0016 152.0
180.0016 180.0030 151.0
1364 rows x 3 columns
The first dataframe has a time column that has instants every four minutes. The second dataframe has a two time columns spaced every two minutes. Both these time columns do not start and end at the same time. However, they contain data collected over the same day. How could I make another dataframe containing:
time browncarbon blackcarbon toc
422 rows X 4 columns
There is a related answer on Stack Overflow, however, that is applicable only when the time columns are datetime or timestamp objects. The link is: How to join two dataframes for which column values are within a certain range?
Addendum 1: The multiple start and end rows that get encapsulated into one of the time rows should also correspond to one toc row, as it does right now, however, it should be the average of the multiple toc rows, which is not the case presently.
Addendum 2: Merging two pandas dataframes with complex conditions
We create a artificial key column to do an outer merge to get the cartesian product back (all matches between the rows). Then we filter all the rows where time falls in between the range with .query.
note: I edited the value of one row so we can get a match (see row 0 in example dataframes on the bottom)
df1.assign(key=1).merge(df2.assign(key=1), on='key', how='outer')\
.query('(time >= start) & (time <= end)')\
.drop(['key', 'start', 'end'], axis=1)
output
time browncarbon blackcarbon toc
1 180.0008 0.10527 NaN 152.0
Example dataframes used:
df1:
time browncarbon blackcarbon
0 180.0008 0.105270 NaN
1 181.3809 0.166545 0.001217
2 181.6197 0.071581 NaN
df2:
start end toc
0 179.9989 180.0002 155.0
1 180.0002 180.0016 152.0
2 180.0016 180.0030 151.0
Since the start and end intervals are mutually exclusive, we may be able to create new columns in df2 such that it would contain all the integer values in the range of floor(start) and floor(end). Later, add another column in df1 as floor(time) and then take left outer join on df1 and df2. I think that should do except that you may have to remove nan values and extra columns if required. If you send me the csv files, I may be able to send you the script. I hope I answered your question.
Perhaps you could just convert your columns to Timestamps and then use the answer in the other question you linked
from pandas import Timestamp
from dateutil.relativedelta import relativedelta as rd
def to_timestamp(x):
return Timestamp(2000, 1, 1) + rd(days=x)
df['start_time'] = df.start.apply(to_timestamp)
df['end_time'] = df.end.apply(to_timestamp)
Your 2nd data frame is too short, so it wouldn't reflect a meaningful merge. So I modified it a little:
df2 = pd.DataFrame({'start': [179.9989, 180.0002, 180.0016, 181.3, 181.5, 181.7],
'end': [180.0002, 180.0016, 180.003, 181.5, 185.7, 181.8],
'toc': [155.0, 152.0, 151.0, 150.0, 149.0, 148.0]})
df1['Rank'] = np.arange(len(df1))
new_df = pd.merge_asof(df1.sort_values('time'), df2,
left_on='time',
right_on='start')
gives you:
time browncarbon blackcarbon Rank start end toc
0 181.3809 0.166545 0.001217 1 181.3 181.5 150.0
1 181.6197 0.071581 NaN 2 181.5 185.7 149.0
2 181.7335 0.105270 NaN 0 181.7 181.8 148.0
which you can drop extra column and sort_values on Rank. For example:
new_df.sort_values('Rank').drop(['Rank','start','end'], axis=1)
gives:
time browncarbon blackcarbon toc
2 181.7335 0.105270 NaN 148.0
0 181.3809 0.166545 0.001217 150.0
1 181.6197 0.071581 NaN 149.0

Resampling on non-time related buckets

Say I have a df looking like this:
price quantity
0 100 20
1 102 31
2 105 25
3 99 40
4 104 10
5 103 20
6 101 55
There are no time intervals here. I need to calculate a Volume Weighted Average Price for every 50 items in quantity. Every row (index) in the output would represent 50 units (as opposed to say 5-min intervals), the output column would be the volume weighted price.
Any neat way to do this using pandas, or numpy for that matter?? I tried using a loop splitting every row into one item prices and them group them like this:
def grouper(n, iterable):
it = iter(iterable)
while True:
chunk = tuple(itertools.islice(it, n))
if not chunk:
return
yield chunk
But it takes for ever and I run out of memory.. The df is a few million rows.
EDIT:
The output I want to see based on the above is:
vwap
0 101.20
1 102.12
2 103.36
3 101.00
Each 50 items gets a new average price.
I struck out on my first at-bat facing this problem. Here's my next plate appearance. Hopefully I can put the ball in play and score a run.
First, let's address some of the comments related to the expected outcome of this effort. The OP posted what he thought the results should be using the small sample data he provided. However, #user7138814 and I both came up with the same outcome that differed from the OP's. Let me explain how I believe the weighted average of exactly 50 units should be calculated using the OP's example. I'll use this worksheet as an illustration.
The first 2 columns (A and B) are the original values given by the OP. Given those values, the goal is to calculate a weighted average for each block of exactly 50 units. Unfortunately, the quantities are not evenly divisible by 50. Columns C and D represent how to create even blocks of 50 units by subdividing the original quantities as needed. The yellow shaded areas show how the original quantity was subdivided and each of the green bounded cells sum to exactly 50 units. Once 50 units are determined the weighted average can be calculated in column E. As you can see, the values in E match what #user7138814 posted in his comment, so I think we agree on the methodology.
After much trial and error the final solution is a function that operates on the numpy arrays of the underlying price and quantity series. The function is further optimized using Numba decorator to jit-compile the Python code into machine-level code. On my laptop, it processed a 3 million row arrays in a second.
Here's the function.
#numba.jit
def vwap50_jit(price_col, quantity_col):
n_rows = len(price_col)
assert len(price_col) == len(quantity_col)
qty_cumdif = 50 # cum difference of quantity to track when 50 units are reached
pq = 0.0 # cumsum of price * quantity
vwap50 = [] # list of weighted averages
for i in range(n_rows):
price, qty = price_col[i], quantity_col[i]
# if current qty will cause more than 50 units
# divide the units
if qty_cumdif < qty:
pq += qty_cumdif * price
# at this point, 50 units accumulated. calculate average.
vwap50.append(pq / 50)
qty -= qty_cumdif
# continue dividing
while qty >= 50:
qty -= 50
vwap50.append(price)
# remaining qty and pq become starting
# values for next group of 50
qty_cumdif = 50 - qty
pq = qty * price
# process price, qty pair as-is
else:
qty_cumdif -= qty
pq += qty * price
return np.array(vwap50)
Results of processing the OP's sample data.
Out[6]:
price quantity
0 100 20
1 102 31
2 105 25
3 99 40
4 104 10
5 103 20
6 101 55
vwap50_jit(df.price.values, df.quantity.values)
Out[7]: array([101.2 , 102.06, 101.76, 101. ])
Notice that I use the .values method to pass the numpy arrays of the pandas series. That's one of the requirements of using numba. Numba is numpy-aware and doesn't work on pandas objects.
It performs pretty well on 3 million row arrays, creating an output array of 2.25 million weighted averages.
df = pd.DataFrame({'price': np.random.randint(95, 150, 3000000),
'quantity': np.random.randint(1, 75, 3000000)})
%timeit vwap50_jit(df.price.values, df.quantity.values)
154 ms ± 4.15 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
vwap = vwap50_jit(df.price.values, df.quantity.values)
vwap.shape
Out[11]: (2250037,)

Resources