Concatenate column values into a new column based off value in another column - string

I have a dataframe with the following format:
Item Balance Date
1 200000 1/1/2020
1 155000 2/1/2020
1 100000 3/1/2020
1 25000 4/1/2020
1 0 5/1/2020
2 100000 1/1/2020
2 15000 2/1/2020
2 0 3/1/2020
I would like to change the dataframe to the following format:
Item Cycle
1 4;2#01/01/2020;1000#02/01/2020;775#03/01/2020;500#04/01/2020;125#05/01/2020;0
2 2;2#01/01/2020;1000#02/01/2020;150#03/01/2020;0
The cycle column will take the form of the count of non zero values (Balance field) for each item (there are 4 for item 1 and 2 for item 2) followed by a ; a constant of 2 followed by a # the date in the date column with an initial scaled value of 1000. Then # + (the next date value) + (current balance of item / initial balance of item) * initial scaled balance (1000) until the item observation reaches a balance of 0. When the item balance is 0; the cycle variable will close with #(date in date column);0. Please also note that the date will take the form of mm/dd/yyyy inside the cycle variable.
Thanks in advance for the help.

Assuming your Date column is already converted to datetime64:
def summarize(group):
# The number of line items where Balance > 0
count = (group['Balance'] > 0).sum()
# Scale the data where the initial balance = 1000
scaled = pd.DataFrame({
'Balance': group['Balance'] / group['Balance'].iloc[0] * 1000,
'Date': group['Date'].dt.strftime('%m/%d/%Y')
})
# The lambda to produce the string 01/01/2020;1000
f = lambda row: f'{row["Date"]};{row["Balance"]:.0f}'
# Join the balances togather
data = '#'.join(scaled.apply(f, axis=1))
# The final string for each group
return f'{count};2#{data}'
df.groupby('Item').apply(summarize)

Related

Normalize Column Values by Monthly Averages with added Group dimension

Initial Note
I already got this running, but it takes a very long time to execute. My DataFrame is around 500MB large. I am hoping to hear some feedback on how to execute this as quickly as possible.
Problem Statement
I want to normalize the DataFrame columns by the mean of the column's values during each month. An added complexity is that I have a column named group which denotes a different sensor in which the parameter (column) was measured. Therefore, the analysis needs to iterate around group and each month.
DF example
X Y Z group
2019-02-01 09:30:07 1 2 1 'grp1'
2019-02-01 09:30:23 2 4 3 'grp2'
2019-02-01 09:30:38 3 6 5 'grp1'
...
Code (Functional, but slow)
This is the code that I used. Coding annotations provide descriptions of most lines. I recognize that the three for loops are causing this runtime issue, but I do not have the foresight to see a way around it. Does anyone know any
# Get mean monthly values for each group
mean_per_month_unit = process_df.groupby('group').resample('M', how='mean')
# Store the monthly dates created in last line into a list called month_dates
month_dates = mean_per_month_unit.index.get_level_values(1)
# Place date on multiIndex columns. future note: use df[DATE, COL_NAME][UNIT] to access mean value
mean_per_month_unit = mean_per_month_unit.unstack().swaplevel(0,1,1).sort_index(axis=1)
divide_df = pd.DataFrame().reindex_like(df)
process_cols.remove('group')
for grp in group_list:
print(grp)
# Iterate through month
for mnth in month_dates:
# Make mask where month and group
mask = (df.index.month == mnth.month) & (df['group'] == grp)
for col in process_cols:
# Set values of divide_df
divide_df.iloc[mask.tolist(), divide_df.columns.get_loc(col)] = mean_per_month_unit[mnth, col][grp]
# Divide process_df with divide_df
final_df = process_df / divide_df.values
EDIT: Example data
Here is the data in CSV format.
EDIT2: Current code (according to current answer)
def normalize_df(df):
df['month'] = df.index.month
print(df['month'])
df['year'] = df.index.year
print(df['year'])
def find_norm(x, df_col_list): # x is a row in dataframe, col_list is the list of columns to normalize
agg = df.groupby(by=['group', 'month', 'year'], as_index=True).mean()
print("###################", x.name, x['month'])
for column in df_col_list: # iterate over col list, find mean from aggregations, and divide the value by
print(column)
mean_col = agg.loc[(x['group'], x['month'], x['year']), column]
print(mean_col)
col_name = "norm" + str(column)
x[col_name] = x[column] / mean_col # norm
return x
normalize_cols = df.columns.tolist()
normalize_cols.remove('group')
#normalize_cols.remove('mode')
df2 = df.apply(find_norm, df_col_list = normalize_cols, axis=1)
The code runs perfectly for one iteration and then it fails with the error:
KeyError: ('month', 'occurred at index 2019-02-01 11:30:17')
As I said, it runs correctly once. However, it iterates over the same row again and then fails. I see according to df.apply() documentation that the first row always runs twice. I'm just not sure why this fails on the second time through.
Assuming that the requirement is to group the columns by mean and the month, here is another approach:
Create new columns - month and year from the index. df.index.month can be used for this provided the index is of type DatetimeIndex
type(df.index) # df is the original dataframe
#pandas.core.indexes.datetimes.DatetimeIndex
df['month'] = df.index.month
df['year'] = df.index.year # added year assuming the grouping occurs per grp per month per year. No need to add this column if year is not to be considered.
Now, group over (grp, month, year) and aggregate to find mean of every column. (Added year assuming the grouping occurs per grp per month per year. No need to add this column if year is not to be considered.)
agg = df.groupby(by=['grp', 'month', 'year'], as_index=True).mean()
Use a function to calculate the normalized values and use apply() over the original dataframe
def find_norm(x, df_col_list): # x is a row in dataframe, col_list is the list of columns to normalize
for column in df_col_list: # iterate over col list, find mean from aggregations, and divide the value by the mean.
mean_col = agg.loc[(str(x['grp']), x['month'], x['year']), column]
col_name = "norm" + str(column)
x[col_name] = x[column] / mean_col # norm
return x
df2 = df.apply(find_norm, df_col_list = ['A','B','C'], axis=1)
#df2 will now have 3 additional columns - normA, normB, normC
df2:
A B C grp month year normA normB normC
2019-02-01 09:30:07 1 2 3 1 2 2019 0.666667 0.8 1.5
2019-03-02 09:30:07 2 3 4 1 3 2019 1.000000 1.0 1.0
2019-02-01 09:40:07 2 3 1 2 2 2019 1.000000 1.0 1.0
2019-02-01 09:38:07 2 3 1 1 2 2019 1.333333 1.2 0.5
Alternatively, for step 3, one can join the agg and df dataframes and find the norm.
Hope this helps!
Here is how the code would look like:
# Step 1
df['month'] = df.index.month
df['year'] = df.index.year # added year assuming the grouping occurs
# Step 2
agg = df.groupby(by=['grp', 'month', 'year'], as_index=True).mean()
# Step 3
def find_norm(x, df_col_list): # x is a row in dataframe, col_list is the list of columns to normalize
for column in df_col_list: # iterate over col list, find mean from aggregations, and divide the value by the mean.
mean_col = agg.loc[(str(x['grp']), x['month'], x['year']), column]
col_name = "norm" + str(column)
x[col_name] = x[column] / mean_col # norm
return x
df2 = df.apply(find_norm, df_col_list = ['A','B','C'], axis=1)

calculate percentage of occurrences in column pandas

I have a column with thousands of rows. I want to select the top significant one. Let's say I want to select all the rows that would represent 90% of my sample. How would I do that?
I have a dataframe with 2 columns, one for product_id one showing whether it was purchased or not (value is or 0 or 1)
product_id purchased
a 1
b 0
c 0
d 1
a 1
. .
. .
with df['product_id'].value_counts() I can have all my product-ids ranked by number of occurrences.
Let's say now I want to get the number of product_ids that I should consider in my future analysis that would represent 90% of the total of occurences.
Is there a way to do that?
If want all product_id with counts under 0.9 then use:
s = df['product_id'].value_counts(normalize=True).cumsum()
df1 = df[df['product_id'].isin(s.index[s < 0.9])]
Or if want all rows sorted by counts and get 90% of them:
s1 = df['product_id'].map(df['product_id'].value_counts()).sort_values(ascending=False)
df2 = df.loc[s1.index[:int(len(df) * 0.9)]]

Calculate a value for 2 different set of IDs in excel

My excel table has 5 Rows: Id, ColA, ColB, Count and Test.
ID A B Count Test
2 a low 5 -
2 b high 6 -
2 c low 7 -
2 d high 8 -
2 e low 9 -
1 a low 1 =(1-5)
1 l high 2 -
1 e low 3 =(3-9)
I want to Calculate the value of Test for only rows with Id = 1
If Value of ColA for ID 1 = Value of of ColA for ID 2 and
Value of ColB for ID 1 = Value of of ColB for ID 2
then calculate the difference between the Count Values
else
0
The Excel Table is connected to Sql Query. Every time I refresh it the table has a different number of rows.
I tried using VLOOKUP in TEST column where Id = 1 and specified the array table as the first 5 rows (only with Id = 2) but it doesn't seem to work because when I refresh the table the second time there are only 2 rows for Id = 2.
I want the TEST column value to be automatically calculated each time the table is refreshed. Thanks!
use countifs to find if it exists, and sumifs to return the value:
=IF(AND(A2=1,COUNTIFS(B:B,B2,C:C,C2,A:A,2)),D2-SUMIFS(D:D,B:B,B2,C:C,C2,A:A,2),0)

How do you search through a row (Row 1) of a CSV file, but search through the next row (Row 2) at the same time?

Imagine there are THREE columns and a certain number of rows in a dataframe. First column are random values, second column are Names, third column are Ages.
I want to search through every row (First Row) of this dataframe and find when value 1 appears in the first column. Then simultaneously, I want to know that if value 1 does indeed exist in the column, does value 2 appear in the SAME column but in the next row.
If this is the case. Copy First Rows, Value, Name And Age into an empty dataframe. Every time this condition is met, copy these rows into an empty dataframe
EmptyDataframe = pd.DataFrame(columns['Name','Age'])
csvfile = pd.DataFrame(columns['Value', 'Name', 'Age'])
row_for_csv_dataframe = next(csv.iterrows())
for index, row_for_csv_dataframe in csv.iterrows():
if row_for_csv_dataframe['Value'] == '1':
# How to code this:
# if the NEXT row after row_for_csv_dataframe finds the 'Value' == 2
# then copy 'Age' and 'Name' from row_for_csv_dataframe into the empty DataFrame.
Assuming you have a dataframe data like this:
Value Name Age
0 1 Anne 10
1 2 Bert 20
2 3 Caro 30
3 2 Dora 40
4 1 Emil 50
5 1 Flip 60
6 2 Gabi 70
You could do something like this, although this is probably not the most efficient:
iterator1 = data.iterrows()
iterator2 = data.iterrows()
iterator2.__next__()
for current, next in zip(iterator1,iterator2):
if(current[1].Value==1 and next[1].Value==2):
print(current[1].Value, current[1].Name, current[1].Age)
And would get this result:
1 Anne 10
1 Flip 60

SUMPRODUCT with a conditional with two ranges to calculate

To calculate a margin (JAN) I need to calculate:
sales(loja1)*margin(loja1)+sales(loja2)*margin(loja2)+sales(loja3)*margin(loja3)
/
(SUM(sales(loja1);sales(loja2);sales(loja3))
but I need to make this using a SUMPRODUCT. I tried:
=SUMPRODUCT((B3:B11="sales")*(C3:C11);(B3:B11="margin")*C3:C11))/SUMPRODUCT((B3:B11="sales")*(C3:C11))
but gave error!
When SUMPRODUCT is used to select cells within a range with text, the result for each evaluation will either be TRUE or FALSE. You will need to convert this to 1's or 0's by using '--' before the function so that when you multiply it by another range of cells, you will get the expected value
SUMPRODUCT Example: Sum of column B where column A is equal to 'Sales"
A B
1 | Sales 5
2 | Sales 6
3 | Margin 3
4 | Margin 2
Resulting Formula =SUMPRODUCT(--(A1:A4 = "Sales"),B1:B4)
How SUMPRODUCT works:
First, an array is returned that has True for each value in A1:A4 that equals "Sales", and False for each value that doesn't
Sales TRUE
Sales -> TRUE
Margin FALSE
Margin FALSE
Then the double negative converts TRUE to 1 and False to 0
1
1
0
0
Next, the first array (now the one with 1's and 0's) is multiplied by your second array (B1:B4) to get a new array
1st 2nd New Array
1 * 5 = 5
1 * 6 = 6
0 * 3 = 0
0 * 2 = 0
Finally all the values in the new array are summed to get your result (5+6+0+0 = 11)
Step 1:
For your scenario, you're going to need find the sales amount for each Location and multiply it by the margin for the corresponding location
location 1: sales * margin
=SUMPRODUCT(--(A3:A11="loja1"),--(B3:B11="venda"),(C3:C11)) * SUMPRODUCT(--(A3:A11="loja1"),--(B3:B11="margem"),(C3:C11))
You can do a similar formula for location 2 and 3 and then sum them all together.
Step: 2
To sum the sales for all locations, you can do a similar formula, again using the double negative, i.e. "--"
SUMPRODUCT(--(B3:B11="sales"),(C3:C11))
The resulting formula will be a bit long, but when you divide Step 1 by Step 2, you'll get the desired result

Resources