Add two columns to a pandas DataFrame based on condition - python-3.x

I am trying to add two new columns with different values based on two conditions.
Source sample data for left and right DataFrames
id
rec_type
end_date
13759
U
20210113
23806
N
NaN
21347
U
20210113
36904
N
NaN
id
23806
21347
Expected output:
id
rec_type
end_date
_merge
error_code
error_description
13759
U
20210113
left_only
601
update record not available in right table
23806
N
NaN
both
0
0
21347
U
20210113
both
0
0
36904
N
NaN
left_only
602
New record not available in right table
I am using numpy (np) select to achieve my requirement as in below code but I am getting error.
import pandas as pd
import numpy as np
merged_df = pd.merge(left_df, right_df,
how='outer',
on=['id'],
indicator=True)
merged_df = merged_df.query('_merge != "right_only"')
conditions = [((merged_df['_merge'] == "left_only") &
(merged_df['rec_type'] == "U") &
(merged_df['end_date'].notnull())),
((merged_df['_merge'] == "left_only") &
(merged_df['rec_type'] == "N") &
(merged_df['end_date'].isnull()))]
error_codes = dict()
error_codes['error_code'] = [601, 602]
error_codes['error_description'] = ['update record not available in right table',
'New record not available in right table']
merged_df['error_code'] = np.select(conditions, error_codes['error_code'])
merged_df['error_description'] = np.select(conditions, error_codes['error_description'])
I am getting below error, please share suggestions to resolve the error.
SettingWithCopyWarning: A value is trying to be set on a copy of a
slice from a DataFrame. Try using .loc[row_indexer,col_indexer] =
value instead
See the caveats in the documentation:
http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
validate_df['error_code'] = np.select(conditions,
error_codes['error_code'])
SettingWithCopyWarning: A value is trying to be set on a copy of a
slice from a DataFrame. Try using .loc[row_indexer,col_indexer] =
value instead
See the caveats in the documentation:
http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
validate_df['error_description'] = np.select(conditions,
error_codes['error_description'])
Thanks,
Raghunath.
Note: Code is working fine with sample data but with more data, getting above error

I am able to resolve the issue by changing
merged_df = merged_df.query('_merge != "right_only"')
to below code
merged_df = merged_df[merged_df._merge != "right_only"]

Related

Why is pandas.to_csv dropping numbers when trying to preserve NaN?

Given a pandas dataframe
df = pd.DataFrame([(290122, 0.20, np.nan),
(1900, 1.20, "ABC")],
columns = ("number", "x", "class")
)
number x class
0 290122 0.2 NaN
1 1900 1.2 ABC
Then exporting it to a csv, I would like to keep the NaN, e.g. as "NULL" or "NaN",
df.to_csv("df.csv", encoding="utf-8", index=False, na_rep="NULL")
Yet, opening the csv I get the following:
That is, the last two digits of number in the first cell are dropped.
Here is the output opened in text editor:
number,x,class
2901,0.20,NULL
1900,1.20,ABC
As mentionned, when dropping the na_rep argument, I obtain as expected:
number,x,class
290122,0.20,
1900,1.20,ABC
Yes this is in fact a bug in pandas 1.0.0. Fixed in 1.0.1. See release notes and https://github.com/pandas-dev/pandas/issues/25099.
Depending on you data a quick work around could be:
import numpy as np
import pandas as pd
na_rep = 'NULL'
if pd.__version__ == '1.0.0':
na_rep_wrk = 8 * na_rep
data = [(290122, 0.20, 'NULL'), (2**40 - 1, 3.20, 'NULL'), (1900, 1.20, "ABC")]
df = pd.DataFrame(data, columns=("number", "x", "class"))
df.to_csv("df.csv", encoding="utf-8", index=False, na_rep=na_rep_wrk)
df2 = pd.read_csv('df.csv', keep_default_na=False)
assert(np.all(df == df2))
This gives the csv file:
number,x,class
290122,0.2,NULL
109951162777,3.2,NULL
1900,1.2,ABC
In .csv files, when reading or writing from pandas - np.nan is stored in file as '' (empty-string), so dont use na_rep='NULL' and instead when reading from pandas again (after saving), do this:
for col in df.columns:
df[col].apply(lambda x: np.nan if x == '' else x)
Although, all empty strings are read as NaN by default. This is still useful - just to be safe.
I have faced this issue myself before and found that there's just no other way to fix this, but if I do find one (saving nan or NULL flawlessly), I'll update here.

Summarize non-zero values or any values from pandas dataframe with timestamps- From_Time & To_Time

I have a dataframe given below
I want to extract all the non-zero values from each column to put it in a summarize way like this
If any value repeated for period of time then starting time of value should go in 'FROM' column and end time of value should go in 'TO' column with column name in 'BLK-ASB-INV' column and value should go in 'Scount' column. For this I have started to write the code like this
import pandas as pd
df = pd.read_excel("StringFault_Bagewadi_16-01-2020.xlsx")
df = df.set_index(['Date (+05:30)'])
cols=['BLK-ASB-INV', 'Scount', 'FROM', 'TO']
res=pd.DataFrame(columns=cols)
for col in df.columns:
ss=df[col].iloc[df[col].to_numpy().nonzero()[0]]
.......
After that I am unable to think how should I approach to get the desired output. Is there any way to do this in python? Thanks in advance for any help.
Finally I have solved my problem, I have written the code given below works perfectly for me.
import pandas as pd
df = pd.read_excel("StringFault.xlsx")
df = df.set_index(['Date (+05:30)'])
cols=['BLK-ASB-INV', 'Scount', 'FROM', 'TO']
res=pd.DataFrame(columns=cols)
for col in df.columns:
device = []
for i in range(len(df[col])):
if df[col][i] == 0:
None
else:
if i < len(df[col])-1 and df[col][i]==df[col][i+1]:
try:
if df[col].index[i] > device[2]:
continue
except IndexError:
device.append(df[col].name)
device.append(df[col][i])
device.append(df[col].index[i])
continue
else:
if len(device)==3:
device.append(df[col].index[i])
res = res.append({'BLK-ASB-INV':device[0], 'Scount':device[1], 'FROM':device[2], 'TO': device[3]}, ignore_index=True)
device=[]
else:
device.append(df[col].name)
device.append(df[col][i])
if i == 0:
device.append(df[col].index[i])
else:
device.append(df[col].index[i-1])
device.append(df[col].index[i])
res = res.append({'BLK-ASB-INV':device[0], 'Scount':device[1], 'FROM':device[2], 'TO': device[3]}, ignore_index=True)
device=[]
For reference, here is the output datafarme

How do I add a dynamic list of variable to the command pd.concat

I am using python3 and pandas to create a script that will:
Be dynamic across different dataset lengths(rows) and unique values - completed
Take unique values from column A and create separate dataframes as variables for each unique entry - completed
Add totals to the bottom of each dataframe - completed
Concatenate the separate dataframes back together - incomplete
The issue is I am unable to formulate a way to create a list of the variables in use and apply them as arg in to the command pd.concat.
The sample dataset. The dataset may have more unique BrandFlavors or less which is why the script must be flexible and dynamic.
Script:
import pandas as pd
import warnings
warnings.simplefilter(action='ignore')
excel_file = ('testfile.xlsx')
df = pd.read_excel(excel_file)
df = df.sort_values(by='This', ascending=False)
colarr = df.columns.values
arr = df[colarr[0]].unique()
for i in range(len(arr)):
globals()['var%s' % i] = df.loc[df[colarr[0]] == arr[i]]
for i in range(len(arr)):
if globals()['var%s' % i].empty:
''
else:
globals()['var%s' % i] = globals()['var%s' % i].append({'BrandFlavor':'Total',
'This':globals()['var%s' % i]['This'].sum(),
'Last':globals()['var%s' % i]['Last'].sum(),
'Diff':globals()['var%s' % i]['Diff'].sum(),
'% Chg':globals()['var%s' % i]['Diff'].sum()/globals()['var%s' % i]['Last'].sum() * 100}, ignore_index=True)
globals()['var%s' % i]['% Chg'].fillna(0, inplace=True)
globals()['var%s' % i].fillna(' ', inplace=True)
I have tried this below, however the list is a series of strings
vararr = []
count = 0
for x in range(len(arr)):
vararr.append('var' + str(count))
count = count + 1
df = pd.concat([vararr])
pd.concat does not recognize a string. I tired to build a class with an arg defined but had the same issue.
The desired outcome would be a code snippet that generated a list of variables that matched the ones created by lines 9/10 and could be referenced by pd.concat([list, of, vars, here]). It must be dynamic. Thank you
Just fixing the issue at hand, you shouldn't use globals to make variables, that is not considered good practice. Your code should work with some minor modifications.
import pandas as pd
import warnings
warnings.simplefilter(action='ignore')
excel_file = ('testfile.xlsx')
df = pd.read_excel(excel_file)
df = df.sort_values(by='This', ascending=False)
def good_dfs(dataframe):
if dataframe.empty:
pass
else:
this = dataframe.This.sum()
last = dataframe.Last.sum()
diff = dataframe.Diff.sum()
data = {
'BrandFlavor': 'Total',
'This': this,
'Last': last,
'Diff': diff,
'Pct Change': diff / last * 100
}
dataframe.append(data, ignore_index=True)
dataframe['Pct Change'].fillna(0.0, inplace=True)
dataframe.fillna(' ', inplace=True)
return dataframe
colarr = df.columns.values
arr = df[colarr[0]].unique()
dfs = []
for i in range(len(arr)):
temp = df.loc[df[colarr[0]] == arr[i]]
dfs.append(temp)
final_dfs = [good_dfs(d) for d in dfs]
final_df = pd.concat(final_dfs)
Although I will say, there are far easier ways to accomplish what you want without doing all of this, however that can be a separate question.

Resample time series after removing top x percentile data

I have an hourly time series data (say df with date/time and value columns) where I want to:
Step 1: Remove the top 5 percentile of each day
Step 2: Get the max(Step 1)for each day
Step 3: Get the mean(Step 2) for each month
Here is what I have tried to implement the above logic:
step_1 = df.resample('D').apply(lambda x: x<x.quantile(0.95))
step_2 = step_1.resample('D').max()
step_3 = step_2.resample('M').mean()
Even though I do not get any code error, the generated output is different to the expected result based on the above 3 steps (I always get a constant value)
Any help will be appreciated.
You are almost there. Your step_1 is a series of booleans with the same index as the original data, you can use it to filter your DataFrame, thus:
step_1 = df.resample('D').apply(lambda x: x<x.quantile(0.95))
step_2 = df[step_1].resample('D').max()
step_3 = step_2.resample('M').mean()
Your first step is a boolean mask, so you need to add an additional step:
import numpy as np
import pandas as pd
df = pd.DataFrame(np.random.randn(1000), index=pd.date_range(start='1/1/2019', periods=1000, freq='H'), columns=['my_data'])
mask = df.resample('D').apply(lambda x: x < x.quantile(.95))
step_1 = df[mask]
step_2 = df.resample('D').max()
step_3 = df.resample('M').mean()

Python3 - using pandas to group rows, where two colums contain values in forward or reverse order: v1,v2 or v2,v1

I'm fairly new to python and pandas, but I've written code that reads an excel workbook, and groups rows based on the values contained in two columns.
So where Col_1=A and Col_2=B, or Col_1=B and Col_2=A, both would be assigned a GroupID=1.
sample spreadsheet data, with rows color coded for ease of visibility
I've manged to get this working, but I wanted to know if there's a more simpler/efficient/cleaner/less-clunky way to do this.
import pandas as pd
df = pd.read_excel('test.xlsx')
# get column values into a list
col_group = df.groupby(['Header_2','Header_3'])
original_list = list(col_group.groups)
# parse list to remove 'reverse-duplicates'
new_list = []
for a,b in original_list:
if (b,a) not in new_list:
new_list.append((a,b))
# iterate through each row in the DataFrame
# check to see if values in the new_list[] exist, in forward or reverse
for index, row in df.iterrows():
for a,b in new_list:
# if the values exist in forward direction
if (a in df.loc[index, "Header_2"]) and (b in df.loc[index,"Header_3"]):
# GroupID value given, where value is index in the new_list[]
df.loc[index,"GroupID"] = new_list.index((a,b))+1
# else check if value exists in the reverse direction
if (b in df.loc[index, "Header_2"]) and (a in df.loc[index,"Header_3"]):
df.loc[index,"GroupID"] = new_list.index((a,b))+1
# Finally write the DataFrame to a new spreadsheet
writer = pd.ExcelWriter('output.xlsx')
df.to_excel(writer, 'Sheet1')
I know of the pandas.groupby([columnA, columnB]) option, but I couldn't figure a way to create groups that contained both (v1, v2) and (v2,v1).
A boolean mask should do the trick:
import pandas as pd
df = pd.read_excel('test.xlsx')
mask = ((df['Header_2'] == 'A') & (df['Header_3'] == 'B') |
(df['Header_2'] == 'B') & (df['Header_3'] == 'A'))
# Label each row in the original DataFrame with
# 1 if it matches the specified criteria, and
# 0 if it does not.
# This column can now be used in groupby operations.
df.loc[:, 'match_flag'] = mask.astype(int)
# Get rows that match the criteria
df[mask]
# Get rows that do not match the criteria
df[~mask]
EDIT: updated answer to address the groupby requirement.
I would do something like this.
import pandas as pd
df = pd.read_excel('test.xlsx')
#make the ordering consistent
df["group1"] = df[["Header_2","Header_3"]].max(axis=1)
df["group2"] = df[["Header_2","Header_3"]].min(axis=1)
#group them together
df = df.sort_values(by=["group1","group2"])
If you need to deal with more than two columns, I can write up a more general way to do this.

Resources