Pandas SettingWithCopyWarning only thrown in function - python-3.x

I am using a Jupyter notebook (not sure if this is relevant) and I am having trouble understanding why Pandas is throwing the SettingWithCopyWarning but only under a certain condition.
Here is the first version of my code:
def read_in_yearly_data(file_name):
print('\t%s' % file_name)
df = pd.read_csv('../' + file_name + '.csv', header=None)
# Drop the first 4 columns, and set names of the remaining 4 columns
for i in range(4):
del df[i]
df.columns = ['Part_Name', 'Date', 'Description', 'Units']
return df
yearly_data = []
for year in ['09', '10', '11']:
yearly_data.append(read_in_yearly_data('20' + year + ' data'))
all_data = pd.concat(yearly_data, axis=0, join='outer')
parts_to_exclude = misc.PART_NAMES
all_data = all_data.query('Part not in #parts_to_exclude')
all_data['Units'] = all_data['Units'].apply(lambda x: hf.to_int(x))
And the second version (wraps the code in a function):
def read_in_yearly_data(file_name):
print('\t%s' % file_name)
df = pd.read_csv('../' + file_name + '.csv', header=None)
# Drop the first 4 columns, and set names of the remaining 4 columns
for i in range(4):
del df[i]
df.columns = ['Part_Name', 'Date', 'Description', 'Units']
return df
def myfunc():
yearly_data = []
for year in ['09', '10', '11']:
yearly_data.append(read_in_yearly_data('20' + year + ' data'))
all_data = pd.concat(yearly_data, axis=0, join='outer')
parts_to_exclude = misc.PART_NAMES
all_data = all_data.query('Part not in #parts_to_exclude')
all_data['Units'] = all_data['Units'].apply(lambda x: hf.to_int(x))
my_func()
The call hf.to_int(x) takes the 'Units' and converts it to an integer (some of the units are stored as strings and have commas, e.g., '2,000.0'.
The first version of the code DOES NOT produce the warning while the second version does.
Even changing the last line to
all_data.loc[:, 'Units'] = all_data['Units'].apply(lambda x: hf.to_int(x))
in the second version does not change anything, and I am struggling to understand why.

The error should go away if you change
all_data = all_data.query('Part not in #parts_to_exclude')
To:
all_data = all_data.query('Part not in #parts_to_exclude').copy()
I'm still not quite clear why it didn't give you an error when it's not in a function.

Related

Get id each row in dataframe

I have done this code. according to my code, I have done(x1-x).The 'outlier_reference_x' is the difference between (x1-x). Now How can I know which x1 and x were?,I want to know which x and x1 were that subtracted of them is less than (3* sd)?
Actully I want to what iloc, 'outlier_reference_x' referes in df_reference and df_test? and then I want to delete these row.
sorry for my language.
import pandas as pd
df_reference = pd.read_csv(
"reference.txt",
delim_whitespace=True, # any whitespace separates data
names=["x", "y"], # column names
index_col=False # no index
)
df_test = pd.read_csv(
"test.txt",
delim_whitespace=True, # any whitespace separates data
names=["x1", "y1"], # column names
index_col=False # no index
)
frames = [df_reference ,df_test]
df = pd.concat(frames, axis=1)
df.to_csv('dataset.txt', sep='\t', header=True)
df_ = df[['x','x1']].copy()
df_['x1-x']=df_['x1']-df_['x']
set_mean_X = df_.loc[:, 'x1-x'].mean()
set_std_X = df_.loc[:, 'x1-x'].std()
outlier_reference_x = [x for x in df_['x1-x'] if ( x > 3 * set_std_X)]

Concat the columns in Pandas Dataframe with separator

I have a dataframe say df_dt_proc with 35 columns.
I want to add a column to the dataframe df_dt_proc['procedures'] which should have all the columns concatenated except column at index 0 separated by , .
I am able to achieve the result by the following script:
df_dt_proc['procedures'] = np.nan
_len = len(df_dt_proc.columns[1:-1])
for i in range(len(df_dt_proc)):
res = ''
for j in range(_len):
try:
res += df_dt_proc[j][i] + ', '
except:
break
df_dt_proc['procedures'][i] = res
However, there must be a more pythonic way to achieve this.
Use custom lambda function with remove NaN and Nones and converting to strings, for select all columns without first and last use DataFrame.iloc:
f = lambda x: ', '.join(x.dropna().astype(str))
df_dt_proc['procedures'] = df_dt_proc.iloc[:, 1:-1].agg(f, axis=1)
Try this with agg:
df_dt_proc['procedures'] = df_dt_proc[df_dt_proc.columns[1:-1]].astype(str).agg(', '.join, axis=1)

df.to_excel capture only last request of iteration pandas

I am currently trying to iterate through a large dataset and using to.excel() to write in my dataframe to Excel.
My code:
writer = pd.ExcelWriter(r'report.xlsx')
for x in range(3):
slq = ("select date_added, fruit_id from market")
data = pd.read_sql(sql, c)
df = pd.DataFrame(data)
df.to_excel(writer)
writer.save()
When this is run, I am only capturing the 3rd request in my range. Is there a different method that would allow me to capture all 3 requests in my range?
There does not appear to be a ExcelWriter.append method. Instead, make a list of the dataframes and pd.concat at the end.
writer = pd.ExcelWriter(r'report.xlsx')
dfs = []
for x in range(3):
sql = ("select date_added, fruit_id from market")
data = pd.read_sql(sql, c)
df = pd.DataFrame(data)
dfs.append(df)
df = pd.concat(dfs)
df.to_excel(writer)
writer.save()
Alternatively, pd.DataFrame.to_excel does have a startrow argument that could be used to append.
writer = pd.ExcelWriter(r'report.xlsx')
row = 0
for x in range(3):
sql = ("select date_added, fruit_id from market")
data = pd.read_sql(sql, c)
df = pd.DataFrame(data)
df.to_excel(writer, startrow=row)
row += len(df)
writer.save()

How do I add a dynamic list of variable to the command pd.concat

I am using python3 and pandas to create a script that will:
Be dynamic across different dataset lengths(rows) and unique values - completed
Take unique values from column A and create separate dataframes as variables for each unique entry - completed
Add totals to the bottom of each dataframe - completed
Concatenate the separate dataframes back together - incomplete
The issue is I am unable to formulate a way to create a list of the variables in use and apply them as arg in to the command pd.concat.
The sample dataset. The dataset may have more unique BrandFlavors or less which is why the script must be flexible and dynamic.
Script:
import pandas as pd
import warnings
warnings.simplefilter(action='ignore')
excel_file = ('testfile.xlsx')
df = pd.read_excel(excel_file)
df = df.sort_values(by='This', ascending=False)
colarr = df.columns.values
arr = df[colarr[0]].unique()
for i in range(len(arr)):
globals()['var%s' % i] = df.loc[df[colarr[0]] == arr[i]]
for i in range(len(arr)):
if globals()['var%s' % i].empty:
''
else:
globals()['var%s' % i] = globals()['var%s' % i].append({'BrandFlavor':'Total',
'This':globals()['var%s' % i]['This'].sum(),
'Last':globals()['var%s' % i]['Last'].sum(),
'Diff':globals()['var%s' % i]['Diff'].sum(),
'% Chg':globals()['var%s' % i]['Diff'].sum()/globals()['var%s' % i]['Last'].sum() * 100}, ignore_index=True)
globals()['var%s' % i]['% Chg'].fillna(0, inplace=True)
globals()['var%s' % i].fillna(' ', inplace=True)
I have tried this below, however the list is a series of strings
vararr = []
count = 0
for x in range(len(arr)):
vararr.append('var' + str(count))
count = count + 1
df = pd.concat([vararr])
pd.concat does not recognize a string. I tired to build a class with an arg defined but had the same issue.
The desired outcome would be a code snippet that generated a list of variables that matched the ones created by lines 9/10 and could be referenced by pd.concat([list, of, vars, here]). It must be dynamic. Thank you
Just fixing the issue at hand, you shouldn't use globals to make variables, that is not considered good practice. Your code should work with some minor modifications.
import pandas as pd
import warnings
warnings.simplefilter(action='ignore')
excel_file = ('testfile.xlsx')
df = pd.read_excel(excel_file)
df = df.sort_values(by='This', ascending=False)
def good_dfs(dataframe):
if dataframe.empty:
pass
else:
this = dataframe.This.sum()
last = dataframe.Last.sum()
diff = dataframe.Diff.sum()
data = {
'BrandFlavor': 'Total',
'This': this,
'Last': last,
'Diff': diff,
'Pct Change': diff / last * 100
}
dataframe.append(data, ignore_index=True)
dataframe['Pct Change'].fillna(0.0, inplace=True)
dataframe.fillna(' ', inplace=True)
return dataframe
colarr = df.columns.values
arr = df[colarr[0]].unique()
dfs = []
for i in range(len(arr)):
temp = df.loc[df[colarr[0]] == arr[i]]
dfs.append(temp)
final_dfs = [good_dfs(d) for d in dfs]
final_df = pd.concat(final_dfs)
Although I will say, there are far easier ways to accomplish what you want without doing all of this, however that can be a separate question.

Import and parse .data file

there is a file I tried to import and safe as pandas df. At a first sight looks like it's already columns and rows ordered, but finally I had to do a bunch of stuff to create pandas df. Could you please check if there is much faster way to manage it?
url = 'https://archive.ics.uci.edu/ml/machine-learning-databases/auto-mpg/auto-mpg.data'
My way of doing it is:
import requests
import pandas as pd
r = requests.get(url)
file = r.text
step_1 = file.split('\n')
for n in range(len(step_1)): # remove empty strings
if bool(step_1[n]) == False:
del(step_1[n])
step_2 = [i.split('\t') for i in step_1]
cars_names = [i[1] for i in step_2]
step_3 = [i[0].split(' ') for i in step_2]
for e in range(len(step_3)): # remove empty strings in each sublist
step_3[e] = [item for item in step_3[e] if item != '']
mpg = [i[0] for i in step_3]
cylinders = [i[1] for i in step_3]
disp = [i[2] for i in step_3]
horsepower = [i[3] for i in step_3]
weight = [i[4] for i in step_3]
acce = [i[5] for i in step_3]
year = [i[6] for i in step_3]
origin = [i[7] for i in step_3]
list_cols = [cars_names, mpg, cylinders, disp, horsepower, weight, acce, year, origin]
# list_labels written manually:
list_labels = ['car name', 'mpg', 'cylinders', 'displacement', 'horsepower', 'weight', 'acceleration', 'model year', 'origin']
zipped = list(zip(list_labels, list_cols))
data = dict(zipped)
df = pd.DataFrame(data)
When you replaced \t to blankspace, you can use read_csv to read it. But you need to wrap up your text, because the first parameter in read_csv is filepath_or_buffer which needs object with a read() method (such as a file handle or StringIO). Then your question can be transform to read_csv doesn't read the column names correctly on this file?
import requests
import pandas as pd
from io import StringIO
url = 'https://archive.ics.uci.edu/ml/machine-learning-databases/auto-mpg/auto-mpg.data'
r = requests.get(url)
file = r.text.replace("\t"," ")
# list_labels written manually:
list_labels = ['mpg', 'cylinders', 'displacement', 'horsepower', 'weight', 'acceleration', 'model year', 'origin','car name']
df = pd.read_csv(StringIO(file),sep="\s+",header = None,names=list_labels)
with pd.option_context('display.max_rows', None, 'display.max_columns', None):
print(df)

Resources