pandas SettingWithCopyWarning only inside function - python-3.x

With a dataframe like
import pandas as pd
df = pd.DataFrame(
["2017-01-01 04:45:00", "2017-01-01 04:45:00removeMe"], columns=["col"]
)
why do I get a SettingWithCopyWarning here
def test_fun(df):
df = df[~df["col"].str.endswith("removeMe")]
df.loc[:, "col"] = pd.to_datetime(df["col"])
return df
df = test_fun(df)
but not if I run it without the function?
df = df[~df["col"].str.endswith("removeMe")]
df.loc[:, "col"] = pd.to_datetime(df["col"])
And how is my function supposed to look like?

In the function, you have df, which when you index it with your boolean array, gives a view of the outside-scope df - then you're trying to additionally index that view, which is why the warning comes in. Without the function, df is just a dataframe that's resized with your index instead (it's not a view).
I would write it as this instead either way:
df["col"] = pd.to_datetime(df["col"], errors='coerce')
return df[~pd.isna(df["col"])]

Found the trick:
def test_fun(df):
df.loc[:] = df[~df["col"].str.endswith("removeMe")] <------- I added the `.loc[:]`
df.loc[:, "col"] = pd.to_datetime(df["col"])
return df
Don't do df = ... in the function.
Instead do df.loc[:] = ... !

Related

How to alter a dataframe in the cycle inside a function

I'm trying to make a function that takes the column list cols and performs get_dummies for each.
The thing is that if I use cycle only that works fine. But if I try to make a function out of it, the dataframe remains unchanged.
The function:
def cols_to_dummies(df, cols: list):
for c in cols:
temp_dum = pd.get_dummies(df[str(c)])
df = pd.concat([df, temp_dum], axis=1)
df.drop(str(c), axis=1, inplace=True)
del temp_dum
The usage:
cols = ['column1', 'column2']
cols_to_dummies(df, cols)
concatenating the two frames, creates df variable inside the function, so it loss reference with the outside df. A simple way to solve your code is to have a return frame.
def cols_to_dummies(df, cols: list):
for c in cols:
temp_dum = pd.get_dummies(df[str(c)])
df = pd.concat([df, temp_dum], axis=1)
df.drop(str(c), axis=1, inplace=True)
del temp_dum
return df
df = cols_to_dummies(df, ['num', 'user'])

looping through list of pandas dataframes and make it empty dataframe

I have a multiple pandas dataframe. I want empty each dataframes like below
df1 = pd.DataFrame()
df2 = pd.DataFrame()
Instead of doing it individually, is there any way to do it in one line of code.
If I understood correctly, this will work:
df_list = []
for i in range (0,10):
df = pd.DataFrame()
df_list.append(df)
print(df_list[0].head())

Filter dataframe based on groupby sum()

I want to filter my dataframe based on a groupby sum(). I am looking for lines where the amounts for a spesific date, gets to zero.
I have solve this by creating a for loop. I suspect this will reduce performance if the dataframe is large.
It also seems clunky.
newdf = pd.DataFrame()
newdf['name'] = ('leon','eurika','monica','wian')
newdf['surname'] = ('swart','swart','swart','swart')
newdf['birthdate'] = ('14051981','198001','20081012','20100621')
newdf['tdate'] = ('13/05/2015','14/05/2015','15/05/2015', '13/05/2015')
newdf['tamount'] = (100.10, 111.11, 123.45, -100.10)
df = newdf.groupby(['tdate'])[['tamount']].sum().reset_index()
df2 = df.loc[df["tamount"] == 0, "tdate"]
df3 = pd.DataFrame()
for i in df2:
df3 = df3.append(newdf.loc[newdf["tdate"] == i])
print (df3)
The below code is creating an output of the two lines getting to zero when combined on tamount
name surname birthdate tdate tamount
0 leon swart 1981-05-14 13/05/2015 100.1
3 wian swart 2010-06-21 13/05/2015 -100.1
Just use basic numpy :)
import numpy as np
df = newdf.groupby(['tdate'])[['tamount']].sum().reset_index()
dates = df['tdate'][np.where(df['tamount'] == 0)[0]]
newdf[np.isin(newdf['tdate'], dates) == True]
Hope this helps; let me know if you have any questions.

PySpark: Replace Punctuations with Space Looping Through Columns

I have the following code running successfully in PySpark:
def pd(data):
df = data
df = df.select('oproblem')
text_col = ['oproblem']
for i in text_col:
df = df.withColumn(i, F.lower(F.col(i)))
df = df.withColumn(i, F.regexp_replace(F.col(i), '[.,#-:;/?!\']', ' '))
return df
But when I add a second column in and try to loop it, it doesn't work:
def pd(data):
df = data
df = df.select('oproblem', 'lca')
text_col = ['oproblem', 'lca']
for i in text_col:
df = df.withColumn(i, F.lower(F.col(i)))
df = df.withColumn(i, F.regexp_replace(F.col(i), '[.,#-:;/?!\']', ' '))
return df
Below is the error I get:
TypeError: 'Column' object is not callable
I think it should be df = df.select(['oproblem', 'lca']) instead of df = df.select('oproblem', 'lca').
Better yet for code quality purposes, have the select statement use the text_columns variable, so you only have to change 1 line of code if you need to do this with more columns or if your column names change. Eg,
def pd(data):
df = data
text_col = ['oproblem', 'lca']
df = df.select(text_col)
....

PySpark: Search For substrings in text and subset dataframe

I am brand new to pyspark and want to translate my existing pandas / python code to PySpark.
I want to subset my dataframe so that only rows that contain specific key words I'm looking for in 'original_problem' field is returned.
Below is the Python code I tried in PySpark:
def pilot_discrep(input_file):
df = input_file
searchfor = ['cat', 'dog', 'frog', 'fleece']
df = df[df['original_problem'].str.contains('|'.join(searchfor))]
return df
When I try to run the above, I get the following error:
AnalysisException: u"Can't extract value from original_problem#207:
need struct type but got string;"
In pyspark, try this:
df = df[df['original_problem'].rlike('|'.join(searchfor))]
Or equivalently:
import pyspark.sql.functions as F
df.where(F.col('original_problem').rlike('|'.join(searchfor)))
Alternatively, you could go for udf:
import pyspark.sql.functions as F
searchfor = ['cat', 'dog', 'frog', 'fleece']
check_udf = F.udf(lambda x: x if x in searchfor else 'Not_present')
df = df.withColumn('check_presence', check_udf(F.col('original_problem')))
df = df.filter(df.check_presence != 'Not_present').drop('check_presence')
But the DataFrame methods are preferred because they will be faster.

Resources