Can we use keyword arguments in UDF - apache-spark

Question I have is can we we use keyword arguments along with UDF in Pyspark as I did below. conv method has a keyword argument conv_type which by default is assigned to a specific type of formatter however I want to specify a different format at some places. Which is not getting through in udf because of keyword argument. Is there a different approach of using keyword argument here?
from datetime import datetime as dt, timedelta as td,date
tpid_date_dict = {'69': '%d/%m/%Y', '62': '%Y/%m/%d', '70201': '%m/%d/%y', '66': '%d.%m.%Y', '11': '%d-%m-%Y', '65': '%Y-%m-%d'}
def date_formatter_based_on_id(column, date_format):
val = dt.strptime(str(column),'%Y-%m-%d').strftime(date_format)
return val
def generic_date_formatter(column, date_format):
val = dt.strptime(str(column),date_format).strftime('%Y-%m-%d')
return val
def conv(column, id, conv_type=date_formatter_based_on_id):
try:
date_format=tpid_date_dict[id]
except KeyError as e:
print("Key value not found!")
val = None
if column:
try:
val = conv_type(column, date_format)
except Exception as err:
val = column
return val
conv_func = functions.udf(conv, StringType())
date_formatted = renamed_cols.withColumn("check_in_std",
conv_func(functions.col("check_in"), functions.col("id"),
generic_date_formatter))
So the problem is with the last statement(date_formatted = renamed_cols.withColumn("check_in_std",
conv_func(functions.col("check_in"), functions.col("id"),
generic_date_formatter)))
Since the third argument generic_date_formatter is a keyword argument.
On trying this I get following error:
AttributeError: 'function' object has no attribute '_get_object_id'

Unfortunately you cannot use udf with keyword arguments. UserDefinedFunction.__call__ is defined with positional arguments only:
def __call__(self, *cols):
judf = self._judf
sc = SparkContext._active_spark_context
return Column(judf.apply(_to_seq(sc, cols, _to_java_column)))
but the problem you have is not really related to keyword arguments. You get exception because generic_date_formatter is not a Column object but a function.
You can create udf dynamically:
def conv(conv_type=date_formatter_based_on_id):
def _(column, id):
try:
date_format=tpid_date_dict[id]
except KeyError as e:
print("Key value not found!")
val = None
if column:
try:
val = conv_type(column, date_format)
except Exception as err:
val = column
return val
return udf(_, StringType())
which can be called:
conv_func(generic_date_formatter)(functions.col("check_in"), functions.col("id"))
Check Passing a data frame column and external list to udf under withColumn for details.

Related

Python: Subclassing a dict to have two keys and a defaultvalue

following the two very readable tutorials 1 and 2, I would like to create a dictionary with two keys that gives a defaultvalue in case the key-pair does not exist.
I managed two fullfill the first condition with
from collections import defaultdict
class DictX(dict):
def __getattr__(self, key1 = None, key2 = None):
try:
return self[(key1,key2)]
# This in idea of how to implement the defaultdict. But it does not seem to work
# except KeyError as k::
# self[(key1,key2)] = 0.
# return self[(key1,key2)]
## or just return 0
except KeyError as k:
raise AttributeError(k)
def __setattr__(self, key1, key2, value):
self[(key1,key2)] = value
def __delattr__(self, key):
try:
del self[key]
except KeyError as k:
raise AttributeError(k)
def __repr__(self):
return '<DictX ' + dict.__repr__(self) + '>'
sampledict = DictX()
sampledict[3,5] = 5
sampledict[1,4] = 4
print("Checking the dict ",sampledict[1,4])
# This line is going to throw an error
print("Checking the default dict ",sampledict[3,6])
How do I code the defaultvalue behaviour?
Pro-Question:
If I just give one value sampledict[1,] or sampledict[1,:], I would like to get a list of all key - value pairs that start with 1. Is that possible?

How to properly write multiple functions depending on one another in a python class

I am currently learning python Classes and I trying to wrap my head on how to use them efficiently. I am creating a Class that initializes a path to a CSV file and has two functions that do the following:
load_data
ProcessTimeColumn which takes the loaded df and infer the column that could be turned into a datetime, convert it and add it as an index.
My challenge is how can I use the return from the load_data function to use in the processTimeColumn function.
My challenge is when I try to run it as below;
data= prep_data("./datasets/AirPassengers.csv")
data= data.load_data()
df= data.processTimeColumn()
I get this error
AttributeError: 'DataFrame' object has no attribute 'processTimeColumn'
Below is my code snippet and some of my trial and errors. Appreciate some guidance;
class prep_data():
def __init__(self, path):
assert path.split(".")[-1] == "csv", "Your file should be in CSV format"
self.path= path
def load_data(self):
"""load_data loads target data and returns a df
Args:
path (str): path of the file (CSV)
"""
self.df= pd.read_csv(self.path)
return self.df
def processTimeColumn(self, column_index= None):
"""processTimeColumn takes the date column in string format and turns it into a datetime format and sets it as the index
If no column_index argument is passed, it tries to infer which column can be transformed into a datetime format
Args:
column_index (int, optional): column indexer of user identified datetime column. Defaults to None.
"""
if column_index == None:
#looping through the columns and trying to convert them to datetime and setting the datetime column as index
for column in self.df.columns:
if self.df[column].dtype == "object":
try:
self.df[column]= pd.to_datetime(self.df[column])
self.df.set_index(column, inplace= True)
except:
pass
else:
pass
else:
self.df.iloc[:, column_index]= pd.to_datetime(self.df.iloc[:, column_index], format= "%Y-%m")
self.df.index= self.df.iloc[:, column_index]
self.df.drop(columns= self.df.columns[column_index], axis= 1, inplace= True)
return self.df
I have tried to create 'self.df =None' variable to try to refer to it as a class variable.
def __init__(self, path):
assert path.split(".")[-1] == "csv", "Your file should be in CSV format"
self.path= path
self.df= None
I have tried to use df instead of self.df as below but still giving the same error;
def load_data(self):
"""load_data loads target data and returns a df
Args:
path (str): path of the file (CSV)
"""
df= pd.read_csv(self.path)
return df
def processTimeColumn(self, column_index= None):
"""processTimeColumn takes the date column in string format and turns it into a datetime format and sets it as the index
If no column_index argument is passed, it tries to infer which column can be transformed into a datetime format
Args:
column_index (int, optional): column indexer of user identified datetime column. Defaults to None.
"""
if column_index == None:
#looping through the columns and trying to convert them to datetime and setting the datetime column as index
for column in df.columns:
if df[column].dtype == "object":
try:
df[column]= pd.to_datetime(df[column])
df.set_index(column, inplace= True)
except:
pass
else:
pass
else:
df.iloc[:, column_index]= pd.to_datetime(df.iloc[:, column_index], format= "%Y-%m")
df.index= df.iloc[:, column_index]
df.drop(columns= df.columns[column_index], axis= 1, inplace= True)
return df
I would appreciate any guidance on classes or ways to improve my coding.
Thank you.

multiple nested functions output

I'm trying to get the result of multiple functions as nested functions from a dataframe
For example, 2 functions:
def carr(df):
df['carr'] = df[['end_value_carr','arr']].max(axis=1)
return df
def date(df):
df['date_id'] = pd.to_datetime(df['date_id']).dt.date
df['renewal_date'] = pd.to_datetime(df['renewal_date']).dt.date
df['next_renewal_date'] = pd.to_datetime(df['next_renewal_date']).dt.date
return df
When I use each one separately I get the right output
However, trying to have them nested in one function gives me a NoneType:
def cleanup(data):
df = data.copy()
def carr(df):
df['carr'] = df[['end_value_carr','arr']].max(axis=1)
return df
def date(df):
df['date_id'] = pd.to_datetime(df['date_id']).dt.date
df['renewal_date'] = pd.to_datetime(df['renewal_date']).dt.date
df['next_renewal_date'] = pd.to_datetime(df['next_renewal_date']).dt.date
return df
return df
Appreciate your help!
Thanks
Define all three functions separately
def carr(df):
df['carr'] = df[['end_value_carr','arr']].max(axis=1)
return df
def date(df):
df['date_id'] = pd.to_datetime(df['date_id']).dt.date
df['renewal_date'] = pd.to_datetime(df['renewal_date']).dt.date
df['next_renewal_date'] = pd.to_datetime(df['next_renewal_date']).dt.date
return df
Call the first two functions in your third one.
def cleanup(data):
df = data.copy()
df = carr(df)
df = date(df)
return df
Then you can call your cleanup function, which will call carr and date on its own.
df = cleanup(df)

Tkinter search query in database: TypeError: 'NoneType' object is not iterable

I'm unable to solve a problem with a search query in the database (sqlite3) in Tkinter. Parts of my code:
front.py
# Entries
self.name_text = tk.StringVar()
self.entry_name = tk.Entry(self.parent, textvariable=self.name_text)
self.entry_name.grid(row=3, column=1)
self.color_text = tk.StringVar()
self.combobox2=ttk.Combobox(self.parent, textvariable=self.color_text)
self.combobox2["values"] = ('red','blue','white')
self.labelCombobox=ttk.Label(self.parent, textvariable=self.color_text)
self.combobox2.grid(row=4, column=1)
self.parent.bind('<Return>',lambda e:refresh())
def search_command(self):
self.listBox.delete(0,tk.END)
for row in backend.database.search(self.name_text.get(),self.color_text.get()):
self.listBox.insert(tk.END, row)
backend.py class database:
def search(name="",color=""):
try:
connect = sqlite3.connect("color.db")
cur = connect.cursor()
sql = "SELECT * FROM color WHERE name=? OR color=?"
values = (self, name_text.get(), color_text.get())
cur.execute(sql, values)
rows = cur.fetchall()
name_text.set(rows[1])
color_text.set(rows[2])
entry_name.configure('disabled')
combobox2.configure('disabled')
connect.close()
except:
messagebox.showinfo('nothing found!')
I also tried to put a self in in an other version of backend.py. This gives the same error.
def search(self, name="",color=""):
try:
self.connect = sqlite3.connect("color.db")
self.cur = self.connect.cursor()
self.sql = "SELECT * FROM color WHERE name=? OR color=?"
self.values = (self, name_text.get(), color_text.get())
self.cur.execute(sql, values)
self.rows = self.cur.fetchall()
self.name_text.set(rows[1])
self.color_text.set(rows[2])
self.entry_name.configure('disabled')
self.combobox2.configure('disabled')
self.connect.close()
except:
messagebox.showinfo('nothing!')
Please help solve the error:
for row in backend.database.search(self.name_text.get(),self.color_text.get()):
TypeError: 'NoneType' object is not iterable
There are few issues on the backend.database.search() function:
name_text and color_text are undefined
passed arguments name and color should be used in values instead
it does not return any result (this is the cause of the error)
Below is a modified search() function:
def search(name="", color=""):
rows = () # assume no result in case of exception
try:
connect = sqlite3.connect("color.db")
cur = connect.cursor()
sql = "SELECT * FROM color WHERE name=? OR color=?"
values = (name, color) # use arguments name and color instead
cur.execute(sql, values)
rows = cur.fetchall()
connect.close()
except Exception as e:
print(e) # better to see what is wrong
messagebox.showinfo('nothing found!')
return rows # return result
The error TypeError: 'NoneType' object is not iterable means that your query is returning no rows.
That is at least partly because of this code:
sql = "SELECT * FROM color WHERE name=? OR color=?"
values = (self, name_text.get(), color_text.get())
cur.execute(sql, values)
This caused self to be used for the name parameter, and the result of name_text.get() will be associated with the color attribute. The result of color_text.get() is ignored.
You need to remove self - your sql uses two parameters so you need to send it two parameters.
The other problem appears to be that you're iterating over the results of search, but search doesn't return anything. You need to add a return statement to the search function.

How to pass a Pandas dataframe as argument to function through apply

I have a custom function as below to do something.
def f(x):
x['A'] = '123'
return x
df = df.groupby(level=0).apply(f)
Now, I would like to change the function as
def f(x):
x['A'] = '123'
df2['name'] = 'ABC'
return x
How to pass the dataframe df2 as an argument to apply?
Does it work? df = df.groupby(level=0).apply(f, args = df2)
df = df.groupby(level=0).apply(f, args = df2) - this will give an error ""TypeError: f() got an unexpected keyword argument 'args'
correct solution: remove args and pass like this, it solves the error.
df = df.groupby(level=0).apply(f, df2)

Resources