UDF reload in PySpark - apache-spark

I am using PySpark (inside an Jupyter Notebook, which connects to a Spark-cluster) and some UDFs. The UDF takes a list as the an additional parameter and I construct the UDF like this:
my_udf = F.udf(partial(my_normal_fn, list_param=list), StringType())
Everything works fine, with regard to executing the function. But I noticed that the UDF is never updated.
To clarify: When I update the list, for example by altering a element in the list, the UDF is not updated. The old version with the old list is still used. Even if I execute the whole notebook again.
I have to restart the Jupyter Kernel in order to use new version of the list. Which is really annoying...
Any thoughts?

I found the solution.
My my_normal_fn did have the following signature:
def my_normal_fn(x, list_param=[]):
dosomestuffwith_x_and_list_param
Changing it to
def my_normal_fn(x, list_param):
dosomestuffwith_x_and_list_param
Did the trick. See here for more information.
Thanks to user Drjones78 of the SparkML-Slack channel.

Related

Is there any way I can fix the error string indices must be integers?

I tried to make graphs for my csv dataset in Jupyter Notebook, using this line of code:
bank['marital'].value_counts().plot(kind='pie',autopct='%.2f')
plt.show()
However, the system return, "string indices must be integers".
I have tried to use many different methods like changing the string to a number,... but nothing really worked
I tried to reproduce it and it worked fine. So it's not something wrong with the code itself.
I suggest experimenting with:
restart Jupyter Notebook
play with a tiny synthetic dataset
cut the real dataset till it works
attach failing dataset contents to the question
Attaching my results:
[input.csv]
name,smth
Maria,12
Anton,2
Maria,3
...
df = pd.read_csv('input.csv')
df['name'].value_counts().plot(kind='pie',autopct='%.2f')

Exception: "SparkContext should only be created and accessed on the driver" while trying foreach()

Being new to Spark, I need to read data from MySQL DB, and then update(or upsert) rows in another table based on what I've read.
AFAIK, unfortunately, there's no way I can do update with DataFrameWriter, so I want to try querying directly to the DB after/while iterating over partitions.
For now I'm writing a script and testing with local gluepyspark shell, Spark version 3.1.1-amzn-0.
from pyspark.context import SparkContext
from pyspark.sql.session import SparkSession
sc = SparkContext.getOrCreate()
spark = SparkSession(sc)
def f(p):
pass
sc.parallelize([1, 2, 3, 4, 5]).foreachPartition(lambda p: f(p))
When I try to import this simple code in gluepyspark shell, it raises errors saying "SparkContext should only be created and accessed on the driver."
However, there are some conditions under which it works.
It works if I run the script via gluesparksubmit.
It works if I use lambda expression instead of function declaration.
It works if I declare a function within REPL and pass it as argument.
It does not work if I put both def func(): () and .foreachPartition(func) call in the same script.
Moving the function declaration to another module also seems to work. But this couldn't be an option for I need to pack things in one job script.
Could anyone please help me understand:
why the error is thrown
why the error is NOT thrown in other cases
Complete error log: https://justpaste.it/37tj6

Invalid date:Error while import CSV to Cassandra using pySpark

I'm using Jupyter NoteBook to run pySpark code to import CSV file to Cassandra v3.11.3. Getting below error.
... 1 more[![enter image description here][1]][1]
---------------------------------------------------------------------------
pySpark Code i have attached as picture:
[![pyspark_code][1]][1]
Any inputs...
Without the full trace it's hard to know exactly where this is failing. The method you pasted is just the p4yj wrapper method and we really would need to see the underlying Java Exception.
From what I can tell it looks like you are attempting to also use some options on the C* write that are unsupported. For example "MODE" - "DROPMALFORMED" is not a valid C* connector option. DataFrame Writer and Reader options are source specific so you are unfortunately unable to mix and match.
This makes me think that the data being written actually has a malformed date string or two and this code is dying when attempting to write the broken record. One way around this would be to attempt to do the date casting on CSV read which I believe does support DROPMALFORMED style parsing options.

Name error when calling defined function in Jupyter

I am following a tutorial over at https://blog.patricktriest.com/analyzing-cryptocurrencies-python/ and I've got a bit stuck. I am tyring to define, then immediately call, a function.
My code is as follows:
def merge_dfs_on_column(dataframes, labels, col):
'''merge a single column of each dataframe on to a new combined dataframe'''
series_dict={}
for index in range(len(dataframes)):
series_dict[labels[index]]=dataframes[index][col]
return pd.DataFrame(series_dict)
# Merge the BTC price dataseries into a single dataframe
btc_usd_datasets= merge_dfs_on_column(list(exchange_data.values()),list(exchange_data.keys()),'Weighted Price')
I can clearly see that I have defined the merge_dfs_on_column fucntion and I think the syntax is correct, however, when I call the function on the last line, I get the following error:
NameError Traceback (most recent call last)
<ipython-input-22-a113142205e3> in <module>()
1 # Merge the BTC price dataseries into a single dataframe
----> 2 btc_usd_datasets= merge_dfs_on_column(list(exchange_data.values()),list(exchange_data.keys()),'Weighted Price')
NameError: name 'merge_dfs_on_column' is not defined
I have Googled for answers and carefully checked the syntax, but I can't see why that function isn't recognised when called.
Your function definition isn't getting executed by the Python interpreter before you call the function.
Double check what is getting executed and when. In Jupyter it's possible to run code out of input-order, which seems to be what you are accidentally doing. (perhaps try 'Run All')
Well, if you're defining yourself,
Then you probably have copy and pasted it directly from somewhere on the web and it might have characters that you are probably not able to see.
Just define that function by typing it and use pass and comment out other code and see if it is working or not.
"run all" does not work.
Shutting down the kernel and restarting does not help either.
If I write:
def whatever(a):
return a*2
whatever("hallo")
in the next cell, this works.
I have also experienced this kind of problem frequently in jupyter notebook
But after replacing %% with %%time the error resolved. I didn't know why?
So,after some browsing i get that this is not jupyter notenook issue,it is ipython issueand here is the issue and also this problem is answered in this stackoverflow question

Existing column can't be found by DataFrame#filter in PySpark

I am using PySpark to perform SparkSQL on my Hive tables.
records = sqlContext.sql("SELECT * FROM my_table")
which retrieves the contents of the table.
When I use the filter argument as a string, it works okay:
records.filter("field_i = 3")
However, when I try to use the filter method, as documented here
records.filter(records.field_i == 3)
I am encountering this error
py4j.protocol.Py4JJavaError: An error occurred while calling o19.filter.
: org.apache.spark.sql.AnalysisException: resolved attributes field_i missing from field_1,field_2,...,field_i,...field_n
eventhough this field_i column clearly exists in the DataFrame object.
I prefer to use the second way because I need to use Python functions to perform record and field manipulations.
I am using Spark 1.3.0 in Cloudera Quickstart CDH-5.4.0 and Python 2.6.
From Spark DataFrame documentation
In Python it’s possible to access a DataFrame’s columns either by attribute (df.age) or by indexing (df['age']). While the former is convenient for interactive data exploration, users are highly encouraged to use the latter form, which is future proof and won’t break with column names that are also attributes on the DataFrame class.
It seems that the name of your field can be a reserved word, try with:
records.filter(records['field_i'] == 3)
What I did was to upgrade my Spark from 1.3.0 to 1.4.0 in Cloudera Quick Start CDH-5.4.0 and the second filtering feature works. Although I still can't explain why 1.3.0 has problems on that.

Resources