SparkContext conflict with spark udf - apache-spark

Good morning
When running:
from pyspark.sql.types import IntegerType
import pyspark.sql.functions as F
from pyspark.sql import SparkSession
spark = SparkSession.builder.getOrCreate()
class ETL:
def addone(x):
return x + 1
def job_run():
df = spark.sql('SELECT 1 one').withColumn('AddOne', udf_addone(F.col('one')))
df.show()
if (__name__ == '__main__'):
udf_addone = F.udf(lambda x: ETL.addone(x), returnType=IntegerType())
ETL.job_run()
I get the following error message:
Exception: It appears that you are attempting to reference SparkContext from a broadcast variable, action, or transformation. SparkContext can only be used on the driver, not in code that it run on workers. For more information, see SPARK-5063.
I have reviewed the answers given at ERROR:SparkContext can only be used on the driver, not in code that it run on workers. For more information, see SPARK-5063 and at Spark: Broadcast variables: It appears that you are attempting to reference SparkContext from a broadcast variable, action, or transforamtion with no success. I'd like to stick to using spark udf in my script.
Any help on this is appreciated.
Many thanks!

Related

spark repartition assigning same container to every element in the rdd

For some reason spark repartition is assigning the exact same yarn container to the every element of the rdd. I do not know what could be the possible reason. The intriguing part is if I run the same code twice without restarting the session it is now able to partition the data properly and I see distribution over all the containers. Could you please help me understand the behavior?
I am using the following session:
import socket
import os
import pyspark
from pyspark.sql import SparkSession
spark = SparkSession.builder.\
config("spark.dynamicAllocation.enabled",False).\
config("spark.executor.cores","3").\
config("spark.executor.instances","5").\
config("spark.executor.memory","6g").\
config("spark.sql.adaptive.enabled", False).\
getOrCreate()
And, the following code:
df = spark.sparkContext.parallelize(range(240000)).repartition(4)
def f(x):
return os.getenv("CONTAINER_ID"), socket.gethostname()
df = df.map(f)
[set(i) for i in df.glom().collect()]
output:
[{('container_1676564785882_0047_01_000002', 'monsoon-spark-sw-009d')},
{('container_1676564785882_0047_01_000002', 'monsoon-spark-sw-009d')},
{('container_1676564785882_0047_01_000002', 'monsoon-spark-sw-009d')},
{('container_1676564785882_0047_01_000002', 'monsoon-spark-sw-009d')}]
Use the exact same code again without restarting the session:
df = spark.sparkContext.parallelize(range(2400000)).repartition(4)
def f(x):
return os.getenv("CONTAINER_ID"), socket.gethostname()
df = df.map(f)
[set(i) for i in df.glom().collect()]
output:
[{('container_1676564785882_0047_01_000002', 'monsoon-spark-sw-009d')},
{('container_1676564785882_0047_01_000004', 'monsoon-spark-w-0')},
{('container_1676564785882_0047_01_000005', 'monsoon-spark-sw-ppqw')},
{('container_1676564785882_0047_01_000001', 'monsoon-spark-sw-m2t7')}]
Here is a snapshot for the same

Where to write the setup and teardown code for Locust tests?

I've exploring locust for our load testing requirements for Spark but stuck on some very basic tasks; documentation also seems very limited.
Stuck on how/where to write my setup & tear-down code that needs to run only once regardless of the number of users. Tried with below sample given in docs; but the code written under events.test_start doesn't run it seems as I'm unable to use attribute 'sc' anywhere under SparkJob class. Any idea how to access the spark instances created under on_test_start method in my SparkJob class?
from locust import User, TaskSet, task, between
from locust import events
from pyspark import SparkConf, SparkContext
from pyspark.sql import SparkSession
class SparkJob(TaskSet):
#task
def submit_jobs(self):
# sample spark job
class SparkUser(User):
host = xxx
wait_time = xxx
tasks = [SparkJob]
#events.test_start.add_listener
def on_test_start(**kw):
conf = SparkConf().setAppName(conn_st['app'])
sc = SparkContext(master=conn_st['master'], conf=conf)
#spark = SparkSession(sc)
return sc
#events.test_stop.add_listener
def on_test_stop(**kw):
#spark.stop()
sc.stop()
I don't know anything about Spark, but making sc or spark a global variable should work for you. So something like:
from locust import User, TaskSet, task, between
from locust import events
from pyspark import SparkConf, SparkContext
from pyspark.sql import SparkSession
spark: SparkSession = None
class SparkJob(TaskSet):
#task
def submit_jobs(self):
# sample spark job
spark.do_stuff()
class SparkUser(User):
host = xxx
wait_time = xxx
tasks = [SparkJob]
#events.test_start.add_listener
def on_test_start(**kw):
global spark
conf = SparkConf().setAppName(conn_st['app'])
sc = SparkContext(master=conn_st['master'], conf=conf)
spark = SparkSession(sc)
#events.test_stop.add_listener
def on_test_stop(**kw):
spark.stop()
You can look more into Python global variables. In short, you only need global if you're going to assign it or change it, otherwise it should be able to infer the global for you. You can be explicit and add it in each place, though.

Error while using dataframe show method in pyspark

I am trying to read data from BigQuery using pandas and pyspark. I am able to get the data but somehow getting below error while converting it into Spark DataFrame.
py4j.protocol.Py4JJavaError: An error occurred while calling o28.showString.
: java.lang.IllegalStateException: Could not find TLS ALPN provider; no working netty-tcnative, Conscrypt, or Jetty NPN/ALPN available
at com.google.cloud.spark.bigquery.repackaged.io.grpc.netty.shaded.io.grpc.netty.GrpcSslContexts.defaultSslProvider(GrpcSslContexts.java:258)
at com.google.cloud.spark.bigquery.repackaged.io.grpc.netty.shaded.io.grpc.netty.GrpcSslContexts.configure(GrpcSslContexts.java:171)
at com.google.cloud.spark.bigquery.repackaged.io.grpc.netty.shaded.io.grpc.netty.GrpcSslContexts.forClient(GrpcSslContexts.java:120)
at com.google.cloud.spark.bigquery.repackaged.io.grpc.netty.shaded.io.grpc.netty.NettyChannelBuilder.buildTransportFactory(NettyChannelBuilder.java:401)
at com.google.cloud.spark.bigquery.repackaged.io.grpc.internal.AbstractManagedChannelImplBuilder.build(AbstractManagedChannelImplBuilder.java:444)
at com.google.cloud.spark.bigquery.repackaged.com.google.api.gax.grpc.InstantiatingGrpcChannelProvider.createSingleChannel(InstantiatingGrpcChannelProvider.java:223)
at com.google.cloud.spark.bigquery.repackaged.com.google.api.gax.grpc.InstantiatingGrpcChannelProvider.createChannel(InstantiatingGrpcChannelProvider.java:169)
at com.google.cloud.spark.bigquery.repackaged.com.google.api.gax.grpc.InstantiatingGrpcChannelProvider.getTransportChannel(InstantiatingGrpcChannelProvider.java:156)
at com.google.cloud.spark.bigquery.repackaged.com.google.api.gax.rpc.ClientContext.create(ClientContext.java:157)
Following is the environment detail
Python version : 3.7
Spark version : 2.4.3
Java version : 1.8
The code is as follow
import google.auth
import pyspark
from pyspark import SparkConf, SparkContext
from pyspark.sql import SparkSession , SQLContext
from google.cloud import bigquery
# Currently this only supports queries which have at least 10 MB of results
QUERY = """ SELECT * FROM test limit 1 """
#spark = SparkSession.builder.appName('Query Results').getOrCreate()
sc = pyspark.SparkContext()
bq = bigquery.Client()
print('Querying BigQuery')
project_id = ''
query_job = bq.query(QUERY,project=project_id)
# Wait for query execution
query_job.result()
df = SQLContext(sc).read.format('bigquery') \
.option('dataset', query_job.destination.dataset_id) \
.option('table', query_job.destination.table_id)\
.option("type", "direct")\
.load()
df.show()
I am looking some help to solve this issue.
I managed to find the better solution referencing this link , below is my working code :
Install pandas_gbq package in python library before writing below code .
import pandas_gbq
from pyspark.context import SparkContext
from pyspark.sql.session import SparkSession
project_id = "<your-project-id>"
query = """ SELECT * from testSchema.testTable"""
athletes = pandas_gbq.read_gbq(query=query, project_id=project_id,dialect = 'standard')
# Get a reference to the Spark Session
sc = SparkContext()
spark = SparkSession(sc)
# convert from Pandas to Spark
sparkDF = spark.createDataFrame(athletes)
# perform an operation on the DataFrame
print(sparkDF.count())
sparkDF.show()
Hope it helps to someone ! Keep pysparking :)

NameError: name 'SparkSession' is not defined

I'm new to cask cdap and Hadoop environment.
I'm creating a pipeline and I want to use a PySpark Program. I have all the script of the spark program and it works when I test it by command like, insted it doesn't if I try to copy- paste it in a cdap pipeline.
It gives me an error in the logs:
NameError: name 'SparkSession' is not defined
My script starts in this way:
from pyspark.sql import *
spark = SparkSession.builder.getOrCreate()
from pyspark.sql.functions import trim, to_date, year, month
sc= SparkContext()
How can I fix it?
Spark connects with the local running spark cluster through SparkContext. A better explanation can be found here https://stackoverflow.com/a/24996767/5671433.
To initialise a SparkSession, a SparkContext has to be initialized.
One way to do that is to write a function that initializes all your contexts and a spark session.
def init_spark(app_name, master_config):
"""
:params app_name: Name of the app
:params master_config: eg. local[4]
:returns SparkContext, SQLContext, SparkSession:
"""
conf = (SparkConf().setAppName(app_name).setMaster(master_config))
sc = SparkContext(conf=conf)
sc.setLogLevel("ERROR")
sql_ctx = SQLContext(sc)
spark = SparkSession(sc)
return (sc, sql_ctx, spark)
This can then be called as
sc, sql_ctx, spark = init_spark("App_name", "local[4]")

pyspark : NameError: name 'spark' is not defined

I am copying the pyspark.ml example from the official document website:
http://spark.apache.org/docs/latest/api/python/pyspark.ml.html#pyspark.ml.Transformer
data = [(Vectors.dense([0.0, 0.0]),), (Vectors.dense([1.0, 1.0]),),(Vectors.dense([9.0, 8.0]),), (Vectors.dense([8.0, 9.0]),)]
df = spark.createDataFrame(data, ["features"])
kmeans = KMeans(k=2, seed=1)
model = kmeans.fit(df)
However, the example above wouldn't run and gave me the following errors:
---------------------------------------------------------------------------
NameError Traceback (most recent call last)
<ipython-input-28-aaffcd1239c9> in <module>()
1 from pyspark import *
2 data = [(Vectors.dense([0.0, 0.0]),), (Vectors.dense([1.0, 1.0]),),(Vectors.dense([9.0, 8.0]),), (Vectors.dense([8.0, 9.0]),)]
----> 3 df = spark.createDataFrame(data, ["features"])
4 kmeans = KMeans(k=2, seed=1)
5 model = kmeans.fit(df)
NameError: name 'spark' is not defined
What additional configuration/variable needs to be set to get the example running?
You can add
from pyspark.context import SparkContext
from pyspark.sql.session import SparkSession
sc = SparkContext('local')
spark = SparkSession(sc)
to the begining of your code to define a SparkSession, then the spark.createDataFrame() should work.
Answer by ηŽ‡ζ€€δΈ€ is good and will work for the first time.
But the second time you try it, it will throw the following exception :
ValueError: Cannot run multiple SparkContexts at once; existing SparkContext(app=pyspark-shell, master=local) created by __init__ at <ipython-input-3-786525f7559f>:10
There are two ways to avoid it.
1) Using SparkContext.getOrCreate() instead of SparkContext():
from pyspark.context import SparkContext
from pyspark.sql.session import SparkSession
sc = SparkContext.getOrCreate()
spark = SparkSession(sc)
2) Using sc.stop() in the end, or before you start another SparkContext.
Since you are calling createDataFrame(), you need to do this:
df = sqlContext.createDataFrame(data, ["features"])
instead of this:
df = spark.createDataFrame(data, ["features"])
spark stands there as the sqlContext.
In general, some people have that as sc, so if that didn't work, you could try:
df = sc.createDataFrame(data, ["features"])
You have to import the spark as following if you are using python then it will create
a spark session but remember it is an old method though it will work.
from pyspark.shell import spark
If it errors you regarding other open session do this:
from pyspark.context import SparkContext
from pyspark.sql.session import SparkSession
sc = SparkContext.getOrCreate();
spark = SparkSession(sc)
scraped_data=spark.read.json("/Users/reihaneh/Desktop/nov3_final_tst1/")

Resources