I've exploring locust for our load testing requirements for Spark but stuck on some very basic tasks; documentation also seems very limited.
Stuck on how/where to write my setup & tear-down code that needs to run only once regardless of the number of users. Tried with below sample given in docs; but the code written under events.test_start doesn't run it seems as I'm unable to use attribute 'sc' anywhere under SparkJob class. Any idea how to access the spark instances created under on_test_start method in my SparkJob class?
from locust import User, TaskSet, task, between
from locust import events
from pyspark import SparkConf, SparkContext
from pyspark.sql import SparkSession
class SparkJob(TaskSet):
#task
def submit_jobs(self):
# sample spark job
class SparkUser(User):
host = xxx
wait_time = xxx
tasks = [SparkJob]
#events.test_start.add_listener
def on_test_start(**kw):
conf = SparkConf().setAppName(conn_st['app'])
sc = SparkContext(master=conn_st['master'], conf=conf)
#spark = SparkSession(sc)
return sc
#events.test_stop.add_listener
def on_test_stop(**kw):
#spark.stop()
sc.stop()
I don't know anything about Spark, but making sc or spark a global variable should work for you. So something like:
from locust import User, TaskSet, task, between
from locust import events
from pyspark import SparkConf, SparkContext
from pyspark.sql import SparkSession
spark: SparkSession = None
class SparkJob(TaskSet):
#task
def submit_jobs(self):
# sample spark job
spark.do_stuff()
class SparkUser(User):
host = xxx
wait_time = xxx
tasks = [SparkJob]
#events.test_start.add_listener
def on_test_start(**kw):
global spark
conf = SparkConf().setAppName(conn_st['app'])
sc = SparkContext(master=conn_st['master'], conf=conf)
spark = SparkSession(sc)
#events.test_stop.add_listener
def on_test_stop(**kw):
spark.stop()
You can look more into Python global variables. In short, you only need global if you're going to assign it or change it, otherwise it should be able to infer the global for you. You can be explicit and add it in each place, though.
Related
For some reason spark repartition is assigning the exact same yarn container to the every element of the rdd. I do not know what could be the possible reason. The intriguing part is if I run the same code twice without restarting the session it is now able to partition the data properly and I see distribution over all the containers. Could you please help me understand the behavior?
I am using the following session:
import socket
import os
import pyspark
from pyspark.sql import SparkSession
spark = SparkSession.builder.\
config("spark.dynamicAllocation.enabled",False).\
config("spark.executor.cores","3").\
config("spark.executor.instances","5").\
config("spark.executor.memory","6g").\
config("spark.sql.adaptive.enabled", False).\
getOrCreate()
And, the following code:
df = spark.sparkContext.parallelize(range(240000)).repartition(4)
def f(x):
return os.getenv("CONTAINER_ID"), socket.gethostname()
df = df.map(f)
[set(i) for i in df.glom().collect()]
output:
[{('container_1676564785882_0047_01_000002', 'monsoon-spark-sw-009d')},
{('container_1676564785882_0047_01_000002', 'monsoon-spark-sw-009d')},
{('container_1676564785882_0047_01_000002', 'monsoon-spark-sw-009d')},
{('container_1676564785882_0047_01_000002', 'monsoon-spark-sw-009d')}]
Use the exact same code again without restarting the session:
df = spark.sparkContext.parallelize(range(2400000)).repartition(4)
def f(x):
return os.getenv("CONTAINER_ID"), socket.gethostname()
df = df.map(f)
[set(i) for i in df.glom().collect()]
output:
[{('container_1676564785882_0047_01_000002', 'monsoon-spark-sw-009d')},
{('container_1676564785882_0047_01_000004', 'monsoon-spark-w-0')},
{('container_1676564785882_0047_01_000005', 'monsoon-spark-sw-ppqw')},
{('container_1676564785882_0047_01_000001', 'monsoon-spark-sw-m2t7')}]
Here is a snapshot for the same
I have the following code snippet in pyspark:
import pandas as pd
from pyspark import SparkContext, SparkConf
from pyspark.context import SparkContext
from pyspark.sql import Row, SQLContext, SparkSession
import pyspark.sql.dataframe
def validate_data():
conf = SparkConf().setAppName("app")
spark = SparkContext(conf=conf)
config = {
"val_path" : "s3://forecasting/data/validation.csv"
}
data1_df = spark.read.table("db1.data_dest”)
data2_df = spark.read.table("db2.data_source”)
print(data1_df.count())
print(data2_df.count())
if __name__ == "__main__":
validate_data()
Now this code works fine when run on jupyter notebook on sagemaker ( connecting to EMR )
but when we are running as a python script on terminal, its throwing this error
Error message
AttributeError: 'SparkContext' object has no attribute 'read'
We have to automate these notebooks, so we are trying to convert them to python scripts
You can only call read on a Spark Session, not on a Spark Context.
from pyspark import SparkContext, SparkConf
from pyspark.sql import SparkSession
conf = SparkConf().setAppName("app")
spark = SparkSession.builder.config(conf=conf)
Or you can convert the Spark context to a Spark session
conf = SparkConf().setAppName("app")
sc = SparkContext(conf=conf)
spark = SparkSession(sc)
Good morning
When running:
from pyspark.sql.types import IntegerType
import pyspark.sql.functions as F
from pyspark.sql import SparkSession
spark = SparkSession.builder.getOrCreate()
class ETL:
def addone(x):
return x + 1
def job_run():
df = spark.sql('SELECT 1 one').withColumn('AddOne', udf_addone(F.col('one')))
df.show()
if (__name__ == '__main__'):
udf_addone = F.udf(lambda x: ETL.addone(x), returnType=IntegerType())
ETL.job_run()
I get the following error message:
Exception: It appears that you are attempting to reference SparkContext from a broadcast variable, action, or transformation. SparkContext can only be used on the driver, not in code that it run on workers. For more information, see SPARK-5063.
I have reviewed the answers given at ERROR:SparkContext can only be used on the driver, not in code that it run on workers. For more information, see SPARK-5063 and at Spark: Broadcast variables: It appears that you are attempting to reference SparkContext from a broadcast variable, action, or transforamtion with no success. I'd like to stick to using spark udf in my script.
Any help on this is appreciated.
Many thanks!
I am trying to read data from BigQuery using pandas and pyspark. I am able to get the data but somehow getting below error while converting it into Spark DataFrame.
py4j.protocol.Py4JJavaError: An error occurred while calling o28.showString.
: java.lang.IllegalStateException: Could not find TLS ALPN provider; no working netty-tcnative, Conscrypt, or Jetty NPN/ALPN available
at com.google.cloud.spark.bigquery.repackaged.io.grpc.netty.shaded.io.grpc.netty.GrpcSslContexts.defaultSslProvider(GrpcSslContexts.java:258)
at com.google.cloud.spark.bigquery.repackaged.io.grpc.netty.shaded.io.grpc.netty.GrpcSslContexts.configure(GrpcSslContexts.java:171)
at com.google.cloud.spark.bigquery.repackaged.io.grpc.netty.shaded.io.grpc.netty.GrpcSslContexts.forClient(GrpcSslContexts.java:120)
at com.google.cloud.spark.bigquery.repackaged.io.grpc.netty.shaded.io.grpc.netty.NettyChannelBuilder.buildTransportFactory(NettyChannelBuilder.java:401)
at com.google.cloud.spark.bigquery.repackaged.io.grpc.internal.AbstractManagedChannelImplBuilder.build(AbstractManagedChannelImplBuilder.java:444)
at com.google.cloud.spark.bigquery.repackaged.com.google.api.gax.grpc.InstantiatingGrpcChannelProvider.createSingleChannel(InstantiatingGrpcChannelProvider.java:223)
at com.google.cloud.spark.bigquery.repackaged.com.google.api.gax.grpc.InstantiatingGrpcChannelProvider.createChannel(InstantiatingGrpcChannelProvider.java:169)
at com.google.cloud.spark.bigquery.repackaged.com.google.api.gax.grpc.InstantiatingGrpcChannelProvider.getTransportChannel(InstantiatingGrpcChannelProvider.java:156)
at com.google.cloud.spark.bigquery.repackaged.com.google.api.gax.rpc.ClientContext.create(ClientContext.java:157)
Following is the environment detail
Python version : 3.7
Spark version : 2.4.3
Java version : 1.8
The code is as follow
import google.auth
import pyspark
from pyspark import SparkConf, SparkContext
from pyspark.sql import SparkSession , SQLContext
from google.cloud import bigquery
# Currently this only supports queries which have at least 10 MB of results
QUERY = """ SELECT * FROM test limit 1 """
#spark = SparkSession.builder.appName('Query Results').getOrCreate()
sc = pyspark.SparkContext()
bq = bigquery.Client()
print('Querying BigQuery')
project_id = ''
query_job = bq.query(QUERY,project=project_id)
# Wait for query execution
query_job.result()
df = SQLContext(sc).read.format('bigquery') \
.option('dataset', query_job.destination.dataset_id) \
.option('table', query_job.destination.table_id)\
.option("type", "direct")\
.load()
df.show()
I am looking some help to solve this issue.
I managed to find the better solution referencing this link , below is my working code :
Install pandas_gbq package in python library before writing below code .
import pandas_gbq
from pyspark.context import SparkContext
from pyspark.sql.session import SparkSession
project_id = "<your-project-id>"
query = """ SELECT * from testSchema.testTable"""
athletes = pandas_gbq.read_gbq(query=query, project_id=project_id,dialect = 'standard')
# Get a reference to the Spark Session
sc = SparkContext()
spark = SparkSession(sc)
# convert from Pandas to Spark
sparkDF = spark.createDataFrame(athletes)
# perform an operation on the DataFrame
print(sparkDF.count())
sparkDF.show()
Hope it helps to someone ! Keep pysparking :)
I'm new to cask cdap and Hadoop environment.
I'm creating a pipeline and I want to use a PySpark Program. I have all the script of the spark program and it works when I test it by command like, insted it doesn't if I try to copy- paste it in a cdap pipeline.
It gives me an error in the logs:
NameError: name 'SparkSession' is not defined
My script starts in this way:
from pyspark.sql import *
spark = SparkSession.builder.getOrCreate()
from pyspark.sql.functions import trim, to_date, year, month
sc= SparkContext()
How can I fix it?
Spark connects with the local running spark cluster through SparkContext. A better explanation can be found here https://stackoverflow.com/a/24996767/5671433.
To initialise a SparkSession, a SparkContext has to be initialized.
One way to do that is to write a function that initializes all your contexts and a spark session.
def init_spark(app_name, master_config):
"""
:params app_name: Name of the app
:params master_config: eg. local[4]
:returns SparkContext, SQLContext, SparkSession:
"""
conf = (SparkConf().setAppName(app_name).setMaster(master_config))
sc = SparkContext(conf=conf)
sc.setLogLevel("ERROR")
sql_ctx = SQLContext(sc)
spark = SparkSession(sc)
return (sc, sql_ctx, spark)
This can then be called as
sc, sql_ctx, spark = init_spark("App_name", "local[4]")