spark repartition assigning same container to every element in the rdd - apache-spark

For some reason spark repartition is assigning the exact same yarn container to the every element of the rdd. I do not know what could be the possible reason. The intriguing part is if I run the same code twice without restarting the session it is now able to partition the data properly and I see distribution over all the containers. Could you please help me understand the behavior?
I am using the following session:
import socket
import os
import pyspark
from pyspark.sql import SparkSession
spark = SparkSession.builder.\
config("spark.dynamicAllocation.enabled",False).\
config("spark.executor.cores","3").\
config("spark.executor.instances","5").\
config("spark.executor.memory","6g").\
config("spark.sql.adaptive.enabled", False).\
getOrCreate()
And, the following code:
df = spark.sparkContext.parallelize(range(240000)).repartition(4)
def f(x):
return os.getenv("CONTAINER_ID"), socket.gethostname()
df = df.map(f)
[set(i) for i in df.glom().collect()]
output:
[{('container_1676564785882_0047_01_000002', 'monsoon-spark-sw-009d')},
{('container_1676564785882_0047_01_000002', 'monsoon-spark-sw-009d')},
{('container_1676564785882_0047_01_000002', 'monsoon-spark-sw-009d')},
{('container_1676564785882_0047_01_000002', 'monsoon-spark-sw-009d')}]
Use the exact same code again without restarting the session:
df = spark.sparkContext.parallelize(range(2400000)).repartition(4)
def f(x):
return os.getenv("CONTAINER_ID"), socket.gethostname()
df = df.map(f)
[set(i) for i in df.glom().collect()]
output:
[{('container_1676564785882_0047_01_000002', 'monsoon-spark-sw-009d')},
{('container_1676564785882_0047_01_000004', 'monsoon-spark-w-0')},
{('container_1676564785882_0047_01_000005', 'monsoon-spark-sw-ppqw')},
{('container_1676564785882_0047_01_000001', 'monsoon-spark-sw-m2t7')}]
Here is a snapshot for the same

Related

Unable to read images simultaneously [in parallels] using pyspark

I have 10 jpeg images in a directory.
I want to read all them simultaneously using pyspark.
I tried as follows.
from PIL import Image
from pyspark import SparkContext, SparkConf
conf = SparkConf()
spark = SparkContext(conf=conf)
files = glob.glob("E:\\tests\\*.jpg")
files_ = spark.parallelize(files)
arrs = []
for fi in files_.toLocalIterator():
im = Image.open(fi)
data = np.asarray(im)
arrs.append(data)
img = np.array(arrs)
print (img.shape)
The code ended without error and printed out img.shape; however, it did not run in parallel.
Could you help me?
You can use rdd.map to load and transform the pictures in parallel and then collect the rdd into a Python list:
files = glob.glob("E:\\tests\\*.jpg")
file_rdd = spark.parallelize(files)
def image_to_array(path):
im = Image.open(path)
data = np.asarray(im)
return data
array_rdd = file_rdd.map(lambda f: image_to_array(f))
result_list = array_rdd.collect()
result_list is now a list with 10 elements, each element is a numpy.ndarray.
The function image_to_array will be executed on different Spark executors in parallel. If you have a multi-node Spark cluster, you have to make sure that all nodes can access E:\\tests\\.
After collecting the arrays, processing can continue with
img = np.array(result_list, dtype=object)
My solution follows the same idea from werner, but I did only using spark libs:
from pyspark.ml.image import ImageSchema
import numpy as np
df = (spark
.read
.format("image")
.option("pathGlobFilter", "*.jpg")
.load("your_data_path"))
df = df.select('image.*')
# Pre-caching the required schema. If you remove this line an error will be raised.
ImageSchema.imageFields
# Transforming images to np.array
arrays = df.rdd.map(ImageSchema.toNDArray).collect()
img = np.array(arrays)
print(img.shape)

Where to write the setup and teardown code for Locust tests?

I've exploring locust for our load testing requirements for Spark but stuck on some very basic tasks; documentation also seems very limited.
Stuck on how/where to write my setup & tear-down code that needs to run only once regardless of the number of users. Tried with below sample given in docs; but the code written under events.test_start doesn't run it seems as I'm unable to use attribute 'sc' anywhere under SparkJob class. Any idea how to access the spark instances created under on_test_start method in my SparkJob class?
from locust import User, TaskSet, task, between
from locust import events
from pyspark import SparkConf, SparkContext
from pyspark.sql import SparkSession
class SparkJob(TaskSet):
#task
def submit_jobs(self):
# sample spark job
class SparkUser(User):
host = xxx
wait_time = xxx
tasks = [SparkJob]
#events.test_start.add_listener
def on_test_start(**kw):
conf = SparkConf().setAppName(conn_st['app'])
sc = SparkContext(master=conn_st['master'], conf=conf)
#spark = SparkSession(sc)
return sc
#events.test_stop.add_listener
def on_test_stop(**kw):
#spark.stop()
sc.stop()
I don't know anything about Spark, but making sc or spark a global variable should work for you. So something like:
from locust import User, TaskSet, task, between
from locust import events
from pyspark import SparkConf, SparkContext
from pyspark.sql import SparkSession
spark: SparkSession = None
class SparkJob(TaskSet):
#task
def submit_jobs(self):
# sample spark job
spark.do_stuff()
class SparkUser(User):
host = xxx
wait_time = xxx
tasks = [SparkJob]
#events.test_start.add_listener
def on_test_start(**kw):
global spark
conf = SparkConf().setAppName(conn_st['app'])
sc = SparkContext(master=conn_st['master'], conf=conf)
spark = SparkSession(sc)
#events.test_stop.add_listener
def on_test_stop(**kw):
spark.stop()
You can look more into Python global variables. In short, you only need global if you're going to assign it or change it, otherwise it should be able to infer the global for you. You can be explicit and add it in each place, though.

SparkContext conflict with spark udf

Good morning
When running:
from pyspark.sql.types import IntegerType
import pyspark.sql.functions as F
from pyspark.sql import SparkSession
spark = SparkSession.builder.getOrCreate()
class ETL:
def addone(x):
return x + 1
def job_run():
df = spark.sql('SELECT 1 one').withColumn('AddOne', udf_addone(F.col('one')))
df.show()
if (__name__ == '__main__'):
udf_addone = F.udf(lambda x: ETL.addone(x), returnType=IntegerType())
ETL.job_run()
I get the following error message:
Exception: It appears that you are attempting to reference SparkContext from a broadcast variable, action, or transformation. SparkContext can only be used on the driver, not in code that it run on workers. For more information, see SPARK-5063.
I have reviewed the answers given at ERROR:SparkContext can only be used on the driver, not in code that it run on workers. For more information, see SPARK-5063 and at Spark: Broadcast variables: It appears that you are attempting to reference SparkContext from a broadcast variable, action, or transforamtion with no success. I'd like to stick to using spark udf in my script.
Any help on this is appreciated.
Many thanks!

Error while using dataframe show method in pyspark

I am trying to read data from BigQuery using pandas and pyspark. I am able to get the data but somehow getting below error while converting it into Spark DataFrame.
py4j.protocol.Py4JJavaError: An error occurred while calling o28.showString.
: java.lang.IllegalStateException: Could not find TLS ALPN provider; no working netty-tcnative, Conscrypt, or Jetty NPN/ALPN available
at com.google.cloud.spark.bigquery.repackaged.io.grpc.netty.shaded.io.grpc.netty.GrpcSslContexts.defaultSslProvider(GrpcSslContexts.java:258)
at com.google.cloud.spark.bigquery.repackaged.io.grpc.netty.shaded.io.grpc.netty.GrpcSslContexts.configure(GrpcSslContexts.java:171)
at com.google.cloud.spark.bigquery.repackaged.io.grpc.netty.shaded.io.grpc.netty.GrpcSslContexts.forClient(GrpcSslContexts.java:120)
at com.google.cloud.spark.bigquery.repackaged.io.grpc.netty.shaded.io.grpc.netty.NettyChannelBuilder.buildTransportFactory(NettyChannelBuilder.java:401)
at com.google.cloud.spark.bigquery.repackaged.io.grpc.internal.AbstractManagedChannelImplBuilder.build(AbstractManagedChannelImplBuilder.java:444)
at com.google.cloud.spark.bigquery.repackaged.com.google.api.gax.grpc.InstantiatingGrpcChannelProvider.createSingleChannel(InstantiatingGrpcChannelProvider.java:223)
at com.google.cloud.spark.bigquery.repackaged.com.google.api.gax.grpc.InstantiatingGrpcChannelProvider.createChannel(InstantiatingGrpcChannelProvider.java:169)
at com.google.cloud.spark.bigquery.repackaged.com.google.api.gax.grpc.InstantiatingGrpcChannelProvider.getTransportChannel(InstantiatingGrpcChannelProvider.java:156)
at com.google.cloud.spark.bigquery.repackaged.com.google.api.gax.rpc.ClientContext.create(ClientContext.java:157)
Following is the environment detail
Python version : 3.7
Spark version : 2.4.3
Java version : 1.8
The code is as follow
import google.auth
import pyspark
from pyspark import SparkConf, SparkContext
from pyspark.sql import SparkSession , SQLContext
from google.cloud import bigquery
# Currently this only supports queries which have at least 10 MB of results
QUERY = """ SELECT * FROM test limit 1 """
#spark = SparkSession.builder.appName('Query Results').getOrCreate()
sc = pyspark.SparkContext()
bq = bigquery.Client()
print('Querying BigQuery')
project_id = ''
query_job = bq.query(QUERY,project=project_id)
# Wait for query execution
query_job.result()
df = SQLContext(sc).read.format('bigquery') \
.option('dataset', query_job.destination.dataset_id) \
.option('table', query_job.destination.table_id)\
.option("type", "direct")\
.load()
df.show()
I am looking some help to solve this issue.
I managed to find the better solution referencing this link , below is my working code :
Install pandas_gbq package in python library before writing below code .
import pandas_gbq
from pyspark.context import SparkContext
from pyspark.sql.session import SparkSession
project_id = "<your-project-id>"
query = """ SELECT * from testSchema.testTable"""
athletes = pandas_gbq.read_gbq(query=query, project_id=project_id,dialect = 'standard')
# Get a reference to the Spark Session
sc = SparkContext()
spark = SparkSession(sc)
# convert from Pandas to Spark
sparkDF = spark.createDataFrame(athletes)
# perform an operation on the DataFrame
print(sparkDF.count())
sparkDF.show()
Hope it helps to someone ! Keep pysparking :)

How to distibute classes with PySpark and Jupyter

I have an annoying problem using jupyter notebook with spark.
I need to define a custom class inside the notebook and use it to perform some map operations
from pyspark import SparkContext
from pyspark import SparkConf
from pyspark import SQLContext
conf = SparkConf().setMaster("spark://192.168.10.11:7077")\
.setAppName("app_jupyter/")\
.set("spark.cores.max", "10")
sc = SparkContext(conf=conf)
data = [1, 2, 3, 4, 5]
distData = sc.parallelize(data)
class demo(object):
def __init__(self, value):
self.test = value + 10
pass
distData.map(lambda x : demo(x)).collect()
It gives the following error:
PicklingError: Can't pickle : attribute lookup
main.demo failed
I know what this error is about, but I could't figure out a solution..
I have tried:
Define a demo.py python file outside the notebook. It works, but it is such a ugly solution ...
Create a dynamic module like this, and then import it afterwards... This gives the same error
What would be a solution?...I want everything to work in the same notebook
It is possible to change something in:
The way spark works, maybe some pickle configuration
Something in the code... Use some static magic approach
There is no reliable and elegant workaround here and this behavior is not particularly related to Spark. This is all about fundamental design of the Python pickle
pickle can save and restore class instances transparently, however the class definition must be importable and live in the same module as when the object was stored.
Theoretically you could define a custom cell magic which would:
Write the content of a cell to a module.
Import it.
Call SparkContext.addPyFile to distribute the module.
from IPython.core.magic import register_cell_magic
import importlib
#register_cell_magic
def spark_class(line, cell):
module = line.strip()
f = "{0}.py".format(module)
with open(f, "w") as fw:
fw.write(cell)
globals()[module] = importlib.import_module(module)
sc.addPyFile(f)
In [2]: %%spark_class foo
...: class Foo(object):
...: def __init__(self, x):
...: self.x = x
...: def __repr__(self):
...: return "Foo({0})".format(self.x)
...:
In [3]: sc.parallelize([1, 2, 3]).map(lambda x: foo.Foo(x)).collect()
Out[3]: [Foo(1), Foo(2), Foo(3)]
but it is a one time deal. Once file is marked for distribution it cannot be changed or redistributed. Moreover there is a problem of reloading imports on remote hosts. I can think of some more elaborate schemes but this is simply more trouble than it is worth.
The answer from zero323 is solid: there's no one "right" way to solve this problem. You could indeed use Jupyter magic, as proposed. One other way is to use Jupyter's %%writefile to have your code inline in a Jupyter cell but to then write it to disk as a python file. Then you can both import the file to your Jupyter kernel session as well as ship it with your PySpark job (via addPyFile() as noted in the other answer). Note that if you make changes to the code but don't restart your PySpark session, you'll have to get the updated code to PySpark somehow.
Can we make this easier? I wrote a blogpost about this topic as well as a PySpark Session wrapper (oarphpy.spark.NBSpark) to help automate a lot of the tricky stuff. See the Jupyter Notebook embedded in that post for a working example. The overall pattern looks like this:
import os
import sys
CUSTOM_LIB_SRC_DIR = '/tmp/'
os.chdir(CUSTOM_LIB_SRC_DIR)
!mkdir -p mymodule
!touch mymodule/__init__.py
%%writefile mymodule/foo.py
class Zebra(object):
def __init__(self, name):
self.name = name
sys.path.append(CUSTOM_LIB_SRC_DIR)
from mymodule.foo import Zebra
# Create Zebra() instances in the notebook
herd = [Zebra(name=str(i)) for i in range(10)]
# Now send those instances to PySpark!
from oarphpy.spark import NBSpark
NBSpark.SRC_ROOT = os.path.join(CUSTOM_LIB_SRC_DIR, 'mymodule')
spark = NBSpark.getOrCreate()
rdd = spark.sparkContext.parallelize(herd)
def get_name(z):
return z.name
names = rdd.map(get_name).collect()
Additionally, if you make any changes to the mymodule files on disk (via %%writefile or otherwise), then NBSpark with automatically ship those changes to the active PySpark session.

Resources