How to debug MemoryError in PySpark - apache-spark

I want to download some xml files (50MBs each - about 3000 = 150GBs), process them and upload to BigQuery using pyspark. For the development purpose I was using jupyter notebook and small amount of files 10. I wrote pretty complex code setup cluster on dataproc. My daproc cluster has 6TBs of HDFSs, 10 nodes (each 4 cores) and 120GBs of RAM.
def context():
import os
os.environ['PYSPARK_SUBMIT_ARGS'] = '--packages org.apache.hadoop:hadoop-aws:2.7.3 pyspark-shell'
import pyspark
conf = pyspark.SparkConf()
conf = (conf.setMaster('local[*]')
.set('spark.executor.memory', '4G')
.set('spark.driver.memory', '45G')
.set('spark.driver.maxResultSize', '10G')
.set("spark.python.profile", "true"))
sc = pyspark.SparkContext(conf=conf)
return sc
def job(sc):
print("Job started")
RDDread = sc.wholeTextFiles("s3a://custom-bucket/*/*.gz")
models = RDDread.flatMap(process_xmls).groupByKey()
tracking_all = (models.filter(lambda x: x[0] == TrackInformation)
.flatMap(lambda x: x[1])
.map(lambda model: (model.flight_ref, model))
.groupByKey())
tracking_merged = tracking_all.map(lambda x: x[1]).map(merge_ti)
flight_plans = (models.filter(lambda x: x[0] == FlightPlan).flatMap(lambda x: x[1]).map(lambda fp: (fp.flight_ref, fp)))
fps_tracking = tracking_merged.union(flight_plans).groupByKey().filter(lambda x: len(x[1]) == 2)
in_bq_batch = 1000
n = fps_tracking.count()
parts = ceil(n / in_bq_batch)
many_n = fps_tracking.repartition(parts).mapPartitions(upload_fpm2)
print("Job ended")
return fps_tracking, tracking_merged, flight_plans, models, many_n
After 200 messages org.apache.hadoop.io.compress.CodecPool: Got brand-new decompressor [.gz] I'm getting 2 errors: java.lang.OutOfMemoryError and MemoryError, mostly MemoryError. I thought that I have just 2 partitions after RDDread, so I modified code for: sc.wholeTextFiles("s3a://custom-bucket//.gz", minPartitions=40) -> And it got broke even faster. I was adding persistent(DISK) function in some random places.
File "/usr/lib/spark/python/lib/pyspark.zip/pyspark/serializers.py", line 684, in loads
return s.decode("utf-8") if self.use_unicode else s
MemoryError
19/05/20 14:09:23 INFO org.apache.hadoop.io.compress.CodecPool: Got brand-new decompressor [.gz]
19/05/20 14:09:30 ERROR org.apache.spark.util.Utils: Uncaught exception in thread stdout writer for /opt/conda/default/bin/python
java.lang.OutOfMemoryError: Java heap space
What I am doing wrong and how to debug my code?

You seem to be running spark in local mode (local[*]). This means that you are using a single jvm with 45G of RAM (spark.driver.memory) and that all your worker threads run within that jvm. The spark.executor.memory option has no effect What does setMaster `local[*]` mean in spark?.
You should set up your spark master either to the yarn scheduler, or if you have no yarn use standalone mode https://spark.apache.org/docs/latest/spark-standalone.html.

Related

Unable to read images simultaneously [in parallels] using pyspark

I have 10 jpeg images in a directory.
I want to read all them simultaneously using pyspark.
I tried as follows.
from PIL import Image
from pyspark import SparkContext, SparkConf
conf = SparkConf()
spark = SparkContext(conf=conf)
files = glob.glob("E:\\tests\\*.jpg")
files_ = spark.parallelize(files)
arrs = []
for fi in files_.toLocalIterator():
im = Image.open(fi)
data = np.asarray(im)
arrs.append(data)
img = np.array(arrs)
print (img.shape)
The code ended without error and printed out img.shape; however, it did not run in parallel.
Could you help me?
You can use rdd.map to load and transform the pictures in parallel and then collect the rdd into a Python list:
files = glob.glob("E:\\tests\\*.jpg")
file_rdd = spark.parallelize(files)
def image_to_array(path):
im = Image.open(path)
data = np.asarray(im)
return data
array_rdd = file_rdd.map(lambda f: image_to_array(f))
result_list = array_rdd.collect()
result_list is now a list with 10 elements, each element is a numpy.ndarray.
The function image_to_array will be executed on different Spark executors in parallel. If you have a multi-node Spark cluster, you have to make sure that all nodes can access E:\\tests\\.
After collecting the arrays, processing can continue with
img = np.array(result_list, dtype=object)
My solution follows the same idea from werner, but I did only using spark libs:
from pyspark.ml.image import ImageSchema
import numpy as np
df = (spark
.read
.format("image")
.option("pathGlobFilter", "*.jpg")
.load("your_data_path"))
df = df.select('image.*')
# Pre-caching the required schema. If you remove this line an error will be raised.
ImageSchema.imageFields
# Transforming images to np.array
arrays = df.rdd.map(ImageSchema.toNDArray).collect()
img = np.array(arrays)
print(img.shape)

Erratic occurence of "Container killed by YARN for exceeding memory limits."

ErrorMessage': 'An error occurred while calling o103.pyWriteDynamicFrame.
Job aborted due to stage failure: Task 0 in stage 5.0 failed 4 times, most
recent failure: Lost task 0.3 in stage 5.0
(TID 131, ip-1-2-3-4.eu-central-1.compute.internal, executor 20):
ExecutorLostFailure (executor 20 exited caused by one of the running tasks)
Reason: Container killed by YARN for exceeding memory limits. 5.5 GB of
5.5 GB physical memory used. Consider boosting
spark.yarn.executor.memoryOverhead or disabling
yarn.nodemanager.vmem-check-enabled because of YARN-4714.
The job is doing this (pseudo code):
Reads CSV into DyanamicFrame dynf
`dynf.toDF().repartition(100)
Map.apply(dyndf, tf) # tf being function applied on every row
`dynf.toDF().coalesce(10)
Writes dyndf as parquet to S3
This job has been executed dozens of times with identical Glue setup (Standard worker with MaxCapacity of 10.0) successfully and reexecution on CSV that it failed on is usually successful without any adjustments. Meaning: it works. Not just that. The job ran even successfully with much larger CSVs than those it failed on.
That's what I mean with erratic. I don't see a pattern like if CSV is larger than X then I need more workers or something like that.
Has somebody an idea what might be a cause for this error which occurs somewhat randomly?
The relevant part of the code:
import sys
from awsglue.transforms import *
from awsglue.utils import getResolvedOptions
from pyspark.context import SparkContext
from awsglue.context import GlueContext
from awsglue.job import Job
from awsglue.dynamicframe import DynamicFrame
# s3://bucket/path/object
args = getResolvedOptions(sys.argv, [
'JOB_NAME',
'SOURCE_BUCKET', # "bucket"
'SOURCE_PATH', # "path/"
'OBJECT_NAME', # "object"
'TARGET_BUCKET', # "bucket"
'TARGET_PATH', # "path/"
'PARTS_LOAD',
'PARTS_SAVE'
])
sc = SparkContext()
glueContext = GlueContext(sc)
spark = glueContext.spark_session
job = Job(glueContext)
job.init(args['JOB_NAME'], args)
data_DYN = glueContext.create_dynamic_frame_from_options(
connection_type="s3",
format="csv",
connection_options={
"paths":[
"s3://{sb}/{sp}{on}".format(
sb=args['SOURCE_BUCKET'],
sp=args['SOURCE_PATH'],
on=args['OBJECT_NAME']
)
]
},
format_options={
"withHeader": True,
"separator": ","
}
)
data_DF = data_DYN.toDF().repartition(int(args["PARTS_LOAD"]))
data_DYN = DynamicFrame.fromDF(data_DF, glueContext, "data_DYN")
def tf(rec):
# functions applied to elements of rec
return rec
data_DYN_2 = Map.apply(data_DYN, tf)
cols = [
'col1', 'col2', ...
]
data_DYN_3 = SelectFields.apply(data_DYN_2, cols)
data_DF_3 = data_DYN_3.toDF().cache()
data_DF_4 = data_DF_3.coalesce(int(args["PARTS_SAVE"]))
data_DYN_4 = DynamicFrame.fromDF(data_DF_4, glueContext, "data_DYN_4")
datasink = glueContext.write_dynamic_frame.from_options(
frame = data_DYN_4,
connection_type = "s3",
connection_options = {
"path": "s3://{tb}/{tp}".format(tb=args['TARGET_BUCKET'],tp=args['TARGET_PATH']),
"partitionKeys": ["col_x","col_y"]
},
format = "parquet",
transformation_ctx = "datasink"
)
job.commit()
I would suspect .coalesce(10) to be the culprit, due to 100 -> 10 reduction in number of partitions without rebalancing the data across them. Doing .repartition(10) instead might fix it, at the expense of an extra shuffle.

sc._jvm.org.apache.spark.streaming.kafka.KafkaUtilsPythonHelper() TypeError: 'JavaPackage' object is not callable when using

I am learning how to integrate spark with kafka. Currently i created a virtualenv and installed pyspark, py4j packages.
I also configured these environment:
PYSPARK_PYTHON : C:\learn_new\learn_utils\venv\Scripts\python.exe
SPARK_HOME : C:\spark-2.4.3-bin-hadoop2.7
Then i want to run the example python source code under C:\spark-2.4.3-bin-hadoop2.7\examples\src\main\python\streaming\direct_kafka_wordcount.py
The script code is this:
from __future__ import print_function
import sys
from pyspark import SparkContext
from pyspark.streaming import StreamingContext
from pyspark.streaming.kafka import KafkaUtils
if __name__ == "__main__":
if len(sys.argv) != 3:
print("Usage: direct_kafka_wordcount.py <broker_list> <topic>", file=sys.stderr)
sys.exit(-1)
sc = SparkContext(appName="PythonStreamingDirectKafkaWordCount")
ssc = StreamingContext(sc, 2)
brokers, topic = sys.argv[1:]
kvs = KafkaUtils.createDirectStream(ssc, [topic], {"metadata.broker.list": brokers})
lines = kvs.map(lambda x: x[1])
counts = lines.flatMap(lambda line: line.split(" ")) \
.map(lambda word: (word, 1)) \
.reduceByKey(lambda a, b: a+b)
counts.pprint()
ssc.start()
ssc.awaitTermination()
The command line to run the python code under virtualenv is this:
python --default --client --host localhost --port 60614 c:\spark-2.4.3-bin-hadoop2.7\examples\src\main\python\streaming\direct_kafka_wordcount.py kafka_host_name:9092 topic_name
Then i got this error:
File "c:\spark-2.4.3-bin-hadoop2.7\examples\src\main\python\venv\lib\site-packages\pyspark\streaming\kafka.py", line 138, in createDirectStream
helper = KafkaUtils._get_helper(ssc._sc)
File "c:\spark-2.4.3-bin-hadoop2.7\examples\src\main\python\venv\lib\site-packages\pyspark\streaming\kafka.py", line 217, in _get_helper
return sc._jvm.org.apache.spark.streaming.kafka.KafkaUtilsPythonHelper()
TypeError: 'JavaPackage' object is not callable
What's the issue of it?
Thanks very much.
I mainly want to debug code locally, so i do not want to use spark-submit and add --jars
or --packages parameters to run the code.
But it really need the spark-streaming-kafka-0-8-assembly_2.11-2.4.3.jar package.(here change the package version according to your spark version)
So i tried to download the package and save it to the C:\spark-2.4.3-bin-hadoop2.7\jars(change it to your spark installation path, and find the jars folder).
Then the issue is solved. Hope it will help other people.
I had similar problem, just added the jar separately to two places one where spark had all its jars. Secondly, added the jar into jars of pyspark which was kept at a different location inside the present python version. And it worked

Spark on Google Cloud Dataproc job failures on last stages

I work with Spark cluster on Dataproc and my job fails in the end of processing.
My datasource is text logs files in csv format on Google Cloud Storage (total volume is 3.5TB, 5000 files).
The processing logic is following:
read files to DataFrame (schema ["timestamp", "message"]);
group all messages into window of 1 second;
apply pipeline [Tokenizer -> HashingTF] to every grouped message to extract words and their frequencies to build a feature vectors;
save feature vectors with timelines on GCS.
The issues that I'm having is that on small subset of data (like 10 files) processing works well, but when I'm running it on all files it fails in the very end with error like "Container killed by YARN for exceeding memory limits. 25.0 GB of 24 GB physical memory used. Consider boosting spark.yarn.executor.memoryOverhead."
My cluster has 25 worker with n1-highmem-8 machines. So I googled for this error and literally increased "spark.yarn.executor.memoryOverhead" parameter to 6500MB.
Now my spark job still fails, but with error "Job aborted due to stage failure: Total size of serialized results of 4293 tasks (1920.0 MB) is bigger than spark.driver.maxResultSize (1920.0 MB)"
I'm new to spark and I believe that I'm doing something wrong or on the configuration level, or in my code. If you can help me to clean these thing up, it will be great!
Here is my code for the spark task:
import logging
import string
from datetime import datetime
import pyspark
import re
from pyspark.sql import SparkSession
from pyspark.ml.feature import HashingTF, IDF, Tokenizer
from pyspark.ml import Pipeline
from pyspark.sql.types import StructType, StructField
from pyspark.sql.types import DoubleType, IntegerType, StringType, TimestampType, ArrayType
from pyspark.sql import functions as F
logging.basicConfig(level=logging.INFO)
logger = logging.getLogger(__name__)
# Constants
NOW = datetime.now().strftime("%Y%m%d%H%M%S")
START_DATE = '2016-01-01'
END_DATE = '2016-03-01'
sc = pyspark.SparkContext()
spark = SparkSession\
.builder\
.appName("LogsVectorizer")\
.getOrCreate()
spark.conf.set('spark.sql.shuffle.partitions', 10000)
logger.info("Start log processing at {}...".format(NOW))
# Filenames to read/write locations
logs_fn = 'gs://databucket/csv/*'
vectors_fn = 'gs://databucket/vectors_out_{}'.format(NOW)
pipeline_fn = 'gs://databucket/pipeline_vectors_out_{}'.format(NOW)
model_fn = 'gs://databucket/model_vectors_out_{}'.format(NOW)
# CSV data schema to build DataFrame
schema = StructType([
StructField("timestamp", StringType()),
StructField("message", StringType())])
# Helpers to clean strings in log fields
def cleaning_string(s):
try:
# Remove ids (like: app[2352] -> app)
s = re.sub('\[.*\]', 'IDTAG', s)
if s == '':
s = 'EMPTY'
except Exception as e:
print("Skip string with exception {}".format(e))
return s
def normalize_string(s):
try:
# Remove punctuation
s = re.sub('[{}]'.format(re.escape(string.punctuation)), ' ', s)
# Remove digits
s = re.sub('\d*', '', s)
# Remove extra spaces
s = ' '.join(s.split())
except Exception as e:
print("Skip string with exception {}".format(e))
return s
def line_splitter(line):
line = line.split(',')
timestamp = line[0]
full_message = ' '.join(line[1:])
full_message = normalize_string(cleaning_string(full_message))
return [timestamp, full_message]
# Read line from csv, split to date|message
# Read CSV to DataFrame and clean its fields
logger.info("Read CSV to DF...")
logs_csv = sc.textFile(logs_fn)
logs_csv = logs_csv.map(lambda line: line_splitter(line)).toDF(schema)
# Keep only lines for our date interval
logger.info("Filter by dates...")
logs_csv = logs_csv.filter((logs_csv.timestamp>START_DATE) & (logs_csv.timestamp<END_DATE))
logs_csv = logs_csv.withColumn("timestamp", logs_csv.timestamp.cast("timestamp"))
# Helpers to join messages into window and convert sparse to dense
join_ = F.udf(lambda x: "| ".join(x), StringType())
asDense = F.udf(lambda v: v.toArray().tolist())
# Agg by time window
logger.info("Group log messages by time window...")
logs_csv = logs_csv.groupBy(F.window("timestamp", "1 second"))\
.agg(join_(F.collect_list("message")).alias("messages"))
# Turn message to hashTF
tokenizer = Tokenizer(inputCol="messages", outputCol="message_tokens")
hashingTF = HashingTF(inputCol="message_tokens", outputCol="tokens_counts", numFeatures=1000)
pipeline_tf = Pipeline(stages=[tokenizer, hashingTF])
logger.info("Fit-Transform ML Pipeline...")
model_tf = pipeline_tf.fit(logs_csv)
logs_csv = model_tf.transform(logs_csv)
logger.info("Spase vectors to Dense list...")
logs_csv = logs_csv.sort("window.start").select(["window.start", "tokens_counts"])\
.withColumn("tokens_counts", asDense(logs_csv.tokens_counts))
# Save to disk
# Save Pipeline and Model
logger.info("Save models...")
pipeline_tf.save(pipeline_fn)
model_tf.save(model_fn)
# Save to GCS
logger.info("Save results to GCS...")
logs_csv.write.parquet(vectors_fn)
spark.driver.maxResultSize is an issue with the size of your driver, which in Dataproc runs on the master node.
By default 1/4 of the memory of the master is given to Driver and 1/2 of that is given set to spark.driver.maxResultSize (the largest RDD Spark will let you .collect().
I'm guessing Tokenizer or HashingTF are moving "metadata" through the driver that is the size of your keyspace. To increase the allowable size you can increase spark.driver.maxResultSize, but you might also want to increase spark.driver.memory and/or use a larger master as well. Spark's configuration guide has more information.

Kafka integration with spark

I want to setup a streaming application using Apache Kafka and Spark streaming. Kafka is running on a seperate unix machine version 0.9.0.1 and spark v1.6.1 is a part of a hadoop cluster.
I have started the zookeeper and kafka server and want to stream in messages from a log file using console producer and consumed by spark streaming application using direct method (no receivers). I have written code in python and executing using the below command:
spark-submit --jars spark-streaming-kafka-assembly_2.10-1.6.1.jar streamingDirectKafka.py
getting below error:
/opt/mapr/spark/spark-1.6.1/python/lib/pyspark.zip/pyspark/streaming/kafka.py", line 152, in createDirectStream
py4j.protocol.Py4JJavaError: An error occurred while calling o38.createDirectStreamWithoutMessageHandler.
: java.lang.ClassCastException: kafka.cluster.BrokerEndPoint cannot be cast to kafka.cluster.Broker
Could you please help?
Thanks!!
from pyspark import SparkContext, SparkConf
from pyspark.streaming import StreamingContext
from pyspark.streaming.kafka import KafkaUtils
if __name__ == "__main__":
conf = SparkConf().setAppName("StreamingDirectKafka")
sc = SparkContext(conf = conf)
ssc = StreamingContext(sc, 1)
topic = ['test']
kafkaParams = {"metadata.broker.list": "apsrd7102:9092"}
lines = (KafkaUtils.createDirectStream(ssc, topic, kafkaParams)
.map(lambda x: x[1]))
counts = (lines.flatMap(lambda line: line.split(" "))
.map(lambda word: (word, 1))
.reduceByKey(lambda a, b: a+b))
counts.pprint()
ssc.start()
ssc.awaitTermination()
Looks like you are using incompatible version of Kafka. From the documentation as of Spark 2.0 - Kafka 0.8.x is supported.
http://spark.apache.org/docs/latest/streaming-programming-guide.html#advanced-sources

Resources