I have been trying to get a PySpark job to work which creates a RDD with a bunch of binary files, and then I use a flatMap operation to process the binary data into a bunch of rows. This has lead to a bunch of out of memory errors, and after playing around with memory settings for a while I have decided to get the simplest thing possible working, which is just counting the number of files in the RDD.
This also fails with OOM error. So I opened up both the spark-shell and PySpark and ran the commands in the REPL/shell with default settings, the only additional parameter was --master yarn. The spark-shellversion works, while the PySpark version shows the same OOM error.
Is there that much overhead to running PySpark? Or is this a problem with binaryFiles being new? I am using Spark version 2.2.0.2.6.4.0-91.
The difference:
Scala will load records as PortableDataStream - this means process is lazy, and unless you call toArray on the values, won't load data at all.
Python will call Java backend, but load the data as byte array. This part will be eager-ish, therefore might fail on both sides.
Additionally PySpark will use at least twice as much memory - for Java and Python copy.
Finally binaryFiles (same as wholeTextFiles) are very inefficient and don't perform well, if individual input files are large. In case like this it is better to implement format specific Hadoop input format.
Since you are reading multiple binary files with binaryFiles() and starting Spark 2.1, the minPartitions argument of binaryFiles() is ignored
1.try to repartition the input files based on the following:
enter code hererdd = sc.binaryFiles(Path to the binary file , minPartitions = ).repartition()
2.You may try reducing the partition size to 64 MB or less depending on your size of the data using below config's
spark.files.maxPartitionBytes, default 128 MB
spark.files.openCostInBytes, default 4 MB
spark.default.parallelism
Related
I want to monitor the number of files that spark generates, and maybe raise an exception if it is generating a lot of files. Is there any way to see this?
well it depends on how you are doing the write operation. Assuming you are writing the content of a dataframe or rdd as output, the easiest way would be to see number of partitions in your final dataframe/rdd. Basically each partition is written as a separate file.
Assuming you are using scala, this should give you the number of partitions.
df.rdd.getNumPartitions
Instead of raising an exception and causing job to fail, i would suggest that you use coalesce function to repartition the df with a value that suits you need. For example, if the output is not too large (1 Gb or less) i use coalesce(1) and write only 1 file.
I am working on a project where I have to read S3 files (each about 3MB zipped) using boto3. I have a small pyspark script that runs every hour to process the file and generate 2 types of output data which is written back to S3. The pyspark script uses 'xmltodict' python library to read some static data into a dictionary object needed for file processing. I have a small Amazon EMR cluster v5.28 running with 1 Master and 1 Core. This might be excessive but is not my main concern right now.
Questions:
1. How do I know 'IF' i should partition the data? I have read articles on how many partitions to create, etc but couldn't find anything on IF and WHEN. What is the criteria that drives partitioning - number of rows, columns, data type, actions taken in the script, etc in the source data file? I read the source file into an RDD and convert it to a DF and perform various operations by adding columns, grouping data, counting data, etc. How does spark handle partitioning behind the scenes?
2. Currently, I manually execute the pyspark script as follows:
spark-submit --master spark://x.x.x.x:7077 --deploy-mode client test.py
on the master node as I have decided to stick with Standalone CM. The 'xmltodict' is installed on this node, but is not installed on the Core node. It doesn't seem like it needs to be installed or even python3 configured on Core node since I am not seeing any errors. Is that correct and can somebody shed some light on this confusion? I tried to install the python libraries via shell file as a bootstrap
when I created the cluster, but it failed and quite frankly after trying it a few times, I gave up.
3. Based on partitioning I think I am slightly confused on whether or not to use coalesce() or collect(). Again, the question is when to use and when not to?
Sorry too many questions. Now, that I have the pyspark script written, I am trying to work the efficiencies.
Thanks
Partitioning is the mechanism with which data is divided into optimum size chunks and based on that multiple tasks are run, each processing one piece of data. As you see this is the core of parallelism and without this there is no significant use of Spark (or any bigdata processing framework). Most of the file formats are splittable and some are splittable when compressed like Avro, parquet, orc etc. Some file formats are not splittable when compressed like - zip, gzip etc. Based on the size of the file being processed and their ability to be split, Spark automatically creates multiple partitions and processes data in parallel. In your case the data being zip, one file will be one partition and no more than 1 CPU can work on it at once. If this zip is small then its ok, but if it is big then its processing will be slow.
I am doing some benchmark in a cluster using Spark. Among the various things I want to get a good approximation of the average size reduction achieved by serialization and compression. I am running in client deploy-mode and with the local master, and tired with both shells of versions 1.6 and 2.2 of spark.
I want to do that calculating the in-memory size and then the size on disk, so the fraction should be my answer. I have obviously no problems getting the on-disk size, but I am really struggling with the in-memory one.
Since my RDD is made of doubles and they occupy 8 bytes each in memory I tried counting the number of elements in the RDD and multiplying by 8, but that leaves out a lot of things.
The second approach was using "SizeEstimator" (https://spark.apache.org/docs/2.3.0/api/scala/index.html#org.apache.spark.util.SizeEstimator$
), but this is giving me crazy results! In Spark 1.6 it is either 30, 130 or 230 randomly (47 MB on disk), in Spark 2.2 it starts at 30 and everytime I execute it it increases by 0 or by 1. I know it says it's not super accurate but I can't even find a bit of consistency! I even tried setting persisting level in memory only
rdd.persist(StorageLevel.MEMORY_ONLY)
but still, nothing changed.
Is there any other way I can get the in-memory size of the RDD? Or should I try another approach? I am writing to disk with rdd.SaveAsTextFile, and generating the rdd via RandomRDDs.uniformRDD.
EDIT
sample code:
write
val rdd = RandomRDDs.uniformRDD(sc, nBlocks, nThreads)
rdd.persist(StorageLevel.MEMORY_ONLY_SER)
println("RDD count: " + rdd.count)
rdd.saveAsObjectFile("file:///path/to/folder")
read
val rdd = sc.wholeTextFiles(name,nThreads)
rdd.count() //action so I'm sure the file is actually read
webUI
Try caching the rdd as you mentioned and check in the storage tab of the spark UI.
By default rdd is deserialised and stored in memory. If you want to serialise it then specifically use persist with option MEMORY_ONLY_SER.The memory consumption will be less. In disk data always will be stored in serialised fashion
Check once the spark UI
I wasn't really sure what to title this question -- happy for a suggested better summary
I'm beating my head trying to figure out why a dead simple spark job works fine from Jupyter, but from the command line is left with insufficient executors to progress.
What I'm trying to do: I have a large amount of data (<1TB) from which I need to extract a small amount of data (~1GB) and save as parquet.
Problem I have: when my dead-simple code is run from the command line, I only get as many executors as I have final partitions, which is ideally one given it is small. The same exact code works just fine in Jupyter, same cluster, where it tasks out >10k tasks across my entire cluster. The commandline version never progresses. Since it doesn't produce any logs beyond reporting lack of progress, i'm not sure where more to dig.
I have tried both python3 mycode.py and spark-submit mycode.py with lots of variations to no avail. My cluster has dynamicAllocation configured.
import findspark
findspark.init('/usr/lib/spark/')
from pyspark.sql import SparkSession
spark = SparkSession.builder.enableHiveSupport().getOrCreate()
data = spark.read.parquet(<datapath>).select(<fields>)
subset = [<list of items>]
spark.sparkContext.broadcast(subset)
data.filter(field.isin.(subset)).coalesce(1).write.parquet("output")
** edit: original version mistakenly had repartition(1) instead of coalesce.
In this case, run from the command line, my process will get one executor.
In my logs, the only real hint I get is
WARN TaskSetManager: Stage 1 contains a task of very large size (330 KB). The maximum recommended task size is 100 KB.
which makes sense given the lack of resources being allocated.
I have tried to manually force the number of executors using spark-submit runtime settings. In that case, it will start with my initial settings and then immediately start bringing them down until there is only one and nothing progresses.
Any ideas? thanks.
I ended up phoning a friend on this one...
the code that was running fine in JupyterHub, but not via the commandline was essentially a:
read parquet,
filter on some small field,
coalesce(1)
write parquet
I had assumed that coalesce(1) and repartition(1) should have the same results -- even though coalesce(N) and repartition(N) do not -- given that they all go to one partition.
According to my friend, Spark sometimes optimizes coalesce(1) to a single task, which was the behavior I saw. By changing it to repartition(1), everything works fine.
I still have no idea why it works fine in JupyterHub --- having done >20 experiments -- and never on the commandline -- also >20 experiements.
But, if you want to take your data lake to a data puddle this way, use repartition(1) or repartition(n), where n is small, instead of coalesce.
I'm getting java.lang.OutOfMemoryError with my Spark job, even though only 20% of the total memory is in use.
I've tried several configurations:
1x n1-highmem-16 + 2x n1-highmem-8
3x n1-highmem-8
My dataset consist of 1.8M records, read from a local json file on the master node. The entire dataset in json format is 7GB. The job I'm trying to execute involves a simple computation followed by a reduceByKey. Nothing extraordinary. The job runs fine on my single home computer with only 32GB ram (xmx28g), although it requires some caching to disk.
The job is submitted through spark-submit, locally on the server (SSH).
Stack trace and Spark config can be viewed here: https://pastee.org/sgda
The code
val rdd = sc.parallelize(Json.load()) // load everything
.map(fooTransform) // apply some trivial transformation
.flatMap(_.bar.toSeq) // flatten results
.map(c => (c, 1)) // count
.reduceByKey(_ + _)
.sortBy(_._2)
log.v(rdd.collect.map(toString).mkString("\n"))
The root of the problem is that you should try to offload more I/O to the distributed tasks instead of shipping it back and forth between the driver program and the worker tasks. While it may not be obvious at times which calls are driver-local and which ones describe a distributed action, rules of thumb include avoiding parallelize and collect unless you absolutely need all of the data in one place. The amounts of data you can Json.load() and the parallelize will max out at whatever largest machine type is possible, whereas using calls like sc.textFile theoretically scale to hundreds of TBs or even PBs without problem.
The short-term fix in your case would be to try passing spark-submit --conf spark.driver.memory=40g ... or something in that range. Dataproc defaults allocate less than a quarter of the machine to driver memory because commonly the cluster must support running multiple concurrent jobs, and also needs to leave enough memory on the master node for the HDFS namenode and the YARN resource manager.
Longer term you might want to experiment with how you can load the JSON data as an RDD directly, instead of loading it in a single driver and using parallelize to distribute it, since this way you can dramatically speed up the input reading time by having tasks load the data in parallel (and also getting rid of the warning Stage 0 contains a task of very large size which is likely related to the shipping of large data from your driver to worker tasks).
Similarly, instead of collect and then finishing things up on the driver program, you can do things like sc.saveAsTextFile to save in a distributed manner, without ever bottlenecking through a single place.
Reading the input as sc.textFile would assume line-separated JSON, and you can parse inside some map task, or you can try using sqlContext.read.json. For debugging purposes, it's often enough instead of using collect() to just call take(10) to take a peek at some records without shipping all of it to the driver.