How dataproc works with google cloud storage? - apache-spark

I am searching for working of google dataproc with GCS. I am using pyspark of dataproc. Data is read from and written to GCS.But unable to figure out best machine types for my use case. Questions
1) Does spark on dataproc copies data to local disk? e.g. If I am processing 2 TB of data, is it ok If I use 4 machine node with 200GB hdd? OR I should at least provide disk that can hold input data?
2) If the local disk is not at all used then is it ok to use high memory low disk instances?
3) If local disk is used then which instance type is good for processing 2 TB of data with minimum possible number of nodes? I mean is good to use SSD ?
Thanks
Manish

Spark will read data directly into memory and/or disk depending on if you use RDD or DataFrame. You should have at least enough disk to hold all data. If you are performing joins, then amount of disk necessary grows to handle shuffle spill.
This equation changes if you discard significant amount of data through filtering.
Whether you use pd-standard, pd-ssd, or local-ssd comes down to cost and if your application is CPU or IO bound.
Disk IOPS is proportional to disk size, so very small disks are inadvisable. Keep in mind that disk (relative to CPU) is cheap.
Same advice goes for network IO: more CPUs = more bandwidth.
Finally, default Dataproc settings are a reasonable place to start experimenting and tweaking your settings.
Source: https://cloud.google.com/compute/docs/disks/performance

Related

How to free space used by block pool in Dataproc

I have started a spark streaming job which streams data from kafka.I have assigned only 2 worker nodes with 15gb disk for testing.Within 2 hours the disk is full and the status of these nodes is showing as unhealthy on YARN Resource Manager web interface, and I have checked HDFS web interface which shows the Block Pool has used 95% of disk space.
The problem is I am not storing any data on the nodes, just reading from kafka, processing and storing to MongoDB.
The Dataproc base image takes at least a few GB of space, so you're left with let's say 10GB per worker.
There's two main usages of disk space I can think of:
1) If you've enabled checkpointing (e.g. ssc.checkpoint(dir)): https://spark.apache.org/docs/latest/streaming-programming-guide.html#checkpointing. That's probably on HDFS.
If you think HDFS is the issue, you can ssh into the master node, and run hdfs dfs -ls -R / to find which files are taking up space.
2) Ephemeral shuffle data gets written to disk between stages
This is less likely in a streaming job, but it's worth checking if HDFS isn't using much space. You can run du to find the directory taking up space, and I bet it's in nm-local-dirs: https://linuxhint.com/disk_space_directory_command_line/
All of that being said, 15GB is a really, really small disk size. PD is relatively cheap compared to compute, and I would suggest just using a larger boot disk size. If you're looking to cut down on cost, consider using e2 machine types.

Is the smallest unit in Apache Spark is the block?

If it is block then, what is the desired block size? What is recommended? Is there some standard size? If not how do I know what block size should I keep?
Lets say sc=streaming context.
Is sc.awaitTermination() used in production.
Is await for termination is the only way?
Lets say a block got corrupted. Since there is a fault tolerance? Is it so that the block will get recovered by taking it from another replicated block from some other executor.
Can different executors have different memory size?
If so, say there are 3 executors
ex1 = 10gb
ex2 = 10gb
ex3 = 5gb
Assume replication factor is configured of 2.
How will replication work in this case. If an rdd of size lets say 8gb needs to be replicated. Then wont it fail? ex1 having say 8gb rdd size cannot be replicated into executor ex3 due to low memory? Then how fault tolerance is achieved? Does spark knows where to replicate what?Is it according to the size that it checks if it can be replicated into that particular node, if it can then replicate else dont? Then if node ex1 itself fails, then there is no fault tolerance and everything is lost?How is it handle in this scenario ?
Block size is the concern of storage system like HDFS, GFS, S3, Azure Blob Storage etc. Spark is a processing engine and it just accommodated to the bloc size of the storage system to create partitions. It would create 1 partition per block so that large files are can be processed in parallel. All the storage systems have a default block size so you needn't worry about setting it though you can certainly override it.
Similarly data duplication is also a concern of storage layer. It copies a block of data into say 3 (replication factor) blocks and in case of failure it can recover data from other blocks.
Spark fits into this equation by being the processing engine operating on this distributed file system. It is responsible for distributing the compute workload where data is located, it can recover from failure on a given node which is similar to data recovery but a different beast.

pyspark java.lang.OutOfMemoryError: GC overhead limit exceeded

I'm trying to process, 10GB of data using spark it is giving me this error,
java.lang.OutOfMemoryError: GC overhead limit exceeded
Laptop configuration is: 4CPU, 8 logical cores, 8GB RAM
Spark configuration while submitting the spark job.
spark = SparkSession.builder.master('local[6]').config("spark.ui.port", "4041").appName('test').getOrCreate()
spark.conf.set("spark.executor.instances", 1)
spark.conf.set("spark.executor.cores", 5)
After searching internet about this error, I have few questions
If answered that would be a great help.
1) Spark is in memory computing engine, for processing 10 gb of data, the system should have 10+gb of RAM. Spark loads 10gb of data into 10+ gb RAM memory and then do the job?
2) If point 1 is correct, how big companies are processing 100s of TBs of data, are they processing 100TB of data by clustering multiple systems to form 100+TB RAM and then process 100TB of data?
3) Is their no other way to process 50gb of data with 8gb RAM and 8Cores, by setting proper spark configurations? If it is what is the way and what should be the spark configurations.
4) What should be ideal spark configuration if the system properites are 8gb RAM and 8 Cores? for processing 8gb of data
spark configuration to be defined in spark config.
spark = SparkSession.builder.master('local[?]').config("spark.ui.port", "4041").appName('test').getOrCreate()
spark.conf.set("spark.executor.instances", ?)
spark.conf.set("spark.executor.cores", ?)
spark.executors.cores = ?
spark.executors.memory = ?
spark.yarn.executor.memoryOverhead =?
spark.driver.memory =?
spark.driver.cores =?
spark.executor.instances =?
No.of core instances =?
spark.default.parallelism =?
I hope the following will help if not clarify everything.
1) Spark is an in-memory computing engine, for processing 10 GB of data, the system should have 10+gb of RAM. Spark loads 10gb of data into 10+ GB RAM memory and then do the job?
Spark being an in-memory computation engine take the input/source from an underlying data lake or distributed storage system. The 10Gb file will be broken into smaller blocks (128Mb or 256Mb block size for Hadoop based data lake) and Spark Driver will get many executor/cores to read them from the cluster's worker node. If you try to load 10Gb data with laptop or with a single node, it will certainly go out of memory. It has to load all the data either in one machine or in many slaves/worker-nodes before it is processed.
2) If point 1 is correct, how big companies are processing 100s of TBs of data, are they processing 100TB of data by clustering multiple systems to form 100+TB RAM and then process 100TB of data?
The large data processing project design the storage and access layer with a lot of design patterns. They simply don't dump GBs or TBs of data to file system like HDFS. They use partitions (like sales transaction data is partition by month/week/day) and for structured data, there are different file formats available (especially columnar) which helps to lad only those columns which are required for processing. So right file format, partitioning, and compaction are the key attributes for large files.
3) Is their no other way to process 50gb of data with 8gb RAM and 8Cores, by setting proper spark configurations? If it is what is the way and what should be the spark configurations.
Very unlikely if there is no partition but there are ways. It also depends on what kind of file it is. You can create a custom stream file reader that can read the logical block and process it. However, the enterprise doesn't read 50Gb of one file which is one single unit. Even if you load an excel file of 10Gb in your machine via Office tool, it will go out of memory.
4) What should be ideal spark configuration if the system properties are 8gb RAM and 8 Cores? for processing 8gb of data
Leave 1 core & 1-Gb or 2-GB for OS and use the rest of them for your processing. Now depends on what kind of transformation is being performed, you have to decide the memory for driver and worker processes. Your driver should have 2Gb of RAM. But laptop is primarily for the playground to explore the code syntax and not suitable for large data set. Better to build your logic with dataframe.sample() and then push the code to bigger machine to generate the output.

Will spark load data into in-memory if data is 10 gb and RAM is 1gb

If i have cluster of 5 nodes, each node having 1gb ram, now if my data file is 10gb distributed in all 5 nodes, let say 2gb in each node, now if i trigger
val rdd = sc.textFile("filepath")
rdd.collect
will spark load data into the ram and how spark will deal with this scenario
will it straight away deny or will it process it.
Lets understand the question first #intellect_dp you are asking, you have a cluster of 5 nodes (here the term "node" I am assuming machine which generally includes hard disk,RAM, 4 core cpu etc.), now each node having 1 GB of RAM and you have 10 GB of data file which is distributed in a manner, that 2GB of data is residing in the hard disk of each node. Here lets assume that you are using HDFS and now your block size at each node is 2GB.
now lets break this :
each block size = 2GB
RAM size of each node = 1GB
Due to lazy evaluation in spark, only when "Action API" get triggered, then only it will load your data into the RAM and execute it further.
here you are saying that you are using "collect" as an action api. Now the problem here is that RAM size is less than your block size, and if you process it with all default configuration (1 block = 1 partition ) of spark and considering that no further node will going to add up, then in that case it will give you out of memory exception.
now the question - is there any way spark can handle this kind of large data with the given kind of hardware provisioning?
Ans - yes, first you need to set default minimum partition :
val rdd = sc.textFile("filepath",n)
here n will be my default minimum partition of block, now as we have only 1gb of RAM, so we need to keep it less than 1gb, so let say we take n = 4,
now as your block size is 2gb and minimum partition of block is 4 :
each partition size will be = 2GB/4 = 500mb;
now spark will process this 500mb first and will convert it into RDD, when next chunk of 500mb will come, the first rdd will get spill to hard disk (given that you have set the storage level "MEMORY_AND_DISK_ONLY").
In this way it will process your whole 10 GB of data file with the given cluster hardware configuration.
Now I personally will not recommend the given hardware provisioning for such case,
as it will definitely process the data, but there are few disadvantages :
firstly it will involve multiple I/O operation making whole process very slow.
secondly if any lag occurs in reading or writing to the hard disk, your whole job will get discarded, you will go frustrated with such hardware configuration. In addition to it you will never be sure that will spark process your data and will be able to give result when data will increase.
So try to keep very less I/O operation, and
Utilize in memory computation power of spark with an adition of few more resources for faster performance.
When you use collect all the data send is collected as array only in driver node.
From this point distribution spark and other nodes does't play part. You can think of it as a pure java application on a single machine.
You can determine driver's memory with spark.driver.memory and ask for 10G.
From this moment if you will not have enough memory for the array you will probably get OutOfMemory exception.
In the otherhand if we do so, Performance will be impacted, we will not get the speed we want.
Also Spark store only results in RDD, so I can say result would not be complete data, any worst case if we are doing select * from tablename, it will give data in chunks , what it can affroad....

How does spark behave without enough memory (RAM) to create RDD

When I do sc.textFile("abc.txt")
Spark creates RDD in RAM (memory).
So does the cluster collective memory should be greater than size of the file “abc.txt”?
My worker nodes have disk space so could I use disk space while reading texfile to create RDD? If so how to do it?
How to work on big data which doesn’t fit into memory?
When I do sc.textFile("abc.txt") Spark creates RDD in RAM (memory).
The above point is not certainly true. In Spark, their is something called transformations and something called actions. sc.textFile("abc.txt") is transformation operation and it does not simply load data straight away unless you trigger any action eg count().
To give you a collective answer to your all questions, I would urge you to understand how spark execution works. Their is something called logical and physical plans.As part of physical plan, it does cost calculation(available resource calculation across the cluster(s)) before it starts the jobs. if you understand them, you will get clear idea on all your questions.
You first assumption is incorrect:
Spark creates RDD in RAM (memory).
Spark doesn't create RDDs "in-memory". It uses memory but it is not limited to in-memory data processing. So:
So does the cluster collective memory should be greater than size of the file “abc.txt”?
No
My worker nodes have disk space so could I use disk space while reading texfile to create RDD? If so how to do it?
No special steps are required.
How to work on big data which doesn’t fit into memory?
See above.

Resources