Can we create a Hadoop Cluster on Dataproc with 0%-2% of HDFS?

Is it possible to create a Hadoop cluster on Dataproc with no or very minimal HDFS space by setting dfs.datanode.du.reserved to about 95% or 100% of the total node size? The plan is to use GCS for all persistent storage while the local file system will primarily be used for Spark's shuffle data. Some of the Hive queries may still need the scratch on HDFS which explains the need for minimal HDFS.
I did create a cluster with a 10-90 split and did not notice any issues with my test jobs.
Could there be stability issues with Dataproc if this approach is taken?
Also, are there concerns with deleting the Data Node daemon from Dataproc's worker nodes, thereby using the Primary workers as compute only nodes. The rationale is that Dataproc currently doesn't allow a mix of preemptible and non preemptible secondary workers. So want to check if we can repurpose primary workers as compute only non-PVM nodes while the other secondary workers can be compute only PVM nodes.

I am starting a GCP project and am well-versed enough in AZURE and AWS less so, but know enough there having done a DDD setup.
What you describe is similar to AWS setup and I looked recently here:
My impression is you can run without HDFS here as well - 0%. The key point is that performance with a suite of jobs will - like also for AWS & AZURE - benefit from writing to and reading from ephemeral HDFS, as it is faster than Google Cloud Storage. I cannot see stability issues; I can use Spark now without HDFS if I really want.
On the 2nd question, stick to what they have engineered. Why try and force things? On AWS we lived with the limitations on scaling down with Spark.


In which scenario should one prefer to create Spark cluster on EC2 machines instead of using Elastic Map Reduce?

Between processing realtime data using Spark cluster on EC2 machines and using Elastic map reduce, some of the differences are:
In Elastic Map Reduce, one would not have to manage the infrastructure and cluster as compared to Spark cluster on EC2 machines where one has to create the cluster and manage it.
In case of Spark cluster on EC2, one has more control over the cluster as compared to Elastic Map Reduce which is a PAAS component.
I went through the below related link:
Hadoop on EC2 vs Elastic Map Reduce
I understand that going with Elastic Map reduce would give the advantage of not having to manage the infrastructure and cluster. What I want to know is that when should one prefer the other option, that is to create Spark cluster on EC2 machines instead of using Elastic Map Reduce? Thanks.
You and the answer you shared have have summed pretty much the advantages and disadvantages for both. But i would like to mention few things
Someone mentioned in comment on the answer you share (and there is infact impression in people) that EMR adds some cost on top of ec2 nodes (which is underlying master/compute nodes of spark) and provides just the cluster, which isnt the case.
But what elastic map reduce is focused on is elastic and scalability part , meaning to provide scalability for your jobs, where scalability is not just number of node in cluster but different parameters like
Dynamically resizing the cluster with running jobs
Reduces and optimizes spin time , provides efficient resubmitting steps and option like automatic termination on step completion
Configuration, management and updation time. Just as an small you have things like release version that automatically handles spark/hadoop/other-application versions providing you way to easy update the version which you have to do manually with ec2.
the ecosystem availability. EMR ecosystem is growing,it doesnt reflect when you start but for example when your requirements grow, for example when you start to integrate other systems stream processing with flink for example) then it is more easier to just select at time of launching flink, pig , hive and moany more etc if you need to use other things in future.
There are already implementing libraries with AWS SDK like boto3 in python that help you to submit steps, poll for completion etc, which are very helpful when you need to scale. Also, you have integration of emr with orchestration frameworks like airflow where can can sense the state, resubmit, one command spin the cluster within the pipeline.
Expanding on previous point, EMR notebook for example provide you the quick and interactive way to submit spark jobs from Jupiter notebook and see the result, progress of jobs immediately which can boost your productivity.
This point is most important from my experience, Sometimes, scaling up the jobs with more nodes save you more money then long running jobs with low number of nodes. Because the adding node cost sometime cost you low than the normalized hours you will be spending with ec2 or small emr cluster. Just to share my experience, we had a job that used to run for 3 days, we satrted to run it with bigger EMR cluster that reduced it to 6-8 hours and it still was in the same cost and was infact a bit less.

While creating Azure HDInsight cluster for Starburst Presto, can I create Spark Cluster?

While creating infrastructure for big data, I wanted to use Azure HDInsight with Presto installation. Azure HDInsight comes with different flavors like hadoop, spark etc. In documentations it is recommended to use hadoop cluster but I want to use the spark one.
Is it possible to use spark cluster with Starburst's Presto distribution?
It looks like you want to use both Presto and Spark at the same time.
If you run them on a single cluster, you would need to configure them appropriately to make sure the JVMs for different processes can co-exist. This is possible, but hard to do in practice (you need to know how JVM allocates memory beyond -Xmx setting), so it's definitely not recommended.
While I can imagine that in some on-premises installations where provisioning new hardware is hard you could want to colocate services on one cluster. In the cloud, it's much more convenient to provision two separate clusters, each appropriately sized for your particular needs and workload. For example, you could have one cluster with Presto for interactive analytics, dashboarding and ad-hoc queries. And another one with Spark for your machine learning or ETL workloads.
Please refer to the Starburst Presto on Azure documentation for detailed configuration instructions.

what should be the Hadoop cofigurations to be used for 100 gb of csv files for analysis in Spark

I have around 100 GB of data in CSV format on which I intend to do some transformation like aggregation, data splitting and after that do some clustering using ML package of Apache Spark.
I have tried it by uploading data on MYSQ trying to automate the process on python but it's taking too much time to build any solution.
What is the configuration I need to setup and how I should start with the spark?
I am new in spark. I am planning to use cloud services.
I'm going to recommend you learn to use spark locally with a small subset of the data; you can run it standalone with a few tens moving to hundreds of MB. Its limited, but you can learn the tooling without paying. Your first spark dataframe query could be sampling the source data and saving it into a more efficient query format.
CSV isn't a great format for big data; Spark likes Parquet and for 2.3+ ORC). Embrace them for better perf.
Play with "notebooks"; Apache Zeppelin is one you can install and run locally.
Like I say, learn to play with small amounts. Spark is very interactive & working with small datasets is an easy way to learn fast.
There are many ways to do that but it depends on your case. As far as I know, HDFS with default configuration(without any specific tuning) works fine. Majority of Hadoop tuning guides are focused on YARN side. So, let me make a plan like below:
Generally speaking, you can put your (raw) data in HDFS and load them in Apache Spark and save them in Parquet/ORC like below:
from pyspark.sql.types import StructType,StructField,StringType
myschema = StructType([StructField("FirstName",StringType(),True),StructField("LastName",StringType(),True)])
mydf ="com.databricks.spark.csv").option("header","true").schema(myschema).option("delimiter",",").load("hdfs://hadoopmaster:9000/user/hduser/mydata.csv")
newdf ="hdfs://hadoopmaster:9000/user/hduser/DataInParquet")
Finally, compare mydf.count() with newdf.count(). That will run faster than raw format. In addition, your data size will decrease from 100GB to ~24GB.
If you are new to hadoop, spark and interested to setup hadoop environment in cloud. I would suggest you to go with Elastic Map Reduce(EMR) powered by AWS. You can create On demand spark cluster with the user defined configuration to process a wide range of data sets.
You can setup a hadoop cluster on top of EC2 instance or in any cloud platform with the required number of nodes with sufficient RAM and CPU. Storage optimized instances is preferred over here to analyze a large data set.
We do not need to bother about storage cost, For storage optimized instances, AWS offers free ephemeral storage data disk with size 1 - 2TB depends on instance size.
Note: Data in the ephemeral storage will be lost when the VM is rebooted. We can persist the processed data in S3 at the cheapest cost.
When it comes to cluster configuration, the list of things to be checked.
Spark on YARN is preferred
Set minimum and maximum core and memory in yarn node manager container settings for your spark executors.
Enable dynamic memory allocation in spark
Set container size to the maximum and spark memory fraction to maximum to avoid shuffling multiple times and frequent spilling and cached data eviction.
Use kryo serialization to get high performance.
Enable compression for map outputs before shuffling.
Enable spark web UI to track your application tasks and its stages.
Apache Spark Config Reference:

auto scale spark cluster

I have a spark streaming job running on a cluster. Spark job pulls messages from Kafka and do the required processing before dumping the processed data to database. I have sized my cluster as per the current load. But this load requirement may go up/down in the future. I want to know the techniques to facilitate this auto scaling without restarting the job. Scaling becomes more complicated if kakfa is being used (as in my case) as I won't like the partitions to be moved around in stateful streaming. Currently the cluster is completely in house but I won't mind migrating to cloud if that assists the scaling use case.
it is not an answer. Just some notes
"in stateful streaming". What did you mean by that? All state in spark is distributed. And you should not rely on local system, as if some task failed, it can be send to any other executor.
do you speak about increasing size of cluster or resources dedicated for your spark job in cluster?
If the first one, you need to monitor each node (memory, cpu) and when it's time (hit some threshold) add more nodes.
If the second one: we didn't find nice solution. Spark provides 'autoscaling' feature, however it doesn't work properly with kafka streaming.

Spark as Data Ingestion/Onboarding to HDFS

While exploring various tools like [Nifi, Gobblin etc.], I have observed that Databricks is now promoting for using Spark for data ingestion/on-boarding.
We have a spark[scala] based application running on YARN. So far we are working on a hadoop and spark cluster where we manually place required data files in HDFS first and then run our spark jobs later.
Now when we are planning to make our application available for the client we are expecting any type and number of files [mainly csv, jason, xml etc.] from any data source [ftp, sftp, any relational and nosql database] of huge size [ranging from GB to PB].
Keeping this in mind we are looking for options which could be used for data on-boarding and data sanity before pushing data into HDFS.
Options which we are looking for based on priority:
1) Spark for data ingestion and sanity: As our application is written and is running on spark cluster, we are planning to use the same for data ingestion and sanity task as well.
We are bit worried about Spark's support for many datasources/file types/etc. Also, we are not sure if we try to copy data from let's say any FTP/SFTP then will all workers will write data on HDFS in parallel? Is there any limitation while using it? Is there any Audit trail maintained by Spark while this data copy?
2) Nifi in clustered mode: How good Nifi would be for this purpose? Can it be used for any datasource and for any size of file? Will be maintain the Audit trail? Would Nifi we able to handle such large files? How large cluster would be required in case we try to copy GB - PB of data and perform certain sanity on top of that data before pushing it to HDFS?
3) Gobblin in clustered mode: Would like to hear similar answers as that for Nifi?
4) If at all there is any other good option available for this purpose with lesser infra/cost involved and better performance?
Any guidance/pointers/comparisions for above mentioned tools and technologies would be appreciated.
After doing certain R&D and considering the fact that using NIFI or goblin will demand for more infrastructure cost. I have started testing Spark for data on-boarding.
SO far I have tried using Spark job for importing data [present at a remote staging area/node] into my HDFS and I am able to do that by mounting that remote location with all my spark cluster worker nodes. Doing this made that location local to those workers, hence spark job ran properly and data is on-boarded to my HDFS.
Since my whole project is going to be on Spark, hence keeping data on-boarding part on spark would not cost anything extra to me. So far I am going good. Hence I would suggest to others as well, if you already have spark cluster and hadoop cluster up and running then instead of adding extra cost [where cost could be a major constraint] go for spark job for data on-boarding.
