Writing spark dataframe in Azure Databricks - apache-spark

I am new to Azure Databricks. I have two input files and python AI model, I am cleaning the input files and applying AI model on input Files to get final probabilities. Reading files, loading model, cleaning data, preprocessing the data and displaying output with probabilities taking me only few minutes.
But while I am trying to write the result to Table or parquet file it is taking me more than 4-5 hours. I have tried various approaches of repartition/partitionBy/saveAsTable but none of it is fast enough.
My output spark dataframe consists of three columns with 120000000 rows. My shared cluster size is 9-Node cluster with each Node of 56GB memory.
My doubts are :-
1.) Is it expected behavior in azure databricks with slow writing capabilities.
2.) Is it true that we can't tune spark configurations in azure databricks, azure databricks tunes itself with available memory.

The performance depends on multiple factors: To investigate further, could you please share the below details:
What is the size of the data?
What is the size of the worker type?
Share the code which you are running?
I would suggest you go through the below articles, which helps to improve the performance:
Optimize performance with caching
7 Tips to Debug Apache Spark Code Faster with Databricks
Azure Databricks Performance Notes

I have used azure databricks and have written data to azure storage and it has been fast.
Also the databricks is hosted on Azure like in Aws.So all configurations of spark can be set.
As pradeep asked, what is the datasize and number of partitions? you can get it using df.rdd.getNumPartitions().
Have you tried a repartition before write? Thanks.

Related

How Spark can speed up bulk loading to JanusGraph?

I need to load lots of vertices and edges to JanusGraph with Cassandra backend from other storage. I've read about bulk loading and Spark configuring (https://docs.janusgraph.org/advanced-topics/bulk-loading/ and https://docs.janusgraph.org/advanced-topics/hadoop/) .
It's clear how to configure JanusGraph for Spark usage but I'm still not sure how to use Spark then and if Spark can help to speed up inserting into graph.
Please give some use cases and code example of using Hadoop MapReduce or Spark to speed up bulk loading data to Janusgraph (Java or Python are preferred). Any help welcome!
I worked on POC project recently to Bulk Load data into JanusGraph using Apache Spark. We were getting pretty good performance loading data into using Spark. Setup and sample code is provided in the article below.
https://medium.com/#nitinpoddar/bulk-loading-data-into-janusgraph-ace7d146af05
https://medium.com/#nitinpoddar/bulk-loading-data-into-janusgraph-part-2-ca946db26582
Alternatively, you can write a Kafka consumer application to load data from your Kafka to JanusGraph. The amount of parallelism will be restricted to the number of partitions of the source/input topic from which your application is reading data. The application will be single-threaded but you can scale your application to the number of input topics. Each instance of your application can open up a connection and write to JanusGraph using a transaction. You can batch transactions with some batch size to spread the load.

what should be the Hadoop cofigurations to be used for 100 gb of csv files for analysis in Spark

I have around 100 GB of data in CSV format on which I intend to do some transformation like aggregation, data splitting and after that do some clustering using ML package of Apache Spark.
I have tried it by uploading data on MYSQ trying to automate the process on python but it's taking too much time to build any solution.
What is the configuration I need to setup and how I should start with the spark?
I am new in spark. I am planning to use cloud services.
I'm going to recommend you learn to use spark locally with a small subset of the data; you can run it standalone with a few tens moving to hundreds of MB. Its limited, but you can learn the tooling without paying. Your first spark dataframe query could be sampling the source data and saving it into a more efficient query format.
CSV isn't a great format for big data; Spark likes Parquet and for 2.3+ ORC). Embrace them for better perf.
Play with "notebooks"; Apache Zeppelin is one you can install and run locally.
Like I say, learn to play with small amounts. Spark is very interactive & working with small datasets is an easy way to learn fast.
There are many ways to do that but it depends on your case. As far as I know, HDFS with default configuration(without any specific tuning) works fine. Majority of Hadoop tuning guides are focused on YARN side. So, let me make a plan like below:
Generally speaking, you can put your (raw) data in HDFS and load them in Apache Spark and save them in Parquet/ORC like below:
from pyspark.sql.types import StructType,StructField,StringType
myschema = StructType([StructField("FirstName",StringType(),True),StructField("LastName",StringType(),True)])
mydf = spark.read.format("com.databricks.spark.csv").option("header","true").schema(myschema).option("delimiter",",").load("hdfs://hadoopmaster:9000/user/hduser/mydata.csv")
mydf.count()
mydf.repartition(6).write.format("parquet").save("hdfs://hadoopmaster:9000/user/hduser/DataInParquet")
newdf = spark.read.parquet("hdfs://hadoopmaster:9000/user/hduser/DataInParquet")
newdf.count()
Finally, compare mydf.count() with newdf.count(). That will run faster than raw format. In addition, your data size will decrease from 100GB to ~24GB.
If you are new to hadoop, spark and interested to setup hadoop environment in cloud. I would suggest you to go with Elastic Map Reduce(EMR) powered by AWS. You can create On demand spark cluster with the user defined configuration to process a wide range of data sets.
https://aws.amazon.com/emr/
https://aws.amazon.com/emr/details/spark/
https://docs.aws.amazon.com/emr/latest/ReleaseGuide/emr-spark-launch.html
Or
You can setup a hadoop cluster on top of EC2 instance or in any cloud platform with the required number of nodes with sufficient RAM and CPU. Storage optimized instances is preferred over here to analyze a large data set.
We do not need to bother about storage cost, For storage optimized instances, AWS offers free ephemeral storage data disk with size 1 - 2TB depends on instance size.
Note: Data in the ephemeral storage will be lost when the VM is rebooted. We can persist the processed data in S3 at the cheapest cost.
When it comes to cluster configuration, the list of things to be checked.
Spark on YARN is preferred
Set minimum and maximum core and memory in yarn node manager container settings for your spark executors.
Enable dynamic memory allocation in spark
Set container size to the maximum and spark memory fraction to maximum to avoid shuffling multiple times and frequent spilling and cached data eviction.
Use kryo serialization to get high performance.
Enable compression for map outputs before shuffling.
Enable spark web UI to track your application tasks and its stages.
Apache Spark Config Reference: https://spark.apache.org/docs/2.1.0/configuration.html

Spark as Data Ingestion/Onboarding to HDFS

While exploring various tools like [Nifi, Gobblin etc.], I have observed that Databricks is now promoting for using Spark for data ingestion/on-boarding.
We have a spark[scala] based application running on YARN. So far we are working on a hadoop and spark cluster where we manually place required data files in HDFS first and then run our spark jobs later.
Now when we are planning to make our application available for the client we are expecting any type and number of files [mainly csv, jason, xml etc.] from any data source [ftp, sftp, any relational and nosql database] of huge size [ranging from GB to PB].
Keeping this in mind we are looking for options which could be used for data on-boarding and data sanity before pushing data into HDFS.
Options which we are looking for based on priority:
1) Spark for data ingestion and sanity: As our application is written and is running on spark cluster, we are planning to use the same for data ingestion and sanity task as well.
We are bit worried about Spark's support for many datasources/file types/etc. Also, we are not sure if we try to copy data from let's say any FTP/SFTP then will all workers will write data on HDFS in parallel? Is there any limitation while using it? Is there any Audit trail maintained by Spark while this data copy?
2) Nifi in clustered mode: How good Nifi would be for this purpose? Can it be used for any datasource and for any size of file? Will be maintain the Audit trail? Would Nifi we able to handle such large files? How large cluster would be required in case we try to copy GB - PB of data and perform certain sanity on top of that data before pushing it to HDFS?
3) Gobblin in clustered mode: Would like to hear similar answers as that for Nifi?
4) If at all there is any other good option available for this purpose with lesser infra/cost involved and better performance?
Any guidance/pointers/comparisions for above mentioned tools and technologies would be appreciated.
Best Regards,
Bhupesh
After doing certain R&D and considering the fact that using NIFI or goblin will demand for more infrastructure cost. I have started testing Spark for data on-boarding.
SO far I have tried using Spark job for importing data [present at a remote staging area/node] into my HDFS and I am able to do that by mounting that remote location with all my spark cluster worker nodes. Doing this made that location local to those workers, hence spark job ran properly and data is on-boarded to my HDFS.
Since my whole project is going to be on Spark, hence keeping data on-boarding part on spark would not cost anything extra to me. So far I am going good. Hence I would suggest to others as well, if you already have spark cluster and hadoop cluster up and running then instead of adding extra cost [where cost could be a major constraint] go for spark job for data on-boarding.

How Data distribution is achieved in Azure HDInsight while processing it

One of the selling points of Hadoop is that the data sits with the compute? How does that work with WASB?
When processing a MapReduce job the map and reduce tasks are executed where the blocks of data are resided. This way the data locality is achieved.
But in the case of HDInsight, the data is stored in the wasb. So when the MapReduce is executed does the data is copied from wasb to each of the compute node and then the processing is proceeded? If so, then the single channel to copy data to compute nodes will be a bottleneck.
Can anyone explain to me how data is stored on wasb and how during processing the data is handled?
Just like with any Hadoop system the data is loaded into memory on the individual nodes at compute time (when the job runs). The difference with WASB is that the data is loaded from the Azure storage accounts instead of from local disks. Given the way Azure data center backbones are built the performance is generally the same with disks locally attached to the VMs.
HDInsight clusters are located in any of Azure's regions. The storage accounts that clusters can read from can only be from the same region to avoid high latency. Azure has done a lot of work on its data centers so that performance is comparable.
If you want to learn more, Ashish's quote comes from this article:
https://blogs.msdn.microsoft.com/cindygross/2015/02/04/understanding-wasb-and-hadoop-storage-in-azure/

How to share data from Spark RDD between two applications

What is the best way to share spark RDD data between two spark jobs.
I have a case where job 1: Spark Sliding window Streaming App, will be consuming data at regular intervals and creating RDD. This we do not want to persist to storage.
Job 2: Query job that will access the same RDD created in job 1 and generate reports.
I have seen few queries where they were suggesting SPARK Job Server, but as it is a open source not sure if it a possible solution, but any pointers will be of great help.
thankyou !
The short answer is you can't share RDD's between jobs. The only way you can share data is to write that data to HDFS and then pull it within the other job. If speed is an issue and you want to maintain a constant stream of data you can use HBase which will allow for very fast access and processing from the second job.
To get a better idea you should look here:
Serializing RDD
You can share RDDs across different applications using Apache Ignite.
Apache ignite provides an abstraction to share the RDDs through which applications can access the RDDs corresponding to different applications. In addition Ignite has the support for SQL indexes, where as native Spark doesn't.
Please refer https://ignite.apache.org/features/igniterdd.html for more details.
According to the official document describes:
Note that none of the modes currently provide memory sharing across applications. If you would like to share data this way, we recommend running a single server application that can serve multiple requests by querying the same RDDs.
http://spark.apache.org/docs/latest/job-scheduling.html
You can save to a temporary view. Table will be available to other sessions until the one that creates it is closed

Resources