I'm fairly new to Spark (using PySpark.sql) and have two questions related to the .createDataFrame() method.
I read that the .createDataFrame() method only create a dataframe locally. What is the reason for doing so rather than pushing the dataframe to the EMR cluster?
What significance does "storing locally" actually have? In other words, what calculations / manipulations would I NOT be able to do as a result of the dataframe not being pushed to the EMR cluster?
Related
I am new to spark and am trying to implement reading data from a parquet file and then after some transformation returning it to web ui as a paginated way. Everything works no issue there.
So now I want to improve the performance of my application, after some google and stack search I found out about pyspark parallelism.
What I know is that :
pyspark parallelism works by default and It creates a parallel process based on the number of cores the system has.
Also for this to work data should be partitioned.
Please correct me if my understanding is not right.
Questions/doubt:
I am reading data from one parquet file, so my data is not partitioned and if I use the .repartition() method on my dataframe that is expensive. so how should I use PySpark Parallelism here ?
Also I could not find any simple implementation of pyspark parallelism, which could explain how to use it.
In spark cluster 1 core reads one partition so if you are on multinode spark cluster
then you need to leave some meory for existing system manager like Yarn etc.
https://spoddutur.github.io/spark-notes/distribution_of_executors_cores_and_memory_for_spark_application.html
you can use reparation and specify number of partitions
df.repartition(n)
where n is the number of partition. Repartition is for parlelleism, it will be ess expensive then process your single file without any partition.
I am new to the Spark community. Please ignore if this question doesn't make sense.
My PySpark Dataframe is just taking a fraction of time (in ms) in 'Sorting', but moving data is much expensive (> 14 sec).
Explanation:
I have a huge Arrow RecordBatches collection which is equally distributed on all of my worker node's memories (in plasma_store). Currently, I am collecting all those RecordBatches back in my master node, merging them, and converting them to a single Spark Dataframe. Then I apply sorting function on that dataframe.
Spark dataframe is a cluster distributed data collection.
So my question is:
Is it possible to create a Spark dataframe from all that already distributed Arrow RecordBatches data collections in the worker's nodes memories? So that the data should remain in the respective worker's nodes memories (instead of bringing it to master node, merging, and then creating distributed dataframe).
Thanks!
Yes you can store the data in a spark cache, whenever you try to get the data, it would get you from cache rather than the source.
Please utilize below kinks to understand more on cache,
https://sparkbyexamples.com/spark/spark-dataframe-cache-and-persist-explained/
where does df.cache() is stored
https://unraveldata.com/to-cache-or-not-to-cache/
I have recently started working on spark, We always use cluster by to optimize the tables before joining but I wanted to know is there any scenario where we prefer distribute by over cluster by clause.
Only difference between cluster by and distribute by is
Distribute by only repartitions the data based on the expression while cluster by first repartitions that data and then sorts the data based on key in each partition.
Equivalent representations of cluster by and distribute by in dataframe api is as follows:
distribute by
df.repartition($"key", 2)
cluster by
df.repartition($"key", 2).sortWithinPartitions()
Both involves shuffle operation except cluster by has extra sorting operation.
I want to create a small dataframe with just 10 rows. And I want to force this dataframe to be distributed to two worker nodes. My cluster has only two worker nodes. How do I do that?
Currently, whenever I create such a small dataframe, it gets persisted in only one worker node.
I know, Spark is build for Big Data and this question does not make much sense. However, conceptually, I just wanted to know if at all it is feasible or possible to enforce the Spark dataframe to be split across all the worker nodes (given a very small dataframe with 10-50 rows only).
Or, it is completely impossible and we have to rely upon the Spark master for this dataframe distribution?
Original title: Besides HDFS, what other DFS does spark support (and are recommeded)?
I am happily using spark and elasticsearch (with elasticsearch-hadoop driver) with several gigantic clusters.
From time to time, I would like to pull the entire cluster of data out, process each doc, and put all of them into a different Elasticsearch (ES) cluster (yes, data migration too).
Currently, there is no way to read ES data from a cluster into RDDs and write the RDDs into a different one with spark + elasticsearch-hadoop, because that would involve swapping SparkContext from RDD. So I would like to write the RDD into object files and then later on read them back into RDDs with different SparkContexts.
However, here comes the problem: I then need a DFS(Distributed File System) to share the big files across my entire spark cluster. The most popular solution is HDFS, but I would very much avoid introducing Hadoop into my stack. Is there any other recommended DFS that spark supports?
Update Below
Thanks to #Daniel Darabos's answer below, I can now read and write data from/into different ElasticSearch clusters using the following Scala code:
val conf = new SparkConf().setAppName("Spark Migrating ES Data")
conf.set("es.nodes", "from.escluster.com")
val sc = new SparkContext(conf)
val allDataRDD = sc.esRDD("some/lovelydata")
val cfg = Map("es.nodes" -> "to.escluster.com")
allDataRDD.saveToEsWithMeta("clone/lovelydata", cfg)
Spark uses the hadoop-common library for file access, so whatever file systems Hadoop supports will work with Spark. I've used it with HDFS, S3 and GCS.
I'm not sure I understand why you don't just use elasticsearch-hadoop. You have two ES clusters, so you need to access them with different configurations. sc.newAPIHadoopFile and rdd.saveAsHadoopFile take hadoop.conf.Configuration arguments. So you can without any problems use two ES clusters with the same SparkContext.