According to the official Azure Guide using native Spark caching, even with disk persistence, won't take advantage of local SSD. I suspect that in order to benefit from it we need to use OFF_HEAP option when persisting RDDs. But then how to configure it so it uses local SDD (mounted as SDB1 under /mnt) and Alluxio for in-memory stuff? I know switches
--conf spark.memory.offHeap.enabled="true" \
--conf spark.memory.offHeap.size=10G \
I'm asking about datasets generated through a set of operations rather than generated from input datasets (which would be easy - they only "HDFS://" prefix is needed).
To persist Data from Spark to a shared external storage that can manage SSD resource, you can use Alluxio. Spark can save and load RDDs or Dataframes to Alluxio easily:
// Save RDD to Alluxio as Text File
scala> rdd.saveAsTextFile("alluxio://master:19998/myRDD")
// Load the RDD back from Alluxio as Text File
scala> sc.textFile("alluxio://master:19998/myRDD")
// Save Dataframe to Alluxio as Parquet files
scala> df.write.parquet("alluxio://master:19998/path")
// Load Dataframe back from Alluxio as Parquet files
scala> df = sqlContext.read.parquet("alluxio://master:19998/path")
Maybe they meant storing data explicitly to Alluxio or Hdfs directly?
e.g. instead of:
df.cache()
use write and read:
df.write.parquet("alluxio://master:19998/out.parquet")
df.read.parquet("alluxio://master:19998/out.parquet")
p.s. sorry for a stupid answer, wanted to write it in a comment but hadn't enough reputation.
Related
This question is almost a replica of the requirement here: Writing files to local system with Spark in Cluster mode
but my query is with a twist. The page above writes files from HDFS directly to local filesystem using spark but after converting it to RDD.
I'm in search of options available with just the Dataframe; conversion to RDD for huge data takes a toll on resource utilisation.
You can make use of below syntax to directly write a dataframe to HDFS filesystem.
df.write.format("csv").save("path in hdfs")
Refer spark doc for more details: https://spark.apache.org/docs/2.2.0/sql-programming-guide.html#generic-loadsave-functions
When I use Spark to read multiple files from S3 (e.g. a directory with many Parquet files) -
Does the logical partitioning happen at the beginning, then each executor downloads the data directly (on the worker node)?
Or does the driver download the data (partially or fully) and only then partitions and sends the data to the executors?
Also, will the partitioning default to the same partitions that were used for write (i.e. each file = 1 partition)?
Data on S3 is external to HDFS obviously.
You can read from S3 by providing a path, or paths, or using Hive Metastore - if you have updated this via creating DDL for External S3 table, and using MSCK for partitions, or ALTER TABLE table_name RECOVER PARTITIONS for Hive on EMR.
If you use:
val df = spark.read.parquet("/path/to/parquet/file.../...")
then there is no guarantee on partitioning and it depends on various settings - see Does Spark maintain parquet partitioning on read?, noting APIs evolve and get better.
But, this:
val df = spark.read.parquet("/path/to/parquet/file.../.../partitioncolumn=*")
will return partitions over executors in some manner as per your saved partition structure, a bit like SPARK bucketBy.
The Driver only gets the metadata if specifying S3 directly.
In your terms:
"... each executor downloads the data directly (on the worker node)? " YES
Metadata is gotten in some way with Driver coordination and other system components for file / directory locations on S3, but not that the data is first downloaded to Driver - that would be a big folly in design. But it depends also on format of statement how the APIs respond.
I am trying to understand which of the below two would be better option especially in case of Spark environment :
Loading the parquet file directly into a dataframe and access the data (1TB of data table)
Using any database to store and access the data.
I am working on data pipeline design and trying to understand which of the above two options will result in more optimized solution.
Loading the parquet file directly into a dataframe and access the data is more scalable comparing to reading RDBMS like Oracle through JDBC connector. I handle the data more the 10TB but I prefer ORC format for better performance. I suggest you have to directly read data from files the reason for that is data locality - if your run your Spark executors on the same hosts, where HDFS data nodes located and can effectively read data into memory without network overhead. See https://jaceklaskowski.gitbooks.io/mastering-apache-spark/content/spark-data-locality.html and How does Apache Spark know about HDFS data nodes? for more details.
Have few questions around Spark RDD. Can someone enlighten me please.
I could see that RDDs are distributed across nodes, does that mean the
distributed RDD are cached in memory of each node or will that RDD data
reside on the hdfs disk. Or Only when any application runs the RDD data get
cached in memory ?
My understanding is, when I create a RDD based on a file which is present
on hdfs blocks , the RDD will first time read the data (I/O operation ) from
the blocks and then cache it persistently. Atleast one time it has to the
read the data from disk, Is that true ???
Is there any way if i can cache the external data directly into RDD instead
of storing the data first in hdfs and then load into RDD from hdfs blocks ?
The intention here is storing data first into hdfs and then loading it into
in memory will present latency ??
Rdd's are data structures similar to arrays and lists. When you create an RDD (example: loading a file ) if it is in the local mode it is stored in the laptop. If you are using hdfs it is stored in hdfs. Remember ON DISK.
If you want to store it in the cache (in RAM), you can use the cache() function.
Hope you got the answer for the second question too from the first one .
Yes you can directly load the data from your laptop without loading it into hdfs.
val newfile = sc.textFile("file:///home/user/sample.txt")
Specify the file path.
By default spark takes hdfs as storage u can change it by using the above line.
Dont forget to put the three ///:
file:///
Is it possible to create a RDD using data from master or worker? I know that there is a option SC.textFile() which sources the data from local system (driver) similarly can we use something like "master:file://input.txt" ? because I am accessing a remote cluster and my input data size is large and cannot login to remote cluster.
I am not looking for S3 or HDFS. Please suggest if there is any other option.
Data in an RDD is always controlled by the Workers, whether it is in memory or located in a data-source. To retrieve the data from the Workers into the Driver you can call collect() on your RDD.
You should put your file on HDFS or a filesystem that is available to all nodes.
The best way to do this is to as you stated use sc.textFile. To do that you need to make the file available on all nodes in the cluster. Spark provides an easy way to do this via the --files option for spark-submit. Simply pass the option followed by the path to the file that you need copied.
You can access the hadoop, file by creating hadoop configuration.
import org.apache.spark.deploy.SparkHadoopUtil
import java.io.{File, FileInputStream, FileOutputStream, InputStream}
val hadoopConfig = SparkHadoopUtil.get.conf
val fs = org.apache.hadoop.fs.FileSystem.get(new java.net.URI(fileName), hadoopConfig)
val fsPath = new org.apache.hadoop.fs.Path(fileName)
Once you get the path you can copy, delete, move or perform any operations.