I am running a spark program on my Windows. I wanted to load the data from SQL to HDFS. I was able to load the data into the data frame and using below query to write it to HDFS (present on my cloudera VM ). Below is the syntax I found. Can some one please guide where to get my cloudera cluster details in my VM ? Like what to pass after hdfs:// ?. I expect /user/hdfs/test/ represents directory structure in below url.
df.write.csv("hdfs://cluster/user/hdfs/test/example.csv")
I have a project that deals with processing data with Spark on EMR.
From what I've read, people usually store their input data on some file system (HDFS, S3, or locally), and then operate on that. If the data is very large, we don't want to store that locally.
My question is, if I generate a bunch of data, how do you even store that data remotely on S3 or whichever cloud file system there is in the first place? Don't I need to have the data stored locally before I can store it on the cloud?
I ask this because currently, I'm using a service that has a method that returns a Spark Dataset object to me. I'm not quite sure how the workflow goes between calling that method and processing it via Spark on EMR.
The object store connectors tend to write data in blocks; for each partition the work creates a file through the Hadoop FS APIs, with a path like s3://bucket/dest/__temporary/0/task_0001/part-0001.csv, gets back an output stream into which the workers write, that's it.
I don't know about the closed source EMR s3 connector, the ASF S3A one is up there for you to examine
Data is buffered up to the value of fs.s3a.blocksize; default = `32M, i.e. 32MB
Buffering is to disk (default), Heap (arrays) or off-heap byte buffers S3ADataBlocks.
When you write data, once the buffer threshold is reached, that block is uploaded (separate thread); a new block buffer created. S3ABlockOutputStream.write.
when the stream's close() method is called, any outstanding data is PUT to S3, then then thread blocks until it is all uploaded. S3ABlockOutputStream.close
The uploads are in a separate thread, so even if the network is slow you can generate data slightly faster, with the block at the end. The amount of disk/ram you need is as much as all outstanding blocks from all workers uploading data. The thread pool for the upload is shared and of a limited size, so you can tune the params to limit these values. Though that's normally only needed if you try to buffer in memory.
When the queue fills up, the worker threads writing to the S3 output stream block, via the SemaphoredDelegatingExecutor.
the amount of local storage you need then depends on:
number of spark worker threads
rate of data they generate
number of threads/http connections you have to upload the data
bandwidth from VM to S3 (the ultimate limit)
any throttling S3 does with many clients writing to same bit of a bucket
That's with the S3A connector; the EMR s3 one will be different, but again, upload bandwidth will be the bottleneck. I assume it too has something to block workers which create more data than the network can handle.
Anyway: for Spark and the hadoop code it uses underneath, all the source is there for you to explore. Don't be afraid to do so!
When dealing with Spark and any distributed storage keep in mind, that there is some amount of nodes in the Spark cluster.
While the Dataset transformations are manipulated from the single node of cluster named driver, common practice is that all processed data never get collected on one single node in such a cluster. Each node of executor role in cluster operates with fraction of whole data during its ingestion into Spark, processing and storing back to some kind of storage.
With such approach the limits of single node do not limit the volume of data that could be processed by cluster.
I have log files in different servers(5 servers connect through LAN) and I need to process and get the result
Each node has 4TB log files and I'm using HDFS to load all log files into Spark
Every time, when request come, the Spark load all files (5 * 4TB) then query with Spark SQL
What if I load all the log files into caseesndra and then query (it can be preloaded ) ? which is the fast way ..?
HDFS and Cassandra has each their own advantages.
If you need to process all log files entirely, HDFS is a better choice because it is a file system and has been designed to store a massive amount of data and process them by batch.
Now, if you only need to process a portion of the log files, a datastore like Cassandra is a better choice because you can filter data by primary key and have faster access and skip scanning through all the files.
Cassandra is designed for OLTP workload whereas HDFS and the kind are designed for OLAP workload
I am using Spark 1.5 without HDFS in cluster mode to build an application. I was wondering, when having a saving operation, e.g.,
df.write.parquet("...")
which data is stored where? Is all the data stored at the master, or is each worker storing its local data?
Generally speaking all workers nodes will perform writes to its local file system with driver writing only a _SUCCESS file.
I would like to know how we can print the output after running a SVM algorithm to a csv file. I am hosting my spark cluster on AWS EMR. So any files I access are to be saved and accessed from S3 only. So when I use the saveAsTextFile command and specify an aws path, I don't see the output file(s) being stored in S3. Any suggestions in this regard?
You can use Spark "saveAsTextFile" action to write the results to a file.
An example is available Here