I'm working on spark 1.6.1
I have a dataframe that is distributed and is for sure bigger than any nodes i have in my cluster.
What will happen if i bring all in a node ?
df.coalesce(1)
Will the job fail ?
Thanks
It will fail for sure as data will not fit in memory.
If you want to return single file as a output, you can merge HDFS files later using HDFS getMerge.
You can use utility to merge multiple files into one file from below mentioned git project
https://github.com/gopal-tiwari/hdfs-file-merge
Related
I have a bunch of data (on S3) that I am copying to a local HDFS (on amazon EMR). Right now I'm doing that using org.apache.hadoop.fs.FileUtil.copy, but it's not clear if this distributes the file copy to the executors. There's certainly nothing showing up in the Spark History server.
Hadoop DistCp seems like the thing (note I'm on S3, so it's actually supposed to be s3-dist-cp which is built on top of dist-cp) except that it's a command-line tool. I'm looking for a way to invoke this from a Scala script (aka, Java).
Any ideas / leads?
cloudcp is an example of using Spark to do the copy; the list of files is turned into an RDD, each row == a copy. That design is optimised for upload from HDFS, as it tries to schedule the upload close to the files in HDFS.
For download, you want to
use listFiles(path, recursive) for maximum performance in listing an object store.
randomise the list of source files so that you don't get throttled by AWS
randomise the placement across the HDFS cluster so that the blocks end up scattered evenly round the cluster
The question is regarding spark 1.6
When a dataframe is written to HDFS in SaveMode.APPEND mode, I want to know which files were created new.
A way to do this is to keep track of files in HDFS before and after job, is there a better way?
Also Map-Reduce prints job statistics at the end, do we have something similar for every spark action.
I am having a daemon process which dumps data as files in HDFS. I need to create a RDD over the new files, de-duplicate them and store them back on HDFS. The file names should be maintained while dumping back on HDFS.
Any pointers to achieve this?
I am open to achieve it with or without spark streaming.
I tried creating a spark streaming process which processes data directly ( using java code on worker nodes) and pushes it into HDFS without creating a RDD.
But, this approach fails for larger files (greater than 15GB).
I am looking into JavaSparkContext.fileStreaming now.
Any pointers would be a great help.
Thanks and Regards,
Abhay Dandekar
I have a spark job that right now pulls data from HDFS and transforms the data into flat files to load into the Cassandra.
The cassandra table is essentially 3 columns but the last two are map collections, so a "complex" data structure.
Right now I use the COPY command and get about 3k rows/sec load but thats extremely slow given that I need to load about 50milllion records.
I see I can convert the CSV file to sstables but I don't see an example involving map collections and/or lists.
Can I use the spark connector to cassandra to load data with map collections and lists and get better performance than just the COPY command?
Yes the Spark Cassandra Connector can be much much faster for files already in HDFS. Using spark you'll be able to distributedly grab and write into C*.
Even without Spark using a java based loader like https://github.com/brianmhess/cassandra-loader will give you a significant speed improvement.
I'm coming from a Hadoop background, in hadoop if we have an input directory that contains lots of small files, each mapper task picks one file each time and operate on a single file (we can change this behaviour and have each mapper picks more than one file but that's not the default behaviour). I wonder to know how that works in Spark? Does each spark task picks files one by one or..?
Spark behaves the same way as Hadoop working with HDFS, as in fact Spark uses the same Hadoop InputFormats to read the data from HDFS.
But your statement is wrong. Hadoop will take files one by one only if each of your files is smaller than a block size or if all the files are text and compressed with non-splittable compression (like gzip-compressed CSV files).
So Spark would do the same, for each of the small input files it would create a separate "partition" and the first stage executed over your data would have the same amount of tasks as the amount of input files. This is why for small files it is recommended to use wholeTextFiles function as it would create much less partitions