how saveToCassandra() work? - apache-spark

i want to know when i use rdd.saveToCassandra() if this function save all elements of current rdd into table cassandra a single time or save element by element similar than map function which process element by element of each rdd and return new parsed element?
Thanks

Neither first option nor second one. It writes data after grouping it in batches of configured size (by default 1024 bytes per batch and 1000 batches per Spark task). If you interested in details - it's open-sourced, so check RDDFunctions and TableWriter for start.
Updated as a response to comments. You may split your RDD in multiple RDDs and save each using saveToCassandra. RDD splitting is not standard feature of Spark as for now, so you need a 3rd-party library like Silex. Check documentation for flatMuxPartitions here

Related

Should we always use rdd.count() instead of rdd.collect().size

rdd.collect().size will first move all data to driver, if the dataset is large, it could resutl in OutOfMemoryError.
So, should we always use rdd.count() instead?
Or in other words, in what situation, people would prefer rdd.collect().size?
collect causes data to be processed and then fetched to the driver node.
For count you don't need:
Full processing - some columns may not be required to be fetched or calculated e.g. not included in any filter. You don't need to load, process or transfer the columns that don't effect the count.
Fetch to driver node - each worker node can count it's rows and the counts can be summed up.
I see no reason for calling collect().size.
Just for general knowledge, there is another way to get around #2, however, for this case it is redundant and won't prevent #1: rdd.mapPartitions(p => p.size).agg(r => r.sum())
Assuming you're using the Scala size function on the array returned by rdd.collect() I don't see any advantage of collecting the whole RDD just to get its number of rows.
This is the point of RDDs, to work on chunks of data in parallel to make transformations manageable. Usually the result is smaller than the original dataset because the given data is somehow transformed/filtered/synthesized.
collect usually comes at the end of data processing and if you run an action you might also want to save the data since might require some expensive computations and the collected data is presumably interesting/valuable.

Spark 1.6 Dataframe cache not working correctly

My understanding is that if I have a dataframe if I cache() it and trigger an action like df.take(1) or df.count() it should compute the dataframe and save it in memory, And whenever that cached dataframe is called in the program it uses already computed dataframe from cache.
but that is not how my program is working.
I have a dataframe like below which I am caching it, and then immediately I run a df.count action.
val df = inputDataFrame.select().where().withColumn("newcol" , "").cache()
df.count
When I run the program. In Spark UI I see that first line runs for 4 min and
when it comes to second line it again runs 4 min basically first line is re computed twice?
Shouldn't first line computed and cached when second line triggers?
how to resolve this behavior. I am stuck, please advise.
My understanding is that if I have a dataframe if I cache() it and trigger an action like df.take(1) or df.count() it should compute the dataframe and save it in memory,
It is not correct. Simple cache and count (take wouldn't work on RDD either) is a valid method for RDDs but it is not the case with Datasets, which use much more advanced optimizations. With query:
df.select(...).where(...).withColumn("newcol" , "").count()
any column, which is not used in where clause can be ignored.
There is an important discussion on the developer list and quoting Sean Owen
I think the right answer is "don't do that" but if you really had to you could trigger a Dataset operation that does nothing per partition. I presume that would be more reliable because the whole partition has to be computed to make it available in practice. Or, go so far as to loop over every element.
Translated to code:
df.foreach(_ => ())
There is
df.registerAsTempTable("df")
sqlContext.sql("CACHE TABLE df")
which is eager but it is no longer (Spark 2 and forward) documented and should be avoided.
No, if you call cache on a DataFrame it's not cached in this moment, it's only "marked" for potential future caching. The actual caching is only done when an action is followed later. You can also see your cached DataFrame in Spark UI under "Storage"
Another problem in your code is that count on DataFrame does not compute the entire DataFrame because not all columns need to be computed for that. You can use df.rdd.count() to force the entire evualation (see How to force DataFrame evaluation in Spark).
The question is why your first operation takes so long, even if no action is called. I think this is related to the caching logic (e.g. size estimations etc) being computed when calling cache (see eg. Why is rdd.map(identity).cache slow when rdd items are big?)

How to Identify the list of available RDDs?

I am using the below command to get the list of available registered Temp tables
sqlContext.sql("show tables").collect().foreach(println)
Is there any similar command to get list of available RDDs?
Here is my requirement (using scala)
1. Need to create some RDD on the fly
2. Identify list of available RDDs
3. remove/delete/clear the unwanted RDDs and move forward
How to delete an RDD in PySpark for the purpose of releasing resources?
An additional note, I went through this link, but it doesn't answer all my questions... also i tried the below but don't find any difference before and after unpersist, so not sure how to confirm that my RDD has been released the memory
val tempRDD1 = RDD1.reduceByKey((acc,value)=> acc+value)
tempRDD1.collect.foreach(println)
tempRDD1.unpersist()
tempRDD1.collect.foreach(println)
The RDD data is not saved until it is 1. persisted (cached) and 2. an action occurs to force the preceding transformations to occur. If either of these do not occur, no data will be stored. Any RDD that appears to be "created", will just create an action plan to produce the data if it is needed later. This model is called lazy evaluation.
In your example, no RDD is ever cached, so no data will ever be stored in memory. And the unpersist call will have no effect.

Spark process file in chunks

I would like to process chunks of data (from a csv file) and then do some analysis within each partition/chunk.
How do I do this and then process these multiple chunks in parallel fashion? I'd like to run map and reduce on each chunk
I don't think you can read only part of a file. Also I'm not quite sure if I understand your intent correctly or if you understood the concept of Spark correctly.
If you read a file and apply map function on the Dataset/RDD, Spark will automatically process the function in parallel on your data.
That is, each worker in your cluster will be assigned to a partition of your data, i.e. will process "n%" of the data. Which data items will be in the same partition is decided by the partitioner. By default, Spark uses a Hash Partitioner.
(Alternatively to map, you can apply mapParititions)
Here are some thoughts that came to my mind:
partition your data using the partitionBy method and create your own partitioner. This partitioner can for example put the first n rows into partition 1, the next n rows into partition 2, etc.
If your data is small enough to fit on the driver, you can read the whole file, collect it into an array, and skip the desired number of rows (in the first run, no row is skipped), take the next n rows, and then create an RDD again of these rows.
You can preprocess the data, create the partitons somehow, i.e. containing the n% and then store it again. This will create different files on your disk/HDFS: part-00000, part-00001, etc. Then in your actual program you can read just the desired part file, one after the other...

Apache Spark RDD value lookup

I loaded data from Hbase and did some operation on that data and a paired RDD is created. I want to use the data of this RDD in my next function. I have half million records in RDD.
Can you please suggest performance effective way of reading data by key from the paired RDD .
Do the following:
rdd2 = rdd1.sortByKey()
rdd2.lookup(key)
This will be fast.
That is a tough use case. Can you use some datastore and index it?
Check out Splice Machine (Open Source).
Only from Driver, you can use rdd.lookup(key) to return all values associated with the provided key.
You can use
rddName.take(5)
where 5 is the number of top most elements to be returned. You can change the number accordingly.
Also to read the very first element, you can use
rddName.first

Resources