Consistent view of data in SPARK Data Frame - apache-spark

Considering lazy evaluation, actions, etc. my understanding is from others, that:
if I make repeated access to a dataframe,
that was built from, say, a Hive table,
that (the Hive table) is subject to mutation,
then this changed data will show up on every dataframe operation that is issued subsequently.
How can I get a consistent dataframe then a la ORACLE's read consistency model, other than copying to a separate non-mutable Hive table?
I am assuming that a TempView will solve the problem, or is that not so? Actually I think not. Performance issues.
Ideally I would like the dataframe will all records persisted, but may be that is not how it works with the lazy protocol.

How can I get a consistent dataframe then a la ORACLE's read consistency model, other than copying to a separate non-mutable Hive table?
There is simply no such option.
Naively one could suggest cache and forced evaluation:
val df: DataFrame = ???
df.cache // Default StorageLevel - MEMORY_AND_DISK
df.foreach(_ => ())
but it just doesn't provide required guarantees, especially in case of node failures. You could increase reliability by setting StorageLevel to MEMORY_AND_DISK_2, but it still can result in silent correctness errors.
So to be blunt - Spark is not a database, don't try to treat it like a one. if you already use Hive, and mutable state, then skip Spark and use Hive's ACID and transaction options.

Related

How to spark partitionBy/bucketBy correctly?

Q1. Will adhoc (dynamic) repartition of the data a line before a join help to avoid shuffling or will the shuffling happen anyway at the repartition and there is no way to escape it?
Q2. should I repartition/partitionBy/bucketBy? what is the right approach if I will join according to column day and user_id in the future? (I am saving the results as hive tables with .write.saveAsTable). I guess to partition by day and bucket by user_id but that seems to create thousands of files (see Why is Spark saveAsTable with bucketBy creating thousands of files?)
Some 'guidance' off the top of my head, noting that title and body of text differ to a degree:
Question 1:
A JOIN will do any (hash) partitioning / repartitioning required automatically - if needed and if not using a Broadcast JOIN. You may
set the number of partitions for shuffling or use the default - 200.
There are more parties (DF's) to consider.
repartition is a transformation, so any up-front repartition may not be executed at all due to Catalyst optimization - see the physical plan generated from the .explain. That's the deal with lazy
evaluation - determining if something is necessary upon Action
invocation.
Question 2:
If you have a use case to JOIN certain input / output regularly, then using Spark's bucketBy is a good approach. It obviates shuffling. The
databricks docs show this clearly.
A Spark schema using bucketBy is NOT compatible with Hive. so these remain Spark only tables, unless this changed recently.
Using Hive partitioning as you state depend on push-down logic, partition pruning etc. It should work as well but you may have have
different number of partitions inside Spark framework after the read.
It's a bit more complicated than saying I have N partitions so I will
get N partitions on the initial read.

Spark SQL joining multiple tables design

I am developing a Spark SQL analytics solutions using set of tables. Suppose there are 5 tables which i need to building my solution and finally i am creating one output table.
Here is my flow
dataframe1 = table1 join table2
dataframe2 = dataframe1 join table3
dataframe3 = datamframe2 + filter + agg
dataframe4 = dataframe3 join table4 join table 5
// finally
dataframe4.saveAsTable
When I save final dataframe that's when all the above dataframe is evaluated.
Is my approach is good? or
Do i need to cache/persist intermediate dataframes?
This is a very generic question and it is hard to provide a definitive answer.
Depending on the size of tables you would want to do broadcast hint for any of tables that are relatively small.
You can do this via
table_i.join(broadcast(table_j), ....)
This behaviour depends on the value in:
Now broadcast hint will be honoured only if Spark is able to evaluate the value of the table so you might need to cache().
Another option is via Spark checkpoints that can help to truncate local plan for optimisation (also this allows you to resume jobs from checkpoint location, it is similar to writing to HDFS but with some overhead).
In case of broadcasting few houndres of Mb tables, you might need to increase your kryo buffer:
--conf spark.kryoserializer.buffer.max=1g
It also depends which join types you will use.
You would probably want to do filter and aggregagtion as early as possible since it will reduce the join surface.
There are many other considerations to be consider in order to properly optimise this. In case of power law distribution of join keys in any of the joins you would need to do salting and explode smaller table.
In your case, in principle, there is not really a cache or persist required Why?
As there are no reuse paths evident (for other Actions or other Transformations within the same Action), it is all sequential.
Also, lazy evaluation and Catalyst.
Try the .explain and see how Spark will process.
However, due to memory eviction possibilities on the Cluster, there may be the need to re-compute on a Worker. There are various settings that you could apply via .cache and .persist, but Spark handles memory and disk spills without explicit .cache or .persist. See https://sparkbyexamples.com/spark/spark-difference-between-cache-and-persist/
Also, using .cache can affect performance. So use .explain. See here an excellent posting: Spark: Explicit caching can interfere with Catalyst optimizer's ability to optimize some queries?
So, each case is different but yours seems Ok to answer as I have. In summary: An RDD or DF that is not cached, nor check-pointed, is re-evaluated again each time an Action is invoked on that RDD or DF or if re-accessed within the current Action and no skipped stage situation applies. In your case no issue. Doing otherwise would slow your App down in fact.

When should we go for Spark-sql and when should we go for Spark RDD

On which scenario we should prefer spark RDD to write a solution and on which scenario we should choose to go for spark-sql. I know spark-sql gives better performance and it works best with structure and semistructure data. But what else factors are there that we need to take into consideration while choosing betweeen spark Rdd and spark-sql.
I don't see much reasons to still use RDDs.
Assuming you are using JVM based language, you can use DataSet that is the mix of SparkSQL+RDD (DataFrame == DataSet[Row]), according to spark documentation:
Dataset is a new interface added in Spark 1.6 that provides the benefits of RDDs (strong typing, ability to use powerful lambda functions) with the benefits of Spark SQL’s optimized execution engine.
The problem is python is not support DataSet so, you will use RDD and lose spark-sql optimization when you work with non-structed data.
I found using DFs easier to use than DSs - the latter are still subject to development imho. The comment on pyspark indeed still relevant.
RDDs still handy for zipWithIndex to put asc, contiguous sequence numbers on items.
DFs / DSs have a columnar store and have a better Catalyst (Optimizer) support.
Also, may things with RDDs are painful, like a JOIN requiring a key, value and multi-step join if needing to JOIN more than 2 tables. They are legacy. Problem is the internet is full of legacy and thus RDD jazz.
RDD
RDD is a collection of data across the clusters and it handles both unstructured and structured data. It's typically a function part of handling data.
DF
Data frames are basically two dimensional array of objects defining the data in a rows and columns. It's similar to relations tables in the database. Data frame handles only the structured data.

Lazy loading of partitioned parquet in Apache Spark

As I understand it, Apache Spark uses lazy evaluation. So for example code like the following that consists only of transformations will do no actual processing:
val transformed_df = df.filter("some_field = 10").select("some_other_field", "yet_another_field")
Only when we do an "action" will any processing actually occur:
transformed_df.show()
I had been under the impression that load operations are also lazy in spark. (See How spark loads the data into memory.)
However, my experiences with spark have not borne this out. When I do something like the following,
val df = spark.read.parquet("/path/to/parquet/")
execution seems to depend greatly on the size of the data in the path. In other words, it's not strictly lazy. This is inconvenient if the data is partitioned and I only need to look at a fraction of the partitions.
For example:
df.filter("partitioned_field = 10").show()
If the data is partitioned in storage on "partitioned_field", I would have expected spark to wait until show() is called, and then read only data under "/path/to/parquet/partitioned_field=10/". But again, this doesn't seem to be the case. Spark appears to perform at least some operations on all of the data as soon as read or load is called.
I could get around this by only loading /path/to/parquet/partitioned_field=10/ in the first place, but this is much less elegant than just calling "read" and filtering on the partitioned field, and it's harder to generalize.
Is there a more elegant preferred way to lazily load partitions of parquet data?
(To clarify, I am using Spark 2.4.3)
I think I've stumbled on an answer to my question while learning about a key distinction that is often overlooked when talking about lazy evaluation in spark.
Data is lazily evaluated, but schemas are not. So if we are reading parquet, which is a structured data type, spark does have to at least determine the schema of any files it's reading as soon as read() or load() is called. So calling read() on a large number of files will take longer than on a small number of files.
Given that partitions are part of the schema, it's less surprising to me now that spark has to look at all of the files in the path to determine the schema before filtering on a partition field.
It would be convenient for my purposes if spark were to wait until schema evaluation was strictly necessary and was able to filter on partition fields prior to determining the rest of the schema, but it sounds like this is not the case. I believe Dataset objects always must have a schema, so I'm not sure there's a way around this problem without significant changes to Spark.
In conclusion, it seems like my only option currently is to pass in a list of paths for the partitions that I need rather than the base path if I want to avoid evaluating the schema over the entire data repository.

use spark to scan multiple cassandra tables using spark-cassandra-connector

I have a problem of how to use spark to manipulate/iterate/scan multiple tables of cassandra. Our project uses spark&spark-cassandra-connector connecting to cassandra to scan multiple tables , try to match related value in different tables and if matched, take the extra action such as table inserting. The use case is like below:
sc.cassandraTable(KEYSPACE, "table1").foreach(
row => {
val company_url = row.getString("company_url")
sc.cassandraTable(keyspace, "table2").foreach(
val url = row.getString("url")
val value = row.getString("value")
if (company_url == url) {
sc.saveToCassandra(KEYSPACE, "target", SomeColumns(url, value))
}
)
})
The problems are
As spark RDD is not serializable, the nested search will fail cause sc.cassandraTable returns a RDD. The only way I know to work around is to use sc.broadcast(sometable.collect()). But if the sometable is huge, the collect will consume all the memory. And also, if in the use case, several tables use the broadcast, it will drain the memory.
Instead of broadcast, can RDD.persist handle the case? In my case, I use sc.cassandraTable to read all tables in RDD and persist back to disk, then retrieve the data back for processing. If it works, how can I guarantee the rdd read is done by chunks?
Other than spark, is there any other tool (like hadoop etc.??) which can handle the case gracefully?
It looks like you are actually trying to do a series of Inner Joins. See the
joinWithCassandraTable Method
This allows you to use the elements of One RDD to do a direct query on a Cassandra Table. Depending on the fraction of data you are reading from Cassandra this may be your best bet. If the fraction is too large though you are better off reading the two table separately and then using the RDD.join method to line up rows.
If all else fails you can always manually use the CassandraConnector Object to directly access the Java Driver and do raw requests with that from a distributed context.

Resources