Spark SQL - READ and WRITE in sequence or pipeline? - apache-spark

I am working on a cost function for Spark SQL.
While modelling the TABLE SCAN behaviour I cannot understand if READ and WRITE are carried out in pipeline or in sequence.
Let us consider the following SQL query:
SELECT * FROM table1 WHERE columnA = ‘xyz’;
Each task:
Reads a data block (either locally or from a remote node)
Filter out the tuples that do not satisfy the predicate
Write to the disk the remaining tuples
Are (1), (2) and (3) carried out in sequence or in pipeline? In other words, the data block is completely read (all the disk pages composing it) first and then it is filtered and then it is rewritten to the disk or are these activities carried out in pipeline? (i.e. while reading the (n+1)-tuple, n-tuple can be processed and written).
Thanks in advance.

Whenever you submit a job, first thing spark does is create DAG (Directed acyclic graph) for your job.
After creating DAG, spark knows, which tasks it can run in parallel, which task are dependent on output of previous step and so on.
So, in your case,
Spark will read your data in parallel (which you can see in partition), filter them out (in each partition).
Now, since saving required filtering, so it will wait for filtering to finish for at least one partition, then start to save it.

After some more digging I found out that Spark SQL uses a so called "volcano style pull model".
According to such model, a simple scan-filter-write query whould be executed in pipeline and are fully distributed.
In other words, while reading the partition (HDFS block), filtering can be executed on read rows. No need to read the whole block to kick off the filtering. Writing is performed accordingly.

Related

How to run apache-beam in batches on a bounded data?

I am trying to understand how the apache beam works and im not quite sure if i do. So, i want someone to tell me if my understanding is right:
Beam is a layer of abstraction over big data frameworks like spark,hadoop,google data flow etc. Now quite every functionality but almost that is the case
Beam treats data in two forms bounded and unbounded. Bounded like a .csv and unbounded like a kafka subscription. There are different i/o read methods for each. For unbounded data we need to implement windowing (attaching a timestamp to each data point) and trigger (a timestamp). A batch here would be all the datapoints in a window till a trigger is hit. For the bounded datasets however, all the dataset is loaded in RAM (? if yes, how do i make beam work on batches?). The output of a i/o method is a pCollection
There are pTransformations (these are the operations i want run on the data) that apply to each element of the of the pCollection. I can make these pTransformations apply over a spark or flint cluster (this choice goes in the initial options set for the pipeline). each pTransformation emits a pCollection and that is how we chain various pTransformations together. End is a pCollection that can be saved to disk
End of the pipeline could be a save on some file system (How does this happen when i am reading a .csv in batches?)
Please point out to me any lapses in my understanding
Beam is not like google cloud dataflow, Cloud Dataflow is a runner on top of Apache Beam. It executes Apache Beam pipelines. But you can run an Apache Beam job with a local runner not on the cloud. There are plenty of different runners that you can find in the documentation : https://beam.apache.org/documentation/#available-runners
One specific aspect of Beam is that it's the same pipeline for Batch and Stream and that's the purpose. You can specify --streaming as an argument to execute your pipeline, withou it it should execute it in batch. But it mostly depends on you inputs, the data will just flow into the pipeline. And that's one important point, PCollections do not contain persistent data just like RDD's for Spark RDD.
You can apply a PTransform on part of your data, it's not necessarly on all the data. All the PTranforms together forms the pipeline.
It really depends where and what format you want for your output...

Losing data in Spark?

Overall: Our code gets data from Snowflake (=data warehouse), performs certain transformations on it in Python with Spark, and pushes it back to Snowflake.
When I perform this task on my computer, things go well: #inputrows = #outputrows. When someone performs it on their computer, data gets lost. The data that gets lost is random: every time this person performs the task, both differen rows and a different amount of rows is returned. Also performing the task on an EMR doesn't return the right number of rows. We perform the exact same spark-submit command.
We believe it is not in the 'writing to snowflake' part, as performing counts in the code already gives the wrong number of rows before writing to Snowflake.
Is it somehow possible to loose data when using Spark?

How to update or even reset rows in persistent table given multiple simultaneous readers?

I have an exchangeRates table that gets updated in batch once per week. This is to be used by other batch and streaming jobs, across different clusters - thus I want to save this as a persistent, shared table for all to jobs share.
allExchangeRatesDF.write.saveAsTable("exchangeRates")
How best then (for the batch job that manages this data) to gracefully update the table contents (actually overwrite it completely) - considering the various spark job as consumers of it and particularily giving its use in some 24/7 structured streaming streams?
Ive checked the APIs, maybe I am missing something obvious! Very likely.
Thanks!
I think you expect some kind of transaction support from Spark so when there's saveAsTable in progress Spark would hold all writes until the update/reset has finished.
I think that the best way to deal with the requirement is to append new records (using insertInto) with the batch id that would denote the rows that belong to a "new table".
insertInto(tableName: String): Unit Inserts the content of the DataFrame to the specified table. It requires that the schema of the DataFrame is the same as the schema of the table.
You'd then use the batch id to deal with the rows as if they were the only rows in the dataset.

Concurrent operations in spark streaming

I wanted to understand something about the internals of spark streaming executions.
If I have a stream X, and in my program I send stream X to function A and function B:
In function A, I do a few transform/filter operations etc. on X->Y->Z to create stream Z. Now I do a forEach Operation on Z and print the output to a file.
Then in function B, I reduce stream X -> X2 (say min value of each RDD), and print the output to file
Are both functions being executed for each RDD in parallel? How does it work?
Thanks
--- Comments from Spark Community ----
I am adding comments from the spark community -
If you execute the collect step (foreach in 1, possibly reduce in 2) in two threads in the driver then both of them will be executed in parallel. Whichever gets submitted to Spark first gets executed first - you can use a semaphore if you need to ensure the ordering of execution, though I would assume that the ordering wouldn't matter.
#Eswara's answer is seems right but it does not apply to your use case as your separate transformation DAG's (X->Y->Z and X->X2) have a common DStream ancestor in X. This means that when the actions are run to trigger each of these flows, the transformation X->Y and the transformation X->X2 cannot happen at the same time. What will happen is the partitions for RDD X will be either computed or loaded from memory (if cached) for each of these transformations separately in a non-parallel manner.
Ideally what would happen is that the transformation X->Y would resolve and then the transformations Y->Z and X->X2 would finish in parallel as there is no shared state between them. I believe Spark's pipelining architecture would optimize for this. You can ensure faster computation on X->X2 by persisting DStream X so that it can be loaded from memory rather than being recomputed or being loaded from disk. See here for more information on persistence.
What would be interesting is if you could provide the replication storage levels *_2 (e.g. MEMORY_ONLY_2 or MEMORY_AND_DISK_2) to be able to run transformations concurrently on the same source. I think those storage levels are currently only useful against lost partitions right now, as the duplicate partition will be processed in place of the lost one.
Yes.
It's similar to spark's execution model which uses DAGs and lazy evaluation except that streaming runs the DAG repeatedly on each fresh batch of data.
In your case, since the DAGs(or sub-DAGs of larger DAG if one prefers to call that way) required to finish each action(each of the 2 foreachs you have) do not have common links all the way back till source, they run completely in parallel.The streaming application as a whole gets X executors(JVMs) and Y cores(threads) per executor allotted at the time of application submission to resource manager.At any time, a given task(i.e., thread) in X*Y tasks will be executing a part or whole of one of these DAGs.Note that any 2 given threads of an application, whether in same executor or otherwise, can execute different actions of the same application at the same time.

reducer concept in Spark

I'm coming from a Hadoop background and have limited knowledge about Spark. BAsed on what I learn so far, Spark doesn't have mapper/reducer nodes and instead it has driver/worker nodes. The worker are similar to the mapper and driver is (somehow) similar to reducer. As there is only one driver program, there will be one reducer. If so, how simple programs like word count for very big data sets can get done in spark? Because driver can simply run out of memory.
The driver is more of a controller of the work, only pulling data back if the operator calls for it. If the operator you're working on returns an RDD/DataFrame/Unit, then the data remains distributed. If it returns a native type then it will indeed pull all of the data back.
Otherwise, the concept of map and reduce are a bit obsolete here (from a type of work persopective). The only thing that really matters is whether the operation requires a data shuffle or not. You can see the points of shuffle by the stage splits either in the UI or via a toDebugString (where each indentation level is a shuffle).
All that being said, for a vague understanding, you can equate anything that requires a shuffle to a reducer. Otherwise it's a mapper.
Last, to equate to your word count example:
sc.textFile(path)
.flatMap(_.split(" "))
.map((_, 1))
.reduceByKey(_+_)
In the above, this will be done in one stage as the data loading (textFile), splitting(flatMap), and mapping can all be done independent of the rest of the data. No shuffle is needed until the reduceByKey is called as it will need to combine all of the data to perform the operation...HOWEVER, this operation has to be associative for a reason. Each node will perform the operation defined in reduceByKey locally, only merging the final data set after. This reduces both memory and network overhead.
NOTE that reduceByKey returns an RDD and is thus a transformation, so the data is shuffled via a HashPartitioner. All of the data does NOT pull back to the driver, it merely moves to nodes that have the same keys so that it can have its final value merged.
Now, if you use an action such as reduce or worse yet, collect, then you will NOT get an RDD back which means the data pulls back to the driver and you will need room for it.
Here is my fuller explanation of reduceByKey if you want more. Or how this breaks down in something like combineByKey

Resources