Spark: Importing Data - apache-spark

I currently have a spark app that reads a couple of files and forms a data frame out of them and implements some logic on the data frames.
I can see the number and size of these files growing by a lot in the future and wanted to understand what goes on behind the scenes to be able to keep up with this growth.
Firstly, I just wanted to double check that since all machines on the cluster can access the files (which is a requirement by spark), the task of reading in data from these files is distributed and no one machine is burdened by it?
I was looking at the Spark UI for this app but since it only shows what actions were performed by which machines and since "sc.textFile(filePath)" is not an action I couldn't be sure what machines are performing this read.
Secondly, what advantages/disadvantages would I face if I were to read this data from a database like Cassandra instead of just reading in files?
Thirdly, in my app I have some code where I perform a collect (val treeArr = treeDF.collect()) on the dataframe to get an array and then I have some logic implemented on those arrays. But since these are not RDDs, how does Spark distribute this work? Or does it distribute them at all?
In other words, should I be doing maximum amount of my work transforming and performing actions on RDDs than converting them into arrays or some other data structure and then implementing the logic like I would in any programming language?
I am only about two weeks into Spark so I apologize if these are stupid questions!

Yes, sc.textFile is distributed. It even has an optional minPartitions argument.
This question is too broad. But the short answer is that you should benchmark it for yourself.
collect fetches all the data to the master. After that it's just a plain array. Indeed the idea is that you should not use collect if you want to perform distributed computations.

Related

How to speed up spark sql filter queries if the where clause is already fixed?

In my case, the data resides in spark tables which are created by calling createOrReplaceTempView API on a dataframe. Once the table is created, several queries are going to run on top of the table. Most of the time, the where query is going to be based on a particular column. The concerned columns' name is already known. I would like to know if some sort of optimizations can be done to improve the performance of the filter query.
I tried exploring the approach of indexing but it turns out spark does not support indexing a particular column.
Have you looked at the SPARK UI to see where most of your time is being consumed? Is it really the query where most of the time is spent? Usually reading the data from disk is where most of the time is spent. Learn to read the SPARK UI and find where the real bottleneck is. The SQL tab is a really great way to start figuring things out.
Here's some tricks to run faster in spark that apply to most jobs:
Can you reframe the problem? Was the data you are using in a format that helps you solve the query? Can you change how it's written to change the problem? (Could you start "pre-chewing" the data before you even query it to have it stored in the best format to help you solve the issue you want to solve?) Most performance gains come from changing the parameters of the problem to make them easier/faster to solve.
What format (is the incoming data) you are
storing the data in? Are you using Parquet/Orc? They have a great payoff disk space/compression that are worth using. They also can enable file level filter to speed read. Is their transformation work that you can push upstream to help make the query do less work? Can you be writing the data via a partition schema that would aid lookups?
How many files is your input? Can you consolidate files to maximize read throughput. Reading/listing a lot of small files as input slows down the processing of data.
If the tempView query is of similar size every time you could look at tweaking the partition count so that files are smaller but approximately the size of your HDFS block size. (Assuming you are using hdfs). HDFS you have to read an entire block weather you use all the data or not. Try and fit this to some multiple of your executors so that you are finishing together and not straggling. This is hard to get perfect but you can make decent strides to find a good ratio.
There is no need to optimize filter conditions with spark. spark already is smart enough to optimize its conditions post where query to fetch minimum rows first. The best I guess you can do is by persisting your TempView if querying the same view again and again.

SnowFlake Stored Procedure Multi Threading

Being new to Snowflake I am trying to understand how to write JavaScript based Stored Procedures (SP) to take advantage of multi-thread/parallel processing.
My background is SQL Server and writing SP, taking advantage of performance feature such as degrees of parallelism, worker threads, indexing, column store segment elimination.
I am started to get accustomed to setting up the storage and using clustering keys, micro partitioning, and any other performance feature available, but I don't get how Snowflake SPs break down a given SQL statement into parallel streams. I am struggling to find any documentation to explain the internal workings.
My concern is producing SPs that serialise everything on one thread and become bottlenecks.
I am wondering if I am applying the correct technique/ need a different mindset to developing SPs.
I hope I have explained my concern sufficiently. In essence I am building a PoC to migrate an on-premise SQL Server DWH ETL solution to Snowflake/Matillion ELT solution, one aspect being evaluating the compute virtual warehouse size I need.
stateless UDF will run in parallel by default, this what I observed when do large amount of binary data importing via base64 encoding.
stateful UDF's run in parallel on the date as controlled by the PARTITION BY and ORDER BY clauses used on the data. The only trick to remember is to always force initialize your data, as the javascript instance can be used on subsequent PARTITON BY batches, thus don't rely on check for undefined to know if it's the first row.

Best procedure to modify Inmutable Spark RDDs

In the past, I worked with low level parallelization (openmpi, openmp,...)
I am currently working in a Spark project and I don't know the best procedure to work with RDDs because they are inmutable.
I will explain my problem with a simple example, imagine that in my RDD I have an object and I need to update one attribute.
The most practical and memory efficient way to solve this is implementing a method called setAttribute(new_value).
Spark RDDs are inmutable, so I need to create a function (for example myModifiedCopy(new_value)) that returns a copy of this object but with the new_value in its attribute and updating the RDD like this:
myRDD = myRDD.map(x->x.myModifiedCopy(new_value)).cache()
My objects are very complex and they use a lot of RAM memory (they are really huge). This procedure is slow, you have to create a complete copy of every element of the RDD just to modify an small value.
Is there a better procedure to deal with this kind of problems?
Do you recommend a different technology?
I would kill for a mutable RDD.
Thank you very much in advance.
I beleive you have some misconceptions of Apache Spark. When you do a transformation, indeed you aren't creating a whole copy of that RDD in memory, you are just "designing" the series of tiny conversions to execute in each record when you run an action.
For instance, map, filter and flatMap are entirely transformations, thus lazy, so when you execute them you just design the plan but don't execute it. On the other hand, collect or count behave differently they trigger all previous transformations (doing everything that was defined in the intermediate stages) until they get the result.

Apache Spark Streaming maintain a binary search tree

I want to understand how to do the following thing:
I want to maintain a binary search tree(BST) with spark.
I have 2 simple operations, and I get them in streaming.
That's why I thought about using Spark Streaming.
The operations are the following:
a)add a number to the BST
b)delete a number
*Lets assume that I don't have duplicate numbers.
How can I do it the right way? My main problem is that I'm not sure where should I keep the tree.(Supposing it's size always fits in my RAM)
For me, the "Big Data" here is the number of operations, so i want to use Spark Streaming in order to handle with lots of operations which come in streams.
Again, the tree is being kept small, and will always fit in the RAM. (What if it doesn't?)
What would be the best approach?
In addition to that, I would like to do the same things using Stack data structure instead of BST.
The operations are only push and pop numbers.
Maybe Apache Storm will be better for those tasks?
For the stack you can use redis with key as counter or timestamp when pushing, when pooping pop the latest.
For the BST YOU could use graph x, and use it as a distributed data structure.
Another approach could be with Akka, http://alexminnaar.com/building-a-distributed-binary-search-tree-with-akka.htm.
For the stack you may use, pair rdd with d stream, key being the the time stamp while pushing, but not sure how to pop.

Is it possible to get and use a JavaSparkContext from within a task?

I've come across a situation where I'd like to do a "lookup" within a Spark and/or Spark Streaming pipeline (in Java). The lookup is somewhat complex, but fortunately, I have some existing Spark pipelines (potentially DataFrames) that I could reuse.
For every incoming record, I'd like to potentially launch a spark job from the task to get the necessary information to decorate it with.
Considering the performance implications, would this ever be a good idea?
Not considering the performance implications, is this even possible?
Is it possible to get and use a JavaSparkContext from within a task?
No. The spark context is only valid on the driver and Spark will prevent serialization of it. Therefore it's not possible to use the Spark context from within a task.
For every incoming record, I'd like to potentially launch a spark job
from the task to get the necessary information to decorate it with.
Considering the performance implications, would this ever be a good
idea?
Without more details, my umbrella answer would be: Probably not a good idea.
Not considering the performance implications, is this even possible?
Yes, probably by bringing the base collection to the driver (collect) and iterating over it. If that collection doesn't fit in memory of the driver, please previous point.
If we need to process every record, consider performing some form of join with the 'decorating' dataset - that will be only 1 large job instead of tons of small ones.

Resources