Efficiently creating two RDDs in Spark - apache-spark

I'm investigating using Spark for a project with hundreds of GB of data being generated per hour. I'm struggling with getting the first step optimal, though it feels like it should be simple!
Suppose there is a daily process that splits the data into two parts, neither of which are small enough to cache in memory. Is there a way to do this in a single spark job without the source data having to be loaded from HDFS and parsed twice?
Say I want to write all the "Dog" events into a new HDFS location, and all the "Cat" events into another. As soon as I specify the first "write to file" action (for the dogs), Spark will set off loading each file into RAM in turn and filtering out all the dogs. Then it'll have to re-parse every file to find the cats, even though it could have done both at once. It could have loaded each partition of the raw data and written out the cats and dogs for that partition in one go.
If the data would fit in memory I could filter down to both types of events, then cache, then do the two writes. But it doesn't, and I can't see how to make spark do this on a per-partition basis as it goes along. I've studied the API docs thinking I must be missing something...
Any advice would be greatly appreciated, thanks!

Related

Best option for storage in spark

A third party is producing a complete daily snapshot of their database table (Authors) and is storing it as a Parquet file in S3. Currently the number of records are around 55 million+. This will increase daily. There are 12 columns.
Initially I want to take this whole dataset and do some processing on the records, normalise them and then block them into groups of authors based on some specific criterias. I will then need to repeat this process daily, and filter it to only include authors that have been added or updated since the previous day.
I am using AWS EMR on EKS (Kubernetes) as my Spark cluster. My current thoughts are that I can save my blocks of authors on HDFS.
The main use for the blocks of data will be a separate Spark Streaming job that will then be deployed unto the same EMR cluster, and will read events from a Kafka topic and do a quick search to see which blocks of data are related to that event, and then it will do some matching (pairwise) against each item of that block.
I have two main questions:
Is using HDFS a performant and viable option for this use case?
The third party database table dump is going to be an initial goal. Later on, there will be quite possibly 10s or even 100s of other sources that I would need to do matching against. Which means trillions of data that are blocked and those blocks need to be stored somewhere. Would this option still be viable at that stage?

How to speed up spark sql filter queries if the where clause is already fixed?

In my case, the data resides in spark tables which are created by calling createOrReplaceTempView API on a dataframe. Once the table is created, several queries are going to run on top of the table. Most of the time, the where query is going to be based on a particular column. The concerned columns' name is already known. I would like to know if some sort of optimizations can be done to improve the performance of the filter query.
I tried exploring the approach of indexing but it turns out spark does not support indexing a particular column.
Have you looked at the SPARK UI to see where most of your time is being consumed? Is it really the query where most of the time is spent? Usually reading the data from disk is where most of the time is spent. Learn to read the SPARK UI and find where the real bottleneck is. The SQL tab is a really great way to start figuring things out.
Here's some tricks to run faster in spark that apply to most jobs:
Can you reframe the problem? Was the data you are using in a format that helps you solve the query? Can you change how it's written to change the problem? (Could you start "pre-chewing" the data before you even query it to have it stored in the best format to help you solve the issue you want to solve?) Most performance gains come from changing the parameters of the problem to make them easier/faster to solve.
What format (is the incoming data) you are
storing the data in? Are you using Parquet/Orc? They have a great payoff disk space/compression that are worth using. They also can enable file level filter to speed read. Is their transformation work that you can push upstream to help make the query do less work? Can you be writing the data via a partition schema that would aid lookups?
How many files is your input? Can you consolidate files to maximize read throughput. Reading/listing a lot of small files as input slows down the processing of data.
If the tempView query is of similar size every time you could look at tweaking the partition count so that files are smaller but approximately the size of your HDFS block size. (Assuming you are using hdfs). HDFS you have to read an entire block weather you use all the data or not. Try and fit this to some multiple of your executors so that you are finishing together and not straggling. This is hard to get perfect but you can make decent strides to find a good ratio.
There is no need to optimize filter conditions with spark. spark already is smart enough to optimize its conditions post where query to fetch minimum rows first. The best I guess you can do is by persisting your TempView if querying the same view again and again.

Apache PySpark - Reading a directory without scanning the files

We have a growing data lake of logs we keep on google storage. The data is partitioned by dates (and other stuff such as env=production/staging). Imagine the path gs://bucket/data/env=*/date=*
We begin an app or an analysis by creating dataframes that can be queried later on for processing. The problem is that creating the DFs takes a long time even before we do actions on them. In other words the following command takes a long time because Spark seems to be scanning all the files inside (and as I mentioned, the amount of data keeps growing).
df = spark.read.load("gs://bucket/data/", schema=data_schema, format="json")
Note that we provide the schema here. Also, after the data is loaded the partitioning works well, that is, if we filter by day we do get the speed-up that we expect. We don't want to read a specific partition from the get go, we would like to have everything in one DF and read only what we need later on.

Most efficient way to load many files in spark in parallel?

[Disclaimer: While this question is somewhat specific, I think it circles a very generic issue with Hadoop/Spark.]
I need to process a large dataset (~14TB) in Spark. Not doing aggregations, mostly filtering. Given ~30k files (250 part files, per month for 10 years, each part being ~ 200MB), I would like to load them into a RDD/DataFrame and filter out items based on some arbitrary filters.
To make the listing of the files efficient (I'm on google dataproc/cloud storage, so the driver doing a wildcard glob was very serial and very slow), I precalculate an RDD of the file names, then load them into an RDD (I'm using avro, but file type shouldn't be relevant), e.g.
#returns an array of files to load
files = sc.textFile('/list/of/files/').collect()
#load the files into a dataframe
documents = sqlContext.read.format('com.databricks.spark.avro').load(files)
When I do this, even on a 50-worker cluster, it seems that only one executor is doing the work of reading the files. I've experimented with broadcasting the files list and read a dozen different approaches but I can't seem to crack the issue.
So, is there an efficient way to create a very large dataframe from multiple files? How do I best take advantage of all the potential computing power when creating this RDD?
This approach works very well on smaller sets but, at this size, I see a large number of symptoms like long-running processes with no feedback. Is there some treasure trove of knowledge -- besides #zero323 :-) -- on optimizing spark at this scale?
Listing 30k files shouldn't be an issue for GCS - even if single GCS list request that lists up to 500 files at a time will take 1 second each, all 30k files will be listed in a minute or so. There could be some corner cases with some glob patterns that make it slow, but there were recent optimizations in GCS connector globbing implementation that could help.
That's why it should be good enough for you to just rely on default Spark API with globbing:
val df = sqlContext.read.avro("gs://<BUCKET>/path/to/files/")

Spark: Importing Data

I currently have a spark app that reads a couple of files and forms a data frame out of them and implements some logic on the data frames.
I can see the number and size of these files growing by a lot in the future and wanted to understand what goes on behind the scenes to be able to keep up with this growth.
Firstly, I just wanted to double check that since all machines on the cluster can access the files (which is a requirement by spark), the task of reading in data from these files is distributed and no one machine is burdened by it?
I was looking at the Spark UI for this app but since it only shows what actions were performed by which machines and since "sc.textFile(filePath)" is not an action I couldn't be sure what machines are performing this read.
Secondly, what advantages/disadvantages would I face if I were to read this data from a database like Cassandra instead of just reading in files?
Thirdly, in my app I have some code where I perform a collect (val treeArr = treeDF.collect()) on the dataframe to get an array and then I have some logic implemented on those arrays. But since these are not RDDs, how does Spark distribute this work? Or does it distribute them at all?
In other words, should I be doing maximum amount of my work transforming and performing actions on RDDs than converting them into arrays or some other data structure and then implementing the logic like I would in any programming language?
I am only about two weeks into Spark so I apologize if these are stupid questions!
Yes, sc.textFile is distributed. It even has an optional minPartitions argument.
This question is too broad. But the short answer is that you should benchmark it for yourself.
collect fetches all the data to the master. After that it's just a plain array. Indeed the idea is that you should not use collect if you want to perform distributed computations.

Resources