Improving the reading and join of multiple spark datasets in parquet - apache-spark

I have to read multiple datasets of around 5 gigs each. Each directoryPath contains more than 30 files.The method I have of reading them is using the command below and transform into some mapping.
Dataset<T> DatasetA = spark.read.parquet(directoryPath).as(Encoders.bean(someMapping));
This happens for multiple files > 15 and later I join them together based on a common key.
I have a scenario in here wherein to improve performance there is another file ex LesserDataFileDataset that can be joined on to reduce the amount of data I am processing in the end.
I persist LesserDataFileDataset first so I don't read it again and again.
I tried 2 approaches to improve the way this functions.
I join each of the above Datasets with the LesserDataFileDataset.
This yields me DatasetA, DatasetB...etc. with fewer data.
After this, I perform the join operations on all of these datasets
DatasetA
.join(DatasetB.drop(Common key to select only from DatasetA),ColumnToJoin)
.join(DatasetC.drop(Common key to select only from DatasetA),ColumnToJoin)
In a second path, I tried to join the datasets together at the end with
LesserDataFileDataset like below:
LesserDataFileDataset.
.join(DatasetA,ColumnToJoinSequenceWithOneColumn,"left-outer")
.join(DatasetB,ColumnToJoinSequenceWithOneColumn,"left-outer")
.join(DatasetC,ColumnToJoinSequenceWithOneColumn,"left-outer")
This yields me similar performance while reading and joining.
Another way I tried to improve performance is by reading file1 of DatasetA, file1 of DatasetB ... and then join it with the LesserDataFileDataset in a manner similar to the 2 methods above. This had a detrimental effect whereby the performance actually went down. Can someone help.

Related

HashPartioning dataframes to achieve co-partitioning during join in PySpark

I am trying to figure out the best way to achieve co-partitioning on my two datasets to eliminate join related shuffles. I'm working with 2 dataframes A and B where A contains minimal user date including a field for event IDs they interacted with, and B contains detailed information about the events. I am trying to join on 3 fields: day, event_type, and event_id. A and B need to be read from disk as they will be written to and read from by external clients on an ongoing basis.
The main goal of the project I'm working on is to enable the ability to quickly:
Filter by event_type
Join raw event details to user IDs
I understand that in order to achieve #1 I probably need to partition my parquet files on event_type so that the directory structure achieves easier filtering. In order to achieve #2 I should try to minimize shuffles as much as possible by means of co-partitioning keys from the two dataframes.
The data I'm working with consists of 3 days of event data (~12M rows per event type) and the goal is to get this working efficiently for 1-3 years of data.
In order to improve my join I first begin by filtering on the event_type I am interested in to narrow down the data on both dataframes. I then do the actual join on day and event_id. This naturally will result in shuffles since there is no co-partitioning so I've tried to address that using hash partitioning.
I read that repartition implements hash partitioning on the specified columns. I save my dataframes to disk and also include a partitionBy('day', 'event_type') in order to achieve better performance on filtering/grouping operations.
A\
.repartition('day', 'event_id')\
.write
.partitionBy('day', 'event_type')\
.mode('overwrite')\
.parquet('/path/to/A')
B\
.repartition('day', 'event_id')\
.write\
.partitionBy('day', 'event_type')\
.mode('overwrite')\
.parquet('/path/to/B')
...
...
A = spark.read.parquet('/path/to/A')
B = spark.read.parquet('/path/to/B')
A.filter(col('event_type') == 'X')\
.join(B.filter(col('event_type) == 'X'), on=['day', event_id'], how='inner')\
.show()
When I execute this I still see a shuffle exchange in the plan as well as shuffle writes which take up around 5-10GB each. I also see longer executor compute times of around 21-41s which might not seem much on 3 days of data but might blow up for yearly data.
I am wondering what's a better way I can go about doing this - or if it is even possible to eliminate shuffles when working with dataframes? Answers to this question seem to suggest that it might be possible but not a great idea?
I am not even sure that doing both a repartition and a partitionBy is the correct approach. Is the initial partitioning using repartition() preserved at all when I re-read the parquet files from disk? I have read that this might not be the case - overall the information available seems either conflicting or without explicit sources attached.
Thank you for taking the time to help.

Get PySpark to output one file per column value (repartition / partitionBy not working)

I've seen many answers and blob posts suggesting that:
df.repartition('category').write().partitionBy('category')
Will output one file per category, but this doesn't appear to be true if the number of unique 'category' values in df is less than the number of default partitions (usually 200).
When I use the above code on a file with 100 categories, I end up with 100 folders each containing between 1 and 3 "part" files, rather than having all rows with a given "category" value in the same "part". The answer at https://stackoverflow.com/a/42780452/529618 seems to explain this.
What is the fastest way get exactly one file per partition value?
Things I've tried
I've seen many claims that
df.repartition(1, 'category').write().partitionBy('category')
df.repartition(2, 'category').write().partitionBy('category')
Will create "exactly one file per category" and "exactly two files per category" respectively, but this doesn't appear to be how this parameter works. The documentation makes it clear that the numPartitions argument is the total number of partitions to create, not the number of partitions per column value. Based on that documentation, specifying this argument as 1 should (accidentally) output a single file per partition when the file is written, but presumably only because it removes all parallelism and forces your entire RDD to be shuffled / recalculated on a single node.
required_partitions = df.select('category').distinct().count()
df.repartition(required_partitions, 'category').write().partitionBy('category')
The above seems like a workaround based on the documented behaviour, but one that would be costly for several reasons. For one, a separate count if df is expensive and not cached (and/or so big that it would be wasteful to cache just for this purpose), and also any repartitioning of a dataframe can cause unnecessary shuffling in a multi-stage workflow that has various dataframe outputss along the way.
The "fastest" way probably depends on the actual hardware set-up and actual data (in case it is skewed). To my knowledge, I also agree that df.repartition('category').write().partitionBy('category') will not help solving your problem.
We faced a similar problem in our application but instead of doing first a count and then the repartition, we separated the writing of the data and the requirement to have only a single file per partition into two different Spark jobs. The first job is optimized to write the data. The second job just iterates over the partitioned folder structure and simply reads the data per folder/partition, coalesces its data to one partition and overwrites them back. Again, I can not tell if that is the fastest way also to your environment, but for us it did the trick.
Having done some research on this topic lead to the Auto Optimize Writes feature on Databricks for writing to a Delta Table. Here, they use a similar approach: First writing the data and then running a separate OPTIMIZE job to aggregate the files into a single file. In the mentioned link you will find this explanation:
"After an individual write, Azure Databricks checks if files can further be compacted, and runs an OPTIMIZE job [...] to further compact files for partitions that have the most number of small files."
As a side note: Make sure to keep the configuration spark.sql.files.maxRecordsPerFile to 0 (default value) or to a negative number. Otherwise, this configuration alone could lead to multiple files for data with the same value in the column "category".
You can try coalesce(n); coalesce is used to decrease the number of partitions, which is an optimized version of repartition.
n = The number of partitions you want to be output.

Spark Dataset join performance

I receive a Dataset and I am required to join it with another Table. Hence the most simple solution that came to my mind was to create a second Dataset for the other table and perform the joinWith.
def joinFunction(dogs: Dataset[Dog]): Dataset[(Dog, Cat)] = {
val cats: Dataset[Cat] = spark.table("dev_db.cat").as[Cat]
dogs.joinWith(cats, ...)
}
Here my main concern is with spark.table("dev_db.cat"), as it feels like we are referring to all of the cat data as
SELECT * FROM dev_db.cat
and then doing a join at a later stage. Or will the query optimizer directly perform the join with out referring to the whole table? Is there a better solution?
Here are some suggestions for your case:
a. If you have where, filter, limit, take etc operations try to apply them before joining the two datasets. Spark can't push down these kind of filters therefore you have to do by your own reducing as much as possible the amount of target records. Here an excellent source of information over the Spark optimizer.
b. Try to co-locate the datasets and minimize the shuffled data by using repartition function. The repartition should be based on the keys that participate in join i.e:
dogs.repartition(1024, "key_col1", "key_col2")
dogs.join(cats, Seq("key_col1", "key_col2"), "inner")
c. Try to use broadcast for the smaller dataset if you are sure that it can fit in memory (or increase the value of spark.broadcast.blockSize). This consists a certain boost for the performance of your Spark program since it will ensure the co-existense of two datasets within the same node.
If you can't apply any of the above then Spark doesn't have a way to know which records should be excluded and therefore will scan all the available rows from both datasets.
You need to do an explain and see if predicate push down is used. Then you can judge your concern to be correct or not.
However, in general now, if no complex datatypes are used and/or datatype mismatches are not evident, then push down takes place. You can see that with simple createOrReplaceTempView as well. See https://databricks-prod-cloudfront.cloud.databricks.com/public/4027ec902e239c93eaaa8714f173bcfc/3741049972324885/4201913720573284/4413065072037724/latest.html

Does joining additional columns in Spark scale horizontally?

I have a dataset with about 2.4M rows, with a unique key for each row. I have performed some complex SQL queries on some other tables, producing a dataset with two columns, a key and the value true. This dataset is about 500 rows. Now I would like to (outer) join this dataset with my original table.
This produces a new table with a very sparse set of values (true in about 500 rows, null elsewhere).
Finally, I would like to do this about 200 times, giving me a final table of about 201 columns (the key, plus the 200 sparse columns).
When I run this, I notice that as it runs it gets considerably slower. The first join takes 2 seconds, then 4s, then 6s, then 10s, then 20s and after about 30 joins the system never recovers. Of course, the actual numbers are irrelevant as that depends on the cluster I'm running, but I'm wondering:
Is this slowdown is expected?
I am using parquet as a data storage format (columnar storage) so I was hopeful that adding more columns would scale horizontally, is that a correct assumption?
All the columns I've joined so far are not needed for the Nth join, can they be unloaded from memory?
Are there other things I can do when combining lots of columns in spark?
Calling explain on each join in the loop shows that each join is getting more complex (appears to include all previous joins and it also includes the complex sql queries, even though those have been checkpointed). Is there a way to really checkpoint so each join is just a join? I am actually calling show() after each join, so I assumed the join is actually happening at that point.
Is this slowdown is expected
Yes, to some extent it is. Joins belong to the most expensive operations in a data intensive systems (it is not a coincidence that products which claim linear scalability usually take joins out of the table). Join-like operation in a distributed system typically require data exchange between nodes hitting a bunch of high latency numbers.
In Spark SQL there is also additional cost of computing execution plan, which has larger than linear complexity.
I am using parquet as a data storage format (columnar storage) so I was hopeful that adding more columns would scale horizontally, is that a correct assumption?
No. Input format doesn't affect join logic at all.
All the columns I've joined so far are not needed for the Nth join, can they be unloaded from memory?
If truly excluded from the final output they will be pruned from the execution plan. But since you for a reason, I assume it is not the case and there are required for the final output.
Is there a way to really checkpoint so each join is just a join? I am actually calling show() after each join, so I assumed the join is actually happening at that point.
show computes only a small subset of data required for the output. It doesn't cache, although shuffle files might be reused.
(appears to include all previous joins and it also includes the complex sql queries, even though those have been checkpointed).
Checkpoints are created only if data is fully computed and don't remove stages from the execution plan. If you want to do it explicitly, write partial result to persistent storage and read it back at the beginning of each iteration (it is probably an overkill).
Are there other things I can do when combining lots of columns in spark?
The best thing you can do is to find a way to avoid joins completely. If key is always the same then single shuffle, and operation on groups / partitions (with byKey method, window functions) might be better choice.
However if you
have a dataset with about 2.4M rows
then using non-distributed system that supports in-place modification might be much better choice.
In the most naive implementation you can compute each aggregate separately, sort by key and write to disk. Then data can be merged together line by line with negligible memory footprint.

Perl : Reading and processing multiple files in parallel

I have one file (lets call it enrolled_students.txt) that I need to read in Perl. This file will have data per line such that it requires to refer other files for getting some more information.
For example, the main database will have names and addresses. But depending on the nationality of each person, I have to refer other files (sorted by country) to find the matching name, the nationality and home address.
Lets say I have 100 name_of_country.txt files and there are 10,000 lines in my enrolled_students.txt. My questions are:
Do I read each line in enrolled_students.txt and parse the other 100 files one by to find a match? That seems like an awful way to process this data. Is there a faster way to do this?
Can I execute this process in parallel mode (multithread)?
Thanks,
Hans
What you are trying to do here is similar to what a database engine has to do when joining data from two tables together. A database engine will typically have a number of different join plans to choose from, and it will attempt to choose the best one based on what it knows about the data in each table.
The same applies to you. There are several ways to join the data and the best way will depend on factors such as the size of each of the input files, whether they are pre-sorted, etc.
Some possible approaches:
A 'Nested Loop', where you read each line of the enrolled_students.txt file and for each of those iterate through the other file(s) to find a match. Not likely to be very fast, you would probably only choose this if the files were too large to make any other solution practical.
A 'Hash Join', where you would read one half of the data to be joined (in your example, probably the name_of_country.txt) into a data structure indexed by a hash. Then for each row of the other file, you can look up the corresponding row in the hash. This can be quite high performance, as long as there is enough memory to store at least one of the two sets of data at once.
If both files are in some sorted order, sorted according to the same key, you might be able to use a 'Merge Join'. This is where you read rows from both files at once, matching the records together like teeth in a zipper.
The above assumes a simple case with two data files that have to be joined. Your question talks about 100 different name_of_country.txt files, which might complicate matters.
In regard to your second question - can you use parallel processing - that would probably only be useful if the processing was CPU-bound. The complexity of producing a forked or threaded solution is probably not warranted unless you find that it is actually CPU bound.
Finally - if you are doing multiple analysis runs of the same data, it might be advisable to import the data into a real database and use that run queries. That would save you a lot of coding work.
I will treat your question as: How to efficient perform a "join" operation of two files and here is the answer.
Actually there is a join command in Unix.
http://linux.die.net/man/1/join
Suppose you have two files, student and student_with_country:
student: [name] [age] [...]
student_with_country: [name] [country] [...]
you can do:
join student student_with_country (by default, it will join based on the first field)
Then the question is how to make it faster by using multiple cores?
Answer is parallel command. Basically, you can run a simple map-reduce program using it. For example, in this case
cat student_with_country | parallel --block 10M --pipe join student -
It will divide the student_with_country file into 10M blocks and run the join command in parallel. In this way, you can utilize power of multiple cores.

Resources