I'm having two DF, each reads 1 TB data. Below code runs very slow. Is there a way to improve it's performance?
diffDF = df1.subtract(df2)
In general, if you have two large datasets that you must shuffle you can't do much to improve the performance (except of configurations tuning).
However, depending on the data and specific use case, you can try the following mitigations:
Assuming you have some id column(s) that uniquely define each record in your datasets, instead of except/subtract you can use left anti-join that might be faster (see Any difference between left anti join and except in Spark?).
In some cases if you can eliminate irrelevant records from df2 before the join and keep a relatively small number of ids to join, you may be able to perform broadcast join and that for sure will significantly improve the performance.
Related
I am learning Databricks and I have some questions about z-order and partitionBy. When I am reading about both functions it sounds pretty similar. Both functions are grouping data in some way that accelerate reading operations. Also it s looks like partitionBy is good with join operations but I don t really understand what function should I use when I want only to read data. Can you tell how should I think about both functions to use it correctly?
Partitioning physically splits the data into different files/directories having only one specific value, while ZOrder provides clustering of related data inside the files that may contain multiple possible values for given column.
Partitioning is useful when you have a low cardinality column - when there are not so many different possible values - for example, you can easily partition by year & month (maybe by day), but if you partition in addition by hour, then you'll have too many partitions with too many files, and it will lead to big performance problems.
ZOrder allows to create bigger files that are more efficient to read compared to many small files.
But you can combine both partitioning with ZOrder - for example partition by year/month, and ZOrder by day - that will allow to collocate data of the same day close to each other, and you can access them faster (because you read fewer files).
Besides ZOrder, you can also use data skipping to efficiently filter out files that doesn't contain data you need for your query.
You can read about data skipping & ZOrder in the following blog post.
I've been trying to think about what the ideal table structure would be for the fastest Spark queries.
I'll try and provide a use case: Let's say your gathering stats for every car in the world and you want to use calculate various metrics with basic math (i.e. add, sub, mult, div).
Would be better to structure the data in a tall table with minimal fields like: day, metric, type, value?
Or would it be better to build a wide tables, that may store metrics independently. With more fields like: day, emmision_value, tire_pressure_value, speed_value, weight_value, heat_value, radio_value, etc .
Is it right to say that tall tables are better for spark? I assume it would be less memory intensive with a taller table.
As mentioned in the comments, this is a subjective question not exactly related to spark, but I'll try and answer none the less.
I assume it would be less memory intensive with a taller table.
Not really, the amount of storage required should be the same in either case based on the use case you have mentioned so let's get this out of the way. In case of taller tables there more rows and lesser columns and in case of wide tables the opposite. So on a cell level it should roughly be the same. I'm considering un compressed data independent of storage format.
Now lets talk about the mentioned use case. Simply put, it's aggregations. This may be fed downstream or may be used for reporting. Generally keeping this is mind, wider tables/views are better simply because - Lesser rows per day = less I/O as less shuffle.
Having said that, look through the cons below as well,
Schema evolution problems due to fixed schema
more suited for batch processing
Taller tables will be more streaming friendly, easier to extend for additional metrics and if its used with a source that supports push down, can result in quick partial scans.
in short, it very much depends on your operations.
I was wondering if there are performance difference between calling except (https://spark.apache.org/docs/2.1.0/api/java/org/apache/spark/sql/Dataset.html#except(org.apache.spark.sql.Dataset) and using a left anti-join. So far, the only difference I can see is that with the left anti-join, the 2 datasets can have different columns.
Your title vs. explanation differ.
But, if you have the same structure you can use both methods to find missing data.
EXCEPT
is a specific implementation that enforces same structure and is a subtract operation, whereas
LEFT ANTI JOIN
allows different structures as you would say, but can give the same result.
Use cases differ: 1) Left Anti Join can apply to many situations pertaining to missing data - customers with no orders (yet), orphans in a database. 2) Except is for subtracting things, e.g. Machine Learning splitting data into test- and training sets.
Performance should not be a real deal breaker as they are different use cases in general and therefore difficult to compare. Except will involve the same data source whereas LAJ will involve different data sources.
Background - Soon in a month I am going to kick-start a project where the dataset has around 300 columns.
Question - what is the maximum number of columns supported in Spark dataframe and dataset ?
Note - I am new to dataframe/dataset
I had DFs that contained about 280 columns, and it worked without a problem.
For DSs, it's a bit more complicated. there is 254 parameters limit in java, so you won't be able to construct a wider DS (because it is based of a java class).
If you have control of the structure of the data, I would recommend grouping columns into structs, it will allow you to overcome the 254 limitation, and make it much easier to work with (if you group the columns in a logical way)
also, make sure to store your data in a columnar format (like parquet), to take advantage of sparks predicate pushdown ability - it will improve you performance significantly when you are using such wide tables
I'm new to Cassandra, so I read a dozen articles about it and thus I know the basics. All the tutorials show efficient data retrieval by 1 or 2 columns and a time range. What I could not find was how to correctly model your data if you have more conditions.
I have a big events normalised database, with quite a few columns, say:
Event type
time
email
User_age
user_country
user_language
and so on.
I would need to be able to query by all columns. So in RDBMS I would query:
SELECT email FROM table WHERE time > X AND user_age BETWEEN X AND X AND user_language = 'nl' etc..
I know I can make a separate table for each column, but then I would still need to combine the results. Maybe this is not a bad approach, but I doubt it since there are no subqueries.
My question is obviously, how can I model this kind of data correctly in Cassandra?
Thanks a lot!
I would need to be able to query by all columns.
Let me stop you right there. In Cassandra, you create your tables based on your anticipated query patterns, and usually a table supports a single query. In your case, you have "quite a few" columns and you will need to duplicate that data into a table designed to support each possible query. That is going to get big and ungainly, very quickly.
Could we just add the rest as secondary indexes? there could potentially still be millions of rows in the eventtype table + merchant_id + time selection.
Secondary indexes are intended to be used on middle-of-the-road cardinality columns. So both, extremely low and extremely high cardinality columns are bad for secondary indexes. The problem, is that Cassandra will have to pick one of your nodes as a coordinator, scan the index on each node (incurring lots of network time), and then build and return the result set. It's a prescription for poor performance, that flies in-the-face of the best practices for working with a distributed database.
In short, Cassandra is not a good solution for use cases like this. It sounds like you want to be able to do OLAP-type queries, and for that you should use a tool that is better-suited for that purpose.