Spark Dataset when to use Except vs Left Anti Join

Spark Dataset when to use Except vs Left Anti Join - apache-spark

I was wondering if there are performance difference between calling except (https://spark.apache.org/docs/2.1.0/api/java/org/apache/spark/sql/Dataset.html#except(org.apache.spark.sql.Dataset) and using a left anti-join. So far, the only difference I can see is that with the left anti-join, the 2 datasets can have different columns.

Your title vs. explanation differ.
But, if you have the same structure you can use both methods to find missing data.
EXCEPT
is a specific implementation that enforces same structure and is a subtract operation, whereas
LEFT ANTI JOIN
allows different structures as you would say, but can give the same result.
Use cases differ: 1) Left Anti Join can apply to many situations pertaining to missing data - customers with no orders (yet), orphans in a database. 2) Except is for subtracting things, e.g. Machine Learning splitting data into test- and training sets.
Performance should not be a real deal breaker as they are different use cases in general and therefore difficult to compare. Except will involve the same data source whereas LAJ will involve different data sources.

Related

Databricks: Z-order vs partitionBy

I am learning Databricks and I have some questions about z-order and partitionBy. When I am reading about both functions it sounds pretty similar. Both functions are grouping data in some way that accelerate reading operations. Also it s looks like partitionBy is good with join operations but I don t really understand what function should I use when I want only to read data. Can you tell how should I think about both functions to use it correctly?

Partitioning physically splits the data into different files/directories having only one specific value, while ZOrder provides clustering of related data inside the files that may contain multiple possible values for given column.
Partitioning is useful when you have a low cardinality column - when there are not so many different possible values - for example, you can easily partition by year & month (maybe by day), but if you partition in addition by hour, then you'll have too many partitions with too many files, and it will lead to big performance problems.
ZOrder allows to create bigger files that are more efficient to read compared to many small files.
But you can combine both partitioning with ZOrder - for example partition by year/month, and ZOrder by day - that will allow to collocate data of the same day close to each other, and you can access them faster (because you read fewer files).
Besides ZOrder, you can also use data skipping to efficiently filter out files that doesn't contain data you need for your query.
You can read about data skipping & ZOrder in the following blog post.

What is the ideal structure of tables for Spark to work with (Tall vs Wide)?

I've been trying to think about what the ideal table structure would be for the fastest Spark queries.
I'll try and provide a use case: Let's say your gathering stats for every car in the world and you want to use calculate various metrics with basic math (i.e. add, sub, mult, div).
Would be better to structure the data in a tall table with minimal fields like: day, metric, type, value?
Or would it be better to build a wide tables, that may store metrics independently. With more fields like: day, emmision_value, tire_pressure_value, speed_value, weight_value, heat_value, radio_value, etc .
Is it right to say that tall tables are better for spark? I assume it would be less memory intensive with a taller table.

As mentioned in the comments, this is a subjective question not exactly related to spark, but I'll try and answer none the less.
I assume it would be less memory intensive with a taller table.
Not really, the amount of storage required should be the same in either case based on the use case you have mentioned so let's get this out of the way. In case of taller tables there more rows and lesser columns and in case of wide tables the opposite. So on a cell level it should roughly be the same. I'm considering un compressed data independent of storage format.
Now lets talk about the mentioned use case. Simply put, it's aggregations. This may be fed downstream or may be used for reporting. Generally keeping this is mind, wider tables/views are better simply because - Lesser rows per day = less I/O as less shuffle.
Having said that, look through the cons below as well,
Schema evolution problems due to fixed schema
more suited for batch processing
Taller tables will be more streaming friendly, easier to extend for additional metrics and if its used with a source that supports push down, can result in quick partial scans.
in short, it very much depends on your operations.

Considerations for time-series

We are looking into using Azure Table Storage (ATS) together with Deedle (or other libraries with similar functionality) for our time-series storage, manipulations and calculations. From what I can read, F# also seems like a good choice for operations on arrays.
Our starting point is a set of time-series for energy consumption. The series will either be the consumption within an interval (fixed or irregular intervals) or a counter (from which we can calculate the consumption from one reading to the next). As a data point is just a tag (used as a partition key), timestamp (rowkey) and value, this should be well suited for ATS.
From a user's perspective, they want to do calculations on the series for a given period and resolution, e.g. calculate a third series as a difference between two others, for one given year with monthly resolution.
This raises a number of questions:
Will ATS together with F# be fast enough? If we have 10.000 data points? 100.000? Compared to C#?
Resampling will require calculations of points between the series' timestamps. I haven't seen any Deedle examples for (linear) interpolation, but I assume that this is just passing a function which can look at the necessary data points? Will this be fast enough for our number of points?
The calculations will be determined by the users and we must have this as configurations. My best guess so far is to have the formula in some format we can parse easily into reverse polish notation, and take special care of tags that will represent series (ie. read from ATS, resample, then do the operations).
Any comments will be highly appreciated!

I think Isaac already mentioned the most important points, but as this question involves some of the things I'm involved with, I thought I'd share a few additional remarks!
BigDeedle. As Isaac mentioned, I used Azure Table storage in BigDeedle. This is mainly useful if you want to explore data interactively using Deedle APIs and do some filtering and range restriction before getting the data in memory and running your calculations. BigDeedle loads data lazily from potentially very big external data source. That said, if you eventually need to load all data into memory, this might not be all that useful for you.
The storage model used in BigDeedle might be useful though - it partitions data based on date, so when you want to get values in a given date range, it knows in which partitions to look. In my experience, loading data from ATS works pretty well, especially if you can do it on an MBrace cluster running in Azure (which is what my NDC demo does in the end).
Efficiency. I think the combination should work well for 10k or 100k data points - there will be no difference whether you do this from F# or C#. As for Deedle, I've definitely used it with data sets of this size - we optimize the library "as needed". Most of the functions are quite efficient already, but there may be some operations that are not efficient. This is something that can be fixed if you open issue on GitHub.
Resampling. There is built-in function for linear interpolation (see here), but I suspect you may need to write your own custom interpolation. Deedle does not "hide the underlying data" from you, so this is not too hard - the last example on this page shows a custom function for filling missing data that uses linear interpolation. If you are doing something like this, you'll need to have the data in memory (so BigDeedle would not be very useful here).
Specifying calculations. I suspect this is a separate question, but F# is great for domain-specific languages. I did a talk on that at earlier NDC. Generally, you can either specify your own DSL (and parse it) or have an embedded DSL where people write subset of F#. F# has good support for both.
PS: If you wanted to get some more help with F#, Deedle and Azure tables, feel free to get in touch. I'm happy to share my experience - you should be able to find a contact via my profile.

F# versus C# will probably be basically the same perf wise unless you do something completely different between the two (for example, immutable vs mutable data sets). Both compile down to IL at the end of the day.
Azure Table Storage - make sure you pick your partition + row keys correctly. There is a lot of documentation on picking Azure Table Storage partition keys, especially over time series - make sure you group rows up at the correct level to ensure data is distributed, with partitions not too large or small. You might also want to look at the Azure Storage Type Provider and / or Azure Storage F# libraries which makes working with ATS easier than the standard .NET SDK.
Deedle AFAIK does indeed have ability to replace missing values across time series, and there's at least a project called BigDeedle which works directly over ATS (although I'm not sure how ready this project is).

Implementing a Spark SQL UserDefinedAggregateFunction that performs multiple passes over a column

I've been experimenting with the UserDefinedAggregateFunction class to write aggregate functions for use in Spark SQL.
It works well for implementing single pass operations like sum(), avg() etc., but is there a trick you can use to perform multiple passes over a column?
For example, Calculating variance using the naive approach. i.e. With a first pass calculating the column mean and then a second pass that uses this value to calculate the variance. I know that there are single pass algorithms for doing this that give good approximations (as in fact implemented by Spark). I was just using this as an example of a two-pass operation.
It would be nice to be able to do the following,
spark.sql("SELECT product, MultiPassAgg(price) FROM products GROUP BY product")
I appreciate that I can do this kind of thing using Dataset / DataFrame operations in stages etc., but I was just looking clean approach as illustrated in the SQL above.
Any ideas or suggestions?

This should be possible, though the following suggestion could potentially use a large amount of memory if a large number of rows are involved in any given partition.
In the implementation of your UserDefinedAggregateFunction, set up the bufferSchema having a StructField that includes a DataType that is a collection (such as ArrayType) to act as an internal collection of inputs provided via update.
Then, in update you append each input to your collection, and in merge you combine all of the collections into a single collection. This allows you to have the full partition available for use in evaluate.
Finally, during evaluate you can operate across the entire collection of rows in any way you see fit.

Using Cassandra to store immutable data?

We're investigating options to store and read a lot of immutable data (events) and I'd like some feedback on whether Cassandra would be a good fit.
Requirements:
We need to store about 10 events per seconds (but the rate will increase). Each event is small, about 1 Kb.
A really important requirement is that we need to be able to replay all events in order. For us it would be fine to read all data in insertion order (like a table scan) so an explicit sort might not be necessary.
Querying the data in any other way is not a prime concern and since Cassandra is a schema db I don't suppose it's possible when the events come in many different forms? Would Cassandra be a good fit for this? If so is there something one should be aware of?

I've had the exact same requirements for a "project" (rather a tool) a year ago, and I used Cassandra and I didn't regret. In general it fits very well. You can fit quite a lot of data in a Cassandra cluster and the performance is impressive (although you might need tweaking) and the natural ordering is a nice thing to have.
Rather than expressing the benefits of using it, I'll rather concentrate on possible pitfalls you might not consider before starting.
You have to think about your schema. The data is naturally ordered within one row by the clustering key, in your case it will be the timestamp. However, you cannot order data between different rows. They might be ordered after the query, but it is not guaranteed in any way so don't think about it. There was some kind of way to write a query before 2.1 I believe (using order by and disabling paging and allowing filtering) but that introduced bad performance and I don't think it is even possible now. So you should order data between rows on your querying side.
This might be an issue if you have multiple variable types (such as temperature and pressure) that have to be replayed at the same time, and you put them in different rows. You have to get those rows with different variable types, then do your resorting on the querying side. Another way to do it is to put all variable types in one row, but than filtering for only a subset is an issue to solve.
Rowlength is limited to 2 billion elements, and although that seems a lot, it really is not unreachable with time series data. Especially because you don't want to get near those two billions, keep it lower in hundreds of millions maximum. If you put some parameter on which you will split the rows (some increasing index or rounding by day/month/year) you will have to implement that in your query logic as well.
Experiment with your queries first on a dummy example. You cannot arbitrarily use <, > or = in queries. There are specific rules in SQL with filtering, or using the WHERE clause..
All in all these things might seem important, but they are really not too much of a hassle when you get to know Cassandra a bit. I'm underlining them just to give you a heads up. If something is not logical at first just fall back to understanding why it is like that and the whole theory about data distribution and the ring topology.
Don't expect too much from the collections within the columns, their length is limited to ~65000 elements.
Don't fall into the misconception that batched statements are faster (this one is a classic :) )

Based on the requirements you expressed, Cassandra could be a good fit as it's a write-optimized data store. Timeseries are quite a common pattern and you can define a clustering order, for example, on the timestamp of the events in order to retrieve all the events in time order. I've found this article on Datastax Academy very useful when wanted to learn about time series.
Variable data structure it's not a problem: you can store the data in a BLOB, then parse it internally from your application (i.e. store it as JSON and read it in your model), or you could even store the data in a map, although collections in Cassandra have some caveats that it's good to be aware of. Here you can find docs about collections in Cassandra 2.0/2.1.
Cassandra is quite different from a SQL database, and although CQL has some similarities there are fundamental differences in usage patterns. It's very important to know how Cassandra works and how to model your data in order to pursue efficiency - a great article from Datastax explains the basics of data modelling.
In a nutshell: Cassandra may be a good fit for you, but before using it take some time to understand its internals as it could be a bad beast if you use it poorly.

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string