Is there some tooling in Spark to handle bad records, meaning something which is null after a left join or that was not joined properly?
It would be great if there was something like this but specifically for checking data quality after joins.
No there is not, the reference to Databricks you make is different to the left join (not properly whatever that may mean) situation you mean. I think you also mean outer join at least and that is deliberately so.
Related
I'm re-writing some SQL code and there is a section of the code which makes use of sub-queries. I could write this as a join, but wanted to know if it can be done in a similar sub-query manner in Pyspark. There is significant performance benefit by using sub-queries in the SQL code, but want to know if this would be irrelevant in Pyspark due to optimisation in the DAG. So it would be helpful if someone could explain the relative performance tradeoff, if there is a tradeoff.
The logic is pretty simple: I have df_a and I want to pull a column from df_b where df_a and df_b have a matching index on a certain key. The below isn't working but is intended to show the intent.
df_a.select("df_a.key_1", "df_a.key_2", df_b.select("df_b.key_2").where(col("df_b.key_1")=="df_a.key_3"))
If you already have your sql code that is working then you can just use spark.sql(s) where s is your query as a String. It can contain subqueries too. Just make sure to create a view of your dataframe to use it inside spark.sql query. Here is a toy example :
df.createOrReplaceTempView("people")
sqlDF = spark.sql("SELECT * FROM people") // put your sql query here containing subqueries
As for your question regarding optimisation tradeoff. In theory the catalyst optimizer used by spark should take care of any optimization in your query but as always if you know exactly what kind of optimisation you need then in general it is better do it by hand rather than relying on catalist.
I was wondering if there are performance difference between calling except (https://spark.apache.org/docs/2.1.0/api/java/org/apache/spark/sql/Dataset.html#except(org.apache.spark.sql.Dataset) and using a left anti-join. So far, the only difference I can see is that with the left anti-join, the 2 datasets can have different columns.
Your title vs. explanation differ.
But, if you have the same structure you can use both methods to find missing data.
EXCEPT
is a specific implementation that enforces same structure and is a subtract operation, whereas
LEFT ANTI JOIN
allows different structures as you would say, but can give the same result.
Use cases differ: 1) Left Anti Join can apply to many situations pertaining to missing data - customers with no orders (yet), orphans in a database. 2) Except is for subtracting things, e.g. Machine Learning splitting data into test- and training sets.
Performance should not be a real deal breaker as they are different use cases in general and therefore difficult to compare. Except will involve the same data source whereas LAJ will involve different data sources.
In Spark, is there a way of adding a column to a DataFrame by means of a join, but in a way that guarantees that the left hand side remains completely unchanged?
This is what I have looked at so far:
leftOuterJoin... but that risks duplicating rows, so one would have to be super-careful to make sure that there are no duplicate keys on the right. Not exactly robust or performant, if the only way to guarantee safety is to dedupe before the join.
There is a data structure that seems to guarantee no duplicate keys: PairRDD. That has a nice method of looking up a key in the key-value table: YYY.lookup("key") . Thus one might expect to be able to do .withColumn("newcolumn", udf((key:String) => YYY.lookup(key)).apply(keyColumn)) but it seems that udfs cannot do this because they apparently cannot access the sqlContext which is apparently needed for the lookup. If there were a way of using withColumn I would be extremely happy because it has the right semantics.
Many thanks in advance!
I am wondering if anyone is aware of any discussion of Joins vs Lookups in Spark? I have seen this page : Lookup in spark dataframes where everyone basically says that joins are far superior to lookups and I was unsuccessful in my google-fu attempt to find anything backing that up or even discussing the two topics.
Such thing as lookup in Spark DataFrame simply doesn't exist, therefore it is inferior to any other solution and join (hash or broadcast) or using local data structures is the only option.
Lookups and Joins are two different concepts in relational data systems. Therefore, it doesn't really make sense in a general context to say that one is superior to the other because they have different functions. A lookup is simply finding data, sometimes using a key or hash value to optimize query speed. A join is using common elements in two data sets to create a new data set.
E.g. (completely hypothetical and abstract)
Lookup query 1
= 'Hello'
Join query 1 , query 2
=
'Hello world'
if query 2 equals world
We're investigating options to store and read a lot of immutable data (events) and I'd like some feedback on whether Cassandra would be a good fit.
Requirements:
We need to store about 10 events per seconds (but the rate will increase). Each event is small, about 1 Kb.
A really important requirement is that we need to be able to replay all events in order. For us it would be fine to read all data in insertion order (like a table scan) so an explicit sort might not be necessary.
Querying the data in any other way is not a prime concern and since Cassandra is a schema db I don't suppose it's possible when the events come in many different forms? Would Cassandra be a good fit for this? If so is there something one should be aware of?
I've had the exact same requirements for a "project" (rather a tool) a year ago, and I used Cassandra and I didn't regret. In general it fits very well. You can fit quite a lot of data in a Cassandra cluster and the performance is impressive (although you might need tweaking) and the natural ordering is a nice thing to have.
Rather than expressing the benefits of using it, I'll rather concentrate on possible pitfalls you might not consider before starting.
You have to think about your schema. The data is naturally ordered within one row by the clustering key, in your case it will be the timestamp. However, you cannot order data between different rows. They might be ordered after the query, but it is not guaranteed in any way so don't think about it. There was some kind of way to write a query before 2.1 I believe (using order by and disabling paging and allowing filtering) but that introduced bad performance and I don't think it is even possible now. So you should order data between rows on your querying side.
This might be an issue if you have multiple variable types (such as temperature and pressure) that have to be replayed at the same time, and you put them in different rows. You have to get those rows with different variable types, then do your resorting on the querying side. Another way to do it is to put all variable types in one row, but than filtering for only a subset is an issue to solve.
Rowlength is limited to 2 billion elements, and although that seems a lot, it really is not unreachable with time series data. Especially because you don't want to get near those two billions, keep it lower in hundreds of millions maximum. If you put some parameter on which you will split the rows (some increasing index or rounding by day/month/year) you will have to implement that in your query logic as well.
Experiment with your queries first on a dummy example. You cannot arbitrarily use <, > or = in queries. There are specific rules in SQL with filtering, or using the WHERE clause..
All in all these things might seem important, but they are really not too much of a hassle when you get to know Cassandra a bit. I'm underlining them just to give you a heads up. If something is not logical at first just fall back to understanding why it is like that and the whole theory about data distribution and the ring topology.
Don't expect too much from the collections within the columns, their length is limited to ~65000 elements.
Don't fall into the misconception that batched statements are faster (this one is a classic :) )
Based on the requirements you expressed, Cassandra could be a good fit as it's a write-optimized data store. Timeseries are quite a common pattern and you can define a clustering order, for example, on the timestamp of the events in order to retrieve all the events in time order. I've found this article on Datastax Academy very useful when wanted to learn about time series.
Variable data structure it's not a problem: you can store the data in a BLOB, then parse it internally from your application (i.e. store it as JSON and read it in your model), or you could even store the data in a map, although collections in Cassandra have some caveats that it's good to be aware of. Here you can find docs about collections in Cassandra 2.0/2.1.
Cassandra is quite different from a SQL database, and although CQL has some similarities there are fundamental differences in usage patterns. It's very important to know how Cassandra works and how to model your data in order to pursue efficiency - a great article from Datastax explains the basics of data modelling.
In a nutshell: Cassandra may be a good fit for you, but before using it take some time to understand its internals as it could be a bad beast if you use it poorly.