Add column with join in Spark - apache-spark

In Spark, is there a way of adding a column to a DataFrame by means of a join, but in a way that guarantees that the left hand side remains completely unchanged?
This is what I have looked at so far:
leftOuterJoin... but that risks duplicating rows, so one would have to be super-careful to make sure that there are no duplicate keys on the right. Not exactly robust or performant, if the only way to guarantee safety is to dedupe before the join.
There is a data structure that seems to guarantee no duplicate keys: PairRDD. That has a nice method of looking up a key in the key-value table: YYY.lookup("key") . Thus one might expect to be able to do .withColumn("newcolumn", udf((key:String) => YYY.lookup(key)).apply(keyColumn)) but it seems that udfs cannot do this because they apparently cannot access the sqlContext which is apparently needed for the lookup. If there were a way of using withColumn I would be extremely happy because it has the right semantics.
Many thanks in advance!

Related

Avoid the use of Java data structures in Apache Spark to avoid copying the data

I have a MySQL database with a single table containing about 100 million records (~25GB, ~5 columns). Using Apache Spark, I extract this data via a JDBC connector and store it in a DataFrame.
From here, I do some pre-processing of the data (e.g. replacing the NULL values), so I absolutely need to go through each record.
Then I would like to perform dimensionality reduction and feature selection (e.g. using PCA), perform clustering (e.g. K-Means) and later on do the testing of the model on new data.
I have implemented this in Spark's Java API, but it is too slow (for my purposes) since I do a lot of copying of the data from a DataFrame to a java.util.Vector and java.util.List (to be able to iterate over all records and do the pre-processing), and later back to a DataFrame (since PCA in Spark expects a DataFrame as input).
I have tried extracting information from the database into a org.apache.spark.sql.Column but cannot find a way to iterate over it.
I also tried avoiding the use of Java data structures (such as List and Vector) by using the org.apache.spark.mllib.linalg.{DenseVector, SparseVector}, but cannot get that to work either.
Finally, I also considered using JavaRDD (by creating it from a DataFrame and a custom schema), but couldn't work it out entirely.
After a lengthy description, my question is: is there a way to do all steps mentioned in the first paragraph, without copying all the data into a Java data structure?
Maybe one of the options I tried could actually work, but I just can't seem to find out how, as the docs and literature on Spark are a bit scarce.
From the wording of your question, it seems there is some confusion about the stages of Spark processing.
First, we tell Spark what to do by specifying inputs and transformations. At this point, the only things that are known are (a) the number of partitions at various stages of processing and (b) the schema of the data. org.apache.spark.sql.Column is used at this stage to identify the metadata associated with a column. However, it doesn't contain any of the data. In fact, there is no data at all at this stage.
Second, we tell Spark to execute an action on a dataframe/dataset. This is what kicks off processing. The input is read and flows through the various transformations and into the final action operation, be it collect or save or something else.
So, that explains why you cannot "extract information from the database into" a Column.
As for the core of your question, it's hard to comment without seeing your code and knowing exactly what it is you are trying to accomplish but it is safe to say that much migrating between types is a bad idea.
Here are a couple of questions that might help guide you to a better outcome:
Why can't you perform the data transformations you need by operating directly on the Row instances?
Would it be convenient to wrap some of your transformation code into a UDF or UDAF?
Hope this helps.

Using Cassandra to store immutable data?

We're investigating options to store and read a lot of immutable data (events) and I'd like some feedback on whether Cassandra would be a good fit.
Requirements:
We need to store about 10 events per seconds (but the rate will increase). Each event is small, about 1 Kb.
A really important requirement is that we need to be able to replay all events in order. For us it would be fine to read all data in insertion order (like a table scan) so an explicit sort might not be necessary.
Querying the data in any other way is not a prime concern and since Cassandra is a schema db I don't suppose it's possible when the events come in many different forms? Would Cassandra be a good fit for this? If so is there something one should be aware of?
I've had the exact same requirements for a "project" (rather a tool) a year ago, and I used Cassandra and I didn't regret. In general it fits very well. You can fit quite a lot of data in a Cassandra cluster and the performance is impressive (although you might need tweaking) and the natural ordering is a nice thing to have.
Rather than expressing the benefits of using it, I'll rather concentrate on possible pitfalls you might not consider before starting.
You have to think about your schema. The data is naturally ordered within one row by the clustering key, in your case it will be the timestamp. However, you cannot order data between different rows. They might be ordered after the query, but it is not guaranteed in any way so don't think about it. There was some kind of way to write a query before 2.1 I believe (using order by and disabling paging and allowing filtering) but that introduced bad performance and I don't think it is even possible now. So you should order data between rows on your querying side.
This might be an issue if you have multiple variable types (such as temperature and pressure) that have to be replayed at the same time, and you put them in different rows. You have to get those rows with different variable types, then do your resorting on the querying side. Another way to do it is to put all variable types in one row, but than filtering for only a subset is an issue to solve.
Rowlength is limited to 2 billion elements, and although that seems a lot, it really is not unreachable with time series data. Especially because you don't want to get near those two billions, keep it lower in hundreds of millions maximum. If you put some parameter on which you will split the rows (some increasing index or rounding by day/month/year) you will have to implement that in your query logic as well.
Experiment with your queries first on a dummy example. You cannot arbitrarily use <, > or = in queries. There are specific rules in SQL with filtering, or using the WHERE clause..
All in all these things might seem important, but they are really not too much of a hassle when you get to know Cassandra a bit. I'm underlining them just to give you a heads up. If something is not logical at first just fall back to understanding why it is like that and the whole theory about data distribution and the ring topology.
Don't expect too much from the collections within the columns, their length is limited to ~65000 elements.
Don't fall into the misconception that batched statements are faster (this one is a classic :) )
Based on the requirements you expressed, Cassandra could be a good fit as it's a write-optimized data store. Timeseries are quite a common pattern and you can define a clustering order, for example, on the timestamp of the events in order to retrieve all the events in time order. I've found this article on Datastax Academy very useful when wanted to learn about time series.
Variable data structure it's not a problem: you can store the data in a BLOB, then parse it internally from your application (i.e. store it as JSON and read it in your model), or you could even store the data in a map, although collections in Cassandra have some caveats that it's good to be aware of. Here you can find docs about collections in Cassandra 2.0/2.1.
Cassandra is quite different from a SQL database, and although CQL has some similarities there are fundamental differences in usage patterns. It's very important to know how Cassandra works and how to model your data in order to pursue efficiency - a great article from Datastax explains the basics of data modelling.
In a nutshell: Cassandra may be a good fit for you, but before using it take some time to understand its internals as it could be a bad beast if you use it poorly.

groupByKey with millions of rows by key

Context:
Aggregation by key with potentially millions of rows by key.
Add features in row. To do that we have to know the previous row (by key and by timestamp). For the moment we used groupByKey and do the work on the Iterable.
We tried:
Add more memory to executor/driver
Change the number of partitions
Changing the memory allowed to executor/driver worked. It worked only for 10k or 100k rows by key. What about millions of rows by key that could happend in the future.
It seems that there is some work on that kind of issues : https://github.com/apache/spark/pull/1977
But it's specific for PySpark and not for the Scala API that we used currently
My questions are:
Is it better that I wait for new features that handle this kind of
issues knowing that I have to work specifically in PySpark?
Another solution would be to implement the workflow differently using some specific keys, values to handle my needs. Any design pattern for that. For example with the need to have the previsous row by key and by timestamp to add fetures?
I think the change in question just makes PySpark work more like the main API. You probably don't want to design a workflow that requires a huge number of values per key, no matter what. There isn't a fix other than designing it differently.
I haven't tried this, and am only fairly sure this behavior is guaranteed, but, maybe you can sortBy timestamp on the whole data set, and then foldByKey. You provide a function that merges a previous value into a next value. This should encounter the data by timestamp. So you see row t, t+1 each time, and each time can just return row t+1 after augmenting it how you like.

Write conflict resolution in cassandra

I understand that cassandra resolves writes conflicts based on every column's key-value pair's timestamp (last write wins). But is there a way we can override this behavior by manual intervention?
Thanks,
Chethan
No.
Cassandra only does LWW. This may seem simplistic, but Cassandra's Big Query-esque data model makes it less of an issue than in a pure key/value-store like Riak, for example. When all you have is an opaque value for a key you want to be able to do things like keeping conflicting writes so that you can resolve them later. Since Cassandra's rows aren't opaque, but more like a sorted map, LWW is almost always enough. With Cassandra you can add new cells to a row from multiple clients without having to worry about conflicts. It's only when multiple clients write to the same cell that there is an issue, but in that case you usually can (and you probably should) model your way around that.

Are Cassandra's Super Columns Useful?

I'm trying to figure out if Cassandra's super columns are useful.
If I understand how Cassandra works (which could be wrong), if I want to read or update a super column, I have to read or write everything in the super column. This means I need to write some mapping code between my object(s) and my super column(s).
Wouldn't it be more efficient for me to simply serialize my object into a regular Cassandra column? It sounds to me like that's exactly what Cassandra does with super columns but they require extra steps.
If I want to read or update a super column, I have to read or write everything in the super column
Not quite: supercolumns are read as a unit, but subcolumns may be updated independently.
So yes, they do have a use, but no, it is not a very common one and in many cases you can indeed just serialize an object into a standard column.

Resources