How to assign unique contiguous numbers to elements in a Spark RDD - apache-spark

I have a dataset of (user, product, review), and want to feed it into mllib's ALS algorithm.
The algorithm needs users and products to be numbers, while mine are String usernames and String SKUs.
Right now, I get the distinct users and SKUs, then assign numeric IDs to them outside of Spark.
I was wondering whether there was a better way of doing this. The one approach I've thought of is to write a custom RDD that essentially enumerates 1 through n, then call zip on the two RDDs.

Starting with Spark 1.0 there are two methods you can use to solve this easily:
RDD.zipWithIndex is just like Seq.zipWithIndex, it adds contiguous (Long) numbers. This needs to count the elements in each partition first, so your input will be evaluated twice. Cache your input RDD if you want to use this.
RDD.zipWithUniqueId also gives you unique Long IDs, but they are not guaranteed to be contiguous. (They will only be contiguous if each partition has the same number of elements.) The upside is that this does not need to know anything about the input, so it will not cause double-evaluation.

For a similar example use case, I just hashed the string values. See http://blog.cloudera.com/blog/2014/03/why-apache-spark-is-a-crossover-hit-for-data-scientists/
def nnHash(tag: String) = tag.hashCode & 0x7FFFFF
var tagHashes = postIDTags.map(_._2).distinct.map(tag =>(nnHash(tag),tag))
It sounds like you're already doing something like this, although hashing can be easier to manage.
Matei suggested here an approach to emulating zipWithIndex on an RDD, which amounts to assigning IDs within each partiition that are going to be globally unique: https://groups.google.com/forum/#!topic/spark-users/WxXvcn2gl1E

Another easy option, if using DataFrames and just concerned about the uniqueness is to use function MonotonicallyIncreasingID
import org.apache.spark.sql.functions.monotonicallyIncreasingId
val newDf = df.withColumn("uniqueIdColumn", monotonicallyIncreasingId)
Edit: MonotonicallyIncreasingID was deprecated and removed since Spark 2.0; it is now known as monotonically_increasing_id .

monotonically_increasing_id() appears to be the answer, but unfortunately won't work for ALS since it produces 64-bit numbers and ALS expects 32-bit ones (see my comment below radek1st's answer for deets).
The solution I found is to use zipWithIndex(), as mentioned in Darabos' answer. Here's how to implement it:
If you already have a single-column DataFrame with your distinct users called userids, you can create a lookup table (LUT) as follows:
# PySpark code
user_als_id_LUT = sqlContext.createDataFrame(userids.rdd.map(lambda x: x[0]).zipWithIndex(), StructType([StructField("userid", StringType(), True),StructField("user_als_id", IntegerType(), True)]))
Now you can:
Use this LUT to get ALS-friendly integer IDs to provide to ALS
Use this LUT to do a reverse-lookup when you need to go back from ALS ID to the original ID
Do the same for items, obviously.

People have already recommended monotonically_increasing_id(), and mentioned the problem that it creates Longs, not Ints.
However, in my experience (caveat - Spark 1.6) - if you use it on a single executor (repartition to 1 before), there is no executor prefix used, and the number can be safely cast to Int. Obviously, you need to have less than Integer.MAX_VALUE rows.

Related

How to enforce a specific partitioning and sort order automatically with a UDAF in Spark?

I think the title of the question is quite clear. Based on the documentation regarding custom UDAFs, I want to develop a UDAF which uses a certain algorithm that relies on the fact that the reduce(b: BUF, a: IN): BUF function is called for inputs that are partitioned and sorted on the field that is being aggregated. And if this is not the case, I would like this to be enforced based on the fact that I'm using this UDAF, rather than repartitioning and sorting manually.
For example:
If I were to develop my own my_count_distinct; say I'm calling it like this:
my_count_distinct("user_id").
The reduce would increment a counter in the buffer when the current user_id is different than the previous one, as the algorithm assumes the the input is ordered on user_id.
The merge would just simply add up the counters in the buffer since the assumption is that the DataFrame is partitioned on the user_id, therefore the input per partition should be mutually exclusive.

What is recommended - keeping empty lists/arrays versus Null in spark tables?

I have a large spark table containing mixed data types String,arrays,maps
The array and map columns are sparse in nature. Should i keep empty arrays in values for these columns or make them null?
Similarly is it recommended to use empty strings "" for storing or null?
What is a good practice and advantages and disadvantages of both?
Generally speaking I would always try to use NULL values instead of empty strings or arrays. The main reason for me for me his how they are handled in spark, e.g. when joining two data frames. NULL values are ignored in joins, but empty strings or lists are not. This can often result in very skew data, which can heavily slow down your transformations. Some information about skew data can be found here [external link].
In addition, NULL values are also often ignored in functions like coalesce of columns [docs], count in aggregations [related question] or first(col, ignorenulls=True) [docs]. If you want to use the functions as they are intended, I would also recommend using NULL over empty string/list.
To sum this up: using NULL over other values like empty strings or lists, allows you to profit for more native Spark functionality and I would recommend to use NULL when ever possible.

If dataframes in Spark are immutable, why are we able to modify it with operations such as withColumn()?

This is probably a stupid question originating from my ignorance. I have been working on PySpark for a few weeks now and do not have much programming experience to start with.
My understanding is that in Spark, RDDs, Dataframes, and Datasets are all immutable - which, again I understand, means you cannot change the data. If so, why are we able to edit a Dataframe's existing column using withColumn()?
As per Spark Architecture DataFrame is built on top of RDDs which are immutable in nature, Hence Data frames are immutable in nature as well.
Regarding the withColumn or any other operation for that matter, when you apply such operations on DataFrames it will generate a new data frame instead of updating the existing data frame.
However, When you are working with python which is dynamically typed language you overwrite the value of the previous reference. Hence when you are executing below statement
df = df.withColumn()
It will generate another dataframe and assign it to reference "df".
In order to verify the same, you can use id() method of rdd to get the unique identifier of your dataframe.
df.rdd.id()
will give you unique identifier for your dataframe.
I hope the above explanation helps.
Regards,
Neeraj
You aren't; the documentation explicitly says
Returns a new Dataset by adding a column or replacing the existing column that has the same name.
If you keep a variable referring to the dataframe you called withColumn on, it won't have the new column.
The Core Data structure of Spark, i.e., the RDD itself is immutable. This nature is pretty much similar to a string in Java which is immutable as well.
When you concat a string with another literal you are not modifying the original string, you are actually creating a new one altogether.
Similarly, either the Dataframe or the Dataset, whenever you alter that RDD by either adding a column or dropping one you are not changing anything in it, instead you are creating a new Dataset/Dataframe.

Why does collect_list in Spark not use partial aggregation

I recently played around with UDAFs and looked into the sourcecode of the built-in aggregation function collect_list, I was suprised to see that collect_list does not have a merge method implemented, although I think this is really straight-farward (just concatenate two Arrays). Code taken from org.apache.spark.sql.catalyst.expressions.aggregate.collect.Collect
override def merge(buffer: InternalRow, input: InternalRow): Unit = {
sys.error("Collect cannot be used in partial aggregations.")
}
It is no longer the case, as SPARK-1893 but I'd assume that the initial design had mostly collect_list in mind.
Because collect_list is logically equivalent to groupByKey the motivation would be exactly the same to avoid long GC pauses. In particular map side combine in groupByKey has been disabled with Spark SPARK-772:
Map side combine in group by key case does not reduce the amount of data shuffled. Instead, it forces a lot more objects to go into old gen, and leads to worse GC.
So to address you comment
I think this is really straight-farward (just concatenate two Arrays).
It might be simple but it doesn't add much value (unless there is another reducing operation on top of it) and sequence concatenation is expensive.

Mind blown: RDD.zip() method

I just discovered the RDD.zip() method and I cannot imagine what its contract could possibly be.
I understand what it does, of course. However, it has always been my understanding that
the order of elements in an RDD is a meaningless concept
the number of partitions and their sizes is an implementation detail only available to the user for performance tuning
In other words, an RDD is a (multi)set, not a sequence (and, of course, in, e.g., Python one gets AttributeError: 'set' object has no attribute 'zip')
What is wrong with my understanding above?
What was the rationale behind this method?
Is it legal outside the trivial context like a.map(f).zip(a)?
EDIT 1:
Another crazy method is zipWithIndex(), as well as well as the various zipPartitions() variants.
Note that first() and take() are not crazy because they are just (non-random) samples of the RDD.
collect() is also okay - it just converts a set to a sequence which is perfectly legit.
EDIT 2: The reply says:
when you compute one RDD from another the order of elements in the new RDD may not correspond to that in the old one.
This appears to imply that even the trivial a.map(f).zip(a) is not guaranteed to be equivalent to a.map(x => (f(x),x)). What is the situation when zip() results are reproducible?
It is not true that RDDs are always unordered. An RDD has a guaranteed order if it is the result of a sortBy operation, for example. An RDD is not a set; it can contain duplicates. Partitioning is not opaque to the caller, and can be controlled and queried. Many operations do preserve both partitioning and order, like map. That said I find it a little easy to accidentally violate the assumptions that zip depends on, since they're a little subtle, but it certainly has a purpose.
The mental model I use (and recommend) is that the elements of an RDD are ordered, but when you compute one RDD from another the order of elements in the new RDD may not correspond to that in the old one.
For those who want to be aware of partitions, I'd say that:
The partitions of an RDD have an order.
The elements within a partition have an order.
If you think of "concatenating" the partitions (say laying them "end to end" in order) using the order of elements within them, the overall ordering you end up with corresponds to the order of elements if you ignore partitions.
But again, if you compute one RDD from another, all bets about the order relationships of the two RDDs are off.
Several members of the RDD class (I'm referring to the Scala API) strongly suggest an order concept (as does their documentation):
collect()
first()
partitions
take()
zipWithIndex()
as does Partition.index as well as SparkContext.parallelize() and SparkContext.makeRDD() (which both take a Seq[T]).
In my experience these ways of "observing" order give results that are consistent with each other, and the ones that translate back and forth between RDDs and ordered Scala collections behave as you would expect -- they preserve the overall order of elements. This is why I say that, in practice, RDDs have a meaningful order concept.
Furthermore, while there are obviously many situations where computing an RDD from another must change the order, in my experience order tends to be preserved where it is possible/reasonable to do so. Operations that don't re-partition and don't fundamentally change the set of elements especially tend to preserve order.
But this brings me to your question about "contract", and indeed the documentation has a problem in this regard. I have not seen a single place where an operation's effect on element order is made clear. (The OrderedRDDFunctions class doesn't count, because it refers to an ordering based on the data, which may differ from the raw order of elements within the RDD. Likewise the RangePartitioner class.) I can see how this might lead you to conclude that there is no concept of element order, but the examples I've given above make that model unsatisfying to me.

Resources