Equivalent function to pyspark.sql.functions.posexplode in presto? - apache-spark

Would anyone know if there in an equivalent function similar to the pyspark function pyspark.sql.functions.posexplode() in presto?
I am trying to explode and array with its index. I am aware of the existence of UNNEST(). However, this does not store the index of the value in the array.

Update - I have found the answer
CROSS JOIN UNNEST('col_wth_array')
WITH ORDINALITY AS t ('value', 'position');

Related

Choose between map+inner loop and flatMapValues+reduceByKey

I have a data like following in pairRDD, and I would like to collect a map with username as key, and sum of each list as value. The number of users is very large say 100m+, and lists are <1k in size. There are 2 choices I can think of - mapToPair and sum list with a simple for loop inside mapToPair, or flatMapValues the list to create <user, value> pairs then reduceBykey. Which is way is better?
Seq(
("user1",List(8,2,....)),
("user2",List(1,12,.....)),
...
("userN",List(99,5,...))
)
I would guess rdd.mapValues(_.sum) would be faster because you iterate over the elements once instead of twice (once to flatten, once to reduce).
But the best answer would be to just test it an see.
Best tip I can think of though, is try to work with DataFrames or Datasets (Spark SQL) to begin with. If you end up with a flattened DataFrame you can call df.groupBy($"user").agg(F.sum($"value")) or if you have a Dataframe like the RDD you described you can just use the aggregate SQL function

Spark, What does groupby return?

what's the return value of groupBy and agg in spark?
(this was one of the confusing part of pandas, and I never got it, and I guess it is similar here with spark)
df.groupBy("col1").agg(max("col2").alias("col2_max"))
it seems even if it looks like regular dataframe when you do .show() on it, I believe it is not a dataframe. (because if you do another .agg after the initial .agg things get weird
So what does groupBy return and agg return?
According to the Spark documentation, the Dataframe.groupBy method returns a GroupData object which basically has aggregated methods like agg, count, sum, avg, etc. The agg method (and the other ones) return a DataFrame
For further details, review the following documentation links: http://spark.apache.org/docs/2.1.0/api/python/pyspark.sql.html#pyspark.sql.DataFrame.groupBy and http://spark.apache.org/docs/2.1.0/api/python/pyspark.sql.html#pyspark.sql.GroupedData
Hope this helps

What's the difference between explode function and operator?

What's the difference between explode function and explode operator?
spark.sql.functions.explode
explode function creates a new row for each element in the given array or map column (in a DataFrame).
val signals: DataFrame = spark.read.json(signalsJson)
signals.withColumn("element", explode($"data.datapayload"))
explode creates a Column.
See functions object and the example in How to unwind array in DataFrame (from JSON)?
Dataset<Row> explode / flatMap operator (method)
explode operator is almost the explode function.
From the scaladoc:
explode returns a new Dataset where a single column has been expanded to zero or more rows by the provided function. This is similar to a LATERAL VIEW in HiveQL. All columns of the input row are implicitly joined with each value that is output by the function.
ds.flatMap(_.words.split(" "))
Please note that (again quoting the scaladoc):
Deprecated (Since version 2.0.0) use flatMap() or select() with functions.explode() instead
See Dataset API and the example in How to split multi-value column into separate rows using typed Dataset?
Despite explode being deprecated (that we could then translate the main question to the difference between explode function and flatMap operator), the difference is that the former is a function while the latter is an operator. They have different signatures, but can give the same results. That often leads to discussions what's better and usually boils down to personal preference or coding style.
One could also say that flatMap (i.e. explode operator) is more Scala-ish given how ubiquitous flatMap is in Scala programming (mainly hidden behind for-comprehension).
flatMap is much better in performance in comparison to explode as flatMap require much lesser data shuffle.
If you are processing big data (>5 GB) the performance difference could be seen evidently.

Find first element in RDD satisfying the given predicate

How to find the first element in a normal RDD( Because in PairRDD, we can use lookup(key) API ) which satisfy a predicate? And after finding the first element, it should exit the RDD traversal.
Looking for a solution without using legacy for loops.
How about
rdd.filter(p).top(1)
or if you don't have an order on the RDD
rdd.filter(p).take(1)
The above solutions stated are perfectly correct. Here is another method to achieve the same goal
rdd.filter(p).first

How to assign unique contiguous numbers to elements in a Spark RDD

I have a dataset of (user, product, review), and want to feed it into mllib's ALS algorithm.
The algorithm needs users and products to be numbers, while mine are String usernames and String SKUs.
Right now, I get the distinct users and SKUs, then assign numeric IDs to them outside of Spark.
I was wondering whether there was a better way of doing this. The one approach I've thought of is to write a custom RDD that essentially enumerates 1 through n, then call zip on the two RDDs.
Starting with Spark 1.0 there are two methods you can use to solve this easily:
RDD.zipWithIndex is just like Seq.zipWithIndex, it adds contiguous (Long) numbers. This needs to count the elements in each partition first, so your input will be evaluated twice. Cache your input RDD if you want to use this.
RDD.zipWithUniqueId also gives you unique Long IDs, but they are not guaranteed to be contiguous. (They will only be contiguous if each partition has the same number of elements.) The upside is that this does not need to know anything about the input, so it will not cause double-evaluation.
For a similar example use case, I just hashed the string values. See http://blog.cloudera.com/blog/2014/03/why-apache-spark-is-a-crossover-hit-for-data-scientists/
def nnHash(tag: String) = tag.hashCode & 0x7FFFFF
var tagHashes = postIDTags.map(_._2).distinct.map(tag =>(nnHash(tag),tag))
It sounds like you're already doing something like this, although hashing can be easier to manage.
Matei suggested here an approach to emulating zipWithIndex on an RDD, which amounts to assigning IDs within each partiition that are going to be globally unique: https://groups.google.com/forum/#!topic/spark-users/WxXvcn2gl1E
Another easy option, if using DataFrames and just concerned about the uniqueness is to use function MonotonicallyIncreasingID
import org.apache.spark.sql.functions.monotonicallyIncreasingId
val newDf = df.withColumn("uniqueIdColumn", monotonicallyIncreasingId)
Edit: MonotonicallyIncreasingID was deprecated and removed since Spark 2.0; it is now known as monotonically_increasing_id .
monotonically_increasing_id() appears to be the answer, but unfortunately won't work for ALS since it produces 64-bit numbers and ALS expects 32-bit ones (see my comment below radek1st's answer for deets).
The solution I found is to use zipWithIndex(), as mentioned in Darabos' answer. Here's how to implement it:
If you already have a single-column DataFrame with your distinct users called userids, you can create a lookup table (LUT) as follows:
# PySpark code
user_als_id_LUT = sqlContext.createDataFrame(userids.rdd.map(lambda x: x[0]).zipWithIndex(), StructType([StructField("userid", StringType(), True),StructField("user_als_id", IntegerType(), True)]))
Now you can:
Use this LUT to get ALS-friendly integer IDs to provide to ALS
Use this LUT to do a reverse-lookup when you need to go back from ALS ID to the original ID
Do the same for items, obviously.
People have already recommended monotonically_increasing_id(), and mentioned the problem that it creates Longs, not Ints.
However, in my experience (caveat - Spark 1.6) - if you use it on a single executor (repartition to 1 before), there is no executor prefix used, and the number can be safely cast to Int. Obviously, you need to have less than Integer.MAX_VALUE rows.

Resources