Dataframe withColumn function - apache-spark

I know general structure of how to use withColumn function with a DataFrame like
df = df.withColumn("new_column_name", ((df.someColumn > someValue) & (df.someColumn < someOtherValue)))
Lets say, now, that the operator information (> and < in the above example) is stored as string(inputted by user). How can I perform above kind of operations? One naive way I can think of is to write many if else blocks one for each kind of operation and whenever we want to add new operation then we would have to add more if else blocks.
What obvious tweaks am I missing here?
Thanks in advance.

Related

Choose between map+inner loop and flatMapValues+reduceByKey

I have a data like following in pairRDD, and I would like to collect a map with username as key, and sum of each list as value. The number of users is very large say 100m+, and lists are <1k in size. There are 2 choices I can think of - mapToPair and sum list with a simple for loop inside mapToPair, or flatMapValues the list to create <user, value> pairs then reduceBykey. Which is way is better?
Seq(
("user1",List(8,2,....)),
("user2",List(1,12,.....)),
...
("userN",List(99,5,...))
)
I would guess rdd.mapValues(_.sum) would be faster because you iterate over the elements once instead of twice (once to flatten, once to reduce).
But the best answer would be to just test it an see.
Best tip I can think of though, is try to work with DataFrames or Datasets (Spark SQL) to begin with. If you end up with a flattened DataFrame you can call df.groupBy($"user").agg(F.sum($"value")) or if you have a Dataframe like the RDD you described you can just use the aggregate SQL function

Update a pyspark Delta Table using a python boolean function

so I have a delta table that I want to update based on a condition of two column values combined;
i.e.
delta_table.update(
condition=is_eligible(col("name"), col("age"))
set={"pension_eligible": lit("yes")}
)
I'm aware that I can do something similar to:
delta_table.update(
condition=(col("name") == "Einar") & (col("age") > 65)
set={"pension_eligible": lit("yes")}
)
But since my logic for computing this is quite complex (I need to look up the name in a database) I would like to define my own Python function for computing this (is_eligible(...)). Other reasons are because this function is used elsewhere and I would like to minimize code duplication.
Is this possible at all? As I understand you could define it as an UDF, but they only take one parameter and I need at least two. I can not find anything about more complex conditions in the delta lake documentation, so I'd really appreciate some guidance here.

PySpark: combine aggregate and window functions

I am working with a legacy Spark SQL code like this:
SELECT
column1,
max(column2),
first_value(column3),
last_value(column4)
FROM
tableA
GROUP BY
column1
ORDER BY
columnN
I am rewriting it in PySpark as below
df.groupBy(column1).agg(max(column2), first(column3), last(column4)).orderBy(columnN)
When I'm comparing the two outcomes I can see differences in the fields generated by the first_value/first and last_value/last functions.
Are they behaving in a non-deterministic way when used outside of Window functions?
Can groupBy aggregates be combined with Window functions?
This behaviour is possible when you have a wide table and you don't specify ordering for the remaining columns. What happens under the hood is that spark takes first() or last() row, whichever is available to it as the first condition-matching row on the heap. Spark SQL and pyspark might access different elements because the ordering is not specified for the remaining columns.
In terms of Window function, you can use a partitionBy(f.col('column_name')) in your Window, which kind of works like a groupBy - it groups the data according to a partitioning column. However, without specifying the ordering for all columns, you might arrive at the same problem of non-determinicity. Hope this helps!
For completeness sake, I recommend you have a look at the pyspark doc for the first() and last() functions here: https://spark.apache.org/docs/2.4.3/api/python/pyspark.sql.html#pyspark.sql.functions.first
In particular, the following note brings light to why you behaviour was non-deterministic:
Note The function is non-deterministic because its results depends on order of rows which may be non-deterministic after a shuffle.
Definitely !
import pyspark.sql.functions as F
partition = Window.partitionBy("column1").orderBy("columnN")
data = data.withColumn("max_col2", F.max(F.col("column2")).over(partition))\
.withColumn("first_col3", F.first(F.col("column3")).over(partition))\
.withColumn("last_col4", F.last(F.col("column4")).over(partition))
data.show(10, False)

How can I access values outside of Spark GraphX .map loop?

Brand new to Apache Spark and I'm a little confused how to make updates to a value that sits outside of a .mapTriplets iteration in GraphX. See below:
def mapTripletsMethod(edgeWeights: Graph[Int, Double], stationaryDistribution: Graph[Double, Double]) = {
val tempMatrix: SparseDoubleMatrix2D = graphToSparseMatrix(edgeWeights)
stationaryDistribution.mapTriplets{ e =>
val row = e.srcId.toInt
val column = e.dstId.toInt
var cellValue = -1 * tempMatrix.get(row, column) + e.dstAttr
tempMatrix.set(row, column, cellValue) // this doesn't do anything to tempMatrix
e
}
}
I'm guessing this is due to the design of an RDD and there's no simple way to update the tempMatrix value. When I run the above code the tempMatrix.set method does nothing. It was rather difficult to try to follow the problem in the debugger.
Does anyone have an easy solution? Thank you!
Edit
I've made an update above to show that stationaryDistribution is a graph RDD.
You could make tempMatrix be of type RDD[((Int,Int), Double)] -- that is, each entry is a pair where the first element is in turn a (row,col) pair. Then use the PairRDDFunctions class to combine that with ((row,col),weight) triplets generated by your mapTriplets call. (So, don't think of it as updating the tempMatrix, but rather combining two RDDs to get a third.)
If you need to support stationary distribution graphs where there is more than one edge per vertex pair it gets a little tricky: you'll probably need to combine those edges in a reduction pass to create an RDD with one entry per pair, with a list of weights, and then apply all the weights to a given (row,col) pair at the same time. Otherwise it's very simple.
Notice that `PairRDDFunctions' on the one hand give you ways to combine multiple RDDs into one, or on the other hand to pull the values out into a Map on the master. Assuming that the distribution matrix is large enough to merit an RDD in the first place, I think you should do the whole thing on RDDs.
Another approach is to make the tempMatrix be a GraphRDD too, which may or may not make sense depending on what you're going to do with it next.

How to assign unique contiguous numbers to elements in a Spark RDD

I have a dataset of (user, product, review), and want to feed it into mllib's ALS algorithm.
The algorithm needs users and products to be numbers, while mine are String usernames and String SKUs.
Right now, I get the distinct users and SKUs, then assign numeric IDs to them outside of Spark.
I was wondering whether there was a better way of doing this. The one approach I've thought of is to write a custom RDD that essentially enumerates 1 through n, then call zip on the two RDDs.
Starting with Spark 1.0 there are two methods you can use to solve this easily:
RDD.zipWithIndex is just like Seq.zipWithIndex, it adds contiguous (Long) numbers. This needs to count the elements in each partition first, so your input will be evaluated twice. Cache your input RDD if you want to use this.
RDD.zipWithUniqueId also gives you unique Long IDs, but they are not guaranteed to be contiguous. (They will only be contiguous if each partition has the same number of elements.) The upside is that this does not need to know anything about the input, so it will not cause double-evaluation.
For a similar example use case, I just hashed the string values. See http://blog.cloudera.com/blog/2014/03/why-apache-spark-is-a-crossover-hit-for-data-scientists/
def nnHash(tag: String) = tag.hashCode & 0x7FFFFF
var tagHashes = postIDTags.map(_._2).distinct.map(tag =>(nnHash(tag),tag))
It sounds like you're already doing something like this, although hashing can be easier to manage.
Matei suggested here an approach to emulating zipWithIndex on an RDD, which amounts to assigning IDs within each partiition that are going to be globally unique: https://groups.google.com/forum/#!topic/spark-users/WxXvcn2gl1E
Another easy option, if using DataFrames and just concerned about the uniqueness is to use function MonotonicallyIncreasingID
import org.apache.spark.sql.functions.monotonicallyIncreasingId
val newDf = df.withColumn("uniqueIdColumn", monotonicallyIncreasingId)
Edit: MonotonicallyIncreasingID was deprecated and removed since Spark 2.0; it is now known as monotonically_increasing_id .
monotonically_increasing_id() appears to be the answer, but unfortunately won't work for ALS since it produces 64-bit numbers and ALS expects 32-bit ones (see my comment below radek1st's answer for deets).
The solution I found is to use zipWithIndex(), as mentioned in Darabos' answer. Here's how to implement it:
If you already have a single-column DataFrame with your distinct users called userids, you can create a lookup table (LUT) as follows:
# PySpark code
user_als_id_LUT = sqlContext.createDataFrame(userids.rdd.map(lambda x: x[0]).zipWithIndex(), StructType([StructField("userid", StringType(), True),StructField("user_als_id", IntegerType(), True)]))
Now you can:
Use this LUT to get ALS-friendly integer IDs to provide to ALS
Use this LUT to do a reverse-lookup when you need to go back from ALS ID to the original ID
Do the same for items, obviously.
People have already recommended monotonically_increasing_id(), and mentioned the problem that it creates Longs, not Ints.
However, in my experience (caveat - Spark 1.6) - if you use it on a single executor (repartition to 1 before), there is no executor prefix used, and the number can be safely cast to Int. Obviously, you need to have less than Integer.MAX_VALUE rows.

Resources