How to efficiently group every k rows in spark dataset? - apache-spark

I created a spark Dataset[Row], and the Row is Row(x: Vector). x here is a 1xp vector.
Is it possible to 1) group every k rows 2) concatenating these rows into a k x p matrix - mX i.e., change Dateset[Row(Vector)] to Dateset[Row(Matrix)] ?
Here is my current soluttion, convert this Dataset[Row] to RDD, and concatenate every k rows with zipWithIndex and aggregateByKey.
val dataRDD = data_df.rdd.zipWithIndex
.map { case (line, index) => (index/k, line) }
.aggregateByKey(...) (..., ...)
But it seems it's not very efficient, is there a more efficient way to do this?
Thanks in advance.

There are two performance issues with your approach:
Using a global ordering
Doing a shuffle to build the groups of k
If you absolutely need a global ordering, starting from line 1, and you cannot break up your data into multiple partitions then Spark has to move all the data through a single core. You can speed that part up by finding a way to have more than one partition.
You can avoid a shuffle by processing the data one partition at a time using mapPartitions:
spark.range(1, 20).coalesce(1).mapPartitions(_.grouped(5)).show
+--------------------+
| value|
+--------------------+
| [1, 2, 3, 4, 5]|
| [6, 7, 8, 9, 10]|
|[11, 12, 13, 14, 15]|
| [16, 17, 18, 19]|
+--------------------+
Note that coalesce(1) above is forcing all 20 rows into a single partition.

Here is a solution that groups N records into columns:
Generate from RDD to DF and process as shown below.
The g is group, the k is key to record number which repeats within g. v is your record content.
Input is a file of 6 lines and I used groups of 3 here.
Only drawback is if the lines have a remainder less than the grouping N.
import org.apache.spark.sql.functions._
import org.apache.spark.mllib.rdd.RDDFunctions._
val dfsFilename = "/FileStore/tables/7dxa9btd1477497663691/Text_File_01-880f5.txt"
val readFileRDD = spark.sparkContext.textFile(dfsFilename)
val rdd2 = readFileRDD.sliding(3,3).zipWithIndex
val rdd3 = rdd2.map(r => (r._1.zipWithIndex, r._2))
val df = rdd3.toDF("vk","g")
val df2 = df.withColumn("vke", explode($"vk")).drop("vk")
val df3 = df2.withColumn("k", $"vke._2").withColumn("v", $"vke._1").drop("vke")
val result = df3
.groupBy("g")
.pivot("k")
.agg(expr("first(v)"))
result.show()
returns:
+---+--------------------+--------------------+--------------------+
| g| 0| 1| 2|
+---+--------------------+--------------------+--------------------+
| 0|The quick brown f...|Here he lays I te...|Gone are the days...|
| 1| Gosh, what to say.|Hallo, hallo, how...| I am fine.|
+---+--------------------+--------------------+--------------------+

Related

PySpark: how to aggregate over column arrays with variable width?

I am attempting to aggregate and create an array of means thus (this is a Minimal Working Example):
n = len(allele_freq_total.select("alleleFrequencies").first()[0])
allele_freq_by_site = allele_freq_total.groupBy("contigName", "start", "end", "referenceAllele").agg(
array(*[mean(col("alleleFrequencies")[i]) for i in range(n)]).alias("mean_alleleFrequencies")
using a solution that I got from
Aggregate over column arrays in DataFrame in PySpark?
but the problem is that n is variable, how do I alter
array(*[mean(col("alleleFrequencies")[i]) for i in range(n)])
so that it takes variable length into consideration?
With arrays of unequal size in the different groups (for you, a group is ("contigName", "start", "end", "referenceAllele"), which I'll simply rename to group), you could consider exploding the array column (the alleleFrequencies), with introduction of the position the values had within the arrays. That will give you an additional column you can use in grouping to compute the average you had in mind. At this point you might actually have enough for further computations (see df3.show() below).
If you really must have it back into an array, that's harder and I haven't an idea. One must keep track of the order, and I believe that's easy with a map (a dictionary, if you like). To do so, I use the aggregation function collect_list on two columns. While collect_list isn't deterministic (you don't know the order in which values will be returned in the list, because rows are shuffled), the aggregation over both arrays will preserve their order, as the rows get shuffled in their entirety (see df4.show(), below). From there, you can create a mapping of the position to the average with map_from_arrays.
Example:
>>> from pyspark.sql.functions import mean, col, posexplode, collect_list, map_from_arrays
>>>
>>> df = spark.createDataFrame([
... ("A", [0, 1, 2]),
... ("A", [0, 3, 6]),
... ("B", [1, 2, 4, 5]),
... ("B", [1, 2, 6, 1])],
... schema=("group", "values"))
>>> df2 = df.select(df.group, posexplode(df.values)) # adds the "pos" and "col" columns
>>> df3 = (df2
... .groupBy("group", "pos")
... .agg(mean(col("col")).alias("avg_of_positions"))
... )
>>> df4 = (df3
... .groupBy("group")
... .agg(
... collect_list("pos").alias("pos"),
... collect_list("avg_of_positions").alias("avgs")
... )
... )
>>> df5 = df4.select(
... "group",
... map_from_arrays(col("pos"), col("avgs")).alias("positional_averages")
... )
>>> df5.show(truncate=False)
[Stage 0:> (0 + 4) / 4]
+-----+----------------------------------------+
|group|positional_averages |
+-----+----------------------------------------+
|B |[0 -> 1.0, 1 -> 2.0, 3 -> 3.0, 2 -> 5.0]|
|A |[0 -> 0.0, 1 -> 2.0, 2 -> 4.0] |
+-----+----------------------------------------+

I want to do the same transformation in Python as I did in Scala

I'm new to Python.
Scala Code:
rdd1 is in string format
rdd1=sc.parallelize("[Canada,47;97;33;94;6]", "[Canada,59;98;24;83;3]","[Canada,77;63;93;86;62]")
val resultRDD = rdd1.map { r =>
val Array(country, values) = r.replaceAll("\\[|\\]", "").split(",")
country -> values
}.reduceByKey((a, b) => a.split(";").zip(b.split(";")).map {
case (i1, i2) => i1.toInt + i2.toInt }.mkString(";"))
Output:
Country,Values //I have puted the column name to make sure that the output should be in two column
Canada,183;258;150;263;71
Edit: OP wants to use map instead of flatMap, so I adjusted flatMap to map by which, you just need to take the first item out of the list comprehension, thus map(lambda x: [...][0]).
side-note: The above change is valid only to this particular case when list comprehension returns a list with only one item. for more general cases, you might need two map()s to replace what flatMap() does.
One way with RDD is to use a list comprehension to strip, split and convert the String into a key-value pair, with Country as key and a tuple of numbers as value. Since we use list comprehension, so we take flatMap on the RDD element, then use reduceByKey to do the calculation and mapValues to convert the resulting tuple back into string:
rdd1.map(lambda x: [ (e[0], tuple(map(int, e[1].split(';')))) for e in [x.strip('][').split(',')] ][0]) \
.reduceByKey(lambda x,y: tuple([ x[i]+y[i] for i in range(len(x))]) ) \
.mapValues(lambda x: ';'.join(map(str,x))) \
.collect()
output after flatMap:
[('Canada', (47, 97, 33, 94, 6)),
('Canada', (59, 98, 24, 83, 3)),
('Canada', (77, 63, 93, 86, 62))]
output after reduceByKey:
[('Canada', (183, 258, 150, 263, 71))]
output after mapValues:
[('Canada', '183;258;150;263;71')]
You can do something like this
import pyspark.sql.functions as f
from pyspark.sql.functions import col
myRDD = sc.parallelize([('Canada', '47;97;33;94;6'), ('Canada', '59;98;24;83;3'),('Canada', '77;63;93;86;62')])
df = myRDD.toDF()
>>> df.show(10)
+------+--------------+
| _1| _2|
+------+--------------+
|Canada| 47;97;33;94;6|
|Canada| 59;98;24;83;3|
|Canada|77;63;93;86;62|
+------+--------------+
df.select(
col("_1").alias("country"),
f.split("_2", ";").alias("values"),
f.posexplode(f.split("_2", ";")).alias("pos", "val")
)\
.drop("val")\
.select(
"country",
f.concat(f.lit("position"),f.col("pos").cast("string")).alias("name"),
f.expr("values[pos]").alias("val")
)\
.groupBy("country").pivot("name").agg(f.sum("val"))\
.show()
+-------+---------+---------+---------+---------+---------+
|country|position0|position1|position2|position3|position4|
+-------+---------+---------+---------+---------+---------+
| Canada| 183.0| 258.0| 150.0| 263.0| 71.0|
+-------+---------+---------+---------+---------+---------+

Python Spark combineByKey Average

I'm trying to learn Spark in Python, and am stuck with combineByKey for averaging the values in key-value pairs. In fact, my confusion is not with the combineByKey syntax, but what comes afterward. The typical example (from the O'Rielly 2015 Learning Spark Book) can be seen on the web in many places; here's one.
The problem is with the sumCount.map(lambda (key, (totalSum, count)): (key, totalSum / count)).collectAsMap() statement. Using spark 2.0.1 and iPython 3.5.2, this throws a syntax error exception. Simplifying it to something that should work (and is what's in the O'Reilly book): sumCount.map(lambda key,vals: (key, vals[0]/vals[1])).collectAsMap() causes Spark to go bats**t crazy with java exceptions, but I do note a TypeError: <lambda>() missing 1 required positional argument: 'v' error.
Can anyone point me to an example of this functionality that actually works with a recent version of Spark & Python? For completeness, I've included my own minimum working (or rather, non-working) example:
In: pRDD = sc.parallelize([("s",5),("g",3),("g",10),("c",2),("s",10),("s",3),("g",-1),("c",20),("c",2)])
In: cbk = pRDD.combineByKey(lambda x:(x,1), lambda x,y:(x[0]+y,x[1]+1),lambda x,y:(x[0]+y[0],x[1]+y[1]))
In: cbk.collect()
Out: [('s', (18, 3)), ('g', (12, 3)), ('c', (24, 3))]
In: cbk.map(lambda key,val:(k,val[0]/val[1])).collectAsMap() <-- errors
It's easy enough to compute [(e[0],e[1][0]/e[1][1]) for e in cbk.collect()], but I'd rather get the "Sparkic" way working.
Step by step:
lambda (key, (totalSum, count)): ... is so-called Tuple Parameter Unpacking which has been removed in Python.
RDD.map takes a function which expect as single argument. Function you try to use:
lambda key, vals: ...
Is a function which expects two arguments, not a one. A valid translation from 2.x syntax would be
lambda key_vals: (key_vals[0], key_vals[1][0] / key_vals[1][1])
or:
def get_mean(key_vals):
key, (total, cnt) = key_vals
return key, total / cnt
cbk.map(get_mean)
You can also make this much simpler with mapValues:
cbk.mapValues(lambda x: x[0] / x[1])
Finally a numerically stable solution would be:
from pyspark.statcounter import StatCounter
(pRDD
.combineByKey(
lambda x: StatCounter([x]),
StatCounter.merge,
StatCounter.mergeStats)
.mapValues(StatCounter.mean))
Averaging over a specific column value can be done by using the Window concept. Consider the following code:
import pyspark.sql.functions as F
from pyspark.sql import Window
df = spark.createDataFrame([('a', 2), ('b', 3), ('a', 6), ('b', 5)],
['a', 'i'])
win = Window.partitionBy('a')
df.withColumn('avg', F.avg('i').over(win)).show()
Would yield:
+---+---+---+
| a| i|avg|
+---+---+---+
| b| 3|4.0|
| b| 5|4.0|
| a| 2|4.0|
| a| 6|4.0|
+---+---+---+
The average aggregation is done on each worker separately, requires no round trip to the host, and therefore efficient.

ALS in mllib vs ALS in ml ---- spark [duplicate]

I have the following Python test code (the arguments to ALS.train are defined elsewhere):
r1 = (2, 1)
r2 = (3, 1)
test = sc.parallelize([r1, r2])
model = ALS.train(ratings, rank, numIter, lmbda)
predictions = model.predictAll(test)
print test.take(1)
print predictions.count()
print predictions
Which works, because it has a count of 1 against the predictions variable and outputs:
[(2, 1)]
1
ParallelCollectionRDD[2691] at parallelize at PythonRDD.scala:423
However, when I try and use an RDD I created myself using the following code, it doesn't appear to work anymore:
model = ALS.train(ratings, rank, numIter, lmbda)
validation_data = validation.map(lambda xs: tuple(int(x) for x in xs))
predictions = model.predictAll(validation_data)
print validation_data.take(1)
print predictions.count()
print validation_data
Which outputs:
[(61, 3864)]
0
PythonRDD[4018] at RDD at PythonRDD.scala:43
As you can see, predictAllcomes back empty when passed the mapped RDD. The values going in are both of the same format. The only noticeable difference that I can see is that the first example uses parallelize and produces a ParallelCollectionRDDwhereas the second example just uses a map which produces a PythonRDD. Does predictAll only work if passed a certain type of RDD? If so, is it possible to convert between RDD types? I'm not sure how to get this working.
There are two basic conditions under which MatrixFactorizationMode.predictAll may return a RDD with lower number of items than the input:
user is missing in the training set.
product is missing in the training set.
You can easily reproduce this behavior and check that it is is not dependent on the way how RDD has been created. First lets use example data to build a model:
from pyspark.mllib.recommendation import ALS, MatrixFactorizationModel, Rating
def parse(s):
x, y, z = s.split(",")
return Rating(int(x), int(y), float(z))
ratings = (sc.textFile("data/mllib/als/test.data")
.map(parse)
.union(sc.parallelize([Rating(1, 5, 4.0)])))
model = ALS.train(ratings, 10, 10)
Next lets see which products and users are present in the training data:
set(ratings.map(lambda r: r.product).collect())
## {1, 2, 3, 4, 5}
set(ratings.map(lambda r: r.user).collect())
## {1, 2, 3, 4}
Now lets create test data and check predictions:
valid_test = sc.parallelize([(2, 5), (1, 4), (3, 5)])
valid_test
## ParallelCollectionRDD[434] at parallelize at PythonRDD.scala:423
model.predictAll(valid_test).count()
## 3
So far so good. Next lets map it using the same logic as in your code:
valid_test_ = valid_test.map(lambda xs: tuple(int(x) for x in xs))
valid_test_
## PythonRDD[497] at RDD at PythonRDD.scala:43
model.predictAll(valid_test_).count()
## 3
Still fine. Next lets create invalid data and repeat experiment:
invalid_test = sc.parallelize([
(2, 6), # No product in the training data
(6, 1) # No user in the training data
])
invalid_test
## ParallelCollectionRDD[500] at parallelize at PythonRDD.scala:423
model.predictAll(invalid_test).count()
## 0
invalid_test_ = invalid_test.map(lambda xs: tuple(int(x) for x in xs))
model.predictAll(invalid_test_).count()
## 0
As expected there are no predictions for invalid input.
Finally you can confirm this is really the case by using ML model which is completely independent in training / prediction from Python code:
from pyspark.ml.recommendation import ALS as MLALS
model_ml = MLALS(rank=10, maxIter=10).fit(
ratings.toDF(["user", "item", "rating"])
)
model_ml.transform((valid_test + invalid_test).toDF(["user", "item"])).show()
## +----+----+----------+
## |user|item|prediction|
## +----+----+----------+
## | 6| 1| NaN|
## | 1| 4| 1.0184212|
## | 2| 5| 4.0041084|
## | 3| 5|0.40498763|
## | 2| 6| NaN|
## +----+----+----------+
As you can see no corresponding user / item in the training data means no prediction.

Apply Model Scores to Spark DataFrame - Python

I'm trying to apply a score to a Spark DataFrame using PySpark. Let's assuming that I built a simple regression model outside of Spark and want to map the coefficient values created in the model to the individual columns in the DataFrame to create a new column that is the sum of each of the different source columns multiplied by the individual coefficients. I understand that there are many utilities in Spark mllib for modeling, but I want to understand how this 'brute force' method could be accomplished. I also know that DataFrames/RDDs are immutable, so a new DataFrame would have to be created.
Here's some pseudo-code for reference:
#load example data
df = sqlContext.createDataFrame(data)
df.show(5)
dfmappd.select("age", "parch", "pclass").show(5)
+----+-----+------+
| age|parch|pclass|
+----+-----+------+
|22.0| 0| 3|
|38.0| 0| 1|
|26.0| 0| 3|
|35.0| 0| 1|
|35.0| 0| 3|
+----+-----+------+
only showing top 5 rows
The model created outside of Spark is a logistic regression model based on a binary response. So essentially I want to map the logit function to these three columns to produce a fourth scored column. Here are the coefficients from the model:
intercept: 3.435222
age: -0.039841
parch: 0.176439
pclass: -1.239452
Here is a description of the logit function for reference:
https://en.wikipedia.org/wiki/Logistic_regression
For comparison, here is how I would do the same thing in R using tidyr and dplyr
library(dplyr)
library(tidyr)
#Example data
Age <- c(22, 38, 26, 35, 35)
Parch <- c(0,0,0,0,0)
Pclass <- c(3, 1, 3, 1, 3)
#Wrapped in a dataframe
mydf <- data.frame(Age, Parch, Pclass)
#Using dplyr to create a new dataframe with mutated column
scoredf = mydf %>%
mutate(score = round(1/(1 + exp(-(3.435 + -0.040 * Age + 0.176 * Parch + -1.239 * Pclass))),2))
scoredf
If I interpret your question correctly, you want to compute the class conditional probability of each sample given the coefficients you computed offline and do it "manually".
Does something like this work:
def myLogisticFunc(age, parch, pclass):
intercept = 3.435222
betaAge = -0.039841
betaParch = 0.176439
betaPclass = -1.239452
z = intercept + betaAge * age + betaParch * parch + betaPclass * pclass
return 1.0 / (1.0 + math.exp(-z))
myLogisticFuncUDF = udf(myLogisticFunc)
df.withColumn("score", myLogisticFuncUDF(col("age"), col("parch"), col("pclass"))).show()

Resources