Related
I created a spark Dataset[Row], and the Row is Row(x: Vector). x here is a 1xp vector.
Is it possible to 1) group every k rows 2) concatenating these rows into a k x p matrix - mX i.e., change Dateset[Row(Vector)] to Dateset[Row(Matrix)] ?
Here is my current soluttion, convert this Dataset[Row] to RDD, and concatenate every k rows with zipWithIndex and aggregateByKey.
val dataRDD = data_df.rdd.zipWithIndex
.map { case (line, index) => (index/k, line) }
.aggregateByKey(...) (..., ...)
But it seems it's not very efficient, is there a more efficient way to do this?
Thanks in advance.
There are two performance issues with your approach:
Using a global ordering
Doing a shuffle to build the groups of k
If you absolutely need a global ordering, starting from line 1, and you cannot break up your data into multiple partitions then Spark has to move all the data through a single core. You can speed that part up by finding a way to have more than one partition.
You can avoid a shuffle by processing the data one partition at a time using mapPartitions:
spark.range(1, 20).coalesce(1).mapPartitions(_.grouped(5)).show
+--------------------+
| value|
+--------------------+
| [1, 2, 3, 4, 5]|
| [6, 7, 8, 9, 10]|
|[11, 12, 13, 14, 15]|
| [16, 17, 18, 19]|
+--------------------+
Note that coalesce(1) above is forcing all 20 rows into a single partition.
Here is a solution that groups N records into columns:
Generate from RDD to DF and process as shown below.
The g is group, the k is key to record number which repeats within g. v is your record content.
Input is a file of 6 lines and I used groups of 3 here.
Only drawback is if the lines have a remainder less than the grouping N.
import org.apache.spark.sql.functions._
import org.apache.spark.mllib.rdd.RDDFunctions._
val dfsFilename = "/FileStore/tables/7dxa9btd1477497663691/Text_File_01-880f5.txt"
val readFileRDD = spark.sparkContext.textFile(dfsFilename)
val rdd2 = readFileRDD.sliding(3,3).zipWithIndex
val rdd3 = rdd2.map(r => (r._1.zipWithIndex, r._2))
val df = rdd3.toDF("vk","g")
val df2 = df.withColumn("vke", explode($"vk")).drop("vk")
val df3 = df2.withColumn("k", $"vke._2").withColumn("v", $"vke._1").drop("vke")
val result = df3
.groupBy("g")
.pivot("k")
.agg(expr("first(v)"))
result.show()
returns:
+---+--------------------+--------------------+--------------------+
| g| 0| 1| 2|
+---+--------------------+--------------------+--------------------+
| 0|The quick brown f...|Here he lays I te...|Gone are the days...|
| 1| Gosh, what to say.|Hallo, hallo, how...| I am fine.|
+---+--------------------+--------------------+--------------------+
I have a Spark DataFrame that has one column that has lots of zeros and very few ones (only 0.01% of ones).
I'd like to take a random subsample but a stratified one - so that it keeps the ratio of 1s to 0s in that column.
Is it possible to do in pyspark ?
I am looking for a non-scala solution and on based on DataFrames and not RDD-based.
The solution I suggested in Stratified sampling in Spark
is pretty straightforward to convert from Scala to Python (or even to Java - What's the easiest way to stratify a Spark Dataset ?).
Nevertheless, I'll rewrite it python. Let's start first by creating a toy DataFrame :
from pyspark.sql.functions import lit
list = [(2147481832,23355149,1),(2147481832,973010692,1),(2147481832,2134870842,1),(2147481832,541023347,1),(2147481832,1682206630,1),(2147481832,1138211459,1),(2147481832,852202566,1),(2147481832,201375938,1),(2147481832,486538879,1),(2147481832,919187908,1),(214748183,919187908,1),(214748183,91187908,1)]
df = spark.createDataFrame(list, ["x1","x2","x3"])
df.show()
# +----------+----------+---+
# | x1| x2| x3|
# +----------+----------+---+
# |2147481832| 23355149| 1|
# |2147481832| 973010692| 1|
# |2147481832|2134870842| 1|
# |2147481832| 541023347| 1|
# |2147481832|1682206630| 1|
# |2147481832|1138211459| 1|
# |2147481832| 852202566| 1|
# |2147481832| 201375938| 1|
# |2147481832| 486538879| 1|
# |2147481832| 919187908| 1|
# | 214748183| 919187908| 1|
# | 214748183| 91187908| 1|
# +----------+----------+---+
This DataFrame has 12 elements as you can see :
df.count()
# 12
Distributed as followed :
df.groupBy("x1").count().show()
# +----------+-----+
# | x1|count|
# +----------+-----+
# |2147481832| 10|
# | 214748183| 2|
# +----------+-----+
Now let's sample :
First we'll set the seed :
seed = 12
The find the keys to fraction on and sample :
fractions = df.select("x1").distinct().withColumn("fraction", lit(0.8)).rdd.collectAsMap()
print(fractions)
# {2147481832: 0.8, 214748183: 0.8}
sampled_df = df.stat.sampleBy("x1", fractions, seed)
sampled_df.show()
# +----------+---------+---+
# | x1| x2| x3|
# +----------+---------+---+
# |2147481832| 23355149| 1|
# |2147481832|973010692| 1|
# |2147481832|541023347| 1|
# |2147481832|852202566| 1|
# |2147481832|201375938| 1|
# |2147481832|486538879| 1|
# |2147481832|919187908| 1|
# | 214748183|919187908| 1|
# | 214748183| 91187908| 1|
# +----------+---------+---+
We can now check the content of our sample :
sampled_df.count()
# 9
sampled_df.groupBy("x1").count().show()
# +----------+-----+
# | x1|count|
# +----------+-----+
# |2147481832| 7|
# | 214748183| 2|
# +----------+-----+
Assume you have titanic dataset in 'data' dataframe which you want to split into train and test set using stratified sampling based on the 'Survived' target variable.
# Check initial distributions of 0's and 1's
-> data.groupBy("Survived").count().show()
Survived|count|
+--------+-----+
| 1| 342|
| 0| 549
# Taking 70% of both 0's and 1's into training set
-> train = data.sampleBy("Survived", fractions={0: 0.7, 1: 0.7}, seed=10)
# Subtracting 'train' from original 'data' to get test set
-> test = data.subtract(train)
# Checking distributions of 0's and 1's in train and test sets after the sampling
-> train.groupBy("Survived").count().show()
+--------+-----+
|Survived|count|
+--------+-----+
| 1| 239|
| 0| 399|
+--------+-----+
-> test.groupBy("Survived").count().show()
+--------+-----+
|Survived|count|
+--------+-----+
| 1| 103|
| 0| 150|
+--------+-----+
This can be accomplished pretty easily with 'randomSplit' and 'union' in PySpark.
# read in data
df = spark.read.csv(file, header=True)
# split dataframes between 0s and 1s
zeros = df.filter(df["Target"]==0)
ones = df.filter(df["Target"]==1)
# split datasets into training and testing
train0, test0 = zeros.randomSplit([0.8,0.2], seed=1234)
train1, test1 = ones.randomSplit([0.8,0.2], seed=1234)
# stack datasets back together
train = train0.union(train1)
test = test0.union(test1)
this is based on the accepted answer of #eliasah and this so thread
If you want to get back a train and testset you can use the following function:
from pyspark.sql import functions as F
def stratified_split_train_test(df, frac, label, join_on, seed=42):
""" stratfied split of a dataframe in train and test set.
inspiration gotten from:
https://stackoverflow.com/a/47672336/1771155
https://stackoverflow.com/a/39889263/1771155"""
fractions = df.select(label).distinct().withColumn("fraction", F.lit(frac)).rdd.collectAsMap()
df_frac = df.stat.sampleBy(label, fractions, seed)
df_remaining = df.join(df_frac, on=join_on, how="left_anti")
return df_frac, df_remaining
to create a stratified train and test set where 80% of the total is used for the training set
df_train, df_test = stratified_split_train_test(df=df, frac=0.8, label="y", join_on="unique_id")
You can use the below function. I used the other answers to combine.
import pyspark.sql.functions as f
from pyspark.sql import DataFrame as SparkDataFrame
def train_test_split_pyspark(
df: SparkDataFrame,
startify_column: str,
unique_col: str = None,
train_fraction: float = 0.05,
validation_fraction: float = 0.005,
test_fraction: float = 0.005,
seed: int = 1234,
to_pandas: bool = True,
):
if not unique_col:
unique_col = "any_unique_name_here"
df = df.withColumn(unique_col, f.monotonically_increasing_id())
# Train data
train_fraction_dict = (
df.select(startify_column)
.distinct()
.withColumn("fraction", f.lit(train_fraction))
.rdd.collectAsMap()
)
df_train = df.stat.sampleBy(startify_column, train_fraction_dict, seed)
df_remaining = df.join(df_train, on=unique_col, how="left_anti")
# Validation data
validation_fraction_dict = {
key: validation_fraction for (_, key) in enumerate(train_fraction_dict)
}
df_val = df_remaining.stat.sampleBy(startify_column, validation_fraction_dict, seed)
df_remaining = df_remaining.join(df_val, on=unique_col, how="left_anti")
# Test data
test_fraction_dict = {
key: test_fraction for (_, key) in enumerate(train_fraction_dict)
}
df_test = df_remaining.stat.sampleBy(startify_column, test_fraction_dict, seed)
if unique_col == "any_unique_name_here":
df_train = df_train.drop(unique_col)
df_val = df_val.drop(unique_col)
df_test = df_test.drop(unique_col)
if to_pandas:
return (df_train.toPandas(), df_val.toPandas(), df_test.toPandas())
return df_train, df_val, df_test
To avoid rows found in both train/test split or disappearing, I would further add to Vincent Claes’s solution
def stratifiedSampler(sparkDf:DataFrame, ratio:float,
label:str, joinOn:str, seed=42):
fractions = (sparkDf.select(label).distinct()
.withColumn("fraction",f.lit(ratio))
.rdd.collectAsMap())
fracDf = sparkDf.stat.sampleBy(label, fractions, seed)
fracDf = fracDf.localCheckpoint()
remaingDf = sparkDf.join(fracDf, on=joinOn, how="left_anti")
return (fracDf, remaingDf)
When train a model, say linear regression, we may make a normalization, like MinMaxScaler, on the train an test dataset.
After we got a trained model and use it to make predictions, and scale back the predictions to the original representation.
In python, there is "inverse_transform" method. For example:
from sklearn.preprocessing import MinMaxScaler
scalerModel.inverse_transform
from sklearn.preprocessing import MinMaxScaler
data = [[-1, 2], [-0.5, 6], [0, 10], [1, 18]]
scaler = MinMaxScaler()
MinMaxScaler(copy=True, feature_range=(0, 1))
print(data)
dataScaled = scaler.fit(data).transform(data)
print(dataScaled)
scaler.inverse_transform(dataScaled)
Is there similar method in spark?
I have googled a lot, but found no answer. Can anyone give me some suggestions?
Thank you very much!
In our company, in order to solve the same problem on the StandardScaler, we extended spark.ml with this (among other things):
package org.apache.spark.ml
import org.apache.spark.ml.linalg.DenseVector
import org.apache.spark.ml.util.Identifiable
package object feature {
implicit class RichStandardScalerModel(model: StandardScalerModel) {
private def invertedStdDev(sigma: Double): Double = 1 / sigma
private def invertedMean(mu: Double, sigma: Double): Double = -mu / sigma
def inverse(newOutputCol: String): StandardScalerModel = {
val sigma: linalg.Vector = model.std
val mu: linalg.Vector = model.mean
val newSigma: linalg.Vector = new DenseVector(sigma.toArray.map(invertedStdDev))
val newMu: linalg.Vector = new DenseVector(mu.toArray.zip(sigma.toArray).map { case (m, s) => invertedMean(m, s) })
val inverted: StandardScalerModel = new StandardScalerModel(Identifiable.randomUID("stdScal"), newSigma, newMu)
.setInputCol(model.getOutputCol)
.setOutputCol(newOutputCol)
inverted
.set(inverted.withMean, model.getWithMean)
.set(inverted.withStd, model.getWithStd)
}
}
}
It should be fairly easy to modify it or do something similar for your specific case.
Keep in mind that due to JVM's double implementation, you normally lose precision in these operations, so you will not recover the exact original values you had before the transformation (e.g.: you will probably get something like 1.9999999999999998 instead of 2.0).
No direct solution here.
Since passing an array to a UDFs can only be done when the array is a column (lit(array) won't do the trick) I am using the following workaround.
In a nutshell it turns an inverted scales array into a string, pass it to the UDFs, and solve the math.
You can use that scaled array (string) in an inverse function (also attached here), the get the inverted values.
Code:
from pyspark.ml.feature import VectorAssembler, QuantileDiscretizer
from pyspark.ml.linalg import SparseVector, DenseVector, Vectors, VectorUDT
df = spark.createDataFrame([
(0, 1, 0.5, -1),
(1, 2, 1.0, 1),
(2, 4, 10.0, 2)
], ["id", 'x1', 'x2', 'x3'])
df.show()
def Normalize(df):
scales = df.describe()
scales = scales.filter("summary = 'mean' or summary = 'stddev'")
scales = scales.select(["summary"] + [col(c).cast("double") for c in scales.columns[1:]])
assembler = VectorAssembler(
inputCols=scales.columns[1:],
outputCol="X_scales")
df_scales = assembler.transform(scales)
x_mean = df_scales.filter("summary = 'mean'").select('X_scales')
x_std = df_scales.filter("summary = 'stddev'").select('X_scales')
ks_std_lit = lit('|'.join([str(s) for s in list(x_std.collect()[0].X_scales)]))
ks_mean_lit = lit('|'.join([str(s) for s in list(x_mean.collect()[0].X_scales)]))
assembler = VectorAssembler(
inputCols=df.columns[0:4],
outputCol="features")
df_features = assembler.transform(df)
df_features = df_features.withColumn('Scaled', exec_norm_udf(df_features.features, ks_mean_lit, ks_std_lit))
return df_features, ks_mean_lit, ks_std_lit
def exec_norm(vector, x_mean, x_std):
x_mean = [float(s) for s in x_mean.split('|')]
x_std = [float(s) for s in x_std.split('|')]
res = (np.array(vector) - np.array(x_mean)) / np.array(x_std)
res = list(res)
return Vectors.dense(res)
exec_norm_udf = udf(exec_norm, VectorUDT())
def scaler_invert(vector, x_mean, x_std):
x_mean = [float(s) for s in x_mean.split('|')]
x_std = [float(s) for s in x_std.split('|')]
res = (np.array(vector) * np.array(x_std)) + np.array(x_mean)
res = list(res)
return Vectors.dense(res)
scaler_invert_udf = udf(scaler_invert, VectorUDT())
df, scaler_mean, scaler_std = Normalize(df)
df.withColumn('inverted', scaler_invert_udf(df.Scaled, scaler_mean, scaler_std)).show(truncate=False)
Maybe I'm too late to the party, however, recently faced exactly the same problem and couldn't find any viable solution.
Presuming that the author of this question doesn't have to inverse MinMax Values of vectors, instead, there is a need to inverse only one column.
Min Max values of a column, as well as min-max parameters of the scaler, are also known.
Maths behind MinMaxScaler as per scikit learn website:
X_std = (X - X.min(axis=0)) / (X.max(axis=0) - X.min(axis=0))
X_scaled = X_std * (max - min) + min
"Reverse-engineered" MinMaxScaler formula
X_scaled = (X - Xmin) / (Xmax) - Xmin) * (max - min) + min
X = (max * Xmin - min * Xmax - Xmin * X_scaled + Xmax * X_scaled)/(max - min)
Implementation
from sklearn.preprocessing import MinMaxScaler
import pandas
data = [[-1, 2], [-0.5, 6], [0, 10], [1, 18]]
scaler = MinMaxScaler(copy=True, feature_range=(0, 1))
print(data)
dataScaled = scaler.fit(data).transform(data)
data_sp = spark.createDataFrame(pandas.DataFrame(data, columns=["x", "y"]).join(pandas.DataFrame(dataScaled, columns=["x_scaled", "y_scaled"])))
data_sp.show()
print("Inversing column: y_scaled")
Xmax = data_sp.select("y").rdd.max()[0]
Xmin = data_sp.select("y").rdd.min()[0]
_max = scaler.feature_range[1]
_min = scaler.feature_range[0]
print("Xmax =", Xmax, "Xmin =", Xmin, "max =", _max, "min =", _min)
data_sp.withColumn(colName="y_scaled_inversed", col=(_max * Xmin - _min * Xmax - Xmin * data_sp.y_scaled + Xmax * data_sp.y_scaled)/(_max - _min)).show()
Outputs
[[-1, 2], [-0.5, 6], [0, 10], [1, 18]]
+----+---+--------+--------+
| x| y|x_scaled|y_scaled|
+----+---+--------+--------+
|-1.0| 2| 0.0| 0.0|
|-0.5| 6| 0.25| 0.25|
| 0.0| 10| 0.5| 0.5|
| 1.0| 18| 1.0| 1.0|
+----+---+--------+--------+
Inversing column: y_scaled
Xmax = 18 Xmin = 2 max = 1 min = 0
+----+---+--------+--------+-----------------+
| x| y|x_scaled|y_scaled|y_scaled_inversed|
+----+---+--------+--------+-----------------+
|-1.0| 2| 0.0| 0.0| 2.0|
|-0.5| 6| 0.25| 0.25| 6.0|
| 0.0| 10| 0.5| 0.5| 10.0|
| 1.0| 18| 1.0| 1.0| 18.0|
+----+---+--------+--------+-----------------+
I have a spark dataframe containing geo-information.
my_df.show(2)
## +----+----+-----------+----------+
## | x0 | x1 | longitude | latitude |
## +----+----+-----------+----------+
## | ...| ...| 51.043 | 13.6847 |
## | ...| ...| 42.6753 | 23.3218 |
I took the longitude and the latitude out of my dataframe and caluculated some centerpoints with the kmeans library from pyspark.
#Trains a k-means model
k = 120
model = KMeans.train(dataset, k)
print ("Final centers: " + str(model.clusterCenters))
the output
Final centers: [array([ 51.04307692, 13.68474126]), array([-33.434 , -70.58366667]), array([ 42.67533333, 23.32185981]), array([ 45.876, -61.492]), array([ 53.07465714, 8.4655 ]), array([ 4.594, 114.262]), array([ 48.15665306, 11.54269728]), array([ 51.51729851, 7.49838806]), array([ 48.76316125, 9.15357859]), ....
Anyone an idea how to add the matching centers to my dataframe?
## +----+----+-----------+----------+-----------+----------+
## | x0 | x1 | longitude | latitude | mean_long | mean_lat |
## +----+----+-----------+----------+-----------+----------+
## | ...| ...| 51.043 | 13.6847 | 50.000 | 15.000 |
## | ...| ...| 42.6753 | 23.3218 | 50.000 | 15.000 |
If you decided to use DataFrames you should use new pyspark.ml API, not the legacy pyspark.mllib. It provides a number of clustering methods, including K-Means, and its predict method will attach prediction column to the DataFrame.
Please check ML documentation for details (API and required input types):
https://spark.apache.org/docs/latest/ml-clustering.html#k-means
Hope this helps!
(note - I have taken sample data from Spark documentation page)
from pyspark.ml.linalg import Vectors
from pyspark.ml.clustering import KMeans
import pandas as pd
#generate data
data = [(Vectors.dense([0.0, 0.0]),), (Vectors.dense([1.0, 1.0]),),
(Vectors.dense([9.0, 8.0]),), (Vectors.dense([8.0, 9.0]),)]
df = sqlContext.createDataFrame(data, ["features"])
df.show()
#run kmeans clustering model
kmeans = KMeans(k=2, seed=1)
model = kmeans.fit(df)
predictions=model.transform(df).withColumnRenamed("prediction","cluster_id")
centers = model.clusterCenters()
#preprocessing centers so that it can be joined with predictions dataframe
centers_p_df = pd.DataFrame(centers)
centers_p_df.insert(0, 'new_col', range(0, len(centers_p_df)))
centers_df = sqlContext.createDataFrame(centers_p_df, schema=['cluster_id','centers_col1','centers_col2'])
final_df = predictions.join(centers_df, on="cluster_id").drop("cluster_id")
final_df.show()
I have the following Python test code (the arguments to ALS.train are defined elsewhere):
r1 = (2, 1)
r2 = (3, 1)
test = sc.parallelize([r1, r2])
model = ALS.train(ratings, rank, numIter, lmbda)
predictions = model.predictAll(test)
print test.take(1)
print predictions.count()
print predictions
Which works, because it has a count of 1 against the predictions variable and outputs:
[(2, 1)]
1
ParallelCollectionRDD[2691] at parallelize at PythonRDD.scala:423
However, when I try and use an RDD I created myself using the following code, it doesn't appear to work anymore:
model = ALS.train(ratings, rank, numIter, lmbda)
validation_data = validation.map(lambda xs: tuple(int(x) for x in xs))
predictions = model.predictAll(validation_data)
print validation_data.take(1)
print predictions.count()
print validation_data
Which outputs:
[(61, 3864)]
0
PythonRDD[4018] at RDD at PythonRDD.scala:43
As you can see, predictAllcomes back empty when passed the mapped RDD. The values going in are both of the same format. The only noticeable difference that I can see is that the first example uses parallelize and produces a ParallelCollectionRDDwhereas the second example just uses a map which produces a PythonRDD. Does predictAll only work if passed a certain type of RDD? If so, is it possible to convert between RDD types? I'm not sure how to get this working.
There are two basic conditions under which MatrixFactorizationMode.predictAll may return a RDD with lower number of items than the input:
user is missing in the training set.
product is missing in the training set.
You can easily reproduce this behavior and check that it is is not dependent on the way how RDD has been created. First lets use example data to build a model:
from pyspark.mllib.recommendation import ALS, MatrixFactorizationModel, Rating
def parse(s):
x, y, z = s.split(",")
return Rating(int(x), int(y), float(z))
ratings = (sc.textFile("data/mllib/als/test.data")
.map(parse)
.union(sc.parallelize([Rating(1, 5, 4.0)])))
model = ALS.train(ratings, 10, 10)
Next lets see which products and users are present in the training data:
set(ratings.map(lambda r: r.product).collect())
## {1, 2, 3, 4, 5}
set(ratings.map(lambda r: r.user).collect())
## {1, 2, 3, 4}
Now lets create test data and check predictions:
valid_test = sc.parallelize([(2, 5), (1, 4), (3, 5)])
valid_test
## ParallelCollectionRDD[434] at parallelize at PythonRDD.scala:423
model.predictAll(valid_test).count()
## 3
So far so good. Next lets map it using the same logic as in your code:
valid_test_ = valid_test.map(lambda xs: tuple(int(x) for x in xs))
valid_test_
## PythonRDD[497] at RDD at PythonRDD.scala:43
model.predictAll(valid_test_).count()
## 3
Still fine. Next lets create invalid data and repeat experiment:
invalid_test = sc.parallelize([
(2, 6), # No product in the training data
(6, 1) # No user in the training data
])
invalid_test
## ParallelCollectionRDD[500] at parallelize at PythonRDD.scala:423
model.predictAll(invalid_test).count()
## 0
invalid_test_ = invalid_test.map(lambda xs: tuple(int(x) for x in xs))
model.predictAll(invalid_test_).count()
## 0
As expected there are no predictions for invalid input.
Finally you can confirm this is really the case by using ML model which is completely independent in training / prediction from Python code:
from pyspark.ml.recommendation import ALS as MLALS
model_ml = MLALS(rank=10, maxIter=10).fit(
ratings.toDF(["user", "item", "rating"])
)
model_ml.transform((valid_test + invalid_test).toDF(["user", "item"])).show()
## +----+----+----------+
## |user|item|prediction|
## +----+----+----------+
## | 6| 1| NaN|
## | 1| 4| 1.0184212|
## | 2| 5| 4.0041084|
## | 3| 5|0.40498763|
## | 2| 6| NaN|
## +----+----+----------+
As you can see no corresponding user / item in the training data means no prediction.