I have a spark dataframe containing geo-information.
my_df.show(2)
## +----+----+-----------+----------+
## | x0 | x1 | longitude | latitude |
## +----+----+-----------+----------+
## | ...| ...| 51.043 | 13.6847 |
## | ...| ...| 42.6753 | 23.3218 |
I took the longitude and the latitude out of my dataframe and caluculated some centerpoints with the kmeans library from pyspark.
#Trains a k-means model
k = 120
model = KMeans.train(dataset, k)
print ("Final centers: " + str(model.clusterCenters))
the output
Final centers: [array([ 51.04307692, 13.68474126]), array([-33.434 , -70.58366667]), array([ 42.67533333, 23.32185981]), array([ 45.876, -61.492]), array([ 53.07465714, 8.4655 ]), array([ 4.594, 114.262]), array([ 48.15665306, 11.54269728]), array([ 51.51729851, 7.49838806]), array([ 48.76316125, 9.15357859]), ....
Anyone an idea how to add the matching centers to my dataframe?
## +----+----+-----------+----------+-----------+----------+
## | x0 | x1 | longitude | latitude | mean_long | mean_lat |
## +----+----+-----------+----------+-----------+----------+
## | ...| ...| 51.043 | 13.6847 | 50.000 | 15.000 |
## | ...| ...| 42.6753 | 23.3218 | 50.000 | 15.000 |
If you decided to use DataFrames you should use new pyspark.ml API, not the legacy pyspark.mllib. It provides a number of clustering methods, including K-Means, and its predict method will attach prediction column to the DataFrame.
Please check ML documentation for details (API and required input types):
https://spark.apache.org/docs/latest/ml-clustering.html#k-means
Hope this helps!
(note - I have taken sample data from Spark documentation page)
from pyspark.ml.linalg import Vectors
from pyspark.ml.clustering import KMeans
import pandas as pd
#generate data
data = [(Vectors.dense([0.0, 0.0]),), (Vectors.dense([1.0, 1.0]),),
(Vectors.dense([9.0, 8.0]),), (Vectors.dense([8.0, 9.0]),)]
df = sqlContext.createDataFrame(data, ["features"])
df.show()
#run kmeans clustering model
kmeans = KMeans(k=2, seed=1)
model = kmeans.fit(df)
predictions=model.transform(df).withColumnRenamed("prediction","cluster_id")
centers = model.clusterCenters()
#preprocessing centers so that it can be joined with predictions dataframe
centers_p_df = pd.DataFrame(centers)
centers_p_df.insert(0, 'new_col', range(0, len(centers_p_df)))
centers_df = sqlContext.createDataFrame(centers_p_df, schema=['cluster_id','centers_col1','centers_col2'])
final_df = predictions.join(centers_df, on="cluster_id").drop("cluster_id")
final_df.show()
Related
How will method in spark threat a vector assembler column? For example, if I have longitude and latitude column, is it better to assemble them using vector assembler then put it into my model or it does not make any difference if I just put them directly(separately)?
Example1:
loc_assembler = VectorAssembler(inputCols=['long', 'lat'], outputCol='loc')
vector_assembler = VectorAssembler(inputCols=['loc', 'feature1', 'feature2'], outputCol='features')
lr = LinearRegression(maxIter=10, regParam=0.3, elasticNetParam=0.8)
pipeline = Pipeline(stages=[loc_assembler, vector_assembler, lr])
Example2:
vector_assembler = VectorAssembler(inputCols=['long', 'lat', 'feature1', 'feature2'], outputCol='features')
lr = LinearRegression(maxIter=10, regParam=0.3, elasticNetParam=0.8)
pipeline = Pipeline(stages=[vector_assembler, lr])
What is the difference? Which one is better?
There will not be any difference simply because, in both your examples, the final form of the features column will be the same, i.e. in your 1st example, the loc vector will be broken back into its individual components.
Here is short demonstration with dummy data (leaving the linear regression part aside, as it is unnecessary for this discussion):
spark.version
# u'2.3.1'
# dummy data:
df = spark.createDataFrame([[0, 33.3, -17.5, 10., 0.2],
[1, 40.4, -20.5, 12., 2.2],
[2, 28., -23.9, -2., -1.7],
[3, 29.5, -19.0, -0.5, -0.2],
[4, 32.8, -18.84, 1.5, 1.8]
],
["id","lat", "long", "other", "label"])
from pyspark.ml.feature import VectorAssembler
from pyspark.ml.pipeline import Pipeline
loc_assembler = VectorAssembler(inputCols=['long', 'lat'], outputCol='loc')
vector_assembler = VectorAssembler(inputCols=['loc', 'other'], outputCol='features')
pipeline = Pipeline(stages=[loc_assembler, vector_assembler])
model = pipeline.fit(df)
model.transform(df).show()
The result is:
+---+----+------+-----+-----+-------------+-----------------+
| id| lat| long|other|label| loc| features|
+---+----+------+-----+-----+-------------+-----------------+
| 0|33.3| -17.5| 10.0| 0.2| [-17.5,33.3]|[-17.5,33.3,10.0]|
| 1|40.4| -20.5| 12.0| 2.2| [-20.5,40.4]|[-20.5,40.4,12.0]|
| 2|28.0| -23.9| -2.0| -1.7| [-23.9,28.0]|[-23.9,28.0,-2.0]|
| 3|29.5| -19.0| -0.5| -0.2| [-19.0,29.5]|[-19.0,29.5,-0.5]|
| 4|32.8|-18.84| 1.5| 1.8|[-18.84,32.8]|[-18.84,32.8,1.5]|
+---+----+------+-----+-----+-------------+-----------------+
i.e. the features column is arguably identical with your 2nd example (not shown here), where you do not use the intermediate assembled feature loc...
I have a Spark DataFrame that has one column that has lots of zeros and very few ones (only 0.01% of ones).
I'd like to take a random subsample but a stratified one - so that it keeps the ratio of 1s to 0s in that column.
Is it possible to do in pyspark ?
I am looking for a non-scala solution and on based on DataFrames and not RDD-based.
The solution I suggested in Stratified sampling in Spark
is pretty straightforward to convert from Scala to Python (or even to Java - What's the easiest way to stratify a Spark Dataset ?).
Nevertheless, I'll rewrite it python. Let's start first by creating a toy DataFrame :
from pyspark.sql.functions import lit
list = [(2147481832,23355149,1),(2147481832,973010692,1),(2147481832,2134870842,1),(2147481832,541023347,1),(2147481832,1682206630,1),(2147481832,1138211459,1),(2147481832,852202566,1),(2147481832,201375938,1),(2147481832,486538879,1),(2147481832,919187908,1),(214748183,919187908,1),(214748183,91187908,1)]
df = spark.createDataFrame(list, ["x1","x2","x3"])
df.show()
# +----------+----------+---+
# | x1| x2| x3|
# +----------+----------+---+
# |2147481832| 23355149| 1|
# |2147481832| 973010692| 1|
# |2147481832|2134870842| 1|
# |2147481832| 541023347| 1|
# |2147481832|1682206630| 1|
# |2147481832|1138211459| 1|
# |2147481832| 852202566| 1|
# |2147481832| 201375938| 1|
# |2147481832| 486538879| 1|
# |2147481832| 919187908| 1|
# | 214748183| 919187908| 1|
# | 214748183| 91187908| 1|
# +----------+----------+---+
This DataFrame has 12 elements as you can see :
df.count()
# 12
Distributed as followed :
df.groupBy("x1").count().show()
# +----------+-----+
# | x1|count|
# +----------+-----+
# |2147481832| 10|
# | 214748183| 2|
# +----------+-----+
Now let's sample :
First we'll set the seed :
seed = 12
The find the keys to fraction on and sample :
fractions = df.select("x1").distinct().withColumn("fraction", lit(0.8)).rdd.collectAsMap()
print(fractions)
# {2147481832: 0.8, 214748183: 0.8}
sampled_df = df.stat.sampleBy("x1", fractions, seed)
sampled_df.show()
# +----------+---------+---+
# | x1| x2| x3|
# +----------+---------+---+
# |2147481832| 23355149| 1|
# |2147481832|973010692| 1|
# |2147481832|541023347| 1|
# |2147481832|852202566| 1|
# |2147481832|201375938| 1|
# |2147481832|486538879| 1|
# |2147481832|919187908| 1|
# | 214748183|919187908| 1|
# | 214748183| 91187908| 1|
# +----------+---------+---+
We can now check the content of our sample :
sampled_df.count()
# 9
sampled_df.groupBy("x1").count().show()
# +----------+-----+
# | x1|count|
# +----------+-----+
# |2147481832| 7|
# | 214748183| 2|
# +----------+-----+
Assume you have titanic dataset in 'data' dataframe which you want to split into train and test set using stratified sampling based on the 'Survived' target variable.
# Check initial distributions of 0's and 1's
-> data.groupBy("Survived").count().show()
Survived|count|
+--------+-----+
| 1| 342|
| 0| 549
# Taking 70% of both 0's and 1's into training set
-> train = data.sampleBy("Survived", fractions={0: 0.7, 1: 0.7}, seed=10)
# Subtracting 'train' from original 'data' to get test set
-> test = data.subtract(train)
# Checking distributions of 0's and 1's in train and test sets after the sampling
-> train.groupBy("Survived").count().show()
+--------+-----+
|Survived|count|
+--------+-----+
| 1| 239|
| 0| 399|
+--------+-----+
-> test.groupBy("Survived").count().show()
+--------+-----+
|Survived|count|
+--------+-----+
| 1| 103|
| 0| 150|
+--------+-----+
This can be accomplished pretty easily with 'randomSplit' and 'union' in PySpark.
# read in data
df = spark.read.csv(file, header=True)
# split dataframes between 0s and 1s
zeros = df.filter(df["Target"]==0)
ones = df.filter(df["Target"]==1)
# split datasets into training and testing
train0, test0 = zeros.randomSplit([0.8,0.2], seed=1234)
train1, test1 = ones.randomSplit([0.8,0.2], seed=1234)
# stack datasets back together
train = train0.union(train1)
test = test0.union(test1)
this is based on the accepted answer of #eliasah and this so thread
If you want to get back a train and testset you can use the following function:
from pyspark.sql import functions as F
def stratified_split_train_test(df, frac, label, join_on, seed=42):
""" stratfied split of a dataframe in train and test set.
inspiration gotten from:
https://stackoverflow.com/a/47672336/1771155
https://stackoverflow.com/a/39889263/1771155"""
fractions = df.select(label).distinct().withColumn("fraction", F.lit(frac)).rdd.collectAsMap()
df_frac = df.stat.sampleBy(label, fractions, seed)
df_remaining = df.join(df_frac, on=join_on, how="left_anti")
return df_frac, df_remaining
to create a stratified train and test set where 80% of the total is used for the training set
df_train, df_test = stratified_split_train_test(df=df, frac=0.8, label="y", join_on="unique_id")
You can use the below function. I used the other answers to combine.
import pyspark.sql.functions as f
from pyspark.sql import DataFrame as SparkDataFrame
def train_test_split_pyspark(
df: SparkDataFrame,
startify_column: str,
unique_col: str = None,
train_fraction: float = 0.05,
validation_fraction: float = 0.005,
test_fraction: float = 0.005,
seed: int = 1234,
to_pandas: bool = True,
):
if not unique_col:
unique_col = "any_unique_name_here"
df = df.withColumn(unique_col, f.monotonically_increasing_id())
# Train data
train_fraction_dict = (
df.select(startify_column)
.distinct()
.withColumn("fraction", f.lit(train_fraction))
.rdd.collectAsMap()
)
df_train = df.stat.sampleBy(startify_column, train_fraction_dict, seed)
df_remaining = df.join(df_train, on=unique_col, how="left_anti")
# Validation data
validation_fraction_dict = {
key: validation_fraction for (_, key) in enumerate(train_fraction_dict)
}
df_val = df_remaining.stat.sampleBy(startify_column, validation_fraction_dict, seed)
df_remaining = df_remaining.join(df_val, on=unique_col, how="left_anti")
# Test data
test_fraction_dict = {
key: test_fraction for (_, key) in enumerate(train_fraction_dict)
}
df_test = df_remaining.stat.sampleBy(startify_column, test_fraction_dict, seed)
if unique_col == "any_unique_name_here":
df_train = df_train.drop(unique_col)
df_val = df_val.drop(unique_col)
df_test = df_test.drop(unique_col)
if to_pandas:
return (df_train.toPandas(), df_val.toPandas(), df_test.toPandas())
return df_train, df_val, df_test
To avoid rows found in both train/test split or disappearing, I would further add to Vincent Claes’s solution
def stratifiedSampler(sparkDf:DataFrame, ratio:float,
label:str, joinOn:str, seed=42):
fractions = (sparkDf.select(label).distinct()
.withColumn("fraction",f.lit(ratio))
.rdd.collectAsMap())
fracDf = sparkDf.stat.sampleBy(label, fractions, seed)
fracDf = fracDf.localCheckpoint()
remaingDf = sparkDf.join(fracDf, on=joinOn, how="left_anti")
return (fracDf, remaingDf)
When train a model, say linear regression, we may make a normalization, like MinMaxScaler, on the train an test dataset.
After we got a trained model and use it to make predictions, and scale back the predictions to the original representation.
In python, there is "inverse_transform" method. For example:
from sklearn.preprocessing import MinMaxScaler
scalerModel.inverse_transform
from sklearn.preprocessing import MinMaxScaler
data = [[-1, 2], [-0.5, 6], [0, 10], [1, 18]]
scaler = MinMaxScaler()
MinMaxScaler(copy=True, feature_range=(0, 1))
print(data)
dataScaled = scaler.fit(data).transform(data)
print(dataScaled)
scaler.inverse_transform(dataScaled)
Is there similar method in spark?
I have googled a lot, but found no answer. Can anyone give me some suggestions?
Thank you very much!
In our company, in order to solve the same problem on the StandardScaler, we extended spark.ml with this (among other things):
package org.apache.spark.ml
import org.apache.spark.ml.linalg.DenseVector
import org.apache.spark.ml.util.Identifiable
package object feature {
implicit class RichStandardScalerModel(model: StandardScalerModel) {
private def invertedStdDev(sigma: Double): Double = 1 / sigma
private def invertedMean(mu: Double, sigma: Double): Double = -mu / sigma
def inverse(newOutputCol: String): StandardScalerModel = {
val sigma: linalg.Vector = model.std
val mu: linalg.Vector = model.mean
val newSigma: linalg.Vector = new DenseVector(sigma.toArray.map(invertedStdDev))
val newMu: linalg.Vector = new DenseVector(mu.toArray.zip(sigma.toArray).map { case (m, s) => invertedMean(m, s) })
val inverted: StandardScalerModel = new StandardScalerModel(Identifiable.randomUID("stdScal"), newSigma, newMu)
.setInputCol(model.getOutputCol)
.setOutputCol(newOutputCol)
inverted
.set(inverted.withMean, model.getWithMean)
.set(inverted.withStd, model.getWithStd)
}
}
}
It should be fairly easy to modify it or do something similar for your specific case.
Keep in mind that due to JVM's double implementation, you normally lose precision in these operations, so you will not recover the exact original values you had before the transformation (e.g.: you will probably get something like 1.9999999999999998 instead of 2.0).
No direct solution here.
Since passing an array to a UDFs can only be done when the array is a column (lit(array) won't do the trick) I am using the following workaround.
In a nutshell it turns an inverted scales array into a string, pass it to the UDFs, and solve the math.
You can use that scaled array (string) in an inverse function (also attached here), the get the inverted values.
Code:
from pyspark.ml.feature import VectorAssembler, QuantileDiscretizer
from pyspark.ml.linalg import SparseVector, DenseVector, Vectors, VectorUDT
df = spark.createDataFrame([
(0, 1, 0.5, -1),
(1, 2, 1.0, 1),
(2, 4, 10.0, 2)
], ["id", 'x1', 'x2', 'x3'])
df.show()
def Normalize(df):
scales = df.describe()
scales = scales.filter("summary = 'mean' or summary = 'stddev'")
scales = scales.select(["summary"] + [col(c).cast("double") for c in scales.columns[1:]])
assembler = VectorAssembler(
inputCols=scales.columns[1:],
outputCol="X_scales")
df_scales = assembler.transform(scales)
x_mean = df_scales.filter("summary = 'mean'").select('X_scales')
x_std = df_scales.filter("summary = 'stddev'").select('X_scales')
ks_std_lit = lit('|'.join([str(s) for s in list(x_std.collect()[0].X_scales)]))
ks_mean_lit = lit('|'.join([str(s) for s in list(x_mean.collect()[0].X_scales)]))
assembler = VectorAssembler(
inputCols=df.columns[0:4],
outputCol="features")
df_features = assembler.transform(df)
df_features = df_features.withColumn('Scaled', exec_norm_udf(df_features.features, ks_mean_lit, ks_std_lit))
return df_features, ks_mean_lit, ks_std_lit
def exec_norm(vector, x_mean, x_std):
x_mean = [float(s) for s in x_mean.split('|')]
x_std = [float(s) for s in x_std.split('|')]
res = (np.array(vector) - np.array(x_mean)) / np.array(x_std)
res = list(res)
return Vectors.dense(res)
exec_norm_udf = udf(exec_norm, VectorUDT())
def scaler_invert(vector, x_mean, x_std):
x_mean = [float(s) for s in x_mean.split('|')]
x_std = [float(s) for s in x_std.split('|')]
res = (np.array(vector) * np.array(x_std)) + np.array(x_mean)
res = list(res)
return Vectors.dense(res)
scaler_invert_udf = udf(scaler_invert, VectorUDT())
df, scaler_mean, scaler_std = Normalize(df)
df.withColumn('inverted', scaler_invert_udf(df.Scaled, scaler_mean, scaler_std)).show(truncate=False)
Maybe I'm too late to the party, however, recently faced exactly the same problem and couldn't find any viable solution.
Presuming that the author of this question doesn't have to inverse MinMax Values of vectors, instead, there is a need to inverse only one column.
Min Max values of a column, as well as min-max parameters of the scaler, are also known.
Maths behind MinMaxScaler as per scikit learn website:
X_std = (X - X.min(axis=0)) / (X.max(axis=0) - X.min(axis=0))
X_scaled = X_std * (max - min) + min
"Reverse-engineered" MinMaxScaler formula
X_scaled = (X - Xmin) / (Xmax) - Xmin) * (max - min) + min
X = (max * Xmin - min * Xmax - Xmin * X_scaled + Xmax * X_scaled)/(max - min)
Implementation
from sklearn.preprocessing import MinMaxScaler
import pandas
data = [[-1, 2], [-0.5, 6], [0, 10], [1, 18]]
scaler = MinMaxScaler(copy=True, feature_range=(0, 1))
print(data)
dataScaled = scaler.fit(data).transform(data)
data_sp = spark.createDataFrame(pandas.DataFrame(data, columns=["x", "y"]).join(pandas.DataFrame(dataScaled, columns=["x_scaled", "y_scaled"])))
data_sp.show()
print("Inversing column: y_scaled")
Xmax = data_sp.select("y").rdd.max()[0]
Xmin = data_sp.select("y").rdd.min()[0]
_max = scaler.feature_range[1]
_min = scaler.feature_range[0]
print("Xmax =", Xmax, "Xmin =", Xmin, "max =", _max, "min =", _min)
data_sp.withColumn(colName="y_scaled_inversed", col=(_max * Xmin - _min * Xmax - Xmin * data_sp.y_scaled + Xmax * data_sp.y_scaled)/(_max - _min)).show()
Outputs
[[-1, 2], [-0.5, 6], [0, 10], [1, 18]]
+----+---+--------+--------+
| x| y|x_scaled|y_scaled|
+----+---+--------+--------+
|-1.0| 2| 0.0| 0.0|
|-0.5| 6| 0.25| 0.25|
| 0.0| 10| 0.5| 0.5|
| 1.0| 18| 1.0| 1.0|
+----+---+--------+--------+
Inversing column: y_scaled
Xmax = 18 Xmin = 2 max = 1 min = 0
+----+---+--------+--------+-----------------+
| x| y|x_scaled|y_scaled|y_scaled_inversed|
+----+---+--------+--------+-----------------+
|-1.0| 2| 0.0| 0.0| 2.0|
|-0.5| 6| 0.25| 0.25| 6.0|
| 0.0| 10| 0.5| 0.5| 10.0|
| 1.0| 18| 1.0| 1.0| 18.0|
+----+---+--------+--------+-----------------+
I'm getting predictions through spark.ml.classification.LogisticRegressionModel.predict. A number of the rows have the prediction column as 1.0 and probability column as .04. The model.getThreshold is 0.5 so I'd assume the model is classifying everything over a 0.5 probability threshold as 1.0.
How am I supposed to interpret a result with a 1.0 prediction and a probability of 0.04?
The probability column from performing a LogisticRegression should contain a list with the same length as the number of classes, where each index gives the corresponding probability for that class. I made a small example with two classes for illustration:
case class Person(label: Double, age: Double, height: Double, weight: Double)
val df = List(Person(0.0, 15, 175, 67),
Person(0.0, 30, 190, 100),
Person(1.0, 40, 155, 57),
Person(1.0, 50, 160, 56),
Person(0.0, 15, 170, 56),
Person(1.0, 80, 180, 88)).toDF()
val assembler = new VectorAssembler().setInputCols(Array("age", "height", "weight"))
.setOutputCol("features")
.select("label", "features")
val df2 = assembler.transform(df)
df2.show
+-----+------------------+
|label| features|
+-----+------------------+
| 0.0| [15.0,175.0,67.0]|
| 0.0|[30.0,190.0,100.0]|
| 1.0| [40.0,155.0,57.0]|
| 1.0| [50.0,160.0,56.0]|
| 0.0| [15.0,170.0,56.0]|
| 1.0| [80.0,180.0,88.0]|
+-----+------------------+
val lr = new LogisticRegression().setMaxIter(10).setRegParam(0.3).setElasticNetParam(0.8)
val Array(testing, training) = df2.randomSplit(Array(0.7, 0.3))
val model = lr.fit(training)
val predictions = model.transform(testing)
predictions.select("probability", "prediction").show(false)
+----------------------------------------+----------+
|probability |prediction|
+----------------------------------------+----------+
|[0.7487950501224138,0.2512049498775863] |0.0 |
|[0.6458452667523259,0.35415473324767416]|0.0 |
|[0.3888393314864866,0.6111606685135134] |1.0 |
+----------------------------------------+----------+
Here are the probabilities as well as the final prediction made by the algorithm. The class that have the highest probability in the end is the one predicted.
I'm trying to apply a score to a Spark DataFrame using PySpark. Let's assuming that I built a simple regression model outside of Spark and want to map the coefficient values created in the model to the individual columns in the DataFrame to create a new column that is the sum of each of the different source columns multiplied by the individual coefficients. I understand that there are many utilities in Spark mllib for modeling, but I want to understand how this 'brute force' method could be accomplished. I also know that DataFrames/RDDs are immutable, so a new DataFrame would have to be created.
Here's some pseudo-code for reference:
#load example data
df = sqlContext.createDataFrame(data)
df.show(5)
dfmappd.select("age", "parch", "pclass").show(5)
+----+-----+------+
| age|parch|pclass|
+----+-----+------+
|22.0| 0| 3|
|38.0| 0| 1|
|26.0| 0| 3|
|35.0| 0| 1|
|35.0| 0| 3|
+----+-----+------+
only showing top 5 rows
The model created outside of Spark is a logistic regression model based on a binary response. So essentially I want to map the logit function to these three columns to produce a fourth scored column. Here are the coefficients from the model:
intercept: 3.435222
age: -0.039841
parch: 0.176439
pclass: -1.239452
Here is a description of the logit function for reference:
https://en.wikipedia.org/wiki/Logistic_regression
For comparison, here is how I would do the same thing in R using tidyr and dplyr
library(dplyr)
library(tidyr)
#Example data
Age <- c(22, 38, 26, 35, 35)
Parch <- c(0,0,0,0,0)
Pclass <- c(3, 1, 3, 1, 3)
#Wrapped in a dataframe
mydf <- data.frame(Age, Parch, Pclass)
#Using dplyr to create a new dataframe with mutated column
scoredf = mydf %>%
mutate(score = round(1/(1 + exp(-(3.435 + -0.040 * Age + 0.176 * Parch + -1.239 * Pclass))),2))
scoredf
If I interpret your question correctly, you want to compute the class conditional probability of each sample given the coefficients you computed offline and do it "manually".
Does something like this work:
def myLogisticFunc(age, parch, pclass):
intercept = 3.435222
betaAge = -0.039841
betaParch = 0.176439
betaPclass = -1.239452
z = intercept + betaAge * age + betaParch * parch + betaPclass * pclass
return 1.0 / (1.0 + math.exp(-z))
myLogisticFuncUDF = udf(myLogisticFunc)
df.withColumn("score", myLogisticFuncUDF(col("age"), col("parch"), col("pclass"))).show()