Appending column name to column value using Spark - apache-spark

I have data in comma separated file, I have loaded it in the spark data frame:
The data looks like:
A B C
1 2 3
4 5 6
7 8 9
I want to transform the above data frame in spark using pyspark as:
A B C
A_1 B_2 C_3
A_4 B_5 C_6
--------------
Then convert it to list of list using pyspark as:
[[ A_1 , B_2 , C_3],[A_4 , B_5 , C_6]]
And then run FP Growth algorithm using pyspark on the above data set.
The code that I have tried is below:
from pyspark.sql.functions import col, size
from pyspark.sql.functions import *
import pyspark.sql.functions as func
from pyspark.sql.functions import udf
from pyspark.sql.types import StringType
from pyspark.ml.fpm import FPGrowth
from pyspark.sql import Row
from pyspark.context import SparkContext
from pyspark.sql.session import SparkSession
from pyspark import SparkConf
from pyspark.sql.types import StringType
from pyspark import SQLContext
sqlContext = SQLContext(sc)
df = spark.read.format("csv").option("header", "true").load("dbfs:/FileStore/tables/data.csv")
names=df.schema.names
Then I thought of doing something inside for loop:
for name in names:
-----
------
After this I will be using fpgrowth:
df = spark.createDataFrame([
(0, [ A_1 , B_2 , C_3]),
(1, [A_4 , B_5 , C_6]),)], ["id", "items"])
fpGrowth = FPGrowth(itemsCol="items", minSupport=0.5, minConfidence=0.6)
model = fpGrowth.fit(df)

A number of concepts here for those who use Scala normally showing how to do with pyspark. Somewhat different but learnsome for sure, although to how many is the big question. I certainly learnt a point on pyspark with zipWithIndex myself. Anyway.
First part is to get stuff into desired format, probably too may imports but leaving as is:
from functools import reduce
from pyspark.sql.functions import lower, col, lit, concat, split
from pyspark.sql.types import *
from pyspark.sql import Row
from pyspark.sql import functions as f
source_df = spark.createDataFrame(
[
(1, 11, 111),
(2, 22, 222)
],
["colA", "colB", "colC"]
)
intermediate_df = (reduce(
lambda df, col_name: df.withColumn(col_name, concat(lit(col_name), lit("_"), col(col_name))),
source_df.columns,
source_df
) )
allCols = [x for x in intermediate_df.columns]
result_df = intermediate_df.select(f.concat_ws(',', *allCols).alias('CONCAT_COLS'))
result_df = result_df.select(split(col("CONCAT_COLS"), ",\s*").alias("ARRAY_COLS"))
# Add 0,1,2,3, ... with zipWithIndex, we add it at back, but that does not matter, you can move it around.
# Get new Structure, the fields (one in this case but done flexibly, plus zipWithIndex value.
schema = StructType(result_df.schema.fields[:] + [StructField("index", LongType(), True)])
# Need this dict approach with pyspark, different to Scala.
rdd = result_df.rdd.zipWithIndex()
rdd1 = rdd.map(
lambda row: tuple(row[0].asDict()[c] for c in schema.fieldNames()[:-1]) + (row[1],)
)
final_result_df = spark.createDataFrame(rdd1, schema)
final_result_df.show(truncate=False)
returns:
+---------------------------+-----+
|ARRAY_COLS |index|
+---------------------------+-----+
|[colA_1, colB_11, colC_111]|0 |
|[colA_2, colB_22, colC_222]|1 |
+---------------------------+-----+
Second part is the old zipWithIndex with pyspark if you need 0,1,.. Painful compared to Scala.
In general easier to solve in Scala.
Not sure on performance, not a foldLeft, interesting. I think it is OK actually.

Related

Apply StopWordsRemover and RegexTokenizer to multiple columns in spark 2.4.3

I've the following dataframe, df4
|Itemno |fits_assembly_id |fits_assembly_name |assembly_name
|0450056 |13039 135502 141114 4147 138865 2021 9164 |OIL PUMP ASSEMBLY A01EA09CA 4999202399920239A06 A02EA09CA A02EA09CB A02EA09CC |OIL PUMP ASSEMBLY 999202399920239A06
and I am using following code to process/clean the above-mentioned data frame
from pyspark.ml.feature import StopWordsRemover, RegexTokenizer
from pyspark.sql.functions import expr
# Task-1: Regex Tokenizer
tk = RegexTokenizer(pattern=r'(?:\p{Punct}|\s)+', inputCol='fits_assembly_name', outputCol='temp1')
df5 = tk.transform(df4)
#Task-2: StopWordsRemover
sw = StopWordsRemover(inputCol='temp1', outputCol='temp2')
df6 = sw.transform(df5)
# #Task-3: Remove duplicates
df7 = df6.withColumn('fits_assembly_name', expr('concat_ws(" ", array_distinct(temp2))')) \
.drop('temp1', 'temp2')
I want to process both columns fits_assembly_name and assembly_name in RegexTokenizer & StopWordsRemover in one go. Could you please share how it can be achieved?
You can use a list comprehension to handle multiple columns, use pyspark.ml.Pipeline to skip the intermediate dataframes, see below:
from pyspark.ml.feature import StopWordsRemover, RegexTokenizer
from pyspark.ml import Pipeline
from pyspark.sql.functions import expr
# df4 is the initial dataframe and new result will overwrite it.
for col in ['fits_assembly_name', 'assembly_name']:
tk = RegexTokenizer(pattern=r'(?:\p{Punct}|\s)+', inputCol=col, outputCol='temp1')
sw = StopWordsRemover(inputCol='temp1', outputCol='temp2')
pipeline = Pipeline(stages=[tk, sw])
df4 = pipeline.fit(df4).transform(df4) \
.withColumn(col, expr('concat_ws(" ", array_distinct(temp2))')) \
.drop('temp1', 'temp2')

how to make an operation in parallel using spark

I have the following data.frame in spark
import findspark
findspark.init()
from pyspark.sql import SparkSession
spark = SparkSession.builder.getOrCreate()
sc = spark.sparkContext
from pyspark.sql import functions as sf
from pyspark.sql.functions import col, when, lit
ddf = spark.createDataFrame([[None, 'Michael',2],
[30, 'Andy',3],
[19, 'Justin',4],
[30, 'James Dr No From Russia with Love Bond',6]],
schema=['age', 'name','weights'])
ddf.show()
In this trivial example I would like to create two columns: One with the weighted.mean of the age if age>29 (with name weighted_age) and the other the age^2 if age<=29 (with the name age_squared)
In order to do so I can do this
from pyspark.sql import functions as f
weightedMean = ddf.filter(f.col('age')>29).select(f.sum(f.col('age')*f.col('weights'))/f.sum(f.col('weights'))).first()[0]
ddf.withColumn('weighted_age', f.when(f.col('age') > 29, weightedMean))\
.withColumn('age_squared', f.when(f.col('age') <= 29, f.col('age')*f.col('age')))\
.show(truncate=False)
My question is, is there any way to do this operation in parallel for the two if conditions ( so, two columns are created. One is created under the condition age >29 (first if condition), and the other is created under the condition age <= 29 (second if condition) )

Adding a Vectors Column to a pyspark DataFrame

How do I add a Vectors.dense column to a pyspark dataframe?
import pandas as pd
from pyspark import SparkContext
from pyspark.sql import SQLContext
from pyspark.ml.linalg import DenseVector
py_df = pd.DataFrame.from_dict({"time": [59., 115., 156., 421.], "event": [1, 1, 1, 0]})
sc = SparkContext(master="local")
sqlCtx = SQLContext(sc)
sdf = sqlCtx.createDataFrame(py_df)
sdf.withColumn("features", DenseVector(1))
Gives an error in file anaconda3/lib/python3.6/site-packages/pyspark/sql/dataframe.py, line 1848:
AssertionError: col should be Column
It doesn't like the DenseVector type as a column. Essentially, I have a pandas dataframe that I'd like to transform to a pyspark dataframe and add a column of the type Vectors.dense. Is there another way of doing this?
Constant Vectors cannot be added as literal. You have to use udf:
from pyspark.sql.functions import udf
from pyspark.ml.linalg import VectorUDT
one = udf(lambda: DenseVector([1]), VectorUDT())
sdf.withColumn("features", one()).show()
But I am not sure why you need that at all. If you want to transform existing columns into Vectors use appropriate pyspark.ml tools, like VectorAssembler - Encode and assemble multiple features in PySpark
from pyspark.ml.feature import VectorAssembler
VectorAssembler(inputCols=["time"], outputCol="features").transform(sdf)

Pyspark: applying kmeans on different groups of a dataframe

Using Pyspark I would like to apply kmeans separately on groups of a dataframe and not to the whole dataframe at once. For the moment I use a for loop which iterates on each group, applies kmeans and appends the result to another table. But having a lot of groups makes it time consuming. Anyone could help me please??
Thanks a lot!
for customer in customer_list:
temp_df = togroup.filter(col("customer_id")==customer)
df = assembler.transform(temp_df)
k = 1
while (k < 5 & mtrc < width):
k += 1
kmeans = KMeans(k=k,seed=5,maxIter=20,initSteps=5)
model = kmeans.fit(df)
mtric = 1 - model.computeCost(df)/ttvar
a = model.transform(df)select(cols)
allcustomers = allcustomers .union(a)
I came up with a solution using pandas_udf. A pure spark or scala solution is preferred and yet to be offered.
Assume my data is
import pandas as pd
df_pd = pd.DataFrame([['cat1',10.],['cat1',20.],['cat1',11.],['cat1',21.],['cat1',22.],['cat1',9.],['cat2',101.],['cat2',201.],['cat2',111.],['cat2',214.],['cat2',224.],['cat2',99.]],columns=['cat','val'])
df_sprk = spark.createDataFrame(df_pd)
First solve the problem in pandas:
from sklearn.cluster import KMeans
kmeans = KMeans(n_clusters=2,random_state=0)
def skmean(kmeans,x):
X = np.array(x)
kmeans.fit(X)
return(kmeans.predict(X))
You can apply skmean() to a panda data frame (to make sure it works properly):
df_pd.groupby('cat').apply(lambda x:skmean(kmeans,x)).reset_index()
To apply the function to pyspark data frame, we use pandas_udf. But first define a schema for the output data frame:
from pyspark.sql.types import *
schema = StructType(
[StructField('cat',StringType(),True),
StructField('clusters',ArrayType(IntegerType()))])
Convert the function above to a pandas_udf:
from pyspark.sql.functions import pandas_udf
from pyspark.sql.functions import PandasUDFType
#pandas_udf(schema, functionType=PandasUDFType.GROUPED_MAP)
def skmean_udf(df):
result = pd.DataFrame(
df.groupby('cat').apply(lambda x: skmean(kmeans,x))
result.reset_index(inplace=True, drop=False)
return(result)
You can use the function as follows:
df_spark.groupby('cat').apply(skmean_udf).show()
I came up with a second solution which is I think is slightly better than the last one. The idea is to use groupby() together withcollect_list() and write a udf that takes a list as input and generates the clusters. Continuing with df_spark in the other solution we write:
df_flat = df_spark.groupby('cat').agg(F.collect_list('val').alias('val_list'))
Now we write the udf function:
import numpy as np
import pyspark.sql.functions as F
from sklearn.cluster import KMeans
from pyspark.sql.types import *
def skmean(x):
kmeans = KMeans(n_clusters=2, random_state=0)
X = np.array(x).reshape(-1,1)
kmeans.fit(X)
clusters = kmeans.predict(X).tolist()
return(clusters)
clustering_udf = F.udf(lambda arr : skmean(arr), ArrayType(IntegerType()))
Then apply the udf to the flattened dataframe:
df = df_flat.withColumn('clusters', clustering_udf(F.col('val')))
Then you can use F.explode() to convert the list to a column.

How to access element of a VectorUDT column in a Spark DataFrame?

I have a dataframe df with a VectorUDT column named features. How do I get an element of the column, say first element?
I've tried doing the following
from pyspark.sql.functions import udf
first_elem_udf = udf(lambda row: row.values[0])
df.select(first_elem_udf(df.features)).show()
but I get a net.razorvine.pickle.PickleException: expected zero arguments for construction of ClassDict(for numpy.dtype) error. Same error if I do first_elem_udf = first_elem_udf(lambda row: row.toArray()[0]) instead.
I also tried explode() but I get an error because it requires an array or map type.
This should be a common operation, I think.
Convert output to float:
from pyspark.sql.types import DoubleType
from pyspark.sql.functions import lit, udf
def ith_(v, i):
try:
return float(v[i])
except ValueError:
return None
ith = udf(ith_, DoubleType())
Example usage:
from pyspark.ml.linalg import Vectors
df = sc.parallelize([
(1, Vectors.dense([1, 2, 3])),
(2, Vectors.sparse(3, [1], [9]))
]).toDF(["id", "features"])
df.select(ith("features", lit(1))).show()
## +-----------------+
## |ith_(features, 1)|
## +-----------------+
## | 2.0|
## | 9.0|
## +-----------------+
Explanation:
Output values have to be reserialized to equivalent Java objects. If you want to access values (beware of SparseVectors) you should use item method:
v.values.item(0)
which return standard Python scalars. Similarly if you want to access all values as a dense structure:
v.toArray().tolist()
If you prefer using spark.sql, you can use the follow custom function 'to_array' to convert the vector to array. Then you can manipulate it as an array.
from pyspark.sql.types import ArrayType, DoubleType
def to_array_(v):
return v.toArray().tolist()
from pyspark.sql import SQLContext
sqlContext=SQLContext(spark.sparkContext, sparkSession=spark, jsqlContext=None)
sqlContext.udf.register("to_array",to_array_, ArrayType(DoubleType()))
example
from pyspark.ml.linalg import Vectors
df = sc.parallelize([
(1, Vectors.dense([1, 2, 3])),
(2, Vectors.sparse(3, [1], [9]))
]).toDF(["id", "features"])
df.createOrReplaceTempView("tb")
spark.sql("""select * , to_array(features)[1] Second from tb """).toPandas()
output
id features Second
0 1 [1.0, 2.0, 3.0] 2.0
1 2 (0.0, 9.0, 0.0) 9.0
I ran into the same problem with not being able to use explode(). One thing you can do is use VectorSlice from the pyspark.ml.feature library. Like so:
from pyspark.ml.feature import VectorSlicer
from pyspark.ml.linalg import Vectors
from pyspark.sql.types import Row
slicer = VectorSlicer(inputCol="features", outputCol="features_one", indices=[0])
output = slicer.transform(df)
output.select("features", "features_one").show()
For anyone trying to split the probability columns generated after training a PySpark ML model into usable columns. This does not use UDF or numpy. And this will only work for binary classification. Here lr_pred is the dataframe which has the predictions from the Logistic Regression Model.
prob_df1=lr_pred.withColumn("probability",lr_pred["probability"].cast("String"))
prob_df =prob_df1.withColumn('probabilityre',split(regexp_replace("probability", "^\[|\]", ""), ",")[1].cast(DoubleType()))
Since Spark 3.0.0 this can be done without using UDF.
from pyspark.ml.functions import vector_to_array
https://discuss.dizzycoding.com/how-to-split-vector-into-columns-using-pyspark/
Why is Vector[Double] is used in the results? That's not a very nice data type.

Resources