how to make an operation in parallel using spark - apache-spark

I have the following data.frame in spark
import findspark
findspark.init()
from pyspark.sql import SparkSession
spark = SparkSession.builder.getOrCreate()
sc = spark.sparkContext
from pyspark.sql import functions as sf
from pyspark.sql.functions import col, when, lit
ddf = spark.createDataFrame([[None, 'Michael',2],
[30, 'Andy',3],
[19, 'Justin',4],
[30, 'James Dr No From Russia with Love Bond',6]],
schema=['age', 'name','weights'])
ddf.show()
In this trivial example I would like to create two columns: One with the weighted.mean of the age if age>29 (with name weighted_age) and the other the age^2 if age<=29 (with the name age_squared)
In order to do so I can do this
from pyspark.sql import functions as f
weightedMean = ddf.filter(f.col('age')>29).select(f.sum(f.col('age')*f.col('weights'))/f.sum(f.col('weights'))).first()[0]
ddf.withColumn('weighted_age', f.when(f.col('age') > 29, weightedMean))\
.withColumn('age_squared', f.when(f.col('age') <= 29, f.col('age')*f.col('age')))\
.show(truncate=False)
My question is, is there any way to do this operation in parallel for the two if conditions ( so, two columns are created. One is created under the condition age >29 (first if condition), and the other is created under the condition age <= 29 (second if condition) )

Related

Add column of dense vectors over a groupby in pyspark 2.2 or 2.3

I am using Pyspark 2.2.
I have a input table like this:
tag | features
1 | [1,0,0,2]
2 | [1.5,0,1,0]
2 | [0,0,1,0]
Need output like this
tag | sum(features)
1 | [1,0,0,2]
2 | [1.5,0,2,0]
Element wise addition needs to happen.
So far I have is:
df.groupBy('tag').agg(F.sum('features')).show(5,0)
But this gives me an error:
cannot resolve 'sum(`features`)' due to data type mismatch: function sum requires numeric types, not ArrayType(FloatType,true)
Any help would be appreciated.
Step 1: I have used F.collect_list to group all lists together
Step 2: Created a UDF by name Sum1() to take lists as input and return the sum
from pyspark.sql.types import *
from datetime import datetime
from pyspark.sql import *
from collections import *
from pyspark.sql.functions import udf
from pyspark.sql.types import StringType
df= spark.createDataFrame(
[
(1, [1,0,0,2]),
(2, [2,0,1,0]),
(2, [0,0,1,0])
],("ID","List")
)
df2=df.groupby("id").agg(F.collect_list("List") ).withColumnRenamed('collect_list(List)', 'List')
df2.registerTempTable("dfTbl")
def sum1(List):
d=[]
for i in range(0,len(List[0])):
if(len(List)==1 or len(List)==0):
d.append(List[0][i])
else:
d.append(List[0][i]+List[1][i])
return d
spark.udf.register("sum12", sum1, returnType=ArrayType(StringType()))
spark.sql("select id,sum12(List) from dfTbl ").show()

Apply StopWordsRemover and RegexTokenizer to multiple columns in spark 2.4.3

I've the following dataframe, df4
|Itemno |fits_assembly_id |fits_assembly_name |assembly_name
|0450056 |13039 135502 141114 4147 138865 2021 9164 |OIL PUMP ASSEMBLY A01EA09CA 4999202399920239A06 A02EA09CA A02EA09CB A02EA09CC |OIL PUMP ASSEMBLY 999202399920239A06
and I am using following code to process/clean the above-mentioned data frame
from pyspark.ml.feature import StopWordsRemover, RegexTokenizer
from pyspark.sql.functions import expr
# Task-1: Regex Tokenizer
tk = RegexTokenizer(pattern=r'(?:\p{Punct}|\s)+', inputCol='fits_assembly_name', outputCol='temp1')
df5 = tk.transform(df4)
#Task-2: StopWordsRemover
sw = StopWordsRemover(inputCol='temp1', outputCol='temp2')
df6 = sw.transform(df5)
# #Task-3: Remove duplicates
df7 = df6.withColumn('fits_assembly_name', expr('concat_ws(" ", array_distinct(temp2))')) \
.drop('temp1', 'temp2')
I want to process both columns fits_assembly_name and assembly_name in RegexTokenizer & StopWordsRemover in one go. Could you please share how it can be achieved?
You can use a list comprehension to handle multiple columns, use pyspark.ml.Pipeline to skip the intermediate dataframes, see below:
from pyspark.ml.feature import StopWordsRemover, RegexTokenizer
from pyspark.ml import Pipeline
from pyspark.sql.functions import expr
# df4 is the initial dataframe and new result will overwrite it.
for col in ['fits_assembly_name', 'assembly_name']:
tk = RegexTokenizer(pattern=r'(?:\p{Punct}|\s)+', inputCol=col, outputCol='temp1')
sw = StopWordsRemover(inputCol='temp1', outputCol='temp2')
pipeline = Pipeline(stages=[tk, sw])
df4 = pipeline.fit(df4).transform(df4) \
.withColumn(col, expr('concat_ws(" ", array_distinct(temp2))')) \
.drop('temp1', 'temp2')

Appending column name to column value using Spark

I have data in comma separated file, I have loaded it in the spark data frame:
The data looks like:
A B C
1 2 3
4 5 6
7 8 9
I want to transform the above data frame in spark using pyspark as:
A B C
A_1 B_2 C_3
A_4 B_5 C_6
--------------
Then convert it to list of list using pyspark as:
[[ A_1 , B_2 , C_3],[A_4 , B_5 , C_6]]
And then run FP Growth algorithm using pyspark on the above data set.
The code that I have tried is below:
from pyspark.sql.functions import col, size
from pyspark.sql.functions import *
import pyspark.sql.functions as func
from pyspark.sql.functions import udf
from pyspark.sql.types import StringType
from pyspark.ml.fpm import FPGrowth
from pyspark.sql import Row
from pyspark.context import SparkContext
from pyspark.sql.session import SparkSession
from pyspark import SparkConf
from pyspark.sql.types import StringType
from pyspark import SQLContext
sqlContext = SQLContext(sc)
df = spark.read.format("csv").option("header", "true").load("dbfs:/FileStore/tables/data.csv")
names=df.schema.names
Then I thought of doing something inside for loop:
for name in names:
-----
------
After this I will be using fpgrowth:
df = spark.createDataFrame([
(0, [ A_1 , B_2 , C_3]),
(1, [A_4 , B_5 , C_6]),)], ["id", "items"])
fpGrowth = FPGrowth(itemsCol="items", minSupport=0.5, minConfidence=0.6)
model = fpGrowth.fit(df)
A number of concepts here for those who use Scala normally showing how to do with pyspark. Somewhat different but learnsome for sure, although to how many is the big question. I certainly learnt a point on pyspark with zipWithIndex myself. Anyway.
First part is to get stuff into desired format, probably too may imports but leaving as is:
from functools import reduce
from pyspark.sql.functions import lower, col, lit, concat, split
from pyspark.sql.types import *
from pyspark.sql import Row
from pyspark.sql import functions as f
source_df = spark.createDataFrame(
[
(1, 11, 111),
(2, 22, 222)
],
["colA", "colB", "colC"]
)
intermediate_df = (reduce(
lambda df, col_name: df.withColumn(col_name, concat(lit(col_name), lit("_"), col(col_name))),
source_df.columns,
source_df
) )
allCols = [x for x in intermediate_df.columns]
result_df = intermediate_df.select(f.concat_ws(',', *allCols).alias('CONCAT_COLS'))
result_df = result_df.select(split(col("CONCAT_COLS"), ",\s*").alias("ARRAY_COLS"))
# Add 0,1,2,3, ... with zipWithIndex, we add it at back, but that does not matter, you can move it around.
# Get new Structure, the fields (one in this case but done flexibly, plus zipWithIndex value.
schema = StructType(result_df.schema.fields[:] + [StructField("index", LongType(), True)])
# Need this dict approach with pyspark, different to Scala.
rdd = result_df.rdd.zipWithIndex()
rdd1 = rdd.map(
lambda row: tuple(row[0].asDict()[c] for c in schema.fieldNames()[:-1]) + (row[1],)
)
final_result_df = spark.createDataFrame(rdd1, schema)
final_result_df.show(truncate=False)
returns:
+---------------------------+-----+
|ARRAY_COLS |index|
+---------------------------+-----+
|[colA_1, colB_11, colC_111]|0 |
|[colA_2, colB_22, colC_222]|1 |
+---------------------------+-----+
Second part is the old zipWithIndex with pyspark if you need 0,1,.. Painful compared to Scala.
In general easier to solve in Scala.
Not sure on performance, not a foldLeft, interesting. I think it is OK actually.

Adding a Vectors Column to a pyspark DataFrame

How do I add a Vectors.dense column to a pyspark dataframe?
import pandas as pd
from pyspark import SparkContext
from pyspark.sql import SQLContext
from pyspark.ml.linalg import DenseVector
py_df = pd.DataFrame.from_dict({"time": [59., 115., 156., 421.], "event": [1, 1, 1, 0]})
sc = SparkContext(master="local")
sqlCtx = SQLContext(sc)
sdf = sqlCtx.createDataFrame(py_df)
sdf.withColumn("features", DenseVector(1))
Gives an error in file anaconda3/lib/python3.6/site-packages/pyspark/sql/dataframe.py, line 1848:
AssertionError: col should be Column
It doesn't like the DenseVector type as a column. Essentially, I have a pandas dataframe that I'd like to transform to a pyspark dataframe and add a column of the type Vectors.dense. Is there another way of doing this?
Constant Vectors cannot be added as literal. You have to use udf:
from pyspark.sql.functions import udf
from pyspark.ml.linalg import VectorUDT
one = udf(lambda: DenseVector([1]), VectorUDT())
sdf.withColumn("features", one()).show()
But I am not sure why you need that at all. If you want to transform existing columns into Vectors use appropriate pyspark.ml tools, like VectorAssembler - Encode and assemble multiple features in PySpark
from pyspark.ml.feature import VectorAssembler
VectorAssembler(inputCols=["time"], outputCol="features").transform(sdf)

Pyspark freezing after an operation on aggregated Data Frame

I am using Spark 1.5.2 with Python 2.7.5.
I have this code that I run in the pyspark repl:
from pyspark.sql import SQLContext
ctx = SQLContext(sc)
df = ctx.createDataFrame([("a",1),("a",1),("a",0),("a",0),("b",1),("b",0),("b",1)],["group","conversion"])
from pyspark.sql.functions import col, count, avg
funs = [(count,"total"),(avg,"cr")]
aggregate = ["conversion"]
exprs = [f(col(c)).alias(name) for f,name in funs for c in aggregate]
df3 = df.groupBy("group").agg(*exprs).cache()
So far the code works fine and I can check df3:
>>> df3.collect()
[Row(group=u'a', total=4, cr=0.5), Row(group=u'b', total=3, cr=0.6666666666666666)]
However, when I try:
df3.agg(sum(col('cr'))).first()[0]
PySpark can't calculate that sum. However df3.rdd.reduce(lambda x,y: x[2]+y[2]) works just fine.
So, what is the issue with the first command to calculate the sum?
You should import pyspark's sum function first: from pyspark.sql.functions import sum. Otherwise python's built-in sum is called which just sums the sequence of numbers.

Resources