Pyspark freezing after an operation on aggregated Data Frame - apache-spark

I am using Spark 1.5.2 with Python 2.7.5.
I have this code that I run in the pyspark repl:
from pyspark.sql import SQLContext
ctx = SQLContext(sc)
df = ctx.createDataFrame([("a",1),("a",1),("a",0),("a",0),("b",1),("b",0),("b",1)],["group","conversion"])
from pyspark.sql.functions import col, count, avg
funs = [(count,"total"),(avg,"cr")]
aggregate = ["conversion"]
exprs = [f(col(c)).alias(name) for f,name in funs for c in aggregate]
df3 = df.groupBy("group").agg(*exprs).cache()
So far the code works fine and I can check df3:
>>> df3.collect()
[Row(group=u'a', total=4, cr=0.5), Row(group=u'b', total=3, cr=0.6666666666666666)]
However, when I try:
df3.agg(sum(col('cr'))).first()[0]
PySpark can't calculate that sum. However df3.rdd.reduce(lambda x,y: x[2]+y[2]) works just fine.
So, what is the issue with the first command to calculate the sum?

You should import pyspark's sum function first: from pyspark.sql.functions import sum. Otherwise python's built-in sum is called which just sums the sequence of numbers.

Related

Appending column name to column value using Spark

I have data in comma separated file, I have loaded it in the spark data frame:
The data looks like:
A B C
1 2 3
4 5 6
7 8 9
I want to transform the above data frame in spark using pyspark as:
A B C
A_1 B_2 C_3
A_4 B_5 C_6
--------------
Then convert it to list of list using pyspark as:
[[ A_1 , B_2 , C_3],[A_4 , B_5 , C_6]]
And then run FP Growth algorithm using pyspark on the above data set.
The code that I have tried is below:
from pyspark.sql.functions import col, size
from pyspark.sql.functions import *
import pyspark.sql.functions as func
from pyspark.sql.functions import udf
from pyspark.sql.types import StringType
from pyspark.ml.fpm import FPGrowth
from pyspark.sql import Row
from pyspark.context import SparkContext
from pyspark.sql.session import SparkSession
from pyspark import SparkConf
from pyspark.sql.types import StringType
from pyspark import SQLContext
sqlContext = SQLContext(sc)
df = spark.read.format("csv").option("header", "true").load("dbfs:/FileStore/tables/data.csv")
names=df.schema.names
Then I thought of doing something inside for loop:
for name in names:
-----
------
After this I will be using fpgrowth:
df = spark.createDataFrame([
(0, [ A_1 , B_2 , C_3]),
(1, [A_4 , B_5 , C_6]),)], ["id", "items"])
fpGrowth = FPGrowth(itemsCol="items", minSupport=0.5, minConfidence=0.6)
model = fpGrowth.fit(df)
A number of concepts here for those who use Scala normally showing how to do with pyspark. Somewhat different but learnsome for sure, although to how many is the big question. I certainly learnt a point on pyspark with zipWithIndex myself. Anyway.
First part is to get stuff into desired format, probably too may imports but leaving as is:
from functools import reduce
from pyspark.sql.functions import lower, col, lit, concat, split
from pyspark.sql.types import *
from pyspark.sql import Row
from pyspark.sql import functions as f
source_df = spark.createDataFrame(
[
(1, 11, 111),
(2, 22, 222)
],
["colA", "colB", "colC"]
)
intermediate_df = (reduce(
lambda df, col_name: df.withColumn(col_name, concat(lit(col_name), lit("_"), col(col_name))),
source_df.columns,
source_df
) )
allCols = [x for x in intermediate_df.columns]
result_df = intermediate_df.select(f.concat_ws(',', *allCols).alias('CONCAT_COLS'))
result_df = result_df.select(split(col("CONCAT_COLS"), ",\s*").alias("ARRAY_COLS"))
# Add 0,1,2,3, ... with zipWithIndex, we add it at back, but that does not matter, you can move it around.
# Get new Structure, the fields (one in this case but done flexibly, plus zipWithIndex value.
schema = StructType(result_df.schema.fields[:] + [StructField("index", LongType(), True)])
# Need this dict approach with pyspark, different to Scala.
rdd = result_df.rdd.zipWithIndex()
rdd1 = rdd.map(
lambda row: tuple(row[0].asDict()[c] for c in schema.fieldNames()[:-1]) + (row[1],)
)
final_result_df = spark.createDataFrame(rdd1, schema)
final_result_df.show(truncate=False)
returns:
+---------------------------+-----+
|ARRAY_COLS |index|
+---------------------------+-----+
|[colA_1, colB_11, colC_111]|0 |
|[colA_2, colB_22, colC_222]|1 |
+---------------------------+-----+
Second part is the old zipWithIndex with pyspark if you need 0,1,.. Painful compared to Scala.
In general easier to solve in Scala.
Not sure on performance, not a foldLeft, interesting. I think it is OK actually.

How to add a column to a PySpark dataframe which contains the nth quantile of another column in the dataframe

I have a very large CSV file which has been imported as a PySpark dataframe: df. The dataframe contains many columns including column ireturn. I want to compute the 0.99 and 0.01 percentile of this column and then add another column to the dataframe df as new_col_99 and new_col_01 which contains the 0.99 and 0.01 percentiles, respectively. I wrote the following code which works for small dataframes, but I get errors when I apply it to my large dataframe.
from pyspark.sql import SparkSession
spark = SparkSession.builder.getOrCreate()
df = spark.read.csv("name of the file", inferSchema = True, header = True)
precentile_99 = df.selectExpr('percentile(val1, 0.99)').head(1)[0][0]
precentile_01 = df.selectExpr('percentile(val1, 0.01)').head(1)[0][0]
from pyspark.sql.functions import lit
df = df.withColumn("new_col_99", lit(precentile_99))
df = df.withColumn("new_col_01", lit(precentile_01))
I tried to replace head with collect, but it did not work either. I get this error:
Logging error ---
ERROR:py4j.java_gateway:An error occurred while trying to connect to the Java server (127.0.0.1:49850)
Traceback (most recent call last):...
I have also tried the following:
percentile = df.approxQuantile('ireturn',[0.01,0.99],0.25)
df = df.withColumn("new_col_01", lit(percentile[0]))
df = df.withColumn("new_col_99", lit(percentile[1]))
The above code takes about 15-20 min to run but the result is wrong (my data on the column ireturn are less than 1 but it returns the 0.99 percentile as 6789....)
Late, but hopefully answers your concerns. You can get the result this way:
from pyspark.sql import SparkSession, functions as F
spark = SparkSession.builder.getOrCreate()
df = spark.read.csv("name of the file", inferSchema = True, header = True)
df = df.withColumn("new_col_99", F.expr('percentile(val1, 0.99) over()'))
df = df.withColumn("new_col_01", F.expr('percentile(val1, 0.01) over()'))
For large datasets percentile_approx may be better:
df = df.withColumn("new_col_99", F.expr('percentile_approx(val1, 0.99) over()'))
df = df.withColumn("new_col_01", F.expr('percentile_approx(val1, 0.01) over()'))

Pyspark: applying kmeans on different groups of a dataframe

Using Pyspark I would like to apply kmeans separately on groups of a dataframe and not to the whole dataframe at once. For the moment I use a for loop which iterates on each group, applies kmeans and appends the result to another table. But having a lot of groups makes it time consuming. Anyone could help me please??
Thanks a lot!
for customer in customer_list:
temp_df = togroup.filter(col("customer_id")==customer)
df = assembler.transform(temp_df)
k = 1
while (k < 5 & mtrc < width):
k += 1
kmeans = KMeans(k=k,seed=5,maxIter=20,initSteps=5)
model = kmeans.fit(df)
mtric = 1 - model.computeCost(df)/ttvar
a = model.transform(df)select(cols)
allcustomers = allcustomers .union(a)
I came up with a solution using pandas_udf. A pure spark or scala solution is preferred and yet to be offered.
Assume my data is
import pandas as pd
df_pd = pd.DataFrame([['cat1',10.],['cat1',20.],['cat1',11.],['cat1',21.],['cat1',22.],['cat1',9.],['cat2',101.],['cat2',201.],['cat2',111.],['cat2',214.],['cat2',224.],['cat2',99.]],columns=['cat','val'])
df_sprk = spark.createDataFrame(df_pd)
First solve the problem in pandas:
from sklearn.cluster import KMeans
kmeans = KMeans(n_clusters=2,random_state=0)
def skmean(kmeans,x):
X = np.array(x)
kmeans.fit(X)
return(kmeans.predict(X))
You can apply skmean() to a panda data frame (to make sure it works properly):
df_pd.groupby('cat').apply(lambda x:skmean(kmeans,x)).reset_index()
To apply the function to pyspark data frame, we use pandas_udf. But first define a schema for the output data frame:
from pyspark.sql.types import *
schema = StructType(
[StructField('cat',StringType(),True),
StructField('clusters',ArrayType(IntegerType()))])
Convert the function above to a pandas_udf:
from pyspark.sql.functions import pandas_udf
from pyspark.sql.functions import PandasUDFType
#pandas_udf(schema, functionType=PandasUDFType.GROUPED_MAP)
def skmean_udf(df):
result = pd.DataFrame(
df.groupby('cat').apply(lambda x: skmean(kmeans,x))
result.reset_index(inplace=True, drop=False)
return(result)
You can use the function as follows:
df_spark.groupby('cat').apply(skmean_udf).show()
I came up with a second solution which is I think is slightly better than the last one. The idea is to use groupby() together withcollect_list() and write a udf that takes a list as input and generates the clusters. Continuing with df_spark in the other solution we write:
df_flat = df_spark.groupby('cat').agg(F.collect_list('val').alias('val_list'))
Now we write the udf function:
import numpy as np
import pyspark.sql.functions as F
from sklearn.cluster import KMeans
from pyspark.sql.types import *
def skmean(x):
kmeans = KMeans(n_clusters=2, random_state=0)
X = np.array(x).reshape(-1,1)
kmeans.fit(X)
clusters = kmeans.predict(X).tolist()
return(clusters)
clustering_udf = F.udf(lambda arr : skmean(arr), ArrayType(IntegerType()))
Then apply the udf to the flattened dataframe:
df = df_flat.withColumn('clusters', clustering_udf(F.col('val')))
Then you can use F.explode() to convert the list to a column.

Spark ML: Taking square root of feature columns

Hi I am using a custom UDF to take square root of each value in each column.
square_root_UDF = udf(lambda x: math.sqrt(x), DoubleType())
for x in features:
dataTraining = dataTraining.withColumn(x, square_root_UDF(x))
Is there any faster way to get it done ? Polynomial expansion function is not suitable in this case.
Don't use UDF. Instead use built-in:
from pyspark.sql.functions import sqrt
for x in features:
dataTraining = dataTraining.withColumn(x, sqrt(x))
To add sqrt results as a column in scala you need to do the following:
import hc.implicits._
import org.apache.spark.sql.functions.sqrt
val dataTraining = dataTraining.withColumn("x_std", sqrt('x_variance))
In order to speed-up your calculation in this case
put your data into a DataFrame (not RDD)
use vectorized operations (not lambda-operations with UDF) as suggested by #user7757642
this is an example if you dataTraining is an RDD then
from pyspark.sql import SparkSession
from pyspark.sql.functions import sqrt
spark = SparkSession.builder.appName("SessionName") \
.config("spark.some.config.option", "some_value") \
.getOrCreate()
df = spark.createDataFrame(dataTraining)
for x in features:
df = df.withColumn(x, sqrt(x))

Load twice to run spark window ok

I'm using spark 1.5.1 and I don't know if this is a bug or a feature...
I'm trying to sum a dataframe column using a window. The normal code should be like:
from pyspark.sql.window import Window
from pyspark.sql import functions as F
df = sqlContext.read.parquet('/tmp/my_file.parquet')
w = Window().partitionBy('id').orderBy('timestamp').rowsBetween(0, 5)
df.select(F.sum(df.v1).over(w)).show()
Result:
That result is wrong ( I'd say that is trash). But, if I 'load' twice...
from pyspark.sql.window import Window
from pyspark.sql import functions as F
df = sqlContext.read.parquet('/tmp/my_file.parquet')
df = sqlContext.read.parquet('/tmp/my_file.parquet') # NEW LOAD
w = Window().partitionBy('id').orderBy('timestamp').rowsBetween(0, 5)
df.select(F.sum(df.v1).over(w)).show()
The new result is correct:
What is happening?
It seems that is a resolved bug for Spark 1.5.1. This is the jira tracker of the bug.

Resources