I'm using spark 1.5.1 and I don't know if this is a bug or a feature...
I'm trying to sum a dataframe column using a window. The normal code should be like:
from pyspark.sql.window import Window
from pyspark.sql import functions as F
df = sqlContext.read.parquet('/tmp/my_file.parquet')
w = Window().partitionBy('id').orderBy('timestamp').rowsBetween(0, 5)
df.select(F.sum(df.v1).over(w)).show()
Result:
That result is wrong ( I'd say that is trash). But, if I 'load' twice...
from pyspark.sql.window import Window
from pyspark.sql import functions as F
df = sqlContext.read.parquet('/tmp/my_file.parquet')
df = sqlContext.read.parquet('/tmp/my_file.parquet') # NEW LOAD
w = Window().partitionBy('id').orderBy('timestamp').rowsBetween(0, 5)
df.select(F.sum(df.v1).over(w)).show()
The new result is correct:
What is happening?
It seems that is a resolved bug for Spark 1.5.1. This is the jira tracker of the bug.
Related
My PySpark code is below, and the first part, i.e, the import statements cell takes a very long time to run in Jupyter, in fact, the execution didn't happen till 5 - 6 hours, and later it shows a "Time limit exceeded error".
I have tried everything, like restarting jupyter, uninstalling anaconda, and then reinstalling, uninstalling spark and pyspark, and then re-installing both of them again. I even removed python completely and then installed it again, BUT THE PROBLEM NEVER SOLVED...!
Edit 1:- I realized that the problem is with the line spark = init_spark() This is taking a lot of time to run (in fact not running even after 4 - 5 hours)
Please help me with this...
import os
import sys
import pyspark
from pyspark.rdd import RDD
from pyspark.sql import Row
from pyspark.sql import DataFrame
from pyspark.sql import SparkSession
from pyspark.sql import SQLContext
from pyspark.sql import functions
from pyspark.sql.functions import lit, desc, col, size
import pandas as pd
from pandas import DataFrame
import numpy as np
import matplotlib.pyplot as plt
import matplotlib.ticker as mtick
from IPython.core.interactiveshell import InteractiveShell
import matplotlib
from pylab import *
import scipy.stats as stats
# This helps auto print out the items without explixitly using 'print'
InteractiveShell.ast_node_interactivity = "all"
# Initialize a spark session.
conf = pyspark.SparkConf().setMaster("local[*]")
def init_spark():
spark = SparkSession \
.builder \
.appName("Statistical Inferences with Pyspark") \
.config(conf=conf) \
.getOrCreate()
return spark
spark = init_spark()
filename_data = 'D:\Subjects\ARTIFICIAL INTELLIGENCE\SEMESTER - 5\Big Data and DataBase Management\End Sem Project\endomondoHR_proper.json'
df = spark.read.json(filename_data, mode="DROPMALFORMED")
# Load meta data file into pyspark data frame as well
print('Data frame type: {}'.format(type(df)))
I am trying to develope a custom describe. For do that, I will combine functions from pyspark.sql.functions with other user aggregated customized functions(UDAF).
The code looks like:
from pyspark.sql.functions import count
from pyspark.sql.functions import pandas_udf, PandasUDFType
from scipy.stats import entropy
# Define a UDAF
#pandas_udf("double", PandasUDFType.GROUPED_AGG)
def my_entropy(data):
p_data = data.value_counts() # counts occurrence of each value
s = entropy(p_data) # get entropy from counts
return s
# Perform a groupby-agg
groupby_col = "a_column"
agg_col = "another_column"
df2return = df\
.groupBy(groupby_cols)\
.agg(count(agg_col).alias("count"),
my_entropy(agg_col).alias("s"))
df2return.show()
The error thrown is very long, so I only copy the last exception arised.
Do someone know how to fix that?
Using Pyspark I would like to apply kmeans separately on groups of a dataframe and not to the whole dataframe at once. For the moment I use a for loop which iterates on each group, applies kmeans and appends the result to another table. But having a lot of groups makes it time consuming. Anyone could help me please??
Thanks a lot!
for customer in customer_list:
temp_df = togroup.filter(col("customer_id")==customer)
df = assembler.transform(temp_df)
k = 1
while (k < 5 & mtrc < width):
k += 1
kmeans = KMeans(k=k,seed=5,maxIter=20,initSteps=5)
model = kmeans.fit(df)
mtric = 1 - model.computeCost(df)/ttvar
a = model.transform(df)select(cols)
allcustomers = allcustomers .union(a)
I came up with a solution using pandas_udf. A pure spark or scala solution is preferred and yet to be offered.
Assume my data is
import pandas as pd
df_pd = pd.DataFrame([['cat1',10.],['cat1',20.],['cat1',11.],['cat1',21.],['cat1',22.],['cat1',9.],['cat2',101.],['cat2',201.],['cat2',111.],['cat2',214.],['cat2',224.],['cat2',99.]],columns=['cat','val'])
df_sprk = spark.createDataFrame(df_pd)
First solve the problem in pandas:
from sklearn.cluster import KMeans
kmeans = KMeans(n_clusters=2,random_state=0)
def skmean(kmeans,x):
X = np.array(x)
kmeans.fit(X)
return(kmeans.predict(X))
You can apply skmean() to a panda data frame (to make sure it works properly):
df_pd.groupby('cat').apply(lambda x:skmean(kmeans,x)).reset_index()
To apply the function to pyspark data frame, we use pandas_udf. But first define a schema for the output data frame:
from pyspark.sql.types import *
schema = StructType(
[StructField('cat',StringType(),True),
StructField('clusters',ArrayType(IntegerType()))])
Convert the function above to a pandas_udf:
from pyspark.sql.functions import pandas_udf
from pyspark.sql.functions import PandasUDFType
#pandas_udf(schema, functionType=PandasUDFType.GROUPED_MAP)
def skmean_udf(df):
result = pd.DataFrame(
df.groupby('cat').apply(lambda x: skmean(kmeans,x))
result.reset_index(inplace=True, drop=False)
return(result)
You can use the function as follows:
df_spark.groupby('cat').apply(skmean_udf).show()
I came up with a second solution which is I think is slightly better than the last one. The idea is to use groupby() together withcollect_list() and write a udf that takes a list as input and generates the clusters. Continuing with df_spark in the other solution we write:
df_flat = df_spark.groupby('cat').agg(F.collect_list('val').alias('val_list'))
Now we write the udf function:
import numpy as np
import pyspark.sql.functions as F
from sklearn.cluster import KMeans
from pyspark.sql.types import *
def skmean(x):
kmeans = KMeans(n_clusters=2, random_state=0)
X = np.array(x).reshape(-1,1)
kmeans.fit(X)
clusters = kmeans.predict(X).tolist()
return(clusters)
clustering_udf = F.udf(lambda arr : skmean(arr), ArrayType(IntegerType()))
Then apply the udf to the flattened dataframe:
df = df_flat.withColumn('clusters', clustering_udf(F.col('val')))
Then you can use F.explode() to convert the list to a column.
Hi I am using a custom UDF to take square root of each value in each column.
square_root_UDF = udf(lambda x: math.sqrt(x), DoubleType())
for x in features:
dataTraining = dataTraining.withColumn(x, square_root_UDF(x))
Is there any faster way to get it done ? Polynomial expansion function is not suitable in this case.
Don't use UDF. Instead use built-in:
from pyspark.sql.functions import sqrt
for x in features:
dataTraining = dataTraining.withColumn(x, sqrt(x))
To add sqrt results as a column in scala you need to do the following:
import hc.implicits._
import org.apache.spark.sql.functions.sqrt
val dataTraining = dataTraining.withColumn("x_std", sqrt('x_variance))
In order to speed-up your calculation in this case
put your data into a DataFrame (not RDD)
use vectorized operations (not lambda-operations with UDF) as suggested by #user7757642
this is an example if you dataTraining is an RDD then
from pyspark.sql import SparkSession
from pyspark.sql.functions import sqrt
spark = SparkSession.builder.appName("SessionName") \
.config("spark.some.config.option", "some_value") \
.getOrCreate()
df = spark.createDataFrame(dataTraining)
for x in features:
df = df.withColumn(x, sqrt(x))
I am using Spark 1.5.2 with Python 2.7.5.
I have this code that I run in the pyspark repl:
from pyspark.sql import SQLContext
ctx = SQLContext(sc)
df = ctx.createDataFrame([("a",1),("a",1),("a",0),("a",0),("b",1),("b",0),("b",1)],["group","conversion"])
from pyspark.sql.functions import col, count, avg
funs = [(count,"total"),(avg,"cr")]
aggregate = ["conversion"]
exprs = [f(col(c)).alias(name) for f,name in funs for c in aggregate]
df3 = df.groupBy("group").agg(*exprs).cache()
So far the code works fine and I can check df3:
>>> df3.collect()
[Row(group=u'a', total=4, cr=0.5), Row(group=u'b', total=3, cr=0.6666666666666666)]
However, when I try:
df3.agg(sum(col('cr'))).first()[0]
PySpark can't calculate that sum. However df3.rdd.reduce(lambda x,y: x[2]+y[2]) works just fine.
So, what is the issue with the first command to calculate the sum?
You should import pyspark's sum function first: from pyspark.sql.functions import sum. Otherwise python's built-in sum is called which just sums the sequence of numbers.