DropDuplicates in PySpark gives stackoverflowerror - apache-spark

I have a PySpark program which reads a json files of size around 350-400 MB and created a dataframe out of it.
In my next step, I create a Spark SQL query using createOrReplaceTempView and select few columns as required
Once this is done, I filter my dataframe with some conditions. It was working fine until this point of time.
Now, I needed to remove some duplicate values using a column. So, I introduced,
dropDuplicates in next step and it suddenly started giving me StackoverflowError
Below is the sample code:-
def create_some_df(initial_df):
initial_df.createOrReplaceTempView('data')
original_df = spark.sql('select c1,c2,c3,c4 from data')
## Filter out some events
original_df = original_df.filter(filter1condition)
original_df = original_df.filter(filter2condition)
original_df = original_df.dropDuplicates(['c1'])
return original_df
It worked fine until I added dropDuplicates method.
I am using 3 node AWS EMR cluster c5.2xlarge
I am running PySpark using spark-submit command in YARN client mode
What I have tried
I tried adding persist and cache before calling filter, but it didn't help
EDIT - Some more details
I realise that the error appears when I invoke my write function after multiple transformation i.e first action.
If I have dropDuplicates in my transformation before I write, it fails with error.
If I do not have dropDuplicates in my transformation, write works fine.

Related

Apache Spark Dataframes Not Being Created with Databricks

When reading in data from SQL from a notebook a Spark Dataframe is created, but when I read in the same data from a different notebook I don't see a DataFrame.
So when I run the following in one notebook I get a dataframe as follows:
jdbcUrl = f"jdbc:sqlserver://{DBServer}.database.windows.net:1433;database={DBDatabase};user={DBUser};password={DBPword}"
my_sales = spark.read.jdbc(jdbcUrl, 'AZ_FH_ELLIPSE.AZ_FND_MSF620')
I get the following DataFrame output
However, but when I run the same code on a different notebook I only get how long it took to run the code but no dataframe.
Any thoughts?
I should mention that the DataFrame isn't appearing on the Community Edition of Databricks. However, I don't think that should be the reason why I'm not seeing a Dataframe or Schema appearing..

Dropping temporary columns in Spark

Im creating a new column in a data frame and use it in subsequent transformations. Latter when I try to drop the new column it breaks the execution. When I look into the execution plan Spark optimize execution plan by removing the whole flow as because Im dropping the column in latter stage. How to drop temporary column without affecting execution plan? - Im using pyspark.
df = df.withColumn('COLUMN_1', "some transformation returns value").withColumn('COLUMN_2',"some transformation returns value")
df = df.withColumn('RESULT',when("some condition", col('COLUMN_1')).otherwise(col('COLUMN_2'))).drop('COLUMN_1','COLUMN_2')
I have tried in spark-shell(using scala) and it's working as expected
I'm using Spark 2.4.4 version and scala 2.11.12.
I have tried the same in Pyspark and refer the attachment. Let me know if this answer helps for you.
With Pyspark

PySpark structured streaming apply udf to window

I am trying to apply a pandas udf to a window of a pyspark structured stream. The problem is that as soon as the stream has caught up with the current state all new windows only contain a single value somehow.
As you can see in the screenshot all windows after 2019-10-22T15:34:08.730+0000 only contain a single value. The code used to generate this is this:
#pandas_udf("Count long, Resampled long, Start timestamp, End timestamp", PandasUDFType.GROUPED_MAP)
def myudf(df):
df = df.dropna()
df = df.set_index("Timestamp")
df.sort_index(inplace=True)
# resample the dataframe
resampled = pd.DataFrame()
oidx = df.index
nidx = pd.date_range(oidx.min(), oidx.max(), freq="30S")
resampled["Value"] = df.Value.reindex(oidx.union(nidx)).interpolate('index').reindex(nidx)
return pd.DataFrame([[len(df.index), len(resampled.index), df.index.min(), df.index.max()]], columns=["Count", "Resampled", "Start", "End"])
predictionStream = sensorStream.withWatermark("Timestamp", "90 minutes").groupBy(col("Name"), window(col("Timestamp"), "70 minutes", "5 minutes"))
predictionStream.apply(myudf).writeStream \
.queryName("aggregates") \
.format("memory") \
.start()
The stream does get new values every 5 minutes. Its just that the window somehow only takes values from the last batch even though the watermark should not have expired.
Is there anything I am doing wrong ? I already tried playing with the watermark; that did have no effect on the result. I need all values of the window inside the udf.
I am running this in databricks on a cluster set to 5.5 LTS ML (includes Apache Spark 2.4.3, Scala 2.11)
It looks like you could specify the Output Mode you want for you writeStream
See documentation at Output Modes
By default it's using Append Mode:
This is the default mode, where only the new rows added to the Result Table since the last trigger will be outputted to the sink.
Try using:
predictionStream.apply(myudf).writeStream \
.queryName("aggregates") \
.format("memory") \
.outputMode(OutputMode.Complete) \
.start()
I found a Spark JIRA issue concerning this problem but it was closed without resolution. The bug appears to be, and I confirmed this independently on Spark version 3.1.1, that the Pandas UDF is executed on every trigger only with the data since the last trigger. So you are likely only processing a subset of the data you want to take into account on each trigger. Grouped Map Pandas UDFs do not appear to be functional for structured streaming with a delta table source. Please do follow up if you previously found a solution, otherwise I’ll just leave this here for folks that also find this thread.
Edit: There's some discussion in the Databricks forums about first doing a streaming aggregation and following that up with a Pandas UDF (that will likely expect a single record with columns containing arrays) as shown below. I tried it. It works. However, my batch duration is high and I'm uncertain how much this additional work is contributing to it.
agg_exprs = [f.collect_list('col_of_interest_1'),
f.collect_list('col_of_interest_2'),
f.collect_list('col_of_interest_3')]
intermediate_sdf = source_sdf.groupBy('time_window', ...).agg(agg_exprs)
final_sdf = intermediate_sdf.groupBy('time_window', ...).applyInPandas(func, schema)

passing value of RDD to another RDD as variable - Spark #Pyspark [duplicate]

This question already has answers here:
How to get a value from the Row object in Spark Dataframe?
(3 answers)
Closed 4 years ago.
I am currently exploring how to call big hql files (contains 100 line of an insert into select statement) via sqlContext.
Another thing is, The hqls files are parameterize, so while calling it from sqlContext, I want to pass the parameters as well.
Have gone through loads of blogs and posts, but not found any answers to this.
Another thing I was trying, to store an output of rdd into a variable.
pyspark
max_date=sqlContext.sql("select max(rec_insert_date) from table")
now want to pass max_date as variable to next rdd
incremetal_data=sqlConext.sql(s"select count(1) from table2 where rec_insert_date > $max_dat")
This is not working , moreover the value for max_date is coming as =
u[row-('20018-05-19 00:00:00')]
now this is not clear how to trim those extra characters.
The sql Context reterns a Dataset[Row]. You can get your value from there with
max_date=sqlContext.sql("select count(rec_insert_date) from table").first()[0]
In Spark 2.0+ using spark Session you can
max_date=spark.sql("select count(rec_insert_date) from table").rdd.first()[0]
to get the underlying rdd from the returned dataframe
Shouldn't you use max(rec_insert_date) instead of count(rec_insert_date)?
You have two options on passing values returned from one query to another:
Use collect, which will trigger computations and assign returned value to a variable
max_date = sqlContext.sql("select max(rec_insert_date) from table").collect()[0][0] # max_date has actual date assigned to it
incremetal_data = sqlConext.sql(s"select count(1) from table2 where rec_insert_date > '{}'".format(max_date))
Another (and better) option is to use Dataframe API
from pyspark.sql.functions import col, lit
incremental_data = sqlContext.table("table2").filter(col("rec_insert_date") > lit(max_date))
Use cross join - it should be avoided if you have more than 1 result from the first query. The advantage is that you don't break the graph of processing, so everything can be optimized by Spark.
max_date_df = sqlContext.sql("select max(rec_insert_date) as max_date from table") # max_date_df is a dataframe with just one row
incremental_data = sqlContext.table("table2").join(max_date_df).filter(col("rec_insert_date") > col("max_date"))
As for you first question how to call large hql files from Spark:
If you're using Spark 1.6 then you need to create a HiveContext https://spark.apache.org/docs/1.6.1/sql-programming-guide.html#hive-tables
If you're using Spark 2.x then while creating SparkSession you need to enable Hive Support https://spark.apache.org/docs/latest/sql-programming-guide.html#hive-tables
You can start by inserting im in a sqlContext.sql(...) method, from my experience this usually works and is a nice starting point to rewrite the logic to DataFrames/Datasets API. There may be some issues while running it in your cluster because your queries will be executed by Spark's SQL engine (Catalyst) and won't be passed to Hive.

Not able to set number of shuffle partition in pyspark

I know that by default, the number of partition for tasks is set to 200 in spark. I can't seem to change this. I'm running jupyter with spark 1.6.
I'm loading a fairly small table with about 37K rows from hive using the following in my notebook
from pyspark.sql.functions import *
sqlContext.sql("set spark.sql.shuffle.partitions=10")
test= sqlContext.table('some_table')
print test.rdd.getNumPartitions()
print test.count()
The output confirms 200 tasks. From the activity log, it's spinning up 200 tasks, which is an overkill. it seems like line number 2 above is ignored. So, I tried the following:
test = sqlContext.table('gfcctdmn_work.icgdeskrev_emma_cusip_activity_bw').repartition(5)
and create a new cell:
print test.rdd.getNumPartitions()
print test.count()
The output shows 5 partitions, but the log shows 200 tasks being spun up for the count, and then repartition to 5 took place after. However, if I convert it first to RDD, and back to DataFrame as follow:
test = sqlContext.table('gfcctdmn_work.icgdeskrev_emma_cusip_activity_bw').repartition(5).rdd
and create a new cell:
print test.getNumPartitions()
print test.toDF().count()
The very first time I ran the new cell, it's still running with 200 tasks. However, the second time I ran the new cell, it ran with 5 tasks.
How can I make the code run with 5 tasks the very first time it's running?
Would you mind explaining why this behaves this way(specifying number of partition, but it's still running under default settings)? Is it because the defauly Hive table was created using 200 partitions?
At the beginning of your notebook, do something like this:
from pyspark.conf import SparkConf
sc.stop()
conf = SparkConf().setAppName("test")
conf.set("spark.default.parallelism", 10)
sc = SparkContext(conf=conf)
When the notebook starts you have already a SparkContext created for you, but still you can change configuration and recreate it.
As for spark.default.parallelism, I understand it is what you need, take a look here:
Default number of partitions in RDDs returned by transformations like
join, reduceByKey, and parallelize when not set by user.

Resources