Dropping temporary columns in Spark - apache-spark

Im creating a new column in a data frame and use it in subsequent transformations. Latter when I try to drop the new column it breaks the execution. When I look into the execution plan Spark optimize execution plan by removing the whole flow as because Im dropping the column in latter stage. How to drop temporary column without affecting execution plan? - Im using pyspark.
df = df.withColumn('COLUMN_1', "some transformation returns value").withColumn('COLUMN_2',"some transformation returns value")
df = df.withColumn('RESULT',when("some condition", col('COLUMN_1')).otherwise(col('COLUMN_2'))).drop('COLUMN_1','COLUMN_2')

I have tried in spark-shell(using scala) and it's working as expected
I'm using Spark 2.4.4 version and scala 2.11.12.
I have tried the same in Pyspark and refer the attachment. Let me know if this answer helps for you.
With Pyspark

Related

Options for inserting large dataset from databricks to SQL table using sparkR

I am trying to write a large dataset (millions) to an SQL table (Impala) using sparkR in databricks. I have found two options, neither of which are working.
Writing using a simple insertInto fails after five minutes with 'The spark driver has stopped unexpectedly and is restarting. Your notebook will be automatically reattached.' It does not restart:
sparkR.session()
insertInto(spark_dt_frame, sql_table , overwrite = FALSE)
The second using COPY INTO seems to hang (runs forever and never completes) even when just inserting 3 rows:
sparkR.session()
sql(paste("COPY INTO ",db_name,'.sql_table',
" FROM ''", spark_data_frame, "'",
" FILEFORMAT = PARQUET",
sep=""
))
It seems these are common issues that databricks only answer for is 'detach and reattach the notebook' which makes no difference. What are my options?
For anyone else who struggles with this issue - it relates to how memory is handled for R dataframes in databricks clusters. To work around it, I have found two options so far:
Convert your df to a partitioned spark dataframe prior to insert (note, you may still need to increase your cluster driver)
spark_df_for_insert <- createDataFrame(r_df, numPartitions=150)
Stop using R dataframes and switch to spark dataframes. This means you will need to change your code and a package like sparklyr will certainly come in handy.
I hope that helps somebody.

DropDuplicates in PySpark gives stackoverflowerror

I have a PySpark program which reads a json files of size around 350-400 MB and created a dataframe out of it.
In my next step, I create a Spark SQL query using createOrReplaceTempView and select few columns as required
Once this is done, I filter my dataframe with some conditions. It was working fine until this point of time.
Now, I needed to remove some duplicate values using a column. So, I introduced,
dropDuplicates in next step and it suddenly started giving me StackoverflowError
Below is the sample code:-
def create_some_df(initial_df):
initial_df.createOrReplaceTempView('data')
original_df = spark.sql('select c1,c2,c3,c4 from data')
## Filter out some events
original_df = original_df.filter(filter1condition)
original_df = original_df.filter(filter2condition)
original_df = original_df.dropDuplicates(['c1'])
return original_df
It worked fine until I added dropDuplicates method.
I am using 3 node AWS EMR cluster c5.2xlarge
I am running PySpark using spark-submit command in YARN client mode
What I have tried
I tried adding persist and cache before calling filter, but it didn't help
EDIT - Some more details
I realise that the error appears when I invoke my write function after multiple transformation i.e first action.
If I have dropDuplicates in my transformation before I write, it fails with error.
If I do not have dropDuplicates in my transformation, write works fine.

Spark SQL Update/Delete

Currently, I am working on a project using pySpark that reads in a few Hive tables, stores them as dataframes, and I have to perform a few updates/filters on them. I am avoiding using Spark syntax at all costs to create a framework that will only take SQL in a parameter file that will be run using my pySpark framework.
Now the problem is that I have to perform UPDATE/DELETE queries on my final dataframe, are there any possible work arounds to performing these operations on my dataframe?
Thank you so much!
A DataFrame is immutable , you can not change it, so you are not able to update/delete.
If you want to "delete" there is a .filter option (it will create a new DF excluding records based on the validation that you applied on filter).
If you want to "update", the closer equivalent is .map, where you can "modify" your record and that value will be on a new DF, the thing is that function will iterate all the records on the .df.
Another thing that you need to keep in mind is: if you load data into a df from some source (ie. Hive table) and perform some operations. That updated data wont be reflected on your source data. DF's live on memory, until you persist that data.
So, you can not work with DF like a sql-table for those operations. Depending on your requirements you need to analyze if Spark is a solution for your specific problem.

Writing data from Spark SQL vs RDD api

I recently performed ETL on a dataset using spark 2.3.0 in EMR 5.19 where i included a new sorting column. I used the following to do this and noticed that the output was much bigger than the original data set (both compressed parquet).
spark.sql("select * from schema.table where column = 'value'").write.bucketBy(1,"column1").sortBy("column2","column3").option("path"m"/mypath").saveAsTable("table")
I then reran this using the method below and got the expected data size (same as original).
spark.read.load("/originaldata").filter("column='value'").write.bucketBy(1,"column1").sortBy("column2","column3").option("path"m"/mypath").saveAsTable("table")
My write method is identical, but the way i'm bringing the data in is different. However, why is the first result about 4x bigger than the 2nd? Am i not doing the exact same thing either way? Tried to look up the differences between Spark SQL and RDD but can't see anything specifically on writing the data. Note that both the original data set and 2 results are all partitioned the same way (200 parts in all 3).
after getting the same larger-than-expected result with these approaches, i switched to this instead
spark.read.load("/originaldata").filter("column='value'").sort("column1","column2").write.save("/location")
this works as expected and does not fail. also does not use any unnecessary Hive saveAsTable features. a better option than sortBy which also requires bucketBy and saveAsTable

Spark dataframe adding new column issue - Structured streaming

I am using spark Structured streaming. I have a Dataframe and adding a new column "current_ts".
inpuDF.withColumn("current_ts", lit(System.currentTimeMillis()))
This does not update every row with current epoch time. It updates the same epcoh time when the job was trigerred causing every row in DF to have the same values. This works well with normal spark jobs. Is this a issue with spark structured streaming ?
Well spark records your transformations as lineage graph, and only executes the graph when some action is called. So it will call
System.currentTimeMillis()
when some action is triggered. What I didn't understand that what in it you find confusing or what are you trying to achieve. Thanks.
Spark has a function to create a column with current timestamp. Your code should look like this:
import org.apache.spark.sql.functions
// ...
inpuDF.withColumn("current_ts", functions.current_timestamp())
The problem with your method is that use lit which is literal function or a constant.
Spark will treat that as constant which is passed from the driver.
So when you execute the job, the literal will be evaluated with the time you execute.
All records have the same timestamp.
You need to use function instead.
current_timestamp() should work.
Try this
inpuDF.writeStream.partitionBy('current_ts')

Resources