Spark dataframe adding new column issue - Structured streaming - apache-spark

I am using spark Structured streaming. I have a Dataframe and adding a new column "current_ts".
inpuDF.withColumn("current_ts", lit(System.currentTimeMillis()))
This does not update every row with current epoch time. It updates the same epcoh time when the job was trigerred causing every row in DF to have the same values. This works well with normal spark jobs. Is this a issue with spark structured streaming ?

Well spark records your transformations as lineage graph, and only executes the graph when some action is called. So it will call
System.currentTimeMillis()
when some action is triggered. What I didn't understand that what in it you find confusing or what are you trying to achieve. Thanks.

Spark has a function to create a column with current timestamp. Your code should look like this:
import org.apache.spark.sql.functions
// ...
inpuDF.withColumn("current_ts", functions.current_timestamp())

The problem with your method is that use lit which is literal function or a constant.
Spark will treat that as constant which is passed from the driver.
So when you execute the job, the literal will be evaluated with the time you execute.
All records have the same timestamp.
You need to use function instead.
current_timestamp() should work.

Try this
inpuDF.writeStream.partitionBy('current_ts')

Related

How to use Pyspark's csv reader on every element of Pyspark RDD? (without "reference SparkContext from a broadcast variable")

I want to use Pyspark to read in hundreds of csv files, create a single dataframe that is (roughly) the concatenation of all the csvs. Since each csv can fit in memory, but not more than one or two at a time, this seems like a good fit for Pyspark. My strategy is not working, and I think it is because I want to make a Pyspark dataframe in the kernel function of my map function resulting in an error:
# initiate spark session and other variables
sc = SparkSession.builder.master("local").appName("Test").config(
"spark.driver.bindAddress", "127.0.0.1").getOrCreate()
file_path_list = [path1, path2] ## list of string path variables
# make an rdd object so i can use .map:
rdd = sc.sparkContext.parallelize(file_path_list)
# make a kernel function for my future .map() application
def kernel_f(path):
df = sc.read.options(delimiter=",", header=True).csv(path)
return df
# apply .map
rdd2 = rdd.map(kernel_f)
# see first dataframe (so excited)
rdd2.take(2)[0].show(3)
this throws an error:
PicklingError: Could not serialize object: RuntimeError: It appears
that you are attempting to reference SparkContext from a broadcast
variable, action, or transformation. SparkContext can only be used on
the driver, not in code that it run on workers. For more information,
see SPARK-5063.
My next step (supposing no error had appeared) was to use a reduce step to concatenate all the members (dataframes with same schema) of that rdd2
It seems related to this post but I don't understand the answer.
Questions:
I think this means is that since my kernel_f calls sc. methods, it is against the rules. Is that right?
I (think I) could use plain-old python (not pyspark) function map to apply the kernel_f to my file_path_list, then use plain-old functools.reduce to concatenate all these into a single pyspark dataframe, but that doesn't seem to leverage pyspark much. Does this seem like a good route?
Can you teach me a good, ideally a "tied-for-best" way to do this?
I don't have a definitive answer but just comments that might help. First off, I think the easiest way to do this is read the CSVs with a wildcard like shown here
A Spark cluster is composed of the scheduler and the workers. You use the SparkSession to pass work to the scheduler. It seems they don't allow workers sending work to the scheduler, which seems like it can be an anti-pattern in a lot of use cases.
The design pattern is also weird here because you are not actually passing a DataFrame back. Spark operations are lazy unlike Pandas so that read is not happening immediately. I feel like if it worked, it would pass a DAG back, not data.
It doesn't sound good because you want loading of files to be lazy. Given you can't use spark to read on a worker, you'd have to use Pandas/Python which evaluate immediately. You will run out of memory trying this even more.
Speaking of memory, Spark lets you perform out-of-memory computation but there are limits to how big can be out-of-memory relative to the memory available. You will inevitably run into errors if you really don't have enough memory by a considerable margin.
I think you should use the wildcard as shown above.

When does Spark do a "Scan ExistingRDD"?

I have a job that takes in a huge dataset and joins it with another dataset. The first time it ran, it took a really long time and Spark executed a FileScan parquet when reading the dataset, but in future jobs the query plan shows Scan ExistingRDD and the build takes minutes.
Why and how is Spark able to scan an existing RDD? Would it ever fall back to scanning the parquet files that back a dataset (and hence revert to worse performance)?
There are two common situations in Foundry in which you'll see this:
You're using a DataFrame you defined manually through createDataFrame
You're running an incremental transform with an input that doesn't have any changes, so you're using an empty synthetic DataFrame that Transforms has created for you (a special case of 1.)
If we follow the Spark code, we see the definition of the call noted, Scan ExistingRDD, this in turn calls into RDDScanExec, which is a mapper for InternalRows (a representation of literal values held by the Driver and synthesized into a DataFrame).

Dropping temporary columns in Spark

Im creating a new column in a data frame and use it in subsequent transformations. Latter when I try to drop the new column it breaks the execution. When I look into the execution plan Spark optimize execution plan by removing the whole flow as because Im dropping the column in latter stage. How to drop temporary column without affecting execution plan? - Im using pyspark.
df = df.withColumn('COLUMN_1', "some transformation returns value").withColumn('COLUMN_2',"some transformation returns value")
df = df.withColumn('RESULT',when("some condition", col('COLUMN_1')).otherwise(col('COLUMN_2'))).drop('COLUMN_1','COLUMN_2')
I have tried in spark-shell(using scala) and it's working as expected
I'm using Spark 2.4.4 version and scala 2.11.12.
I have tried the same in Pyspark and refer the attachment. Let me know if this answer helps for you.
With Pyspark

Why is `spark.range(100).orderBy('id', ascending=False).rdd` not lazy and trigger an action?

Spark v2.4 pyspark
spark.range(100).orderBy('id', ascending=False).rdd
When I type the above, it immediately spawn a spark job. I find it suprising as I didn't even specify an action.
E.g. spark.range(100).repartition(10, 'id').sortWithinPartitions('id').rdd works as expected in a way that no job is triggered..
A related question is Why does sortBy transformation trigger a Spark job?
It confirms RDD sortBy can trigger an action.
But here I am using a DataFrame. spark.range(100).orderBy('id', ascending=False) works alright. The job only gets triggered once I access .rdd.
Not all transformation is 100% lazy. OrderBy needs to evaluate the RDD to determine the range of data, so it involves both a transformation and an action.

Pyspark Checkpointing in Dataproc(StackOverFlowError)

I faced stackoveroverflow error when persisting dataset with pyspark. I am casting whole dataframe into doubletype and then persisting to calculate statistics, and I read that checkpointing is a solution to stackoverflow. However, I am having trouble with implementing it in dataproc.
I am working with pyspark, and when I checkpoint the dataframe and checkedpointed with df.isCheckpointed(), it returns false. However, when I debug it, df.rdd.is_checkpointed says True. Is there any issue with the package / am I doing something wrong?
I thought localCheckpoint is more appropriate for my purpose(https://spark.apache.org/docs/2.3.1/api/java/org/apache/spark/rdd/RDD.html#localCheckpoint()), as my problem was simply DAG depth was too deep, but I couldn't find any use case. Also, if I just checkpointed RDD says it is checkpointed(as in first question), but if I tried localcheckpoint, it says it is not. Has anyone tried this function?
After I tried with local standalone mode, I tried it with dataproc. I tried both hdfs and google cloud storage, but either way the storage was empty, but rdd says it is checkpointed.
Can anyone help me with this? Thanks in advance!
If you're using localCheckpoint, it will write to the local disk of executors, not to HDFS/GCS: https://spark.apache.org/docs/2.3.1/api/java/org/apache/spark/rdd/RDD.html#localCheckpoint--.
Also note that there's an eager (checkpoint immediately) and non-eager (checkpoint when the RDD is actually materialized) mode to checkpointing. That may affect what those methods return. The code is often the best documentation: https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/rdd/RDD.scala#L1615.
In general, please attach sample code (repros) to questions like this -- that way we can answer your question more directly.
Create a temp dir using your SparkSession object:
spark.sparkContext.setCheckpointDir("/tmp/checkpoints")
dataframe_name = # Any Spark Dataframe of type <pyspark.sql.dataframe.DataFrame>
at this point of time, the dataframe_name would be a DAG, which you can store as a checkpoint like,
dataframe_checkpoint = dataframe_name.checkpoint()
dataframe_checkpoint is also a spark dataframe of type <pyspark.sql.dataframe.DataFrame> but instead of the DAG, it stores the result of the query
Use checkpoints if:
the computation takes a long time
the computing chain is too long
depends too many RDDs

Resources