Join between stream-static data frame - apache-spark

I have an issue with spark streaming.
I have a stream df and I would like to update it
buy adding some rows from a batch df, but
when I try opération like join between the
stream Df and Batch Df | have an empty
df..can some one help me ?

Related

How to avoid re-evaluation of each transformation on pyspark data frame again and again

I have a spark data frame. I'm doing multiple transformations on the data frame. My code looks like this:
df = df.withColumn ........
df2 = df.filter......
df = df.join(df1 ...
df = df.join(df2 ...
Now I have around 30 + transformations like this. Also I'm aware of persisting of a data frame. So if I have some transformations like this:
df1 = df.filter.....some condition
df2 = df.filter.... some condtion
df3 = df.filter... some other conditon
I'm persisting the data frame "df" in the above case.
Now the problem is spark is taking too long to run (8 + mts) or sometimes it fails with Java heap space issue.
But after some 10+ transformations if I save to a table (persistent hive table) and read from table in the next line, it takes around 3 + mts to complete. Its not working even if I save it to a intermediate in memory table.
Cluster size is not the issue either.
# some transformations
df.write.mode("overwrite").saveAsTable("test")
df = spark.sql("select * from test")
# some transormations ---------> 3 mts
# some transformations
df.createOrReplaceTempView("test")
df.count() #action statement for view to be created
df = spark.sql("select * from test")
# some more transformations --------> 8 mts.
I looked at spark sql plan(still do not completely understand it). It looks like spark is re evaluating same dataframe again and again.
What I'm i doing wrong? I don have to write it to intermediate table.
Edit: I'm working on azure databricks 5.3 (includes Apache Spark 2.4.0, Scala 2.11)
Edit2: The issue is rdd long lineage. It looks like my spark application is getting slower and slower if the rdd lineage is increasing.
You should use caching.
Try using
df.cache
df.count
Using count to force caching all the information.
Also I recommend you take a look at this and this

How to make empty values available on data frame while writing to kafka on spark streaming

I am getting one issue on writing my spark streaming dataframe to kafka. I am writing the dataframe as JSON structure. The following way am using to write
val df =df_agg.select($"Country",$"plan",$"value")
df.selectExpr("to_json(struct(*)) AS value").writeStream.format("kafka").option("topic", "topicname").option("kafka.bootstrap.servers", "ddd.dl.uk.ddd.com:8002").option("sasl.kerberos.service.name","kafka").option("checkpointLocation", "/user/dddff/ddd/").option("kafka.security.protocol","SASL_PLAINTEXT").option("Partitioner.class","DefaultPartitioner").start().awaitTermination()
The issue is whenever my value for column "country" is empty , then its not even writting the field. For example i am getting dataframe df as
US,postpaid,300
CAN,prepaid,30
,postpaid,400
my output on kafka is
{"country":"US","plan":postpaid,"value":300}
{"country":"CAN","plan":0.0,"value":30}
{"plan":postpaid,"value":400}
But my expected output is
{"country":"US","plan":postpaid,"value":300}
{"country":"CAN","plan":0.0,"value":30}
{"country":"","plan":postpaid,"value":400}
How can i achieve this ? please help

Does Spark know the partitioning key of a DataFrame?

I want to know if Spark knows the partitioning key of the parquet file and uses this information to avoid shuffles.
Context:
Running Spark 2.0.1 running local SparkSession. I have a csv dataset that I am saving as parquet file on my disk like so:
val df0 = spark
.read
.format("csv")
.option("header", true)
.option("delimiter", ";")
.option("inferSchema", false)
.load("SomeFile.csv"))
val df = df0.repartition(partitionExprs = col("numerocarte"), numPartitions = 42)
df.write
.mode(SaveMode.Overwrite)
.format("parquet")
.option("inferSchema", false)
.save("SomeFile.parquet")
I am creating 42 partitions by column numerocarte. This should group multiple numerocarte to same partition. I don't want to do partitionBy("numerocarte") at the write time because I don't want one partition per card. It would be millions of them.
After that in another script I read this SomeFile.parquet parquet file and do some operations on it. In particular I am running a window function on it where the partitioning is done on the same column that the parquet file was repartitioned by.
import org.apache.spark.sql.expressions.Window
import org.apache.spark.sql.functions._
val df2 = spark.read
.format("parquet")
.option("header", true)
.option("inferSchema", false)
.load("SomeFile.parquet")
val w = Window.partitionBy(col("numerocarte"))
.orderBy(col("SomeColumn"))
df2.withColumn("NewColumnName",
sum(col("dollars").over(w))
After read I can see that the repartition worked as expected and DataFrame df2 has 42 partitions and in each of them are different cards.
Questions:
Does Spark know that the dataframe df2 is partitioned by column numerocarte?
If it knows, then there will be no shuffle in the window function. True?
If it does not know, It will do a shuffle in the window function. True?
If it does not know, how do I tell Spark the data is already partitioned by the right column?
How can I check a partitioning key of DataFrame? Is there a command for this? I know how to check number of partitions but how to see partitioning key?
When I print number of partitions in a file after each step, I have 42 partitions after read and 200 partitions after withColumn which suggests that Spark repartitioned my DataFrame.
If I have two different tables repartitioned with the same column, would the join use that information?
Does Spark know that the dataframe df2 is partitioned by column numerocarte?
It does not.
If it does not know, how do I tell Spark the data is already partitioned by the right column?
You don't. Just because you save data which has been shuffled, it does not mean, that it will be loaded with the same splits.
How can I check a partitioning key of DataFrame?
There is no partitioning key once you loaded data, but you can check queryExecution for Partitioner.
In practice:
If you want to support efficient pushdowns on the key, use partitionBy method of DataFrameWriter.
If you want a limited support for join optimizations use bucketBy with metastore and persistent tables.
See How to define partitioning of DataFrame? for detailed examples.
I am answering my own question for future reference what worked.
Following suggestion of #user8371915, bucketBy works!
I am saving my DataFrame df:
df.write
.bucketBy(250, "userid")
.saveAsTable("myNewTable")
Then when I need to load this table:
val df2 = spark.sql("SELECT * FROM myNewTable")
val w = Window.partitionBy("userid")
val df3 = df2.withColumn("newColumnName", sum(col("someColumn")).over(w)
df3.explain
I confirm that when I do window functions on df2 partitioned by userid there is no shuffle! Thanks #user8371915!
Some things I learned while investigating it
myNewTable looks like a normal parquet file but it is not. You could read it normally with spark.read.format("parquet").load("path/to/myNewTable") but the DataFrame created this way will not keep the original partitioning! You must use spark.sql select to get correctly partitioned DataFrame.
You can look inside the table with spark.sql("describe formatted myNewTable").collect.foreach(println). This will tell you what columns were used for bucketing and how many buckets there are.
Window functions and joins that take advantage of partitioning often require also sort. You can sort data in your buckets at the write time using .sortBy() and the sort will be also preserved in the hive table. df.write.bucketBy(250, "userid").sortBy("somColumnName").saveAsTable("myNewTable")
When working in local mode the table myNewTable is saved to a spark-warehouse folder in my local Scala SBT project. When saving in cluster mode with mesos via spark-submit, it is saved to hive warehouse. For me it was located in /user/hive/warehouse.
When doing spark-submit you need to add to your SparkSession two options: .config("hive.metastore.uris", "thrift://addres-to-your-master:9083") and .enableHiveSupport(). Otherwise the hive tables you created will not be visible.
If you want to save your table to specific database, do spark.sql("USE your database") before bucketing.
Update 05-02-2018
I encountered some problems with spark bucketing and creation of Hive tables. Please refer to question, replies and comments in Why is Spark saveAsTable with bucketBy creating thousands of files?

spark driver memory consumption

I have an application which submits multiple spark applications in parallel in client mode. What are the guidelines I need to follow to write the spark application such that driver memory is not overflown.
The operations I am doing in the spark are as follows:
val df1 : read data from file to dataframe
val df2 : do sum(col4) sum(col5) on df1
val df3 : sort df2 on col2
val df4 : df3.limit(threshould)
val result : fill in Null spaces with literals
save(write) result into file
I am creating multiple data frames here. Are all those data frames brought back to driver program? Will doing multiple operations in one step would make it client(driver) memory efficient?
I read https://spark.apache.org/docs/latest/programming-guide.html but did not get answer to my question.

How to create a spark dataframe after reading data directly from kafka queue

the data from kafka queue would be a line delimited json string like below
{"header":{"platform":"atm","msgtype":"1","version":"1.0"},"details":[{"bcc":"5814","dsrc":"A","aid":"5678"},{"bcc":"5814","dsrc":"A","mid":"0003"},{"bcc":"5812","dsrc":"A","mid":"0006"}]}
{"header":{"platform":"atm","msgtype":"1","version":"1.0"},"details":[{"bcc":"5814","dsrc":"A","aid":"1234"},{"bcc":"5814","dsrc":"A","mid":"0004"},{"bcc":"5812","dsrc":"A","mid":"0009"}]}
{"header":{"platform":"atm","msgtype":"1","version":"1.0"},"details":[{"bcc":"5814","dsrc":"A","aid":"1234"},{"bcc":"5814","dsrc":"A","mid":"0004"},{"bcc":"5812","dsrc":"A","mid":"0009"}]}
how can we create a dataframe in python for the above input? I have many columns to access the above is only a sample, the data would have 23 columns in total. Any help on this would be greatly appreciated.
You're looking for pyspark.sql.SQLContext.jsonRDD. Since Spark streaming is batched, your stream object will return a series of RDDs, each of which can be made into a DF via jsonRDD.

Resources