I'm using spark structured streaming in databricks. In that, I'm using foreach operation to perform some operations on every record of the data. But the function which I'm passing in foreach uses SparkSession but it's throwing an error: _pickle.PicklingError: Could not serialize object: Exception: It appears that you are attempting to reference SparkContext from a broadcast variable, action, or transformation. SparkContext can only be used on the driver, not in code that it run on workers.
So, Is there any way to use SparkSession inside foreach?
EDIT #1:
One example of function passed in foreach would be something like:
def process_data(row):
df = spark.createDataFrame([row])
df.write.mode("overwrite").saveAsTable("T2")
spark.sql("""
MERGE INTO T1
USING T2 ON T2.type="OrderReplace" AND T1.ReferenceNumber=T2.originalReferenceNumber
WHEN MATCHED THEN
UPDATE SET
shares = T2.shares,
price = T2.price,
ReferenceNumber = T2.newReferenceNumber
""")
So, I need SparkSession here which is not available inside foreach.
From the description of the tasks you want to perform row by row on your streaming dataset, you can try using foreachBatch.
A streaming dataset may have thousands to millions of records and if you are hitting external systems on each record, it is a massive overhead. From your example, it seems you are updating a base data set T1 based on events in T2.
From the link above, you can do something like below,
def process_data(df, epoch_id):
df.write.mode("overwrite").saveAsTable("T2")
spark.sql("""
MERGE INTO T1
USING T2 ON T2.type="OrderReplace" AND T1.ReferenceNumber=T2.originalReferenceNumber
WHEN MATCHED THEN
UPDATE SET
shares = T2.shares,
price = T2.price,
ReferenceNumber = T2.newReferenceNumber
""")
streamingDF.writeStream.foreachBatch(process_data).start()
Some additional points regarding what you are doing,
You are trying to do a foreach on a streaming data set and overwriting T2 hive table. Since records are processed parallel in clusters-managers like yarn, there might be a corrupted T2 since multiple tasks maybe updating it.
If you are persisting records in T2 (even temporarily), why don't you try making the merge/ update logic as a separate batch process? This will be similar to the foreachBatch solution I wrote above.
Updating/ Creating Hive tables fire a map/reduce job ultimately which requires resource negotiations, etc., etc. and may be painfully slow if done on each record, you may want to consider a different type of destination like a JDBC based one for example.
Related
In my code we use a lot createOrReplaceTempView so that we can invoke SQL on the generated view. This is done on multiple stages of the transformation. It also help us to keep the code in modules each performing a particular operation. A sample code below to put in context my question is shown below. So my questions are:
What is the performance penalty if any by creating the temp view
from the Dataset?
When I create more than one from each transformation does this
increases memory size?
What is the life cycle of those views and is there any function call
to remove them?
val dfOne = spark.read.option("header",true).csv("/apps/cortex/landing/auth/cof_auth.csv")
dfOne.createOrReplaceTempView("dfOne")
val dfTwo = spark.sql("select * from dfOne where column_one=1234567890")
dfTwo.createOrReplaceTempView("dfTwo")
val dfThree = spark.sql("select column_two, count(*) as count_two from dfTwo")
dfTree.createOrReplaceTempView("dfThree")
No.
From the manuals on
Running SQL Queries Programmatically
The sql function on a SparkSession enables applications to run SQL queries programmatically and returns the result as a DataFrame.
In order to do this you register the dataFrame as a SQL temporary view. This is a "lazy" artefact and there must already be a data frame / dataset present. It's just needs registering to allow the SQL interface.
createOrReplaceTempView takes some time in processing.
udf also takes time as it should be register in spark application and can cause performance issue.
From my experience, its better to use and look for bulitin function --> then I would put other two in same speed udf == sql_temp_table/view. Here, table/view has not all possiblities.
Actually I am working on a project which includes a workflow which consists of multiple tasks and single task consists of multiple components.
for eg.
In join , we need 4 components. 2 for input ( consider two table join) , 1 for join logic and 1 for output ( writing back to hdfs ).
This is one task. , similarly "sort" can be another task.
suppose a workflow with these two tasks and which are linked i.e
after performing join , we are using output of join in our sorting task.
But "join" and "sort" invoke separate "spark sessions".
So the flow is like , one spark session is created for join using spark submit , output is saved in hdfs and current spark session is closed. For the sort another session is created using spark submit and output stored at hdfs by join task is fetched for sorting.
But the problem is there is a overhead of fetching the data from hdfs.
So is there any way I can share a session for different tasks between two spark-submit. So that i will not loose my result dataframe from join and which can be directly used in my next spark-submit of sorting.
So basically I have multiple spark-submit associated with different task . But I want to retain my result in dataframe in memory so that I need not to persist it and can be used in another linked task ( spark-submit)
The spark session builder has a function to get or create the SparkSession.
val spark = SparkSession.builder()
.master("local[*]")
.appName("AzuleneTest")
.getOrCreate()
But reading your description you may be better off passing dataframes between between the functions (which hold a reference to a spark session). This avoids creating multiple instances of spark. For example
//This is run within a single spark application
val df1 = spark.read.parquet("s3://bucket/1/")
val df2 = spark.read.parquet("s3://bucket/1/")
val df3 = df1.join(df2,"id")
df3.write.parquet("s3://bucket/3/")
I have a problem of how to use spark to manipulate/iterate/scan multiple tables of cassandra. Our project uses spark&spark-cassandra-connector connecting to cassandra to scan multiple tables , try to match related value in different tables and if matched, take the extra action such as table inserting. The use case is like below:
sc.cassandraTable(KEYSPACE, "table1").foreach(
row => {
val company_url = row.getString("company_url")
sc.cassandraTable(keyspace, "table2").foreach(
val url = row.getString("url")
val value = row.getString("value")
if (company_url == url) {
sc.saveToCassandra(KEYSPACE, "target", SomeColumns(url, value))
}
)
})
The problems are
As spark RDD is not serializable, the nested search will fail cause sc.cassandraTable returns a RDD. The only way I know to work around is to use sc.broadcast(sometable.collect()). But if the sometable is huge, the collect will consume all the memory. And also, if in the use case, several tables use the broadcast, it will drain the memory.
Instead of broadcast, can RDD.persist handle the case? In my case, I use sc.cassandraTable to read all tables in RDD and persist back to disk, then retrieve the data back for processing. If it works, how can I guarantee the rdd read is done by chunks?
Other than spark, is there any other tool (like hadoop etc.??) which can handle the case gracefully?
It looks like you are actually trying to do a series of Inner Joins. See the
joinWithCassandraTable Method
This allows you to use the elements of One RDD to do a direct query on a Cassandra Table. Depending on the fraction of data you are reading from Cassandra this may be your best bet. If the fraction is too large though you are better off reading the two table separately and then using the RDD.join method to line up rows.
If all else fails you can always manually use the CassandraConnector Object to directly access the Java Driver and do raw requests with that from a distributed context.
Here is the sample senario, we have real time data record in cassandra, and we want to aggregate the data in different time ranges. What I write code like below:
val timeRanges = getTimeRanges(report)
timeRanges.foreach { timeRange =>
val (timestampStart, timestampEnd) = timeRange
val query = _sc.get.cassandraTable(report.keyspace, utilities.Helper.makeStringValid(report.scope)).
where(s"TIMESTAMP > ?", timestampStart).
where(s"VALID_TIMESTAMP <= ?", timestampEnd)
......do the aggregation work....
what the issue of the code is that for every time range, the aggregation work is running not in parallized. My question is how can I parallized the aggregation work? Since RDD can't run in another RDD or Future? Is there any way to parallize the work, or we can't using spark connector here?
Use the joinWithCassandraTable function. This allows you to use the data from one RDD to access C* and pull records just like in your example.
https://github.com/datastax/spark-cassandra-connector/blob/master/doc/2_loading.md#performing-efficient-joins-with-cassandra-tables-since-12
joinWithCassandraTable utilizes the java driver to execute a single
query for every partition required by the source RDD so no un-needed
data will be requested or serialized. This means a join between any
RDD and a Cassandra Table can be preformed without doing a full table
scan. When preformed between two Cassandra Tables which share the
same partition key this will not require movement of data between
machines. In all cases this method will use the source RDD's
partitioning and placement for data locality.
Finally , we using union to join each RDD and makes them parallized.
We are trying to implement a use case using Spark Streaming and Spark SQL that allows us to run user-defined rules against some data (See below for how the data is captured and used). The idea is to use SQL to specify the rules and return the results as alerts to the users. Executing the query based on each incoming event batch seems to be very slow. Would appreciate if anyone can suggest a better approach to implementing this use case. Also, would like know if Spark is executing the sql on the driver or workers? Thanks in advance. Given below are the steps we perform in order to achieve this -
1) Load the initial dataset from an external database as a JDBCRDD
JDBCRDD<SomeState> initialRDD = JDBCRDD.create(...);
2) Create an incoming DStream (that captures updates to the initialized data)
JavaReceiverInputDStream<SparkFlumeEvent> flumeStream =
FlumeUtils.createStream(ssc, flumeAgentHost, flumeAgentPort);
JavaDStream<SomeState> incomingDStream = flumeStream.map(...);
3) Create a Pair DStream using the incoming DStream
JavaPairDStream<Object,SomeState> pairDStream =
incomingDStream.map(...);
4) Create a Stateful DStream from the pair DStream using the initialized RDD as the base state
JavaPairDStream<Object,SomeState> statefulDStream = pairDStream.updateStateByKey(...);
JavaRDD<SomeState> updatedStateRDD = statefulDStream.map(...);
5) Run a user-driven query against the updated state based on the values in the incoming stream
incomingStream.foreachRDD(new Function<JavaRDD<SomeState>,Void>() {
#Override
public Void call(JavaRDD<SomeState> events) throws Exception {
updatedStateRDD.count();
SQLContext sqx = new SQLContext(events.context());
schemaDf = sqx.createDataFrame(updatedStateRDD, SomeState.class);
schemaDf.registerTempTable("TEMP_TABLE");
sqx.sql(SELECT col1 from TEMP_TABLE where <condition1> and <condition2> ...);
//collect the results and process and send alerts
...
}
);
The first step should be to identify which step is taking most of the time.
Please see the Spark Master UI and identify which Step/ Phase is taking most of the time.
There are few best practices + my observations which you can consider: -
Use Singleton SQLContext - See example - https://github.com/apache/spark/blob/master/examples/src/main/scala/org/apache/spark/examples/streaming/SqlNetworkWordCount.scala
updateStateByKey can be a memory intensive operation in case of large number of keys. You need to check size of data processed by
updateStateByKey function and also if it fits well in the given
memory.
How is your GC behaving?
Are you really using "initialRDD"? if not then do not load it. In case it is static dataset then cache it.
Check the time taken by your SQL Query too.
Here are few more questions/ areas which can help you
What is the StorageLevel for DStreams?
Size of cluster and configuration of Cluster
version of Spark?
Lastly - ForEachRDD is an Output Operation which executes the given function on the Driver but RDD might actions and those actions are executed on worker nodes.
You may need to read this for better explaination about Output Operations - http://spark.apache.org/docs/latest/streaming-programming-guide.html#output-operations-on-dstreams
I too facing the same issue could you please let me know if you have got the solution for the same? Though I have mentioned the detailed use case in below post.
Spark SQL + Window + Streming Issue - Spark SQL query is taking long to execute when running with spark streaming