We are working on a scenario where we are fetching some data through API that will be used in Spark SQL queries. I want to make that API call once and store it as a broadcast variable. So, will the API call take place only once in the driver and broadcast the output or will it take place in each worker making no sense to Broadcasting?
This is what I'm trying to do:
api_call() is a Python function performing the API call operation and returning a Python list.
op = api_call()
values = spark.sparkContext.broadcast(op)
Related
In Spark, we have MapPartition function, which is used to do some initialization for a group of entries, like some db operation.
Now I want to do the same thing in Flink. After some research I found out that I can use RichMap for the same use but it has a drawback that the operation can be done only at the open method which will be at the start of a streaming job. I will explain my use case which will clarify the situtaion.
Example : I am getting data for a millions of users in kafka, but I only want the data of some users to be finally persisted. Now this list of users is dynamic and is available in a db. I wanted to lookup the current users every 10mins, so that I filter out and store the data for only those users. In Spark(MapPartition), it would do the user lookup for every group and there I had configured to get users from the DB after every 10mins. But with Flink using RichMap I can do that only in the open function when my job starts.
How can I do the following operation in Flink?
It seems that what You want to do is stream-table join. There are multiple ways of doing that, but seems that the easiest one would be to use Broadcast state pattern here.
The idea is to define custom DataSource that periodically queries data from SQL table (or even better use CDC), use that tableStream as broadcast state and connect it with actual users stream.
Inside the ProcessFunction for the connected streams You will have access to the broadcasted table data and You can perform lookup for every user You receive and decide what to do with that.
We need to call an external restful service to update the column value in a Dataset. We are using a UDF function to make restful service calls which is very slow.
dataset.withColumn("upper", upperUDF('call restful service'))
It's a sync call, that took ~1 hour and 10 minutes for 25,000 accounts (each account issue a call).
How to make it faster?
I'd recommend converting the Dataset to an RDD using Dataset.rdd and then RDD.foreachPartition.
val names = Seq("hello", "world").toDF("name")
scala> names.show
+-----+
| name|
+-----+
|hello|
|world|
+-----+
scala> names.rdd.foreachPartition(p => p.map(n => "call restful service for " + n).foreach(println))
call restful service for [hello]
call restful service for [world]
You could then think of a local cache for the same entries to avoid time-expensive restful service calls.
From the comments:
how does this improve the performance?
RDD.foreachPartition gives you access to all elements as an iterator (lazy and memory-friendly) so you could avoid external calls by using a local cache (per partition or per executor so all partitions / tasks that are executed on an executor can use the cache).
The number of partitions can be changed to avoid too many parallel external calls (DDOS). Use RDD.repartition or RDD.coalesce operators. Moreover you could control the number of partitions by the data source you use to read the dataset from.
How to update the corresponding column after the get the response back from API
Since you left Dataset API and want to use RDD API for external calls, the question is how to go back from RDDs to Datasets. That is as simple as RDD.toDF(comma-separated column names). The columns have to match the RDD representation and is up to the case class of the RDD.
We are using hazelcast in our project, i would like to know that whatever calculation we do runs on the hazelcast node or the client itself
For eg.
Imap map = client.get("data");
map.values().stream....
In this example does it get whole map from hzc (hazelcast) or its just a reference and when i actually use terminal function in the stream it goes to hzc node and performs operation and get the data?
map.values() will fetch all the values from all the nodes to the client. Please review Fast Aggregations if you need to do the calculations on member side.
Let me first inform all of you that I am very new to Spark.
I need to process a huge number of records in a table and when it is grouped by email it is around 1 million. I need to perform multiple logical calculations based on the data set against individual email and update the database based on the logical calculation
Roughly my code structure is like
//Initial Data Load ...
import sparkSession.implicits._
var tableData = sparkSession.read.jdbc(<JDBC_URL>, <TABLE NAME>, connectionProperties).select("email").where(<CUSTOM CONDITION>)
//Data Frame with Records with grouping on email count greater than one
var recordsGroupedBy =tableData.groupBy("email").count().withColumnRenamed("count", "recordcount").filter("recordcount > 1 ").toDF()
//Now comes the processing after grouping against email using processDataAgainstEmail() method
recordsGroupedBy.collect().foreach(x=>processDataAgainstEmail(x.getAs("email"),sparkSession))
Here I see foreach is not parallelly executed. I need to invoke the method processDataAgainstEmail(,) in parallel.
But if I try to parallelize by doing
Hi I can get a list by invoking
val emailList =dataFrameWithGroupedByMultipleRecords.select("email").rdd.map(r => r(0).asInstanceOf[String]).collect().toList
var rdd = sc.parallelize(emailList )
rdd.foreach(x => processDataAgainstEmail(x.getAs("email"),sparkSession))
This is not supported as I can not pass sparkSession when using parallelize.
Can anybody help me with this as in processDataAgainstEmail(,) multiple operations would be performed related to database insert and update and also spark dataframe and spark SQL operations needs to be performed?
To summerize I need to invoke parallelly processDataAgainstEmail(,) with sparksession
In case it is not all possible to pass spark sessions, the method won't be able to perform anything on the database. I am not sure what would be the alternate way as parallelism on email is a must for my scenario.
The forEach is the method the list that operates on each element of the list sequentially, so you are acting on it one at a time, and passing that to processDataAgainstEmail method.
Once you have gotten the resultant list, you then invoke the sc.parallelize on to parallelize the creation of the dataframe from the list of records you created/manipulated in the previous step. The parallelization, as I can see in the pySpark, is the property of creating of the dataframe, not acting the result of any operation.
Is there a way to write every row of my spark dataframe as a new item in a dynamoDB table ? (in pySpark)
I used this code with boto3 library, but I wonder if there's another way, avoiding the pandas and the for loop steps :
sparkDF_dict = sparkDF.toPandas().to_dict('records')
for item in sparkDF_dict :
table.put_item(Item = item)
DynamoDB offers a BatchWriteItem API. It is available in boto3, so you could call it after creating slices of the sparkDF_dict 25 elements long. Note, the BatchWriteItem API only supports writing 25 items at a time, and not all writes may succeed at first (as they may get throttled on the service side and come back to you in the UnprocessedItems part of the response). Your application will need to look at UnprocessedItems in the response and retry as needed.