When I execute a query using jdbc against a SQL db, I get a ResultSet object in return. Say I want to iterate through each row returned in the ResultSet, and then do operations on each, my question is whether or not the initial iterating through the ResultSet would be handled by the Driver or the Executors?
For example, say I have a service where I want to handle many WordCount jobs in large batches. Perhaps I have a DB with the following schema:
JobId: int
Input: string(hdfs location)
Output: string (hdfs path)
Status: (not started, in progress, complete, etc.)
Every time my Spark application runs, I want to use jdbc to read from the DB and get every row where the status is "not started". This gets returned as a ResultSet and each result is basically parameters for Spark to run the WordCount on. When I iterate on the ResultSet, does it get split up and the executors iterate over small chunks? Or does the driver handle iterating on each? If the former, what happens once I start loading DataFrames for the given input locations and running the necessary transformations and actions to get word count? Would the executors further split up the DataFrames to other executors for processing?
Sorry if this question is unclear, I'm still learning about Spark and having trouble wrapping my mind around some of this. As well, would this generally be considered a good approach to handling multiple requests in one large batch? Or is there a better way to go about doing this?
Thanks!
Related
Hi I am using spark Mllib and doing approxSimilarityJoin between a 1M dataset and a 1k dataset.
When i do it I bradcast the 1k one.
What I see is that thew job stops going forward at the second-last task.
All the executors are dead but one which keeps running for very long time until it reaches Out of memory.
I checked ganglia and it shows memory keeping rising until it reaches the limit
and the disk space keeps going down until it finishes:
The action I called is a write, but it does the same with count.
Now I wonder: is it possible that all the partitions in the cluster converge to only one node and creating this bottleneck?
Here is my code snippet:
var dfW = cookesWb.withColumn("n", monotonically_increasing_id())
var bunchDf = dfW.filter(col("n").geq(0) && col("n").lt(1000000) )
bunchDf.repartition(3000)
model.
approxSimilarityJoin(bunchDf,broadcast(cookesNextLimited),80,"EuclideanDistance").
withColumn("min_distance", min(col("EuclideanDistance")).over(Window.partitionBy(col("datasetA.uid")))
).
filter(col("EuclideanDistance") === col("min_distance")).
select(col("datasetA.uid").alias("weboId"),
col("datasetB.nextploraId").alias("nextId"),
col("EuclideanDistance")).write.format("parquet").mode("overwrite").save("approxJoin.parquet")
I'll try to answer as best as I can.
In Spark there are things that are called shuffle operations, and they do exactly what you thought , after some calculations they transfer all the information to a single node.
If you think about it there's no other way for those operations to work without putting all the data in a single node in the end.
example for join operation:
you have to partitions on 2 different nodes
partition 1:
s, 1
partition 2:
s, k
and you want to join by the s.
If you dont get both rows on a single machine it will be impossible to calculate they need to be joined.
It is the same with count and reduce and many more operations.
You can read about shuffle operations or ask me if you want more clarification.
a possible solution for you is :
instead of only saving data in memory you can use something like :
dfW.persist(StorageLevel.MEMORY_AND_DISK_SER)
there are other options for persist but what it does basically is saving the partitions and data not only in memory but in disk as well in a Serialized way to save space.
so I have this need to broadcast some related content from a RDD to all worker nodes, and I am trying to do it more efficiently.
More specifically, some RDD is created dynamically in the middle of the execution, to broadcast some of its content to all the worker nodes, an obvious solution would be to traverse its element one by one, and create a list/vector/hashmap to hold the needed content while traversing, and then broadcast this data structure to the cluster.
This does not seems to be a good solution at all since the RDD can be huge and it is distributed already, traversing it and creating some array/list based on the traversal result will be very slow.
So what would be a better solution, or best practice for this case? Would it be a good idea to run a SQL query on the RDD (after changing it to a dataFrame) to get the needed content, and then broadcast the query result to all the worker nodes?
thank you for your help in advance!
The following is added after reading Varslavans' answer:
a RDD is created dynamically and it has the following content:
[(1,1), (2,5), (3,5), (4,7), (5,1), (6,3), (7,2), (8,2), (9,3), (10,3), (11,3), ...... ]
so this RDD contains key-value pairs. What we want is to collect all the pairs whose value is > 3. So pair (2,5), (3,5), (4,7), ..., will be collected. Now, once we collected all these pairs, we would like to broadcast them so all the worker nodes will have these pairs.
Sounds like we should use collect() on the RDD and then broadcast... at least this is the best solution at this point.
Thanks again!
First of all - you don't need to traverse RDD to get all data. There is API for that - collect().
Second: Broadcast is not the same as distributed.
In broadcast - you have all the data on each node
In Distributed - you have different parts of a whole on each node
RDD is distributed by it's nature.
Third: To get needed content you can either use RDD API or convert it to DataFrame and use SQL queries. It depends on the data you have. Anyway contents of the result will be RDD or DataFrame and it will also be distributed. So if you need data locally - you collect() it.
Btw from your question it's not possible to understand what you exactly want to do and it looks like you need to read Spark basics. That will give you much answers :)
I have a rather simple (but essential) question about Spark window functions (such as e.g. lead, lag, count, sum, row_number etc):
If a specify my window as Window.partitionBy(lit(0)) (i.e. I need to run the window fucntion accross the entire dataframe), is the window aggregate function running in parallel, or are all records moved in one single tasks?
EDIT:
Especially for "rolling" operations (e.g. rolling average using something like avg(...).Window.partitionBy(lit(0)).orderBy(...).rowsBetween(-10,10)), this operation could very well be split up into different tasks even tough all data is in the same partition of the window because only 20 row are needed at once to compute the average
If you define as Window.partitionBy(lit(0)) or if you don't define partitionBy at all, then all of the partitions of a dataframe will be collected as one in one executor and that executor is going to perform the aggregating function on whole dataframe. So parallelism will not be preserved.
The collection is different than the collect() function as collect() function will collect all the partitions into driver node but partitionBy function will collect data to executor where the partitions are easy to be collected.
I have a problem of how to use spark to manipulate/iterate/scan multiple tables of cassandra. Our project uses spark&spark-cassandra-connector connecting to cassandra to scan multiple tables , try to match related value in different tables and if matched, take the extra action such as table inserting. The use case is like below:
sc.cassandraTable(KEYSPACE, "table1").foreach(
row => {
val company_url = row.getString("company_url")
sc.cassandraTable(keyspace, "table2").foreach(
val url = row.getString("url")
val value = row.getString("value")
if (company_url == url) {
sc.saveToCassandra(KEYSPACE, "target", SomeColumns(url, value))
}
)
})
The problems are
As spark RDD is not serializable, the nested search will fail cause sc.cassandraTable returns a RDD. The only way I know to work around is to use sc.broadcast(sometable.collect()). But if the sometable is huge, the collect will consume all the memory. And also, if in the use case, several tables use the broadcast, it will drain the memory.
Instead of broadcast, can RDD.persist handle the case? In my case, I use sc.cassandraTable to read all tables in RDD and persist back to disk, then retrieve the data back for processing. If it works, how can I guarantee the rdd read is done by chunks?
Other than spark, is there any other tool (like hadoop etc.??) which can handle the case gracefully?
It looks like you are actually trying to do a series of Inner Joins. See the
joinWithCassandraTable Method
This allows you to use the elements of One RDD to do a direct query on a Cassandra Table. Depending on the fraction of data you are reading from Cassandra this may be your best bet. If the fraction is too large though you are better off reading the two table separately and then using the RDD.join method to line up rows.
If all else fails you can always manually use the CassandraConnector Object to directly access the Java Driver and do raw requests with that from a distributed context.
I wrote this program in spark shell
val array = sc.parallelize(List(1, 2, 3, 4))
array.foreach(x => println(x))
this prints some debug statements but not the actual numbers.
The code below works fine
for(num <- array.take(4)) {
println(num)
}
I do understand that take is an action and therefore will cause spark to trigger the lazy computation.
But foreach should have worked the same way... why did foreach not bring anything back from spark and start doing the actual processing (get out of lazy mode)
How can I make the foreach on the rdd work?
The RDD.foreach method in Spark runs on the cluster so each worker which contains these records is running the operations in foreach. I.e. your code is running, but they are printing out on the Spark workers stdout, not in the driver/your shell session. If you look at the output (stdout) for your Spark workers, you will see these printed to the console.
You can view the stdout on the workers by going to the web gui running for each running executor. An example URL is http://workerIp:workerPort/logPage/?appId=app-20150303023103-0043&executorId=1&logType=stdout
In this example Spark chooses to put all the records of the RDD in the same partition.
This makes sense if you think about it - look at the function signature for foreach - it doesn't return anything.
/**
* Applies a function f to all elements of this RDD.
*/
def foreach(f: T => Unit): Unit
This is really the purpose of foreach in scala - its used to side effect.
When you collect records, you bring them back into the driver so logically collect/take operations are just running on a Scala collection within the Spark driver - you can see the log output as the spark driver/spark shell is whats printing to stdout in your session.
A use case of foreach may not seem immediately apparent, an example - if for each record in the RDD you wanted to do some external behaviour, like call a REST api, you could do this in the foreach, then each Spark worker would submit a call to the API server with the value. If foreach did bring back records, you could easily blow out the memory in the driver/shell process. This way you avoid these issues and can do side-effects on all the items in an RDD over the cluster.
If you want to see whats in an RDD I use;
array.collect.foreach(println)
//Instead of collect, use take(...) or takeSample(...) if the RDD is large
You can use RDD.toLocalIterator() to bring the data to the driver (one RDD partition at a time):
val array = sc.parallelize(List(1, 2, 3, 4))
for(rec <- array.toLocalIterator) { println(rec) }
See also
Spark: Best practice for retrieving big data from RDD to local machine
this blog post about toLocalIterator