for ex: we are trying to read data from oracle from multiple tables using for loop... where for loop is executing line by line.. either driver node or executers nodes..
for eg.. variables get created... where they saved driver node or executer nodes
I am bit new to spark.. could you please explain
The for loop is executed by the driver. The data is read by the executors. But as the for loop is on the driver, the tables will be read sequentially. If you want to read them in parallel, you need to submit the jobs from different threads. Look at multithreading in python.
The variables in your code are created by the driver. It will send to the nodes the tasks that they need to perform, packed with a copy of the needed variables. If you want to avoid having a copy of each variable for each task, you can use broadcasting to have only a copy of each variable for each node (which contains multiple executors). This is useful for large read-only variables.
Related
I have a very basic understanding about spark and I am trying to find something that can help me achieve the following :
Have a Pool of objects shared over all the nodes, asynchronously.
What I am thinking of currently, is, lets say there are ten nodes numbered from 1 to 10.
If I have a single object, I will have to make my object synchronous in order for it to be accessible by any node. I do not want that.
Second option is, I can have a pool of say 10 objects.
I want to write my code in such a way that the node number 1 always uses the object number 1, the node number 2 always uses the object number 2 and so on..
A sample approach would be, before performing a task, get the thread ID and use the object number (threadID % 10). This would result in a lot of collisions and would not work.
Is there a way that I can somehow get a nodeID or processID, and make my code fetch an object according to that ID ? Or some other way to have an asynchronous pool of objects on my cluster?
I apologize if it sounds trivial, I am just getting started and cannot find a lot of resources pertaining to my doubt online.
PS : I am using a SparkStreaming + Kafka + YARN setup if it matters.
Spark automatically partitions the data across all available cluster nodes; you don't need to control or keep track of where the partitions are actually stored. Some RDD operations also require shuffling which is fully managed by Spark, so you can't rely on the layout of the partitions.
Sharing an object only makes sense if it's immutable. Each worker node receives a copy of the original object, and any local changes to it will not be reflected on other nodes. If that's what you need, you can use sc.broadcast() to efficiently distribute an object across all workers prior to a parallel operation.
I would like to execute over hundred of user-defined-type statements. These statements are encapsulated in a .cql file.
While executing .cql file everytime for new cases, I find that many of the statements within it gets skipped.
Therefore, I would like to know if there is any performance issues of executing 100s of statements composed in .cql file
Note: I am executing .cql files on behalf of a Python script via os.system method
The performance of executing 100's of DDL statements via code (or cql file/cqlsh) is proportional to the number of nodes in the cluster. In a distributed system like Cassandra all nodes have to agree for the schema change and more the number of nodes, more the time it takes for schema agreement.
There is essentially a timeout value maxSchemaAgreementWaitSeconds which determines how long coordinator node will wait before replying back to client. Typically case for schema deployment is one or two tables and the default value for this parm works just fine.
Since in the special case of multiple DDL executed at once via code/cqlsh; its better to increase the value for maxSchemaAgreementWaitSeconds say to 20sec. Its going to a take a little longer for the schema deployment, but it will make sure the deployment succeeds.
Java reference
Python reference
I have a set of large variables that I broadcast. These variables are loaded from a clustered database. Is it possible to distribute the load from the database across worker nodes and then have each one broadcast their specific variables to all nodes for subsequent map operations?
Thanks!
Broadcast variables are generally passed to workers, but I can tell you what I did in a similar case in python.
If you know the total number of rows, you can try to create an RDD of that length and then run a map operation on it (which will be distributed to workers). In the map, the workers are running a function to get some piece of data (not sure how you are going to make them all get different data).
Each worker would retrieve required data through making the calls. You could then do a collectAsMap() to get a dictionary and broadcast that to all workers.
However keep in mind that you need all software dependencies of making client requests on each worker. You also need to keep socket usage in mind. I just did something similar with querying an API and did not see a rise in sockets, although I was making regular HTTP requests. Not sure....
Ok, so the answer it seems is no.
Calling sc.broadcast(someRDD) results in an error. You have to collect() it back to the driver first.
I have around 100 threads running parallel and dumping data in a single table using sqlldr ctl file. the query generates values for ID using expression ID SEQUENCE(MAX,1).
The process fails to load files properly due to parallel execution and may be two or more threads get same ID. it works fine when I run it sequentially with one single thread.
Please suggest a workaround.
Each CSV file contains data associated with a test cases and cases are supposed to be run in parallel. I can not concatenate all files in one go.
You could load the data and then run a separate update in which you could update ID with a traditional oracle sequence?
Is it possible to write output of spark program's result in driver node when it is processed in cluster?
df = sqlContext("hdfs://....")
result = df.groupby('abc','cde').count()
result.write.save("hdfs:...resultfile.parquet", format="parquet") # this works fine
result = result.collect()
with open("<my drivernode local directory>//textfile") as myfile:
myfile.write(result) # I'll convert to python object before writing
Could someone give some idea how to refer to the local filesystem where I gave spark-submit?
tl;dr Use . (the dot) and the current working directory is resolved by API.
From what I understand from your question, you are asking about saving local files in driver or workers while running spark.
This is possible and is quite straightforward.
The point is that in the end, the driver and workers are running python so you can use python "open", "with", "write" and so on.
To do this in the workers you'll need to run "foreach" or "map" on your rdd and then save locally (This can be tricky, as you may have more than one partition on each executor).
Saving from the driver is even easier, after you collected the data you have a regular python object and you can save it in any stranded pythonic way.
BUT
When you save any local file, may it be on worker or driver, that file is created inside the container that the worker or driver are running in. Once the execution is over those containers are deleted and you would not be able to access any local data stored in them.
The way to solve this is to move those local files to somewhere also while the container is still alive. You can do this with a shell command, inserting into data base and so on.
For example, I use this technique to insert results of calculations into MySQL without the need to do collect. I save results locally on workers as part of a "map" operation and then upload them using MySQL "LOAD DATA LOCAL INFILE".