PySpark: pull data to driver and then upload to dataframe - apache-spark

I am trying to create a pyspark dataframe from data stored in an external database. I use the pyodbc module to connect to the database and pull the required data, after which I use spark.createDataFrame to send my data to the cluster for analysis.
I run the script using --deploy-mode client, so the driver runs on the master node, but the executors can be distributed to other machines. The problem is pyodbc is not installed on any of the worker nodes (this is fine since I don't want them all querying the database anyway), so when I try to import this module in my scripts, I get an import error (unless all the executors happen to be on the master node).
My question is how can I specify that I want a certain portion of my code (in this case, importing pyodbc and querying the database) to run on the driver only? I am thinking something along the lines of
if __name__ == '__driver__':
<do stuff>
else:
<wait until stuff is done>

Your imports in your python driver DO only run on the master. The only time you will see errors on your executors about missing imports is if you are referencing some object/function from one of those imports in a function you are calling on a driver. I would look carefully at any python code you are running in RDD/DataFrame calls for unintended references. If you post your code, we can give you more specific guidance.
Also, routing data through your driver is usually not a great idea because it will not scale well. If you have lots of data you are going to try and force all through a single point which defeats the purpose of distributed processing!
Depending on what database you are using is, there is probably a Spark Connector implemented to load it directly into a dataframe. If you are using ODBC then maybe you are using SQL Server? For example, in that case you should be able to use JDBC drivers, like for example in this post:
https://stephanefrechette.com/connect-sql-server-using-apache-spark/#.Wy1S7WNKjmE

This is not how spark is supposed to work. Spark collections (RDDs or DataFrames) are inherently distributed. What you're describing is to create a dataset locally, by reading the whole dataset into drivers memory, and then sending it over to executors for further processing by creating an RDD or DataFrame out of it. That does not make much sense.
If you want to make sure that there is only one connection from spark to your database, then set the parallelism to 1. You can then increase the parallelism in further transformation steps.

Related

Can Apache Spark be used in place of Sqoop

I have tried connecting spark with JDBC connections to fetch data from MySQL / Teradata or similar RDBMS and was able analyse the data.
Can spark be used to store the data to HDFS?
Is there any possibility for spark outperforming
the activities of Sqoop.
Looking for you valuable answers and explanations.
There are two main things about Sqoop and Spark. The main difference is Sqoop will read the data from your RDMS doesn't matter what you have and you don't need to worry much about how you table is configured.
With Spark using JDBC connection is a little bit different how you need to load the data. If your database doesn't have any column like numeric ID or timestamp Spark will load ALL the data in one single partition. And then will try to process and save. If you have one column to use as partition than Spark sometimes can be even faster than Sqoop.
I would recommend you to take a look in this doc.enter link description here
The conclusion is, if you are going to do a simple export and that need to be done daily with no transformation I would recommend Sqoop to be simple to use and will not impact your database that much. Using Spark will work well IF your table is ready for that, besides that goes with Sqoop

Write Spark dataframe to database (Exasol) using jdbc slow

I am reading from AWS(s3) and writing in to database (exasol) taking too much time even setting batchsize is not effecting performance.
I am writing 6.18m rows (around 3.5 gb) taking 17min
running in cluster mode 20 node cluster
how I can make it fast
Dataset ds = session.read().parquet(s3Path)
ds.write().format("jdbc").option("user", username).option("password", password).option("driver", Conf.DRIVER).option("url", dbURL).option("dbtable", exasolTableName).option("batchsize", 50000).mode(SaveMode.Append).save();
Ok, it's an interesting question.
I did not check the implementation details of recently released Spark connector. But you may go with some previously existing methods.
Save Spark job results as CSV files in Hadoop. Run standard parallel IMPORT from all created files via WebHDFS http calls.
Official UDF script is capable of importing directly from Parquet, as far as I know.
You may implement your own Java UDF script to read Parquet in way you want. For example, this is how it works for ORC files.
Generally speaking, the best way to achieve some real performance is to bypass Spark altogether.

Spark code is taking a long time to return query. Help on speeding this up

I am currently running some Spark code and I need to query a data frame that is taking a long time (over 1 hour) per query. I need to query multiple times to check if the data frame is in fact correct.
I am relatively new to Spark and I understand that Spark uses lazy evaluation which means that the commands are executed only once I do a call for some action (in my case .show()).
Is there a way to do this process once for the whole DF and then quickly call on the data?
Currently I am saving the DF as a temporary table and then running queries in beeline (HIVE). This seems a little bit overkill as I have to save the table in a database first, which seems like a waste of time.
I have looked into the following functions .persist, .collect but I am confused on how to use them and query from them.
I would really like to learn the correct way of doing this.
Many thanks for the help in advance!!
Yes, you can keep your RDD in memory using rddName.cache() (or persists()) . More information about RDD Persistence can be found here
Using a temporary table ( registerTempTable (spark 1.6) or createOrReplaceTempView (spark2.x)) does not "save" any data. It only creates a view with the lifetime of you spark session. If you wish to save the table, you should use .saveAsTable, but I assume that this is not what you are looking for.
Using .cache is equivalent to .persist(StorageLevel.MEMORY). If your table is large and thus can't fit in memory, you should use .persist(StorageLevel.MEMORY_AND_DISK).
Also it is possible that you simple need more nodes in you cluster. In case you are running locally, make sure you deploy with --master local[*] to use all available cores on your machine. If you are running on a stand alone cluster or with a cluster manager like Yarn or Mesos, you should make sure that all necessary/available resources are assigned to you job.

Spark Ingestion path: "Source to Driver to Worker" or "Source to Workers"

When Spark ingest the Data, is there specific situation where it has to go trough the driver and then from the driver the worker ? Same question apply for a direct read by the worker.
I guess i am simply trying to map out what are the condition or situation that lead to one way or the other, and how does partitioning happen in each case.
If you limit yourself to built-in methods then unless you create distributed data structure from a local one with method like:
SparkSession.createDataset
SparkContext.parallelize
data is always accessed directly by the workers, but the details of the data distribution will vary from source to source.
RDDs typically depend on Hadoop input formats, but Spark SQL and data source API, are at least partially independent, at least when it comes to configuration,
It doesn't mean data is always properly distributed. In some cases (JDBC, streaming receivers) data may still be piped trough a single node.

Apache Spark node asking master for more data?

I'm trying to benchmark a few approaches to putting an image processing algorithm into apache spark. For one step in this algorithm, a computation on a pixel in the image will depend on an unknown amount of surrounding data, so we can't partition the image with guaranteed sufficient overlap a priori.
One solution to that problem I need to benchmark is for a worker node to ask the master node for more data when it encounters a pixel with insufficient surrounding data. I'm not convinced this is the way to do things, but I need to benchmark it anyway because of reasons.
Unfortunately, after a bunch of googling and reading docs I can't find any way for a processingFunc called as part of sc.parallelize(partitions).map(processingFunc) to query the master node for more data from a different partition mid-computation.
Does a way for a worker node to ask the master for more data exist in spark, or will I need to hack something together that kind of goes around spark?
Master Node in Spark is for allocating the resources to a particular job and once the resources are allocated, the Driver ships the complete code with all its dependencies to the various executors.
The first step in every code is to load the data to the Spark cluster. You can read the data from any underlying data repository like Database, filesystem, webservices etc.
Once data is loaded it is wrapped into an RDD which is partitioned across the nodes in the cluster and further stored in the workers/ Executors Memory. Though you can control the number of partitions by leveraging various RDD API's but you should do it only when you have valid reasons to do so.
Now all operations are performed over RDD's using its various methods/ Operations exposed by RDD API. RDD keep tracks of partitions and partitioned data and depending upon the need or request it automatically query the appropriate partition.
In nutshell, you do not have to worry about the way data is partitioned by RDD or which partition stores which data and how they communicate with each other but if you do care, then you can write your own custom partitioner, instructing Spark of how to partition your data.
Secondly if your data cannot be partitioned then I do not think Spark would be an ideal choice because that will result in processing of everything in 1 single machine which itself is contrary to the idea of distributed computing.
Not sure what is exactly your use case but there are people who have been leveraging Spark for Image processing. see here for the comments from Databricks

Resources