Write Spark dataframe to database (Exasol) using jdbc slow - apache-spark

I am reading from AWS(s3) and writing in to database (exasol) taking too much time even setting batchsize is not effecting performance.
I am writing 6.18m rows (around 3.5 gb) taking 17min
running in cluster mode 20 node cluster
how I can make it fast
Dataset ds = session.read().parquet(s3Path)
ds.write().format("jdbc").option("user", username).option("password", password).option("driver", Conf.DRIVER).option("url", dbURL).option("dbtable", exasolTableName).option("batchsize", 50000).mode(SaveMode.Append).save();

Ok, it's an interesting question.
I did not check the implementation details of recently released Spark connector. But you may go with some previously existing methods.
Save Spark job results as CSV files in Hadoop. Run standard parallel IMPORT from all created files via WebHDFS http calls.
Official UDF script is capable of importing directly from Parquet, as far as I know.
You may implement your own Java UDF script to read Parquet in way you want. For example, this is how it works for ORC files.
Generally speaking, the best way to achieve some real performance is to bypass Spark altogether.

Related

Can Apache Spark be used in place of Sqoop

I have tried connecting spark with JDBC connections to fetch data from MySQL / Teradata or similar RDBMS and was able analyse the data.
Can spark be used to store the data to HDFS?
Is there any possibility for spark outperforming
the activities of Sqoop.
Looking for you valuable answers and explanations.
There are two main things about Sqoop and Spark. The main difference is Sqoop will read the data from your RDMS doesn't matter what you have and you don't need to worry much about how you table is configured.
With Spark using JDBC connection is a little bit different how you need to load the data. If your database doesn't have any column like numeric ID or timestamp Spark will load ALL the data in one single partition. And then will try to process and save. If you have one column to use as partition than Spark sometimes can be even faster than Sqoop.
I would recommend you to take a look in this doc.enter link description here
The conclusion is, if you are going to do a simple export and that need to be done daily with no transformation I would recommend Sqoop to be simple to use and will not impact your database that much. Using Spark will work well IF your table is ready for that, besides that goes with Sqoop

Any benefit for my case when using Hive as datawarehouse?

Currently, i am trying to adopt big data to replace my current data analysis platform. My current platform is pretty simple, my system get a lot of structured csv feed files from various upstream systems, then, we load them as java objects (i.e. in memory) for aggregation.
I am looking for using Spark to replace my java object layer for aggregation process.
I understandthat Spark support loading file from hdfs / filesystem. So, Hive as data warehouse seems not a must. However, i can still load my csv files to Hive first, then, use Spark to load data from Hive.
My question here is, in my situation, what's the pros / benefit if i introduce a Hive layer rather than directly loading the csv file to Spark DF.
Thanks.
You can always look and feel the data using the tables.
Adhoc queries/aggregation can be performed using HiveQL.
When accessing that data through Spark, you need not mention the schema of the data separately.

What is the best way to get Dataframe Abstraction over HBase Data without Pheonix

I want to save and read the data from HBase from/to Spark.
I want to get the Dataframe abstraction as dataframe is best for memory management compared to RDD and it is convenient to do any processing.
I looked at possible candidates for getting Dataframe abstraction. One of them is Phoenix based solution. I do not want to have pheonix layer on top of HBase due to approvals. I searched for other solutions, but would want to know the best possibility that someone had tried.
We have a performant one at Splice Machine (Open Source). We wrote a separate InputFormat for HBase so we can read directly from store files in hbase vs. performing remote scans. The killer for Spark performance on top of hbase is the remote scan based InputFormat (i.e. how you read the data).
Sean Busbey at Cloudera has worked on a Spark HBase connector and here is a blog from HortonWorks on a similar idea...
http://hortonworks.com/blog/spark-hbase-dataframe-based-hbase-connector/
The "connectors" functionally work but perform poorly for large data sets.
Hope this helps and good luck.

Baseline for measuring Apache Spark jobs execution times

I am fairly new to Apache Spark. I have been using it for several months, but this is my first project that uses it.
I use Spark to compute dynamic reports from data, stored in a NoSQL database (Cassandra). So far I have created several reports and they are computed correctly. Inside them I use DataFrame .unionAll(), .join(), .count(), .map(), etc.
I am running a 1.4.1 Spark cluster on my local machine with the following setup:
export SPARK_WORKER_INSTANCES=6
export SPARK_WORKER_CORES=8
export SPARK_WORKER_MEMORY=1g
I have also populated the database with test data which is around 10-12k records per table.
By using the driver's web UI (http://localhost:4040/), I have noticed that the jobs are taking 40s-50s to execute, so lately I have been researching ways to tune Apache Spark and the jobs.
I have configured Spark to use the KryoSerializer, I have set the spark.io.compression.codec to lzf, I have optimized the jobs as much as I can and as much as my knowledge allows me to.
This led to the jobs taking 20s-30s to compute (which I think is a good improvement). The problem is that because this is my first Spark project, I have no baseline to compare the jobs times, so I have no idea if the execution is slow or fast and whether there is some problem in the code or with the Spark config.
What is the best way to proceed? Is there a graph or benchmark that shows how much time an action with N data should take?
You have to use hive . On top of hive you can put spark . After doing this create temp table in hive for Cassandra table you can perform all type of aggregation and filtering . After doing this you can use hive jdbc connection to get result . It will give fast result .

library to process .rrd(round robin data) using spark

I have huge time series data which is in .rrd(round robin database) format stored in S3. I am planning to use apache spark for running analysis on this to get different performance matrix.
Currently I am downloading the .rrd file from s3 and processing it using rrd4j library. I am going to do processing for longer terms like year or more. it involves processing of hundreds of thousands of .rrd files. I want spark nodes to get the file directly from s3 and run the analysis.
how can I make spark to use the rrd4j to read the .rrd files? is there any library which helps me do that?
is there any support in spark for processing this kind of data?
The spark part is rather easy, use either wholeTextFiles or binaryFiles on sparkContext (see docs). According to the documentation, rrd4j usually wants a path to construct an rrd, but with the RrdByteArrayBackend, you could load the data in there - but that might be a problem, because most of the API is protected. You'll have to figure out a way to load an Array[Byte] into rrd4j.

Resources