Spark (pyspark) speed test - apache-spark

I am connected via jdbc to a DB having 500'000'000 of rows and 14 columns.
Here is the code used:
from pyspark.sql import SparkSession
spark = SparkSession.builder.getOrCreate()
properties = {'jdbcurl': 'jdbc:db:XXXXXXXXX','user': 'XXXXXXXXX', 'password': 'XXXXXXXXX'}
data = spark.read.jdbc(properties['jdbcurl'], table='XXXXXXXXX', properties=properties)
data.show()
The code above took 9 seconds to display the first 20 rows of the DB.
Later I created a SQL temporary view via
data[['XXX','YYY']].createOrReplaceTempView("ZZZ")
and I ran the following query:
sqlContext.sql('SELECT AVG(XXX) FROM ZZZ').show()
The code above took 1355.79 seconds (circa 23 minutes). Is this ok? It seems to be a large amount of time.
In the end I tried to count the number of rows of the DB
sqlContext.sql('SELECT COUNT(*) FROM ZZZ').show()
It took 2848.95 seconds (circa 48 minutes).
Am I doing something wrong or are these amounts standard?

When you read jdbc source with this method you loose parallelism, main advantage of spark. Please read the official spark jdbc guidelines, especially regarding partitionColumn, lowerBound, upperBound and numPartitions. This will allow spark to run multiple JDBC queries in parallel, resulting with partitioned dataframe.
Also tuning fetchsize parameter may help for some databases.

Related

Why RDD calculating count take so much time

(English is not my first language so please excuse any mistakes)
I use SparkSQL reading 4.7TB data from hive table, and performing a count operation. It takes about 1.6 hours to do that. While reading directly from HDFS txt file and performing count, it takes only 10 minutes. The two jobs used same resources and parallelism. Why RDD count takes so much time?
The hive table has about 3000 thousand columns, and maybe serialization is costly. I checked the spark UI and each tasks read about 240MB data and take about 3.6 minutes to execute. I can't believe that serialization overhead is so expensive.
Reading from hive(taking 1.6 hours):
val sql = s"SELECT * FROM xxxtable"
val hiveData = sqlContext.sql(sql).rdd
val count = hiveData.count()
Reading from hdfs(taking 10 minutes):
val inputPath = s"/path/to/above/hivetable"
val hdfsData = sc.textFile(inputPath)
val count = hdfsData.count()
While using SQL count, it still takes 5 minutes:
val sql = s"SELECT COUNT(*) FROM xxxtable"
val hiveData = sqlContext.sql(sql).rdd
hiveData.foreach(println(_))
Your first method is querying the data instead of fetching the data. Big difference.
val sql = s"SELECT * FROM xxxtable"
val hiveData = sqlContext.sql(sql).rdd
We can look at the above code as programmers and think "yes, this is how we grab all of the data". But the way that the data is being grabbed is via query instead of reading it from a file. Basically, the following steps occur:
Read from file into temporary storage
A Query engine processes query on temp storage and creates results
Results are read into an RDD
There's a lot of steps there! More so than what occurs by the following:
val inputPath = s"/path/to/above/hivetable"
val hdfsData = sc.textFile(inputPath)
Here, we just have one step:
Read from file into RDD
See, that's 1/3 of the steps. Even though it is a simple query, there is still a lot of overhead and processing involved in order to get it into that RDD. Once it's in the RDD though, processing will be easier. As shown by your code:
val count = hdfsData.count()
Your first way it will be load all data to spark, the network, serialization and transform operation it will take a lot of time.
The second way, I think it's because he omitted the hive layer.
If you just count, the third way is better, it's to load only count results after executes count

Ignite Spark Dataframe slow performance

I was trying to improve the performance of some existing spark dataframe by adding ignite on top of it. Following code is how we currently read dataframe
val df = sparksession.read.parquet(path).cache()
I managed to save and load spark dataframe from ignite by the example here: https://apacheignite-fs.readme.io/docs/ignite-data-frame. Following code is how I do it now with ignite
val df = spark.read()
.format(IgniteDataFrameSettings.FORMAT_IGNITE()) //Data source
.option(IgniteDataFrameSettings.OPTION_TABLE(), "person") //Table to read.
.option(IgniteDataFrameSettings.OPTION_CONFIG_FILE(), CONFIG) //Ignite config.
.load();
df.createOrReplaceTempView("person");
SQL Query(like select a, b, c from table where x) on ignite dataframe is working but the performance is much slower than spark alone(i.e without ignite, query spark DF directly), an SQL query often take 5 to 30 seconds, and it's common to be 2 or 3 times slower spark alone. I noticed many data(100MB+) are exchanged between ignite container and spark container for every query. Query with same "where" but smaller result is processed faster. Overall I feel ignite dataframe support seems to be a simple wrapper on top of spark. Hence most of the case it is slower than spark alone. Is my understanding correct?
Also by following the code example when the cache is created in ignite it automatically has a name like "SQL_PUBLIC_name_of_table_in_spark". So I could't change any cache configuration in xml (Because I need to specify cache name in xml/code to configure it and ignite will complain it already exists) Is this expected?
Thanks
First of all, it doesn't seem that your test is fair. In the first case you prefetch Parquet data, cache it locally in Spark, and only then execute the query. In case of Ignite DF you don't use caching, so data is fetched during query execution. Typically you will not be able to cache all your data, so performance with Parquet will go down significantly once some of the data needs to be fetched during execution.
However, with Ignite you can use indexing to improve the performance. For this particular case, you should create index on the x field to avoid scanning all the data every time query is executed. Here is the information on how to create an index: https://apacheignite-sql.readme.io/docs/create-index

Spark 1.4.1 dataframe queries on Hive ORC tables take forever

I am using Apache Spark 1.4.1 (which is integrated with Hive 0.13.1) along
with Hadoop 2.7
I have created an ORC table with Snappy compression in Hive and inserted
around 50 million records into the same using Spark Dataframe API
(insertInto method),as below:
inputDF.write.format("orc").mode(SaveMode.Append).partitionBy("call_date","hour","batch_id").insertInto("MYTABLE")
This table has around 50-60 columns with 3 columns being varchar and all
other columns being either INT or FLOAT.
My problem is that when I query the table using below spark command:
var df1 = hiveContext.sql("select * from MYTABLE")
val count1 = df1.count()
The query doesn't come out and is stuck for several hours at the above
query.Spark console logs are stuck at below:
16/12/02 00:50:46 INFO DAGScheduler: Submitting 2700 missing tasks from
ShuffleMapStage 70 (MapPartitionsRDD[553] at cache at
MYTABLE_LOAD.scala:498)16/12/02 00:50:46 INFO YarnScheduler: Adding task
set 70.0 with 2700 tasks
The table has 2700 part files in warehouse directory.
I have tried coalescing the inputDF to 10 partitions before inserting into
the table which created 270 part files for the table instead of 2700,but
querying the table gives same issue,i.e. the query doesn't come out.
The strange thing is that when I invoke the same select query via
spark-shell(invoked with 5g driver memory),the query gives results in less
than a minute.
Even for other ORC tables (not Snappy compressed),querying them using
hiveContext.sql with very simple queries (select from table where ) is taking more than 10 minutes.
Can someone please advise what could be the issue here? I don't think there
is something wrong with the table as the spark-shell query wouldn't have
worked in that case.
Many thanks in advance.

increasing number of partitions in spark

I was using Hive for executing SQL queries on a project. I used ORC with 50k Stride for my data and have created the hive ORC tables using this configuration with a certain date column as partition.
Now I wanted to use Spark SQL to benchmark the same queries operating on the same data.
I executed the following query
val q1 = sqlContext.sql("select col1,col2,col3,sum(col4),sum(col5) from mytable where date_key=somedatkye group by col1,col2,col3")
In hive it takes 90 seconds for this query. But spark takes 21 minutes for the same query and on looking at the job, i found the issue was because Spark creates 2 stages and on the first stage, it has only 7 tasks, one each for each of the 7 blocks of data within that given partition in orc file. The blocks are of different size, one is 5MB while the other is 45MB and because of this stragglers take more time leading to taking too much time for the job.
How do i mitigate this issue in spark. How do i manually increase the number of partitions, resulting in increasing the number of tasks in stage 1, even though there are only 7 physical blocks for the given range of the query.

Low JDBC write speed from Spark to MySQL

I need write about 1 million rows from Spark a DataFrame to MySQL but the insert is too slow. How can I improve it?
Code below:
df = sqlContext.createDataFrame(rdd, schema)
df.write.jdbc(url='xx', table='xx', mode='overwrite')
The answer in https://stackoverflow.com/a/10617768/3318517 has worked for me. Add rewriteBatchedStatements=true to the connection URL. (See Configuration Properties for Connector/J.)
My benchmark went from 3325 seconds to 42 seconds!

Resources