I need write about 1 million rows from Spark a DataFrame to MySQL but the insert is too slow. How can I improve it?
Code below:
df = sqlContext.createDataFrame(rdd, schema)
df.write.jdbc(url='xx', table='xx', mode='overwrite')
The answer in https://stackoverflow.com/a/10617768/3318517 has worked for me. Add rewriteBatchedStatements=true to the connection URL. (See Configuration Properties for Connector/J.)
My benchmark went from 3325 seconds to 42 seconds!
Related
I am trying to load spark dataframe into hive like below:
df.repartition(col(col_nme)).write.mode("overwrite").format("ORC").option("compression","snappy").insertInto(hive_tbl)
The same df in pyspark loads in 2 minutes but with scala it loads in 15 mins.
Any suggestions or clues?
I have a spark data frame. I'm doing multiple transformations on the data frame. My code looks like this:
df = df.withColumn ........
df2 = df.filter......
df = df.join(df1 ...
df = df.join(df2 ...
Now I have around 30 + transformations like this. Also I'm aware of persisting of a data frame. So if I have some transformations like this:
df1 = df.filter.....some condition
df2 = df.filter.... some condtion
df3 = df.filter... some other conditon
I'm persisting the data frame "df" in the above case.
Now the problem is spark is taking too long to run (8 + mts) or sometimes it fails with Java heap space issue.
But after some 10+ transformations if I save to a table (persistent hive table) and read from table in the next line, it takes around 3 + mts to complete. Its not working even if I save it to a intermediate in memory table.
Cluster size is not the issue either.
# some transformations
df.write.mode("overwrite").saveAsTable("test")
df = spark.sql("select * from test")
# some transormations ---------> 3 mts
# some transformations
df.createOrReplaceTempView("test")
df.count() #action statement for view to be created
df = spark.sql("select * from test")
# some more transformations --------> 8 mts.
I looked at spark sql plan(still do not completely understand it). It looks like spark is re evaluating same dataframe again and again.
What I'm i doing wrong? I don have to write it to intermediate table.
Edit: I'm working on azure databricks 5.3 (includes Apache Spark 2.4.0, Scala 2.11)
Edit2: The issue is rdd long lineage. It looks like my spark application is getting slower and slower if the rdd lineage is increasing.
You should use caching.
Try using
df.cache
df.count
Using count to force caching all the information.
Also I recommend you take a look at this and this
I am connected via jdbc to a DB having 500'000'000 of rows and 14 columns.
Here is the code used:
from pyspark.sql import SparkSession
spark = SparkSession.builder.getOrCreate()
properties = {'jdbcurl': 'jdbc:db:XXXXXXXXX','user': 'XXXXXXXXX', 'password': 'XXXXXXXXX'}
data = spark.read.jdbc(properties['jdbcurl'], table='XXXXXXXXX', properties=properties)
data.show()
The code above took 9 seconds to display the first 20 rows of the DB.
Later I created a SQL temporary view via
data[['XXX','YYY']].createOrReplaceTempView("ZZZ")
and I ran the following query:
sqlContext.sql('SELECT AVG(XXX) FROM ZZZ').show()
The code above took 1355.79 seconds (circa 23 minutes). Is this ok? It seems to be a large amount of time.
In the end I tried to count the number of rows of the DB
sqlContext.sql('SELECT COUNT(*) FROM ZZZ').show()
It took 2848.95 seconds (circa 48 minutes).
Am I doing something wrong or are these amounts standard?
When you read jdbc source with this method you loose parallelism, main advantage of spark. Please read the official spark jdbc guidelines, especially regarding partitionColumn, lowerBound, upperBound and numPartitions. This will allow spark to run multiple JDBC queries in parallel, resulting with partitioned dataframe.
Also tuning fetchsize parameter may help for some databases.
We have a high volume streaming job (Spark/Kafka) and the data (avro) needs to be grouped by a timestamp field inside the payload. We are doing groupBy on RDD to achieve this: RDD[Timestamp, Iterator[Records]]. This works well for decent volume records. But for loads like 150k every 10 seconds, the shuffle read time goes beyond 10 seconds and it slows down everything.
So my question here is, switching to Structured Streaming and using groupBy on DF will help here. I understand it has Catalyst/Optimizer which helps especially in SQL like jobs. But just for grouping the data, will that help? Anyone has any idea or a use case where they had similar issue and it helped in the performance?
I am using Apache Spark 1.4.1 (which is integrated with Hive 0.13.1) along
with Hadoop 2.7
I have created an ORC table with Snappy compression in Hive and inserted
around 50 million records into the same using Spark Dataframe API
(insertInto method),as below:
inputDF.write.format("orc").mode(SaveMode.Append).partitionBy("call_date","hour","batch_id").insertInto("MYTABLE")
This table has around 50-60 columns with 3 columns being varchar and all
other columns being either INT or FLOAT.
My problem is that when I query the table using below spark command:
var df1 = hiveContext.sql("select * from MYTABLE")
val count1 = df1.count()
The query doesn't come out and is stuck for several hours at the above
query.Spark console logs are stuck at below:
16/12/02 00:50:46 INFO DAGScheduler: Submitting 2700 missing tasks from
ShuffleMapStage 70 (MapPartitionsRDD[553] at cache at
MYTABLE_LOAD.scala:498)16/12/02 00:50:46 INFO YarnScheduler: Adding task
set 70.0 with 2700 tasks
The table has 2700 part files in warehouse directory.
I have tried coalescing the inputDF to 10 partitions before inserting into
the table which created 270 part files for the table instead of 2700,but
querying the table gives same issue,i.e. the query doesn't come out.
The strange thing is that when I invoke the same select query via
spark-shell(invoked with 5g driver memory),the query gives results in less
than a minute.
Even for other ORC tables (not Snappy compressed),querying them using
hiveContext.sql with very simple queries (select from table where ) is taking more than 10 minutes.
Can someone please advise what could be the issue here? I don't think there
is something wrong with the table as the spark-shell query wouldn't have
worked in that case.
Many thanks in advance.