Spark Exception when converting a MySQL table to parquet - apache-spark

I'm trying to convert a MySQL remote table to a parquet file using spark 1.6.2.
The process runs for 10 minutes, filling up memory, than starts with these messages:
WARN NettyRpcEndpointRef: Error sending message [message = Heartbeat(driver,[Lscala.Tuple2;#dac44da,BlockManagerId(driver, localhost, 46158))] in 1 attempts
org.apache.spark.rpc.RpcTimeoutException: Futures timed out after [10 seconds]. This timeout is controlled by spark.executor.heartbeatInterval
at the end fails with this error:
ERROR ActorSystemImpl: Uncaught fatal error from thread [sparkDriverActorSystem-scheduler-1] shutting down ActorSystem [sparkDriverActorSystem]
java.lang.OutOfMemoryError: GC overhead limit exceeded
I'm running it in a spark-shell with these commands:
spark-shell --packages mysql:mysql-connector-java:5.1.26 org.slf4j:slf4j-simple:1.7.21 --driver-memory 12G
val dataframe_mysql = sqlContext.read.format("jdbc").option("url", "jdbc:mysql://.../table").option("driver", "com.mysql.jdbc.Driver").option("dbtable", "...").option("user", "...").option("password", "...").load()
dataframe_mysql.saveAsParquetFile("name.parquet")
I have limits to the max executor memory to 12G. Is there a way to force writing the parquet file in "small" chunks freeing memory?

It seemed like the problem was that you had no partition defined when you read your data with the jdbc connector.
Reading from JDBC isn't distributed by default, so to enable distribution you have to set manual partitioning. You need a column which is a good partitioning key and you have to know distribution up front.
This is what your data looks like apparently :
root
|-- id: long (nullable = false)
|-- order_year: string (nullable = false)
|-- order_number: string (nullable = false)
|-- row_number: integer (nullable = false)
|-- product_code: string (nullable = false)
|-- name: string (nullable = false)
|-- quantity: integer (nullable = false)
|-- price: double (nullable = false)
|-- price_vat: double (nullable = false)
|-- created_at: timestamp (nullable = true)
|-- updated_at: timestamp (nullable = true)
order_year seemed like a good candidate to me. (you seem to have ~20 years according to your comments)
import org.apache.spark.sql.SQLContext
val sqlContext: SQLContext = ???
val driver: String = ???
val connectionUrl: String = ???
val query: String = ???
val userName: String = ???
val password: String = ???
// Manual partitioning
val partitionColumn: String = "order_year"
val options: Map[String, String] = Map("driver" -> driver,
"url" -> connectionUrl,
"dbtable" -> query,
"user" -> userName,
"password" -> password,
"partitionColumn" -> partitionColumn,
"lowerBound" -> "0",
"upperBound" -> "3000",
"numPartitions" -> "300"
)
val df = sqlContext.read.format("jdbc").options(options).load()
PS: partitionColumn, lowerBound, upperBound, numPartitions:
These options must all be specified if any of them is specified.
Now you can save your DataFrame to parquet.

Related

Spark Cassandra Write UDT With Case-Sensitive Names Fails

Spark connector Write fails with a java.lang.IllegalArgumentException: udtId is not a field defined in this definition error when using case-sensitive field names
I need the fields in the Cassandra table to maintain case. So i have used
quotes to create them.
My Cassandra schema
CREATE TYPE my_keyspace.my_udt (
"udtId" text,
"udtValue" text
);
CREATE TABLE my_keyspace.my_table (
"id" text PRIMARY KEY,
"someCol" text,
"udtCol" list<frozen<my_udt>>
);
My Spark DataFrame schema is
root
|-- id: string (nullable = true)
|-- someCol: string (nullable = true)
|-- udtCol: array (nullable = true)
| |-- element: struct (containsNull = true)
| | |-- udtId: string (nullable = true)
|-- udtValue: string (nullable = true)
Are there any other options to get this write to work other than defining my udt with lowercase names? Making them lower case would make me invoke case management code everywhere this is used and i'd like to avoid that ?
Because i couldn't write successfully, i did try read yet? Is this an issue with reads as well ?
You need to upgrade to Spark Cassandra Connector 2.5.0 - I can't find specific commit that fixes it, or specific Jira that mentions that - I suspect that it was fixed in the DataStax version first, and then released as part of merge announced here.
Here is how it works in SCC 2.5.0 + Spark 2.4.6, while it fails with SCC 2.4.2 + Spark 2.4.6:
scala> import org.apache.spark.sql.cassandra._
import org.apache.spark.sql.cassandra._
scala> val data = spark.read.cassandraFormat("my_table", "test").load()
data: org.apache.spark.sql.DataFrame = [id: string, someCol: string ... 1 more field]
scala> val data2 = data.withColumn("id", concat(col("id"), lit("222")))
data2: org.apache.spark.sql.DataFrame = [id: string, someCol: string ... 1 more field]
scala> data2.write.cassandraFormat("my_table", "test").mode("append").save()

How to write result of streaming query to multiple database tables?

I am using spark structured streaming and reading from Kafka topic. The goal is to write the message to PostgreSQL database multiple tables.
The message schema is:
root
|-- id: string (nullable = true)
|-- name: timestamp (nullable = true)
|-- comment: string (nullable = true)
|-- map_key_value: map (nullable = true)
|-- key: string
|-- value: string (valueContainsNull = true)
While writing to one table after dropping map_key_value works with below code:
My write code is:
message.writeStream.foreachBatch { (batchDF: DataFrame, batchId: Long) =>
batchDF.write.format("jdbc").option("url", "url")
.option("user", "username")
.option("password", "password")
.option(JDBCOptions.JDBC_TABLE_NAME, "table_1')
.mode(SaveMode.Append).save();
}.outputMode(OutputMode.Append()).start().awaitTermination()
I want to write the message to two DB tables table 1(id, name, comment) and tables 2 need have the map_key_value.
You will need N streaming queries for N sinks; t1 and t2 both count as a separate sink.
writeStream does not currently write to jdbc so you should use foreachBatch operator.

Reading orc file of Hive managed tables in pyspark

I am trying to read orc file of a managed hive table using below pyspark code.
spark.read.format('orc').load('hive managed table path')
when i do a print schema on fetched dataframe, it is as follow
root
|-- operation: integer (nullable = true)
|-- originalTransaction: long (nullable = true)
|-- bucket: integer (nullable = true)
|-- rowId: long (nullable = true)
|-- currentTransaction: long (nullable = true)
|-- row: struct (nullable = true)
| |-- col1: float (nullable = true)
| |-- col2: integer (nullable = true)
|-- partition_by_column: date (nullable = true)
Now i am not able to parse this data and do any manipulation on data frame. While applying action like show(), i am getting an error saying
java.lang.IllegalArgumentException: Include vector the wrong length
did someone face the same issue? if yes can you please suggest how to resolve it.
It's a known issue.
You get that error because you're trying to read Hive ACID table but Spark still doesn't have support for this.
Maybe you can export your Hive table to normal ORC files and then read them with Spark or try using alternatives like Hive JDBC as described here
As i am not sure about the versions You can try other ways to load the ORC file.
Using SqlContext
val df = sqlContext.read.format("orc").load(orcfile)
OR
val df= spark.read.option("inferSchema", true).orc("filepath")
OR SparkSql(recommended)
import spark.sql
sql("SELECT * FROM table_name").show()

spark read orc with specific columns

I have a orc file, when read with below option it reads all the columns .
val df= spark.read.orc("/some/path/")
df.printSChema
root
|-- id: string (nullable = true)
|-- name: string (nullable = true)
|-- value: string (nullable = true)
|-- all: string (nullable = true)
|-- next: string (nullable = true)
|-- action: string (nullable = true)
but I want to read only two columns from that file , is there any way to read only two columns (id,name) while loading orc file ?
is there any way to read only two columns (id,name) while loading orc file ?
Yes, all you need is subsequent select. Spark will take care of the rest for you:
val df = spark.read.orc("/some/path/").select("id", "name")
Spark has lazy execution model. So you can do any data transformation in you code without immediate real effect. Only after action call Spark start to doing job. And Spark are smart enough not to do extra work.
So you can write like this:
val inDF: DataFrame = spark.read.orc("/some/path/")
import spark.implicits._
val filteredDF: DataFrame = inDF.select($"id", $"name")
// any additional transformations
// real work starts after this action
val result: Array[Row] = filteredDF.collect()

Spark DataFrame to MySql Time Stamp error

I Have Data Frame where schema is
id : long (nullable = false)
DDate: timestamp (nullable = true)
EDate: timestamp (nullable = true)
B1Date: timestamp (nullable = true)
B2Date: timestamp (nullable = true)
B3Date: timestamp (nullable = true)
when I'm using df.write.jdbc(url, "DF", prop) I am getting error
com.mysql.jdbc.exceptions.jdbc4.MySQLSyntaxErrorException: Invalid default value for 'DDate'
at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
at sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorI mpl.java:62)
at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorA ccessorImpl.java:45)
at java.lang.reflect.Constructor.newInstance(Constructor.java:423)
for every timestamp I am getting same problem how to solve the issue
change timestamp type to java.sql.Timestamp
ex: case class DfWithTs (ts: sent_at: java.sql.Timestamp, value: Int)
val ds = df.as[DfWithTs]
and then write it, this will work with jdbc

Resources