I Have Data Frame where schema is
id : long (nullable = false)
DDate: timestamp (nullable = true)
EDate: timestamp (nullable = true)
B1Date: timestamp (nullable = true)
B2Date: timestamp (nullable = true)
B3Date: timestamp (nullable = true)
when I'm using df.write.jdbc(url, "DF", prop) I am getting error
com.mysql.jdbc.exceptions.jdbc4.MySQLSyntaxErrorException: Invalid default value for 'DDate'
at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
at sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorI mpl.java:62)
at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorA ccessorImpl.java:45)
at java.lang.reflect.Constructor.newInstance(Constructor.java:423)
for every timestamp I am getting same problem how to solve the issue
change timestamp type to java.sql.Timestamp
ex: case class DfWithTs (ts: sent_at: java.sql.Timestamp, value: Int)
val ds = df.as[DfWithTs]
and then write it, this will work with jdbc
Related
I am trying to read orc file of a managed hive table using below pyspark code.
spark.read.format('orc').load('hive managed table path')
when i do a print schema on fetched dataframe, it is as follow
root
|-- operation: integer (nullable = true)
|-- originalTransaction: long (nullable = true)
|-- bucket: integer (nullable = true)
|-- rowId: long (nullable = true)
|-- currentTransaction: long (nullable = true)
|-- row: struct (nullable = true)
| |-- col1: float (nullable = true)
| |-- col2: integer (nullable = true)
|-- partition_by_column: date (nullable = true)
Now i am not able to parse this data and do any manipulation on data frame. While applying action like show(), i am getting an error saying
java.lang.IllegalArgumentException: Include vector the wrong length
did someone face the same issue? if yes can you please suggest how to resolve it.
It's a known issue.
You get that error because you're trying to read Hive ACID table but Spark still doesn't have support for this.
Maybe you can export your Hive table to normal ORC files and then read them with Spark or try using alternatives like Hive JDBC as described here
As i am not sure about the versions You can try other ways to load the ORC file.
Using SqlContext
val df = sqlContext.read.format("orc").load(orcfile)
OR
val df= spark.read.option("inferSchema", true).orc("filepath")
OR SparkSql(recommended)
import spark.sql
sql("SELECT * FROM table_name").show()
I'm trying to convert json files to parquet with very few transformations (adding date) but I then need to partition this data before saving it to parquet.
I'm hitting a wall on this area.
Here is the creation process of the table:
df_temp = spark.read.json(data_location) \
.filter(
cond3
)
df_temp = df_temp.withColumn("date", fn.to_date(fn.lit(today.strftime("%Y-%m-%d"))))
df_temp.createOrReplaceTempView("{}_tmp".format("duration_small"))
spark.sql("CREATE TABLE IF NOT EXISTS {1} LIKE {0}_tmp LOCATION '{2}/{1}'".format("duration_small","duration", warehouse_location))
spark.sql("DESC {}".format("duration"))
then regarding the save of the conversion:
df_final.write.mode("append").format("parquet").partitionBy("customer_id", "date").saveAsTable('duration')
but this generates the following error:
pyspark.sql.utils.AnalysisException: '\nSpecified partitioning does not match that of the existing table default.duration.\nSpecified partition columns: [customer_id, date]\nExisting partition columns: []\n ;'
the schema being:
root
|-- action_id: string (nullable = true)
|-- customer_id: string (nullable = true)
|-- duration: long (nullable = true)
|-- initial_value: string (nullable = true)
|-- item_class: string (nullable = true)
|-- set_value: string (nullable = true)
|-- start_time: string (nullable = true)
|-- stop_time: string (nullable = true)
|-- undo_event: string (nullable = true)
|-- year: integer (nullable = true)
|-- month: integer (nullable = true)
|-- day: integer (nullable = true)
|-- date: date (nullable = true)
Thus I tried to change the create table to:
spark.sql("CREATE TABLE IF NOT EXISTS {1} LIKE {0}_tmp PARTITIONED BY (customer_id, date) LOCATION '{2}/{1}'".format("duration_small","duration", warehouse_location))
But this create an error like:
...mismatched input 'PARTITIONED' expecting ...
So I discovered that PARTITIONED BY doesn't work with LIKE but I'm running out of ideas.
if using USING instead of LIKE I got the error:
pyspark.sql.utils.AnalysisException: 'It is not allowed to specify partition columns when the table schema is not defined. When the table schema is not provided, schema and partition columns will be inferred.;'
How am I supposed to add a partition when creating the table?
Ps - Once the schema of the table is defined with the partitions, I want to simply use:
df_final.write.format("parquet").insertInto('duration')
I finally figured out how to do it with spark.
df_temp.read.json...
df_temp.createOrReplaceTempView("{}_tmp".format("duration_small"))
spark.sql("""
CREATE TABLE IF NOT EXISTS {1}
USING PARQUET
PARTITIONED BY (customer_id, date)
LOCATION '{2}/{1}' AS SELECT * FROM {0}_tmp
""".format("duration_small","duration", warehouse_location))
spark.sql("DESC {}".format("duration"))
df_temp.write.mode("append").partitionBy("customer_id", "date").saveAsTable('duration')
I don't know why, but if I can't use insertInto, it uses a weird customer_id out of nowhere and doesn't append the different dates.
Goal:
Read data from a JSON file where timestamp is a long type, and insert into a table that has a Timestamp type. The problem is that I don't know how to convert the long type to a Timestamp type for the insert.
Input File Sample:
{"sensor_id":"sensor1","reading_time":1549533263587,"notes":"My Notes for
Sensor1","temperature":24.11,"humidity":42.90}
I want to read this, create a Bean from it, and insert into a table. Here is my Bean Definition:
public class DummyBean {
private String sensor_id;
private String notes;
private Timestamp reading_time;
private double temperature;
private double humidity;
Here is the table I want to insert into:
create table dummy (
id serial not null primary key,
sensor_id varchar(40),
notes varchar(40),
reading_time timestamp with time zone default (current_timestamp at time zone 'UTC'),
temperature decimal(15,2),
humidity decimal(15,2)
);
Here is my Spark app to read the JSON file and do the insert (append)
SparkSession spark = SparkSession
.builder()
.appName("SparkJDBC2")
.getOrCreate();
// Java Bean used to apply schema to JSON Data
Encoder<DummyBean> dummyEncoder = Encoders.bean(DummyBean.class);
// Read JSON file to DataSet
String jsonPath = "input/dummy.json";
Dataset<DummyBean> readings = spark.read().json(jsonPath).as(dummyEncoder);
// Diagnostics and Sink
readings.printSchema();
readings.show();
// Write to JDBC Sink
String url = "jdbc:postgresql://dbhost:5432/mydb";
String table = "dummy";
Properties connectionProperties = new Properties();
connectionProperties.setProperty("user", "foo");
connectionProperties.setProperty("password", "bar");
readings.write().mode(SaveMode.Append).jdbc(url, table, connectionProperties);
Output and Error Message:
root
|-- humidity: double (nullable = true)
|-- notes: string (nullable = true)
|-- reading_time: long (nullable = true)
|-- sensor_id: string (nullable = true)
|-- temperature: double (nullable = true)
+--------+--------------------+-------------+---------+-----------+
|humidity| notes| reading_time|sensor_id|temperature|
+--------+--------------------+-------------+---------+-----------+
| 42.9|My Notes for Sensor1|1549533263587| sensor1| 24.11|
+--------+--------------------+-------------+---------+-----------+
Exception in thread "main" org.apache.spark.sql.AnalysisException: Column "reading_time" not found in schema Some(StructType(StructField(id,IntegerType,false), StructField(sensor_id,StringType,true), StructField(notes,StringType,true), StructField(temperature,DecimalType(15,2),true), StructField(humidity,DecimalType(15,2),true)));
at org.apache.spark.sql.execution.datasources.jdbc.JdbcUtils$$anonfun$4$$anonfun$6.apply(JdbcUtils.scala:147)
at org.apache.spark.sql.execution.datasources.jdbc.JdbcUtils$$anonfun$4$$anonfun$6.apply(JdbcUtils.scala:147)
at scala.Option.getOrElse(Option.scala:121)
at org.apache.spark.sql.execution.datasources.jdbc.JdbcUtils$$anonfun$4.apply(JdbcUtils.scala:146)
The exception in your post says "reading_time" column not found.. so please cross check if the table is having the required column in the db end. Also, the timestamp is coming in milli seconds, so you need to divide that by 1000 before applying the to_timestamp() function, otherwise you'll get a weird date.
I'm able to replicate below and convert the reading_time.
scala> val readings = Seq((42.9,"My Notes for Sensor1",1549533263587L,"sensor1",24.11)).toDF("humidity","notes","reading_time","sensor_id","temperature")
readings: org.apache.spark.sql.DataFrame = [humidity: double, notes: string ... 3 more fields]
scala> readings.printSchema();
root
|-- humidity: double (nullable = false)
|-- notes: string (nullable = true)
|-- reading_time: long (nullable = false)
|-- sensor_id: string (nullable = true)
|-- temperature: double (nullable = false)
scala> readings.show(false)
+--------+--------------------+-------------+---------+-----------+
|humidity|notes |reading_time |sensor_id|temperature|
+--------+--------------------+-------------+---------+-----------+
|42.9 |My Notes for Sensor1|1549533263587|sensor1 |24.11 |
+--------+--------------------+-------------+---------+-----------+
scala> readings.withColumn("ts", to_timestamp('reading_time/1000)).show(false)
+--------+--------------------+-------------+---------+-----------+-----------------------+
|humidity|notes |reading_time |sensor_id|temperature|ts |
+--------+--------------------+-------------+---------+-----------+-----------------------+
|42.9 |My Notes for Sensor1|1549533263587|sensor1 |24.11 |2019-02-07 04:54:23.587|
+--------+--------------------+-------------+---------+-----------+-----------------------+
scala>
Thanks for your help. Yes the table was missing the column so I fixed that.
This is what solved it (Java version)
import static org.apache.spark.sql.functions.col;
import static org.apache.spark.sql.functions.to_timestamp;
...
Dataset<Row> readingsRow = readings.withColumn("reading_time", to_timestamp(col("reading_time").$div(1000L)));
// Write to JDBC Sink
String url = "jdbc:postgresql://dbhost:5432/mydb";
String table = "dummy";
Properties connectionProperties = new Properties();
connectionProperties.setProperty("user", "foo");
connectionProperties.setProperty("password", "bar");
readingsRow.write().mode(SaveMode.Append).jdbc(url, table, connectionProperties);
If your date is String you can use
String readtime = obj.getString("reading_time");
SimpleDateFormat sdf = new SimpleDateFormat("yyyy-MM-dd'T'HH:mm:ssZ"); //Z for time zone
Date reading_time = sdf.parse(readtime);
or use
new Date(json.getLong(milliseconds))
if it is long
im reading data from jdbc source and writing it directly into elastic search index. when I queried the data in ES I saw that all timestamp fields in my dataframe transformed to long
Below is to save
spark_df1.write.format("org.elasticsearch.spark.sql")
.option('es.index.auto.create', 'true')
.option("es.write.operation", "index")
.option('es.host','localhost')
.option('es.mapping.date.rich',"True")
.option('es.mapping.id', 'Ticket')
.mode("append")
.save("index_esche/type")
when I run spark_df.printSchema()
|-- Createdon: timestamp (nullable = true)
|-- Updatedon: timestamp (nullable = true)
|-- Resolvedon: timestamp (nullable = true)
I'm trying to convert a MySQL remote table to a parquet file using spark 1.6.2.
The process runs for 10 minutes, filling up memory, than starts with these messages:
WARN NettyRpcEndpointRef: Error sending message [message = Heartbeat(driver,[Lscala.Tuple2;#dac44da,BlockManagerId(driver, localhost, 46158))] in 1 attempts
org.apache.spark.rpc.RpcTimeoutException: Futures timed out after [10 seconds]. This timeout is controlled by spark.executor.heartbeatInterval
at the end fails with this error:
ERROR ActorSystemImpl: Uncaught fatal error from thread [sparkDriverActorSystem-scheduler-1] shutting down ActorSystem [sparkDriverActorSystem]
java.lang.OutOfMemoryError: GC overhead limit exceeded
I'm running it in a spark-shell with these commands:
spark-shell --packages mysql:mysql-connector-java:5.1.26 org.slf4j:slf4j-simple:1.7.21 --driver-memory 12G
val dataframe_mysql = sqlContext.read.format("jdbc").option("url", "jdbc:mysql://.../table").option("driver", "com.mysql.jdbc.Driver").option("dbtable", "...").option("user", "...").option("password", "...").load()
dataframe_mysql.saveAsParquetFile("name.parquet")
I have limits to the max executor memory to 12G. Is there a way to force writing the parquet file in "small" chunks freeing memory?
It seemed like the problem was that you had no partition defined when you read your data with the jdbc connector.
Reading from JDBC isn't distributed by default, so to enable distribution you have to set manual partitioning. You need a column which is a good partitioning key and you have to know distribution up front.
This is what your data looks like apparently :
root
|-- id: long (nullable = false)
|-- order_year: string (nullable = false)
|-- order_number: string (nullable = false)
|-- row_number: integer (nullable = false)
|-- product_code: string (nullable = false)
|-- name: string (nullable = false)
|-- quantity: integer (nullable = false)
|-- price: double (nullable = false)
|-- price_vat: double (nullable = false)
|-- created_at: timestamp (nullable = true)
|-- updated_at: timestamp (nullable = true)
order_year seemed like a good candidate to me. (you seem to have ~20 years according to your comments)
import org.apache.spark.sql.SQLContext
val sqlContext: SQLContext = ???
val driver: String = ???
val connectionUrl: String = ???
val query: String = ???
val userName: String = ???
val password: String = ???
// Manual partitioning
val partitionColumn: String = "order_year"
val options: Map[String, String] = Map("driver" -> driver,
"url" -> connectionUrl,
"dbtable" -> query,
"user" -> userName,
"password" -> password,
"partitionColumn" -> partitionColumn,
"lowerBound" -> "0",
"upperBound" -> "3000",
"numPartitions" -> "300"
)
val df = sqlContext.read.format("jdbc").options(options).load()
PS: partitionColumn, lowerBound, upperBound, numPartitions:
These options must all be specified if any of them is specified.
Now you can save your DataFrame to parquet.