Writing Spark Dataframe to ORC gives the wrong timezone - apache-spark

Whenever I write a Dataframe into ORC, the timezone of Timestamp fields is not correct.
here's my code -
// setting the timezone
val schema = List(
StructField("name", StringType),
StructField("date", TimestampType)
)
val data = Seq(
Row("test", java.sql.Timestamp.valueOf("2021-03-15 10:10:10.0"))
)
val df = spark.createDataFrame(
spark.sparkContext.parallelize(data),
StructType(schema)
)
// changing the timezone
spark.conf.set("spark.sql.session.timeZone", "MDT")
// value of the df has changed accordingly
df.show // prints 2021-03-15 08:10:10
// writing to orc
df.write.mode(SaveMode.Overwrite).format("orc").save("/tmp/dateTest.orc/")
value in ORC file will be 2021-03-15 10:10:10.0.
Is there any way to control the writer's timezone? am i missing something here?
Thanks in advance!

So after much investigation, this is something that's not supported (ATM) for ORC. it is supported for csv, though.

Related

how to insert dataframe having map column in hive table

I have a dataframe with multiple columns out of which one column is map(string,string) type. I'm able to print this dataframe having column as map which gives data as Map("PUN" -> "Pune"). I want to write this dataframe to hive table (stored as avro) which has same column with type map.
Df.withcolumn("cname", lit("Pune"))
withcolumn("city_code_name", map(lit("PUN"), col("cname"))
Df.show(false)
//table - created external hive table..stored as avro..with avro schema
After removing this map type column I'm able to save the dataframe to hive avro table.
Save way to hive table:
spark.save - saving avro file
spark.sql - creating partition on hive table with avro file location
see this test case as an example from spark tests
test("Insert MapType.valueContainsNull == false") {
val schema = StructType(Seq(
StructField("m", MapType(StringType, StringType, valueContainsNull = false))))
val rowRDD = spark.sparkContext.parallelize(
(1 to 100).map(i => Row(Map(s"key$i" -> s"value$i"))))
val df = spark.createDataFrame(rowRDD, schema)
df.createOrReplaceTempView("tableWithMapValue")
sql("CREATE TABLE hiveTableWithMapValue(m Map <STRING, STRING>)")
sql("INSERT OVERWRITE TABLE hiveTableWithMapValue SELECT m FROM tableWithMapValue")
checkAnswer(
sql("SELECT * FROM hiveTableWithMapValue"),
rowRDD.collect().toSeq)
sql("DROP TABLE hiveTableWithMapValue")
}
also if you want save option then you can try with saveAsTable as showed here
Seq(9 -> "x").toDF("i", "j")
.write.format("hive").mode(SaveMode.Overwrite).option("fileFormat", "avro").saveAsTable("t")
yourdataframewithmapcolumn.write.partitionBy is the way to create partitions.
You can achieve that with saveAsTable
Example:
Df\
.write\
.saveAsTable(name='tableName',
format='com.databricks.spark.avro',
mode='append',
path='avroFileLocation')
Change the mode option to whatever suits you

Spark read avro

Trying to read an avro file.
val df = spark.read.avro(file)
Running into Avro schema cannot be converted to a Spark SQL StructType: [ "null", "string" ]
Tried to manually create a schema, but now running into the following:
val s = StructType(List(StructField("value", StringType, nullable = true)))
val df = spark.read
.option("inferSchema", "false")
.schema(s)
.avro(file)
com.databricks.spark.avro.SchemaConverters$IncompatibleSchemaException: Cannot convert Avro schema to catalyst type because schema at path is not compatible (avroType = StructType(StructField(value,StringType,true)), sqlType = STRING).
Source Avro schema: ["null","string"].
Target Catalyst type: StructType(StructField(value,StringType,true))
Trying to override the avro schema (without the null) also does not work:
val df = spark.read
.option("inferSchema", "false")
.option("avroSchema", """["string"]""")
.avro(file)
Avro schema cannot be converted to a Spark SQL StructType: [ "string" ]
Looks like spark-avro only creates a GenericDatumReader[GenericRecord] and I need a GenericDatumReader[Utf8] :(
Please make sure you are providing the correct AVSC with the data type.
["null", "String"] is placed to take care of null values in the Avro data.
You can create the schema of your Avro file by:-
val schema = new Schema.Parser().parse(new File("user.avsc")
Or if you have Java Schema file then you can get the schema by doing:-
val schema = Schema.getClassSchema
now once you have the schema it is very simple to build a data frame with it.
code snippet:-
val df =sparkSession.read.format("com.databricks.spark.avro")
.option("avroSchema", schema.toString)
.load("/home/garvit.vijay/000009_0.avro")
df.printSchema()
df.show()
Hope it works for you.

how can i add a timestamp as an extra column to my dataframe

*Hi all,
I have an easy question for you all.
I have an RDD, created from kafka streaming using createStream method.
Now i want to add a timestamp as a value to this rdd before converting in to dataframe.
I have tried doing to add a value to the dataframe using with withColumn() but returning this error*
val topicMaps = Map("topic" -> 1)
val now = java.util.Calendar.getInstance().getTime()
val messages = KafkaUtils.createStream[String, String, StringDecoder, StringDecoder](ssc, kafkaConf, topicMaps, StorageLevel.MEMORY_ONLY_SER)
messages.foreachRDD(rdd =>
{
val sqlContext = new org.apache.spark.sql.SQLContext(sc)
import sqlContext.implicits._
val dataframe = sqlContext.read.json(rdd.map(_._2))
val d =dataframe.withColumn("timeStamp_column",dataframe.col("now"))
val d =dataframe.withColumn("timeStamp_column",dataframe.col("now"))
org.apache.spark.sql.AnalysisException: Cannot resolve column name "now" among (action, device_os_ver, device_type, event_name,
item_name, lat, lon, memberid, productUpccd, tenantid);
at org.apache.spark.sql.DataFrame$$anonfun$resolve$1.apply(DataFrame.scala:15
As i came to know that DataFrames cannot be altered as they are immutable, but RDDs are immutable as well.
Then what is the best way to do it.
How to a value to the RDD(adding timestamp to an RDD dynamically).
Try current_timestamp function.
import org.apache.spark.sql.functions.current_timestamp
df.withColumn("time_stamp", current_timestamp())
For add a new column with a constant like timestamp, you can use litfunction:
import org.apache.spark.sql.functions._
val newDF = oldDF.withColumn("timeStamp_column", lit(System.currentTimeMillis))
This works for me. I usually perform a write after this.
val d = dataframe.withColumn("SparkLoadedAt", current_timestamp())
In Scala/Databricks:
import org.apache.spark.sql.functions._
val newDF = oldDF.withColumn("Timestamp",current_timestamp())
See my output
I see in comments that some folks are having trouble getting the timestamp to string. Here is a way to do that using spark 3 datetime format
import org.apache.spark.sql.functions._
val d =dataframe.
.withColumn("timeStamp_column", date_format(current_timestamp(), "y-M-d'T'H:m:sX"))

how to convert current_timestamp() value in to a string in scala

I have been trying to add a timestamp to my data frames and persist them in to hive.
But here is the problem: as I cannot use timestamps as a data type in hive version 0.13 I want to convert current_timestamp() of the timestamp data type to string so that I can load it in to my hive table.
Here is my timestamp column:
[2017-01-12 12:55:55.278] [2017-01-12 12:55:55.278] [2017-01-12 12:55:55.278] [2017-01-12 12:55:55.278] [2017-01-12 12:55:55.278]
I have tried this but with no luck:
val ts = current_timestamp()
val df:SimpleDateFormat = new SimpleDateFormat("yyyy-MM-dd")
val date:String = df.format(ts.toLong)
Any way to convert the timestamp to string in Scala??
It's quite easy than i expected,
I have been appending a timestamp to my data frame like this,
val NewDF = oldDF.withColumn("newColumn_name",current_timestamp())
And casted the timestamp in to string like this,
val NewDF = oldDF.withColumn("newColumn_name",current_timestamp().cast("String"))
Hoe this helps some one.
Thanks.
val format = new java.text.SimpleDateFormat("yyyy-MM-dd")
format.format(new java.util.Date())

How to convert cassandraRow into Row (apache spark)?

I am trying to create a Dataframe from RDD[cassandraRow].. But i can't because createDataframe(RDD[Row],schema: StructType) need RDD[Row] not RDD[cassandraRow].
How can I achieve this?
And also as per the answer in this question
How to convert rdd object to dataframe in spark
( one of the answers ) suggestion for using toDF() on RDD[Row] to get Dataframe from the RDD, is not working for me. I tried using RDD[Row] in another example ( tried to use toDF() ).
it's also unknown for me that how can we call the method of Dataframe ( toDF() ) with instance of RDD ( RDD[Row] ) ?
I am using Scala.
If you really need this you can always map your data to Spark rows:
sqlContext.createDataFrame(
rdd.map(r => org.apache.spark.sql.Row.fromSeq(r.columnValues)),
schema
)
but if you want DataFrames it is better to import data directly:
val df = sqlContext
.read
.format("org.apache.spark.sql.cassandra")
.options(Map( "table" -> table, "keyspace" -> keyspace))
.load()

Resources