DateTime datatype in BigQuery - apache-spark

I have a partitioned table where one of the column is of type DateTime and the table is partitioned on same column. According to spark-bigquery documentation, the corresponding Spark SQL type is of String type.
https://github.com/GoogleCloudDataproc/spark-bigquery-connector
I tried doing the same but I am getting datatype mismatch issue.
Code Snippet:
ZonedDateTime nowPST = ZonedDateTime.ofInstant(Instant.now(), TimeZone.getTimeZone("PST").toZoneId());
df = df.withColumn("createdDate", lit(nowPST.toLocalDateTime().toString()));
Error:
Caused by: com.google.cloud.spark.bigquery.repackaged.com.google.cloud.bigquery.BigQueryException: Failed to load to <PROJECT_ID>:<DATASET_NAME>.<TABLE_NAME> in job JobId{project=<PROJECT_ID>, job=<JOB_ID>, location=US}. BigQuery error was Provided Schema does not match Table <PROJECT_ID>:<DATASET_NAME>.<TABLE_NAME>. Field createdDate has changed type from DATETIME to STRING
at com.google.cloud.spark.bigquery.BigQueryWriteHelper.loadDataToBigQuery(BigQueryWriteHelper.scala:156)
at com.google.cloud.spark.bigquery.BigQueryWriteHelper.writeDataFrameToBigQuery(BigQueryWriteHelper.scala:89)
... 36 more

As Spark has no support for DateTime, the BigQuery connector does not support writing DateTime - there is no equivalent Spark data type that can be used. We are exploring ways to augment the DataFrame's metadata in order to support the types which are supported by BigQuery and not by Spark (DateTime, Time, Geography).
At the moment please have this field as String, and have the conversion on the BigQuery side.

I am running into this issue now as well with both geography https://community.databricks.com/s/question/0D58Y000099mPyDSAU/does-databricks-support-writing-geographygeometry-data-into-bigquery
And for Datetime types. The only way I could get the table from databricks to BigQuery (without creating a temporary table and Inserting the data as this would still be costly due to the size of the table) was to write the table out to a CSV into a GCS Bucket
results_df.write.format("csv").mode("overwrite").save("gs://<bucket-name>/ancillary_test")
And then load the data from the bucket to the table in BigQuery specifying the schema
LOAD DATA INTO <dataset>.<tablename>(
PRICENODEID INTEGER,
ISONAME STRING,
PRICENODENAME STRING,
MARKETTYPE STRING,
GMTDATETIME TIMESTAMP,
TIMEZONE STRING,
LOCALDATETIME DATETIME,
ANCILLARY STRING,
PRICE FLOAT64,
CHANGE_DATE TIMESTAMP
)
FROM FILES (
format = 'CSV',
uris = ['gs://<bucket-name>/ancillary_test/*.csv']
);

Related

BigQuery exports NUMERIC data type as binary data type in AVRO

I am exporting the data from BigQuery table which has column named prop12 defined as NUMERIC data type. Please note that destination format is AVRO and can't be changed.
bq extract --destination_format AVRO datasetName.myTableName /path/to/file-1-*.avro
When i am reading avro data, using spark it is not able to convert this NUMERIC data type to Integer.
--prop12: binary (nullable = true)
cannot resolve 'CAST(`prop12` AS INT)' due to data type mismatch: cannot cast BinaryType to IntegerType
Is there any way i can specify prop12 should be exported as Integer while doing bq extract?
OR
If it is not possible during bq export, am i left with only option of reading the binary data in spark?
Is there any way i can specify prop12 should be exported as Integer
while doing bq extract?
In the extract command you can't do it. You can create a new temporary table and then extract it:
bq query --nouse_legacy_sql '
CREATE TABLE `my_dataset.my_temp_table`
OPTIONS(
expiration_timestamp=TIMESTAMP_ADD(CURRENT_TIMESTAMP(), INTERVAL 10 MINUTE)
) AS
SELECT * REPLACE (CAST(prop12 AS INT64) AS prop12)
FROM `my_dataset.my_table`;
' && bq extract --destination_format AVRO my_dataset.my_temp_table /path/to/file-1-*.avro
Consider that this will generate additional cost.
If it is not possible during bq export, am i left with only option of
reading the binary data in spark?
Numeric types in BigQuery are 16-bytes, it could be possible to work with them as decimal. You can try casting them as decimal instead.

spark Dataframe string to Hive varchar

I read data from Oracle via spark JDBC connection to a DataFrame. I have a column which is obviously StringType in dataframe.
Now I want to persist this in Hive, but as datatype Varchar(5). I know the string would be truncated but it is ok.
I tried using UDFs which didn't work since dataframe does not have varchar or char types. I also created a temporary view in Hive using:
val tv = df.createOrReplaceTempView("t_name")
val df = spark.sql("select cast(col_name as varchar(5)) from tv")
But then when i printSchema, i still see a string type.
How can I make I save it as a varchar column in Hive table ?
Try creating Hive table("dbName.tableName") with required schema (varchar(5) in this case) and insert into the table directly from Dataframe like below.
df.write.insertInto("dbName.tableName" ,overwrite = False)

Spark timestamp type is not getting accepted with hive timestamp

I have a spark Dataframe which contains a field as a timestamp. I am storing the dataframe into HDFS location where hive external table is created. Hive table contains the field with timestamp type. But while reading data from the external location hive is populating the timestamp field as a blank value in the table.
my spark dataframe query:
df.select($"ipAddress", $"clientIdentd", $"userId", to_timestamp(unix_timestamp($"dateTime", "dd/MMM/yyyy:HH:mm:ss Z").cast("timestamp")).as("dateTime"), $"method", $"endpoint", $"protocol", $"responseCode", $"contentSize", $"referrerURL", $"browserInfo")
Hive create table statement:
CREATE EXTERNAL TABLE `finalweblogs3`(
`ipAddress` string,
`clientIdentd` string,
`userId` string,
`dateTime` timestamp,
`method` string,
`endpoint` string,
`protocol` string,
`responseCode` string,
`contentSize` string,
`referrerURL` string,
`browserInfo` string)
ROW FORMAT SERDE
'org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe'
WITH SERDEPROPERTIES (
'field.delim'=',',
'serialization.format'=',')
STORED AS INPUTFORMAT
'org.apache.hadoop.mapred.TextInputFormat'
OUTPUTFORMAT
'org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat'
LOCATION
'hdfs://localhost:9000/streaming/spark/finalweblogs3'
I am not able to get it why this is happening.
I resolved it by changing the storing format as "Parquet".
I still don't know why it is not working for CSV format.

Spark SQL - Cast to UUID of the Dataset Column throws Parse Exception

Dataset<Row> finalResult = df.selectExpr("cast(col1 as uuid())", "col2");
When we tried to cast the Column in the dataset to UUID and persist in Postgres, i see the following exception. Please suggest the alternate solution to convert the column in a data set to UUID.
java.lang.RuntimeException: org.apache.spark.sql.catalyst.parser.ParseException:
DataType uuid() is not supported.(line 1, pos 21)
== SQL ==
cast(col1 as UUID)
---------------------^^^
Spark has no uuid type, so casting to one is just not going to work.
You can try to use database.column.type metadata property as explained in Custom Data Types for DataFrame columns when using Spark JDBC and SPARK-10849.

Auto Cast parquet to Hive

I have a scenario where spark infers schema from the input file and writes parquet files with Integer Data Types.
But we have tables in hive where the fields are defined as BigInt. Right now there is no conversion from int to long and hive throws errors that it cannot cast Integer to Long. I cannot edit the Hive DDL to Integer data types as it is business requirement to have those fields as Long.
I have looked up the option where we can cast the data types before saving.This can be done except that i have hundreds of columns and explicit cast makes code very messy.
Is there a way to tell spark to auto cast data types.
Since Spark version 1.4 you can apply the cast method with DataType on the column:
Suppose dataframe df has column year : Long
import org.apache.spark.sql.types.IntegerType
val df2 = df.withColumn("yearTmp", df.year.cast(IntegerType))
.drop("year")
.withColumnRenamed("yearTmp", "year")
If you are using sql expressions you can also do:
val df2 = df.selectExpr("cast(year as int) year",
"make",
"model",
"comment",
"blank")
For more info check the docs: http://spark.apache.org/docs/1.6.0/api/scala/#org.apache.spark.sql.DataFrame

Resources