Spark SQL - Cast to UUID of the Dataset Column throws Parse Exception - apache-spark

Dataset<Row> finalResult = df.selectExpr("cast(col1 as uuid())", "col2");
When we tried to cast the Column in the dataset to UUID and persist in Postgres, i see the following exception. Please suggest the alternate solution to convert the column in a data set to UUID.
java.lang.RuntimeException: org.apache.spark.sql.catalyst.parser.ParseException:
DataType uuid() is not supported.(line 1, pos 21)
== SQL ==
cast(col1 as UUID)
---------------------^^^

Spark has no uuid type, so casting to one is just not going to work.
You can try to use database.column.type metadata property as explained in Custom Data Types for DataFrame columns when using Spark JDBC and SPARK-10849.

Related

DateTime datatype in BigQuery

I have a partitioned table where one of the column is of type DateTime and the table is partitioned on same column. According to spark-bigquery documentation, the corresponding Spark SQL type is of String type.
https://github.com/GoogleCloudDataproc/spark-bigquery-connector
I tried doing the same but I am getting datatype mismatch issue.
Code Snippet:
ZonedDateTime nowPST = ZonedDateTime.ofInstant(Instant.now(), TimeZone.getTimeZone("PST").toZoneId());
df = df.withColumn("createdDate", lit(nowPST.toLocalDateTime().toString()));
Error:
Caused by: com.google.cloud.spark.bigquery.repackaged.com.google.cloud.bigquery.BigQueryException: Failed to load to <PROJECT_ID>:<DATASET_NAME>.<TABLE_NAME> in job JobId{project=<PROJECT_ID>, job=<JOB_ID>, location=US}. BigQuery error was Provided Schema does not match Table <PROJECT_ID>:<DATASET_NAME>.<TABLE_NAME>. Field createdDate has changed type from DATETIME to STRING
at com.google.cloud.spark.bigquery.BigQueryWriteHelper.loadDataToBigQuery(BigQueryWriteHelper.scala:156)
at com.google.cloud.spark.bigquery.BigQueryWriteHelper.writeDataFrameToBigQuery(BigQueryWriteHelper.scala:89)
... 36 more
As Spark has no support for DateTime, the BigQuery connector does not support writing DateTime - there is no equivalent Spark data type that can be used. We are exploring ways to augment the DataFrame's metadata in order to support the types which are supported by BigQuery and not by Spark (DateTime, Time, Geography).
At the moment please have this field as String, and have the conversion on the BigQuery side.
I am running into this issue now as well with both geography https://community.databricks.com/s/question/0D58Y000099mPyDSAU/does-databricks-support-writing-geographygeometry-data-into-bigquery
And for Datetime types. The only way I could get the table from databricks to BigQuery (without creating a temporary table and Inserting the data as this would still be costly due to the size of the table) was to write the table out to a CSV into a GCS Bucket
results_df.write.format("csv").mode("overwrite").save("gs://<bucket-name>/ancillary_test")
And then load the data from the bucket to the table in BigQuery specifying the schema
LOAD DATA INTO <dataset>.<tablename>(
PRICENODEID INTEGER,
ISONAME STRING,
PRICENODENAME STRING,
MARKETTYPE STRING,
GMTDATETIME TIMESTAMP,
TIMEZONE STRING,
LOCALDATETIME DATETIME,
ANCILLARY STRING,
PRICE FLOAT64,
CHANGE_DATE TIMESTAMP
)
FROM FILES (
format = 'CSV',
uris = ['gs://<bucket-name>/ancillary_test/*.csv']
);

SPARK read.jdbc & custom schema

With spark.read.format ... once can add the custom schema non-programmatically, like so:
val df = sqlContext
.read()
.format("jdbc")
.option("url", "jdbc:mysql://127.0.0.1:3306/test?useUnicode=true&characterEncoding=UTF-8&autoReconnect=true")
.option("user", "root")
.option("password", "password")
.option("dbtable", sql)
.schema(customSchema)
.load();
However, using spark.read.jdbc, I cannot seem to do the same or find the syntax to do the same as for the above. What am i missing or has this changed in SPARK 2.x? I read this in the manual: ... Spark automatically reads the schema from the database table and maps its types back to Spark SQL types. ... Presumably what I am trying to do is no longer possible as in the above example.
val dataframe_mysql = spark.read.jdbc(jdbcUrl, "(select k, v from sample) e ", connectionProperties)
I ended up trying this:
val dataframe_mysql = spark.read.schema(openPositionsSchema).jdbc(jdbcUrl, "(select k, v from sample) e ", connectionProperties)
and got this:
org.apache.spark.sql.AnalysisException: User specified schema not supported with `jdbc`;
Seems a retrograde step in a certain way.
I do not agree with the answer.
You can supply custom schema using your method or by setting properties:
connectionProperties.put("customSchema", schemachanges);
Where schema changes in format "field Name" "New data type", ... :
"key String, value DECIMAL(20, 0)"
If key was an number in original table, it will generate an SQL query like "key::character varying, value::numeric(20, 0)"
It is better than a cast, because cast is a mapping operation executed after it selected in original type, custom schema is not.
I had a case, when spark can not select NaN from postgres Numeric, because it maps numerics into java BigDecimal which does not allow NaN, so spark job failed every time when reading those values. Cast produced the same result. However after changing a scheme to either String or Double, it was able to read it properly.
Spark documentation: https://spark.apache.org/docs/latest/sql-data-sources-jdbc.html
You can use a Custom schema and put in the properties parameters. You can read more at https://spark.apache.org/docs/latest/sql-data-sources-jdbc.html
Create a variable:
c_schema = 'id_type INT'
Properties conf:
config = {"user":"xxx",
"password": "yyy",
"driver":"com.mysql.jdbc.Driver",
"customSchema":c_schema}
Read the table and create the DF:
df = spark.read.jdbc(url=jdbc_url,table='table_name',properties=config)
You must use the same column name and it's going to change only the column
you put inside the customized schema.
. What am i missing or has this changed in SPARK 2.x?
You don't miss anything. Modifying schema on read with JDBC sources was never supported. The input is already typed so there there is no place for schema.
If the types are not satisfying, just cast the results to the desired types.

spark Dataframe string to Hive varchar

I read data from Oracle via spark JDBC connection to a DataFrame. I have a column which is obviously StringType in dataframe.
Now I want to persist this in Hive, but as datatype Varchar(5). I know the string would be truncated but it is ok.
I tried using UDFs which didn't work since dataframe does not have varchar or char types. I also created a temporary view in Hive using:
val tv = df.createOrReplaceTempView("t_name")
val df = spark.sql("select cast(col_name as varchar(5)) from tv")
But then when i printSchema, i still see a string type.
How can I make I save it as a varchar column in Hive table ?
Try creating Hive table("dbName.tableName") with required schema (varchar(5) in this case) and insert into the table directly from Dataframe like below.
df.write.insertInto("dbName.tableName" ,overwrite = False)

Spark SQL - Custom Datatype UUID

i am trying to convert the Column in the Dataset from varchar to UUID using the custom datatype in Spark SQL. But i see the conversion not happening. Please let me know if i am missing anything here.
val secdf = sc.parallelize( Array(("85d8b889-c793-4f23-93e9-ea18db640039","Revenue"), ("85d8b889-c793-4f23-93e9-ea18db640038","Income:123213"))).toDF("id", "report")
val metadataBuilder = new MetadataBuilder()
metadataBuilder.putString("database.column.type", "uuid")
metadataBuilder.putLong("jdbc.type", java.sql.Types.OTHER)
val metadata = metadataBuilder.build()
val secReportDF = secdf.withColumn("id", col("id").as("id", metadata))
i did the work around as we are not able to cast to UUID in Spark SQL, i have added the property in the Postgres JDBC client as stringtype=unspecified which solved my issue in Inserting UUID through Spark JDBC

Auto Cast parquet to Hive

I have a scenario where spark infers schema from the input file and writes parquet files with Integer Data Types.
But we have tables in hive where the fields are defined as BigInt. Right now there is no conversion from int to long and hive throws errors that it cannot cast Integer to Long. I cannot edit the Hive DDL to Integer data types as it is business requirement to have those fields as Long.
I have looked up the option where we can cast the data types before saving.This can be done except that i have hundreds of columns and explicit cast makes code very messy.
Is there a way to tell spark to auto cast data types.
Since Spark version 1.4 you can apply the cast method with DataType on the column:
Suppose dataframe df has column year : Long
import org.apache.spark.sql.types.IntegerType
val df2 = df.withColumn("yearTmp", df.year.cast(IntegerType))
.drop("year")
.withColumnRenamed("yearTmp", "year")
If you are using sql expressions you can also do:
val df2 = df.selectExpr("cast(year as int) year",
"make",
"model",
"comment",
"blank")
For more info check the docs: http://spark.apache.org/docs/1.6.0/api/scala/#org.apache.spark.sql.DataFrame

Resources