Auto Cast parquet to Hive - apache-spark

I have a scenario where spark infers schema from the input file and writes parquet files with Integer Data Types.
But we have tables in hive where the fields are defined as BigInt. Right now there is no conversion from int to long and hive throws errors that it cannot cast Integer to Long. I cannot edit the Hive DDL to Integer data types as it is business requirement to have those fields as Long.
I have looked up the option where we can cast the data types before saving.This can be done except that i have hundreds of columns and explicit cast makes code very messy.
Is there a way to tell spark to auto cast data types.

Since Spark version 1.4 you can apply the cast method with DataType on the column:
Suppose dataframe df has column year : Long
import org.apache.spark.sql.types.IntegerType
val df2 = df.withColumn("yearTmp", df.year.cast(IntegerType))
.drop("year")
.withColumnRenamed("yearTmp", "year")
If you are using sql expressions you can also do:
val df2 = df.selectExpr("cast(year as int) year",
"make",
"model",
"comment",
"blank")
For more info check the docs: http://spark.apache.org/docs/1.6.0/api/scala/#org.apache.spark.sql.DataFrame

Related

Concatenating hive table after adding columns breaks spark read

Given some table manipulation – create table with 2 rows and columns, add 3rd column and insert third row with 3 values
CREATE TABLE concat_test(
one string,
two string
)
STORED AS ORC;
INSERT INTO TABLE concat_test VALUES (1,1), (2,2);
ALTER TABLE concat_test ADD COLUMNS (three string);
INSERT INTO TABLE concat_test VALUES (3,3,3);
alter table concat_test concatenate;
I'm having an exception Caused by: java.lang.ArrayIndexOutOfBoundsException: 3 when I try reading it with Spark
spark.sql("select * from concat_test").collect()
It is obviously connected with columns number. I'm further investigating problem in orc. I didn't find quick fix for such partitions nor the bug described elsewhere. Is there one?
Could anyone try this on the latest hadoop versions? Does the bug exist?
Hive 1.2.1, Spark 2.3.2
UPD. I myself fixed my tables via Hive. Hive queries do work after this manipulation so I created copy tables and did select-insert of the old data to them.
I have totally run into this issue before!
This is a known issue.
Hive only does schema on read, so there is no reason it should detect this as an issue and will happily let you define any definition you want. And the data underlying the table does NOT get updated when you change the definition of the hive table. Generally I have fixed the issue by fixing the underlying ORC files to meet the hive definition. You could read the ORC files directly as that issue has been fixed now as a work around.
Here's a work around if you know that the underlying orc files aren't in the correct format and want to correct the format.
val s = Seq(("apple","apples"),("car","cars")) // create data
val t = Seq(("apple",12),("apples", 50),("car",5),("cars",40))// create data
val df1 = sc.parallelize(t).toDF("Sub_Cat", "Count")
val df2 = sc.parallelize(s).toDF("Main_Cat","Sub_Cat")
df1.write.format("orc").save("category_count")
df2.write.format("orc").save("categories")
val schema = StructType( Array( StructField("Main_Cat", StringType, nullable = true), StructField("Sub_Cat", StringType, nullable = true),StructField("Count", IntegerType, nullable = true)) )
val CorrectedSchema = spark.read.schema(schema).org("category_count")
CorrectedSchema.show()
This helps to correct Schema into the format you intend. If you trust the hive schema you can use this cheat to get the schema.(and reduce the typing)
val schema = spark.sql("select * from concat_test limit 0").schema

BigQuery exports NUMERIC data type as binary data type in AVRO

I am exporting the data from BigQuery table which has column named prop12 defined as NUMERIC data type. Please note that destination format is AVRO and can't be changed.
bq extract --destination_format AVRO datasetName.myTableName /path/to/file-1-*.avro
When i am reading avro data, using spark it is not able to convert this NUMERIC data type to Integer.
--prop12: binary (nullable = true)
cannot resolve 'CAST(`prop12` AS INT)' due to data type mismatch: cannot cast BinaryType to IntegerType
Is there any way i can specify prop12 should be exported as Integer while doing bq extract?
OR
If it is not possible during bq export, am i left with only option of reading the binary data in spark?
Is there any way i can specify prop12 should be exported as Integer
while doing bq extract?
In the extract command you can't do it. You can create a new temporary table and then extract it:
bq query --nouse_legacy_sql '
CREATE TABLE `my_dataset.my_temp_table`
OPTIONS(
expiration_timestamp=TIMESTAMP_ADD(CURRENT_TIMESTAMP(), INTERVAL 10 MINUTE)
) AS
SELECT * REPLACE (CAST(prop12 AS INT64) AS prop12)
FROM `my_dataset.my_table`;
' && bq extract --destination_format AVRO my_dataset.my_temp_table /path/to/file-1-*.avro
Consider that this will generate additional cost.
If it is not possible during bq export, am i left with only option of
reading the binary data in spark?
Numeric types in BigQuery are 16-bytes, it could be possible to work with them as decimal. You can try casting them as decimal instead.

Type change support in spark parquet read-write

I came across one problem while reading parquet through spark.
One parquet file has been written with field a of type Integer. Afterwards, reading this file with schema for a as Long gives exception.
Caused by: java.lang.UnsupportedOperationException: Unimplemented
type: LongType at
org.apache.spark.sql.execution.datasources.parquet.VectorizedColumnReader.readIntBatch(VectorizedColumnReader.java:397)
at
org.apache.spark.sql.execution.datasources.parquet.VectorizedColumnReader.readBatch(VectorizedColumnReader.java:199)
at
org.apache.spark.sql.execution.datasources.parquet.VectorizedParquetRecordReader.nextBatch(VectorizedParquetRecordReader.java:263)
at
org.apache.spark.sql.execution.datasources.parquet.VectorizedParquetRecordReader.nextKeyValue(VectorizedParquetRecordReader.java:161)
at
org.apache.spark.sql.execution.datasources.RecordReaderIterator.hasNext(RecordReaderIterator.scala:39)
at
org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.hasNext(FileScanRDD.scala:106)
I thought this compatible type change is supported. But this is not working.
Code snippet of this:
val oldSchema = StructType(StructField("a", IntegerType, true) :: Nil)
val df1 = spark.read.schema(oldSchema).json("/path/to/json/data")
df1.write.parquet("/path/to/parquet/data")
val newSchema = StructType(StructField("a", LongType, true) :: Nil)
spark.read.schema(newSchema).parquet("/path/to/parquet/data").show()
Any help around this is really appreciated.
as parquet is column based storage format for Hadoop so it keeps the datatype of the data also. So while reading the parquet with different datatype even if it's upcasting it’s not handled automatically.
You need to specifically cast the data
val colarraywithcast = Array(col("eid"),col("did"),col("seal").cast(LongType))
df.select(colarraywithcast:_*).printSchema

spark Dataframe string to Hive varchar

I read data from Oracle via spark JDBC connection to a DataFrame. I have a column which is obviously StringType in dataframe.
Now I want to persist this in Hive, but as datatype Varchar(5). I know the string would be truncated but it is ok.
I tried using UDFs which didn't work since dataframe does not have varchar or char types. I also created a temporary view in Hive using:
val tv = df.createOrReplaceTempView("t_name")
val df = spark.sql("select cast(col_name as varchar(5)) from tv")
But then when i printSchema, i still see a string type.
How can I make I save it as a varchar column in Hive table ?
Try creating Hive table("dbName.tableName") with required schema (varchar(5) in this case) and insert into the table directly from Dataframe like below.
df.write.insertInto("dbName.tableName" ,overwrite = False)

Spark SQL - Cast to UUID of the Dataset Column throws Parse Exception

Dataset<Row> finalResult = df.selectExpr("cast(col1 as uuid())", "col2");
When we tried to cast the Column in the dataset to UUID and persist in Postgres, i see the following exception. Please suggest the alternate solution to convert the column in a data set to UUID.
java.lang.RuntimeException: org.apache.spark.sql.catalyst.parser.ParseException:
DataType uuid() is not supported.(line 1, pos 21)
== SQL ==
cast(col1 as UUID)
---------------------^^^
Spark has no uuid type, so casting to one is just not going to work.
You can try to use database.column.type metadata property as explained in Custom Data Types for DataFrame columns when using Spark JDBC and SPARK-10849.

Resources