How to stop Spark from changing varchar to string? - apache-spark

I have a Hive table with below schema:
hive> desc <DB>.<TN>;
id int,
name varchar(10),
reg varchar(8);
When I try to describe the same table on Spark (Pyspark shell) it's converting Varchar to String.
spark.sql("""describe <DB>.<TN>""").show()
id int
name string
reg string
I would like to retain Hive datatypes while querying on Spark. Means I am expecting varchar in the place of String. Does anyone know how to stop spark from inferring datatypes of its own?

There is no varchar in Apache Spark, it's all Strings. Yeah, this page says there is a VarcharType but it is only for schemas.
Once the data is in the dataframe, things are transparent. When you save the data, all should be back to varchar in Hive.
You can force a schema on reading the dataframe when it is available (like CSV for example), but I do not think it is available for Hive, which is already typed.

I was going to tell you to just add a schema
schema = StructType([StructField('ID', IntegerType(), True),StructField('name', VarcharType(10), True),StructField('reg', VarcharType(8), True)])
df3 = sqlContext.createDataFrame(rdd, schema)
to a dataframe but data frames do not have a varchar type in spark <= 2.4. Which is likely why your varchars are being converted to StringType. That isn't to say that they aren't available in spark(2.4 >).

Related

Hive data type for large values

I am learning hive and spark
Which data type should be use to store value like 1.00E+07 ,-1.00E+08
I have tried bigint but it is storing as null.

Schema mismatch - Spark DataFrame written to Delta

When writing a dataframe to delta format, the resulting delta does not seem to follow the schema of the dataframe that was written. Specifically, the 'nullable' property of a field seems to be always 'true' in the resulting delta regardless of the source dataframe schema. Is this expected or am I making a mistake here? Is there a way to get the schema of the written delta to match exactly with the source df?
scala> df.schema
res2: org.apache.spark.sql.types.StructType = StructType(StructField(device_id,StringType,false), StructField(val1,StringType,true), StructField(val2,StringType,false), StructField(dt,StringType,true))
scala> df.write.format("delta").save("D:/temp/d1")
scala> spark.read.format("delta").load("D:/temp/d1").schema
res5: org.apache.spark.sql.types.StructType = StructType(StructField(device_id,StringType,true), StructField(val1,StringType,true), StructField(val2,StringType,true), StructField(dt,StringType,true))
Writing in parquet, the underlying format of delta lake, can't guarantee the nullability of the column.
Maybe you wrote a parquet that for sure it's not null, but the schema is never validated on write in parquet, and any could append some data with the same schema, but with nulls. So spark will always put as nullable the columns, just to prevention.
This behavior can be prevented using a catalog, that will validate that the dataframe follows the expected schema.
The problem is that a lot of users thought that their schema was not nullable, and wrote null data. Then they couldn't read the data back as their parquet files were corrupted. In order to avoid this, we always assume the table schema is nullable in Delta. In Spark 3.0, when creating a table, you will be able to specify columns as NOT NULL. This way, Delta will actually prevent null values from being written, because Delta will check that the columns are in fact not null when writing it.

Accessing Hive Tables from Spark SQL when Data is Stored in Object Storage

I am using spark dataframe writer to write the data in internal hive tables in parquet format in IBM Cloud Object Storage.
So , my hive metastore is in HDP cluster and I am running the spark job from the HDP cluster. This spark job writes the data to the IBM COS in parquet format.
This is how I am starting the spark session
SparkSession session = SparkSession.builder().appName("ParquetReadWrite")
.config("hive.metastore.uris", "<thrift_url>")
.config("spark.sql.sources.bucketing.enabled", true)
.enableHiveSupport()
.master("yarn").getOrCreate();
session.sparkContext().hadoopConfiguration().set("fs.cos.mpcos.iam.api.key",credentials.get(ConnectionConstants.COS_APIKEY));
session.sparkContext().hadoopConfiguration().set("fs.cos.mpcos.iam.service.id",credentials.get(ConnectionConstants.COS_SERVICE_ID));
session.sparkContext().hadoopConfiguration().set("fs.cos.mpcos.endpoint",credentials.get(ConnectionConstants.COS_ENDPOINT));
The issue that I am facing is that when I partition the data and store it (via partitionBy) I am unable to access the data directly from spark sql
spark.sql("select * from partitioned_table").show
To fetch the data from the partitioned table , I have to load the dataframe and register it as a temp table and then query it.
The above issue does not occur when the table is not partitioned.
The code to write the data is this
dfWithSchema.orderBy(sortKey).write()
.partitionBy("somekey")
.mode("append")
.format("parquet")
.option("path",PARQUET_PATH+tableName )
.saveAsTable(tableName);
Any idea why the the direct query approach is not working for the partitioned tables in COS/Parquet ?
To read the partitioned table(created by Spark), you need to give the absolute path of the table as below.
selected_Data=spark.read.format("parquet").option("header","false").load("hdfs/path/loc.db/partition_table")
To filter out it further, please try the below approach.
selected_Data.where(col("column_name")=='col_value').show()
This issue occurs when the property hive.metastore.try.direct.sql is set to true on the HiveMetastore configurations and the SparkSQL query is run over a non STRING type partition column.
For Spark, it is recommended to create tables with partition columns of STRING type.
If you are getting below error message while filtering the hive partitioned table in spark.
Caused by: MetaException(message:Filtering is supported only on partition keys of type string)
recreate your hive partitioned table with partition column datatype as string, then you would be able to access the data directly from spark sql.
else you have to specify the absolute path of your hdfs location to get the data incase your partitioned column has been defined as varchar.
selected_Data=spark.read.format("parquet").option("header","false").load("hdfs/path/loc.db/partition_table")
However I was not able to understand, why it's differentiating in between a varchar and string datatype for partition column

Upgrading to Spark 2.0.1 broke array<string> in parquet DataFrame

I have a table with a few columns, some of which are arrays. Since upgrading from Spark 1.6 to Spark 2.0.1, the array fields are always null when reading in a DataFrame.
When writing the Parquet files, the schema of the column is specified as
StructField("packageIds",ArrayType(StringType)).
The schema of the column in the Hive Metastore is
packageIds array<string>
The schema used in the writer exactly matches the schema in the Metastore
The query is a simple "select *"
spark.sql("select * from tablename limit 1").collect() // null columns in Row
How can I debug this issue? Notable things I've already investigated:
It works in spark 1.6
I've inspected the parquet files using parquet-tools and can see the data.
I also have another table written in exactly the same way and it doesn't have the issue.

How using Spark read Hive DOUBLE value stored in Avro logical format

I have existing Hive data stored in Avro format. For whatever reason reading these data by executing SELECT is very slow. I didn't figure out yet why. The data is partitioned and my WHERE clause always follows the partition columns. So I decided to read the data directly by navigating to the partition path and using Spark SQLContext. This works much faster. However, the problem I have is reading the DOUBLE values. Avro stores them in a binary format.
When I execute the following query in Hive:
select myDoubleValue from myTable;
I'm getting the correct expected values
841.79
4435.13
.....
but the following Spark code:
val path="PathToMyPartition"
val sqlContext = new SQLContext(sc)
val df = sqlContext.read.avro(path)
df.select("myDoubleValue").rdd.map(x => x.getAs[Double](0))
gives me this exception
java.lang.ClassCastException : [B cannot be cast to java.lang.Double
What would be the right way either to provide a schema or convert the value that is stored in a binary format into a double format?
I found a partial solution how to convert the Avro schema to a Spark SQL StructType. There is com.databricks.spark.avro.SchemaConverters developed by Databricks that has a bug in converting Avro logical data types in their toSqlType(avroSchema: Schema) method which was incorrectly converting the logicalType
{"name":"MyDecimalField","type":["null",{"type":"bytes","logicalType":"decimal","precision":38,"scale":18}],"doc":"","default":null}
into
StructField("MyDecimalField",BinaryType,true)
I fixed this bug in my local version of the code and now it is converting into
StructField("MyDecimalField",DecimalType(38,18),true)
Now, the following code reads the Avro file and creates a Dataframe:
val avroSchema = new Schema.Parser().parse(QueryProvider.getQueryString(pathSchema))
val sqlContext = new SQLContext(sc)
val df = sqlContext.read.schema(MyAvroSchemaConverter.toSqlType(avroSchema).dataType.asInstanceOf[StructType]).avro(path)
However, when I'm selecting the filed that I expect to be decimal by
df.select("MyDecimalField")
I'm getting the following exception:
scala.MatchError: [B#3e6e0d8f (of class [B)
This is where I stuck at this time and would appreciate if anyone can suggest what to do next or any other work around.

Resources