spark Dataframe string to Hive varchar - apache-spark

I read data from Oracle via spark JDBC connection to a DataFrame. I have a column which is obviously StringType in dataframe.
Now I want to persist this in Hive, but as datatype Varchar(5). I know the string would be truncated but it is ok.
I tried using UDFs which didn't work since dataframe does not have varchar or char types. I also created a temporary view in Hive using:
val tv = df.createOrReplaceTempView("t_name")
val df = spark.sql("select cast(col_name as varchar(5)) from tv")
But then when i printSchema, i still see a string type.
How can I make I save it as a varchar column in Hive table ?

Try creating Hive table("dbName.tableName") with required schema (varchar(5) in this case) and insert into the table directly from Dataframe like below.
df.write.insertInto("dbName.tableName" ,overwrite = False)

Related

Concatenating hive table after adding columns breaks spark read

Given some table manipulation – create table with 2 rows and columns, add 3rd column and insert third row with 3 values
CREATE TABLE concat_test(
one string,
two string
)
STORED AS ORC;
INSERT INTO TABLE concat_test VALUES (1,1), (2,2);
ALTER TABLE concat_test ADD COLUMNS (three string);
INSERT INTO TABLE concat_test VALUES (3,3,3);
alter table concat_test concatenate;
I'm having an exception Caused by: java.lang.ArrayIndexOutOfBoundsException: 3 when I try reading it with Spark
spark.sql("select * from concat_test").collect()
It is obviously connected with columns number. I'm further investigating problem in orc. I didn't find quick fix for such partitions nor the bug described elsewhere. Is there one?
Could anyone try this on the latest hadoop versions? Does the bug exist?
Hive 1.2.1, Spark 2.3.2
UPD. I myself fixed my tables via Hive. Hive queries do work after this manipulation so I created copy tables and did select-insert of the old data to them.
I have totally run into this issue before!
This is a known issue.
Hive only does schema on read, so there is no reason it should detect this as an issue and will happily let you define any definition you want. And the data underlying the table does NOT get updated when you change the definition of the hive table. Generally I have fixed the issue by fixing the underlying ORC files to meet the hive definition. You could read the ORC files directly as that issue has been fixed now as a work around.
Here's a work around if you know that the underlying orc files aren't in the correct format and want to correct the format.
val s = Seq(("apple","apples"),("car","cars")) // create data
val t = Seq(("apple",12),("apples", 50),("car",5),("cars",40))// create data
val df1 = sc.parallelize(t).toDF("Sub_Cat", "Count")
val df2 = sc.parallelize(s).toDF("Main_Cat","Sub_Cat")
df1.write.format("orc").save("category_count")
df2.write.format("orc").save("categories")
val schema = StructType( Array( StructField("Main_Cat", StringType, nullable = true), StructField("Sub_Cat", StringType, nullable = true),StructField("Count", IntegerType, nullable = true)) )
val CorrectedSchema = spark.read.schema(schema).org("category_count")
CorrectedSchema.show()
This helps to correct Schema into the format you intend. If you trust the hive schema you can use this cheat to get the schema.(and reduce the typing)
val schema = spark.sql("select * from concat_test limit 0").schema

InsertInto(tablename) always saving Dataframe in default database in Hive

Hi I have 2 table in my hive in which from first table i m selecting data creating dataframe and saving that dataframe into another table in orc format.I have created both the tables in same database.
when I am saving this dataframe into 2nd table I'm getting table not found in database issue.and if i m not using any databasename then it always creating and saving my df in hive default database.can someone please guide me why its not taking userdefined database and always taking as default database?below is code which I m using,and also i m using HDP.
//creating hive session
val hive = com.hortonworks.spark.sql.hive.llap.HiveWarehouseBuilder.session(sparksession).build()
hive.setDatabase("dbname")
var a= "SELECT 'all columns' from dbname.tablename"
val a1=hive.executeQuery(a)
a1.write
.format("com.hortonworks.spark.sql.hive.llap.HiveWarehouseConnector")
.option("database", "dbname")
.option("table", "table_name")
.mode("Append")
.insertInto("dbname.table_name")
instead of insertInto(dbname.table_name) if I'm using insertInto(table_name) then its is saving dataframe in default database. But if I'm giving dbname.tablename then its showing table not found in database.
I also tried same using dbSession using.
val dbSession = HiveWarehouseSession.session(sparksession).build()
dbSession.setDatabase("dbname")
Note: My second table(target table where I'm writing data) is a partitioned and bucketed table.
// 2. partitionBy(...)
{ a1.write
.format("com.hortonworks.spark.sql.hive.llap.HiveWarehouseConnector")
.option("database", "dbname")
.option("table", "table_name")
.mode("Append")
.insertInto("dbname.table_name")
// My second table(target table where I'm writing data) is a partitioned and bucketed table. add .partitionBy(<list cols>)
}

how to insert dataframe having map column in hive table

I have a dataframe with multiple columns out of which one column is map(string,string) type. I'm able to print this dataframe having column as map which gives data as Map("PUN" -> "Pune"). I want to write this dataframe to hive table (stored as avro) which has same column with type map.
Df.withcolumn("cname", lit("Pune"))
withcolumn("city_code_name", map(lit("PUN"), col("cname"))
Df.show(false)
//table - created external hive table..stored as avro..with avro schema
After removing this map type column I'm able to save the dataframe to hive avro table.
Save way to hive table:
spark.save - saving avro file
spark.sql - creating partition on hive table with avro file location
see this test case as an example from spark tests
test("Insert MapType.valueContainsNull == false") {
val schema = StructType(Seq(
StructField("m", MapType(StringType, StringType, valueContainsNull = false))))
val rowRDD = spark.sparkContext.parallelize(
(1 to 100).map(i => Row(Map(s"key$i" -> s"value$i"))))
val df = spark.createDataFrame(rowRDD, schema)
df.createOrReplaceTempView("tableWithMapValue")
sql("CREATE TABLE hiveTableWithMapValue(m Map <STRING, STRING>)")
sql("INSERT OVERWRITE TABLE hiveTableWithMapValue SELECT m FROM tableWithMapValue")
checkAnswer(
sql("SELECT * FROM hiveTableWithMapValue"),
rowRDD.collect().toSeq)
sql("DROP TABLE hiveTableWithMapValue")
}
also if you want save option then you can try with saveAsTable as showed here
Seq(9 -> "x").toDF("i", "j")
.write.format("hive").mode(SaveMode.Overwrite).option("fileFormat", "avro").saveAsTable("t")
yourdataframewithmapcolumn.write.partitionBy is the way to create partitions.
You can achieve that with saveAsTable
Example:
Df\
.write\
.saveAsTable(name='tableName',
format='com.databricks.spark.avro',
mode='append',
path='avroFileLocation')
Change the mode option to whatever suits you

hive external table on parquet not fetching data

I am trying to create a datapipeline where the incomng data is stored into parquet and i create and external hive table and users can query the hive table and retrieve data .I am able to save the parquet data and retrieve it directly but when i query the hive table its not returning any rows. I did the following test setup
--CREATE EXTERNAL HIVE TABLE
create external table emp (
id double,
hire_dt timestamp,
user string
)
stored as parquet
location '/test/emp';
Now created dataframe on some data and saved to parquet .
---Create dataframe and insert DATA
val employeeDf = Seq(("1", "2018-01-01","John"),("2","2018-12-01", "Adam")).toDF("id","hire_dt","user")
val schema = List(("id", "double"), ("hire_dt", "date"), ("user", "string"))
val newCols= schema.map ( x => col(x._1).cast(x._2))
val newDf = employeeDf.select(newCols:_*)
newDf.write.mode("append").parquet("/test/emp")
newDf.show
--read the contents directly from parquet
val sqlcontext=new org.apache.spark.sql.SQLContext(sc)
sqlcontext.read.parquet("/test/emp").show
+---+----------+----+
| id| hire_dt|user|
+---+----------+----+
|1.0|2018-01-01|John|
|2.0|2018-12-01|Adam|
+---+----------+----+
--read from the external hive table
spark.sql("select id,hire_dt,user from emp").show(false)
+---+-------+----+
|id |hire_dt|user|
+---+-------+----+
+---+-------+----+
As shown above i am able to see the data if i read from parquet directly but not from hive .The question is what i am doing wrong here ? What i am i doing wrong that the hive isnt getting the data. I thought msck repair may be a reason but i get error if i try to do msck repair table saying table not partitioned.
Based on your create table statement, you have used location as /test/emp but while writing data, you are writing at /tenants/gwm/idr/emp. So you will not have data at /test/emp.
CREATE EXTERNAL HIVE TABLE create external table emp ( id double, hire_dt timestamp, user string ) stored as parquet location '/test/emp';
Please re-create external table as
CREATE EXTERNAL HIVE TABLE create external table emp ( id double, hire_dt timestamp, user string ) stored as parquet location '/tenants/gwm/idr/emp';
Apart from the answer given by Ramdev below, you also need to be cautious of using the correct datatype around date/timestamp; as 'date' type is not supported by parquet when creating a hive table.
For that you can change the 'date' type for column 'hire_dt' to 'timestamp'.
Otherwise there will be a mismatch in data you persisting through spark and trying to read in hive (or hive SQL). Keeping it to 'timestamp' at both places will resolve the issue. I hope it helps.
Do you have enableHiveSupport() in your sparkSession builder() statement. Are you able to connect to hive metastore? Try doing show tables/databases in your code to see if you can display tables present at your hive location?
i got this working with below chgn.
val dfTransformed = employeeDf.withColumn("id", employeeDf.col("id").cast(DoubleType))
.withColumn("hire_dt", employeeDf.col("hire_dt".cast(TimestampType))
So basically the issue was datatype mismatch and some how the original code the cast doesn't seem to work. So i did an explicit cast and then write it goes fine and able to query back as well.Logically both are doing the same not sure why the original code not working.
val employeeDf = Seq(("1", "2018-01-01","John"),("2","2018-12-01", "Adam")).toDF("id","hire_dt","user")
val dfTransformed = employeeDf.withColumn("id", employeeDf.col("id").cast(DoubleType))
.withColumn("hire_dt", employeeDf.col("hire_dt".cast(TimestampType))
dfTransformed.write.mode("append").parquet("/test/emp")
dfTransformed.show
--read the contents directly from parquet
val sqlcontext=new org.apache.spark.sql.SQLContext(sc)
sqlcontext.read.parquet("/test/emp").show
+---+----------+----+
| id| hire_dt|user|
+---+----------+----+
|1.0|2018-01-01|John|
|2.0|2018-12-01|Adam|
+---+----------+----+
--read from the external hive table
spark.sql("select id,hire_dt,user from emp").show(false)
+---+----------+----+
| id| hire_dt|user|
+---+----------+----+
|1.0|2018-01-01|John|
|2.0|2018-12-01|Adam|
+---+----------+----+

Spark SQL returns null for a column in HIVE table while HIVE query returns non null values

I have a hive table created on top of s3 DATA in parquet format and partitioned by one column named eventdate.
1) When using HIVE QUERY, it returns data for a column named "headertime" which is in the schema of BOTH the table and the file.
select headertime from dbName.test_bug where eventdate=20180510 limit 10
2) FROM a scala NOTEBOOK , when directly loading a file from a particular partition that also works,
val session = org.apache.spark.sql.SparkSession.builder
.appName("searchRequests")
.enableHiveSupport()
.getOrCreate;
val searchRequest = session.sqlContext.read.parquet("s3n://bucketName/module/search_request/eventDate=20180510")
searchRequest.createOrReplaceTempView("SearchRequest")
val exploreDF = session.sql("select headertime from SearchRequest where SearchRequestHeaderDate='2018-05-10' limit 100")
exploreDF.show(20)
this also displays the values for the column "headertime"
3) But, when using spark sql to query directly the HIVE table as below,
val exploreDF = session.sql("select headertime from tier3_vsreenivasan.test_bug where eventdate=20180510 limit 100")
exploreDF.show(20)
it keeps returning null always.
I opened the parquet file and see that the column headertime is present with values, but not sure why spark SQL is not able to read the values for that column.
it will be helpful if someone can point out from where the spark SQL gets the schema? I was expecting it to behave similar to the HIVE QUERY

Resources