create table stored as parquet and compressed with snappy not work - apache-spark

I have tryed to save data to hdfs with parquet-snappy:
spark.sql("drop table if exists onehands.parquet_snappy_not_work")
spark.sql(""" CREATE TABLE onehands.parquet_snappy_not_work (`trans_id` INT) PARTITIONED by ( `year` INT) STORED AS PARQUET TBLPROPERTIES ("parquet.compression"="SNAPPY") """)
spark.sql("""insert into onehands.parquet_snappy_not_work values (20,2021)""")
spark.sql("drop table if exists onehands.parquet_snappy_works_well")
val df = spark.createDataFrame(Seq(
(20, 2021)
)) toDF("trans_id", "year")
df.show()
df.write.format("parquet").partitionBy("year").mode("append").option("compression","snappy").saveAsTable("onehands.parquet_snappy_works_well")
df.write.format("parquet").partitionBy("year").mode("append").option("compression","snappy").saveAsTable("onehands.parquet_snappy_works_well")
but it`s not working with pre-created table
for onehands.parquet_snappy_not_work , the file is not ending with .snappy.parquet,
onehands.parquet_snappy_works_well looks working very well
[***]$ hadoop fs -ls /data/spark/warehouse/onehands.db/parquet_snappy_works_well/year=2021
/data/spark/warehouse/onehands.db/parquet_snappy_works_well/year=2021/part-00000-f5ec0f2d-525f-41c9-afee-ce5589ddfe27.c000.snappy.parquet
[****]$ hadoop fs -ls /data/spark/warehouse/onehands.db/parquet_snappy_not_work/year=2021
/data/spark/warehouse/onehands.db/parquet_snappy_not_work/year=2021/part-00000-85e2a7a5-c281-4960-9786-4c0ea88faf15.c000
even if I have tryed add some properties:
SET hive.exec.compress.output=true;
SET mapred.compress.map.output=true;
SET mapred.output.compress=true;
SET mapred.output.compression=org.apache.hadoop.io.compress.SnappyCodec;
SET mapred.output.compression.codec=org.apache.hadoop.io.compress.SnappyCodec;
SET io.compression.codecs=org.apache.hadoop.io.compress.SnappyCodec;
but it still not work
by the way , the sql I got with "show create table onehands.parquet_snappy_works_well",e.g.
CREATE TABLE `onehands`.`parquet_snappy_works_well` (`trans_id` INT, `year` INT) USING parquet OPTIONS ( `compression` 'snappy', `serialization.format` '1' ) PARTITIONED BY (year)
can not be run with spark-sql
spark vrtsion: 2.3.1
hadoop version:2.9.2
What`s the problem with my code ? Thanks for your help

Related

InsertInto(tablename) always saving Dataframe in default database in Hive

Hi I have 2 table in my hive in which from first table i m selecting data creating dataframe and saving that dataframe into another table in orc format.I have created both the tables in same database.
when I am saving this dataframe into 2nd table I'm getting table not found in database issue.and if i m not using any databasename then it always creating and saving my df in hive default database.can someone please guide me why its not taking userdefined database and always taking as default database?below is code which I m using,and also i m using HDP.
//creating hive session
val hive = com.hortonworks.spark.sql.hive.llap.HiveWarehouseBuilder.session(sparksession).build()
hive.setDatabase("dbname")
var a= "SELECT 'all columns' from dbname.tablename"
val a1=hive.executeQuery(a)
a1.write
.format("com.hortonworks.spark.sql.hive.llap.HiveWarehouseConnector")
.option("database", "dbname")
.option("table", "table_name")
.mode("Append")
.insertInto("dbname.table_name")
instead of insertInto(dbname.table_name) if I'm using insertInto(table_name) then its is saving dataframe in default database. But if I'm giving dbname.tablename then its showing table not found in database.
I also tried same using dbSession using.
val dbSession = HiveWarehouseSession.session(sparksession).build()
dbSession.setDatabase("dbname")
Note: My second table(target table where I'm writing data) is a partitioned and bucketed table.
// 2. partitionBy(...)
{ a1.write
.format("com.hortonworks.spark.sql.hive.llap.HiveWarehouseConnector")
.option("database", "dbname")
.option("table", "table_name")
.mode("Append")
.insertInto("dbname.table_name")
// My second table(target table where I'm writing data) is a partitioned and bucketed table. add .partitionBy(<list cols>)
}

Error while exchanging partition in hive tables

I am trying to merge the incremental data with an existing hive table.
For testing I created a dummy table from the base table as below:
create base.dummytable like base.fact_table
The table: base.fact_table is partition based on dbsource String
When I checked the dummy table's DDL, I could see that the partition column is correctly defined.
PARTITIONED BY ( |
| `dbsource` string)
Then I tried to exchange one of the partition from the dummy table by dropping it first.
spark.sql("alter table base.dummy drop partition(dbsource='NEO4J')")
The partition: NEO4J has dropped successfully and I ran the exchange statement as below:
spark.sql("ALTER TABLE base.dummy EXCHANGE PARTITION (dbsource = 'NEO4J') WITH TABLE stg.inc_labels_neo4jdata")
The exchange statement is giving an error:
Error: Error while compiling statement: FAILED: ValidationFailureSemanticException table is not partitioned but partition spec exists: {dbsource=NEO4J}
The table I am trying to push the incremental data is partitioned by dbsource and I have dropped it successfully.
I am running this from spark code and the config is given below:
val conf = new SparkConf().setAppName("MERGER").set("spark.executor.heartbeatInterval", "120s")
.set("spark.network.timeout", "12000s")
.set("spark.sql.inMemoryColumnarStorage.compressed", "true")
.set("spark.shuffle.compress", "true")
.set("spark.shuffle.spill.compress", "true")
.set("spark.sql.orc.filterPushdown", "true")
.set("spark.serializer", "org.apache.spark.serializer.KryoSerializer")
.set("spark.kryoserializer.buffer.max", "512m")
.set("spark.serializer", classOf[org.apache.spark.serializer.KryoSerializer].getName)
.set("spark.streaming.stopGracefullyOnShutdown", "true")
.set("spark.dynamicAllocation.enabled", "false")
.set("spark.shuffle.service.enabled", "true")
.set("spark.executor.instances", "4")
.set("spark.executor.memory", "4g")
.set("spark.executor.cores", "5")
.set("hive.merge.sparkfiles","true")
.set("hive.merge.mapfiles","true")
.set("hive.merge.mapredfiles","true")
show create table base.dummy:
CREATE TABLE `base`.`dummy`(
`dff_id` bigint,
`dff_context_id` bigint,
`descriptive_flexfield_name` string,
`model_table_name` string)
PARTITIONED BY (`dbsource` string)
ROW FORMAT SERDE
'org.apache.hadoop.hive.ql.io.orc.OrcSerde'
STORED AS INPUTFORMAT
'org.apache.hadoop.hive.ql.io.orc.OrcInputFormat'
OUTPUTFORMAT
'org.apache.hadoop.hive.ql.io.orc.OrcOutputFormat'
LOCATION
'/apps/hive/warehouse/base.db/dummy'
TBLPROPERTIES (
'orc.compress'='ZLIB')
show create table stg.inc_labels_neo4jdata:
CREATE TABLE `stg`.`inc_labels_neo4jdata`(
`dff_id` bigint,
`dff_context_id` bigint,
`descriptive_flexfield_name` string,
`model_table_name` string)
`dbsource` string)
ROW FORMAT SERDE
'org.apache.hadoop.hive.ql.io.orc.OrcSerde'
STORED AS INPUTFORMAT
'org.apache.hadoop.hive.ql.io.orc.OrcInputFormat'
OUTPUTFORMAT
'org.apache.hadoop.hive.ql.io.orc.OrcOutputFormat'
LOCATION
'/apps/hive/warehouse/stg.db/inc_labels_neo4jdata'
TBLPROPERTIES (
'orc.compress'='ZLIB')
Could anyone let me know what the mistake I am doing here & what should I change inorder to successfully exchange the partition ?
My take on this error is that table stg.inc_labels_neo4jdata is not partitioned as base.dummy and therefore there's no partition to move.
From Hive documentation:
This statement lets you move the data in a partition from a table to
another table that has the same schema and does not already have that
partition.
You can check the Hive DDL Manual for EXCHANGE PARTITION
And the JIRA where this feature was added to Hive. You can read:
This only works if and have the
same field schemas and the same partition by parameters. If they do not
the command will throw an exception.
You basically need to have exactly the same schema on both source_table and destination_table.
Per your last edit, this is not the case.

hive external table on parquet not fetching data

I am trying to create a datapipeline where the incomng data is stored into parquet and i create and external hive table and users can query the hive table and retrieve data .I am able to save the parquet data and retrieve it directly but when i query the hive table its not returning any rows. I did the following test setup
--CREATE EXTERNAL HIVE TABLE
create external table emp (
id double,
hire_dt timestamp,
user string
)
stored as parquet
location '/test/emp';
Now created dataframe on some data and saved to parquet .
---Create dataframe and insert DATA
val employeeDf = Seq(("1", "2018-01-01","John"),("2","2018-12-01", "Adam")).toDF("id","hire_dt","user")
val schema = List(("id", "double"), ("hire_dt", "date"), ("user", "string"))
val newCols= schema.map ( x => col(x._1).cast(x._2))
val newDf = employeeDf.select(newCols:_*)
newDf.write.mode("append").parquet("/test/emp")
newDf.show
--read the contents directly from parquet
val sqlcontext=new org.apache.spark.sql.SQLContext(sc)
sqlcontext.read.parquet("/test/emp").show
+---+----------+----+
| id| hire_dt|user|
+---+----------+----+
|1.0|2018-01-01|John|
|2.0|2018-12-01|Adam|
+---+----------+----+
--read from the external hive table
spark.sql("select id,hire_dt,user from emp").show(false)
+---+-------+----+
|id |hire_dt|user|
+---+-------+----+
+---+-------+----+
As shown above i am able to see the data if i read from parquet directly but not from hive .The question is what i am doing wrong here ? What i am i doing wrong that the hive isnt getting the data. I thought msck repair may be a reason but i get error if i try to do msck repair table saying table not partitioned.
Based on your create table statement, you have used location as /test/emp but while writing data, you are writing at /tenants/gwm/idr/emp. So you will not have data at /test/emp.
CREATE EXTERNAL HIVE TABLE create external table emp ( id double, hire_dt timestamp, user string ) stored as parquet location '/test/emp';
Please re-create external table as
CREATE EXTERNAL HIVE TABLE create external table emp ( id double, hire_dt timestamp, user string ) stored as parquet location '/tenants/gwm/idr/emp';
Apart from the answer given by Ramdev below, you also need to be cautious of using the correct datatype around date/timestamp; as 'date' type is not supported by parquet when creating a hive table.
For that you can change the 'date' type for column 'hire_dt' to 'timestamp'.
Otherwise there will be a mismatch in data you persisting through spark and trying to read in hive (or hive SQL). Keeping it to 'timestamp' at both places will resolve the issue. I hope it helps.
Do you have enableHiveSupport() in your sparkSession builder() statement. Are you able to connect to hive metastore? Try doing show tables/databases in your code to see if you can display tables present at your hive location?
i got this working with below chgn.
val dfTransformed = employeeDf.withColumn("id", employeeDf.col("id").cast(DoubleType))
.withColumn("hire_dt", employeeDf.col("hire_dt".cast(TimestampType))
So basically the issue was datatype mismatch and some how the original code the cast doesn't seem to work. So i did an explicit cast and then write it goes fine and able to query back as well.Logically both are doing the same not sure why the original code not working.
val employeeDf = Seq(("1", "2018-01-01","John"),("2","2018-12-01", "Adam")).toDF("id","hire_dt","user")
val dfTransformed = employeeDf.withColumn("id", employeeDf.col("id").cast(DoubleType))
.withColumn("hire_dt", employeeDf.col("hire_dt".cast(TimestampType))
dfTransformed.write.mode("append").parquet("/test/emp")
dfTransformed.show
--read the contents directly from parquet
val sqlcontext=new org.apache.spark.sql.SQLContext(sc)
sqlcontext.read.parquet("/test/emp").show
+---+----------+----+
| id| hire_dt|user|
+---+----------+----+
|1.0|2018-01-01|John|
|2.0|2018-12-01|Adam|
+---+----------+----+
--read from the external hive table
spark.sql("select id,hire_dt,user from emp").show(false)
+---+----------+----+
| id| hire_dt|user|
+---+----------+----+
|1.0|2018-01-01|John|
|2.0|2018-12-01|Adam|
+---+----------+----+

Unable to directly load hive parquet table using spark dataframe

I have gone thru related posts available in SO and couldn't find this specific issue anywhere over the internet.
I am trying to load Hive table (Hive external table pointed to parquet files) but spark data frame couldn't read the data and it is just able to read schema. But for the same hive table i can query from hive shell. When i try to load hive table into dataframe it is not returning any data. Below is my script looks like and the DDL. I am using Spark 2.1 (Mapr distribution)
Unable to read data from hive table has underlying parquet files from spark
val df4 = spark.sql("select * from default.Tablename")
scala> df4.show()
+----------------------+------------------------+----------+---+-------------+-------------+---------+
|col1 |col2 |col3 |key |col4| record_status|source_cd|
+----------------------+------------------------+----------+---+-------------+-------------+---------+
+----------------------+------------------------+----------+---+-------------+-------------+---------+
Hive DDL
CREATE EXTERNAL TABLE `Tablename`(
`col1` string,
`col2` string,
`col3` decimal(19,0),
`key` string,
`col6` string,
`record_status` string,
`source_cd` string)
ROW FORMAT SERDE
'org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe'
WITH SERDEPROPERTIES (
'path'='maprfs:abc/bds/dbname.db/Tablename')
STORED AS INPUTFORMAT
'org.apache.hadoop.hive.ql.io.parquet.MapredParquetInputFormat'
OUTPUTFORMAT
'org.apache.hadoop.hive.ql.io.parquet.MapredParquetOutputFormat'
LOCATION
'maprfs:/Datalocation/Tablename'
TBLPROPERTIES (
'numFiles'='2',
'spark.sql.sources.provider'='parquet',
'spark.sql.sources.schema.numParts'='1',
'spark.sql.sources.schema.part.0'='{\"type\":\"struct\",\"fields\":[{\"name\":\"col1\",\"type\":\"string\",\"nullable\":true,\"metadata\":{}},{\"name\":\"col2\",\"type\":\"string\",\"nullable\":true,\"metadata\":{}},{\"name\":\"col3\",\"type\":\"string\",\"nullable\":true,\"metadata\":{}},{\"name\":\"key\",\"type\":\"string\",\"nullable\":true,\"metadata\":{}},{\"name\":\"col6\",\"type\":\"string\",\"nullable\":true,\"metadata\":{}},{\"name\":\"record_status\",\"type\":\"string\",\"nullable\":true,\"metadata\":{}},{\"name\":\"source_cd\",\"type\":\"string\",\"nullable\":true,\"metadata\":{}}]}',
'totalSize'='68216',
'transient_lastDdlTime'='1502904476')
remove
'spark.sql.sources.provider'='parquet'
and you will success

Reading hive orc table using spark

I have a partitioned table. Partitons from 2017-06-20 and up.
My query.
import org.apache.spark.sql.hive.orc._
import org.apache.spark.sql._
val hiveContext = new org.apache.spark.sql.hive.HiveContext(sc)
val test_enc_orc = hiveContext.sql("select * from db.tbl where time_key = '2017-06-21' limit 1")
Every time I run it, spark looks for this partition 2017-06-20
INFO OrcFileOperator: ORC file hdfs://nameservice1/apps/hive/warehouse/db.db/tbl/time_key=2017-06-20/000016_0 has empty schema, it probably contains no rows. Trying to read another ORC file to figure out the schema.
and searches for all files for date 2017-06-20. It holds empty ORC files. But partition 2017-06-21 has files with data. Why doesn't spark search that date or any other?
EDIT
Created test table
drop table arstel.evkuzmin_test_it;
create table arstel.evkuzmin_test_it(name string)
partitioned by(ban int)
stored as orc;
insert into arstel.evkuzmin_test_it partition(ban) values
("bob", 1)
, ("marty", 1)
, ("monty", 2)
, ("naruto", 2)
, ("death", 4);
Seems like the problem is exactly because of empty files. In this case there are none, so everything works. Is there a way to make spark ignore them?

Resources