DataStax Cassandra File System - Fixed Width Text File - Hive Integration Issue - cassandra

I'm trying to read a fixed width text file stored in Cassandra File System (CFS) using Hive. I'm able to query the file when I run from hive client. However, when I try to run from Hadoop Hive JDBC, It says table is not available or bad connection. Below are the steps I followed.
Input file (employees.dat):
21736Ambalavanar Thirugnanam BOY-EAG 2005-05-091992-11-18
21737Anand Jeyamani BOY-AST 2005-05-091985-02-12
31123Muthukumar Rajendran BOY-EES 2009-08-121983-02-23
Starting Hive Client
bash-3.2# dse hive;
Logging initialized using configuration in file:/etc/dse/hive/hive-log4j.properties
Hive history file=/tmp/root/hive_job_log_root_201209250900_157600446.txt
hive> use HiveDB;
OK
Time taken: 1.149 seconds
Creating Hive External Table pointing to fixed width format text file
hive> CREATE EXTERNAL TABLE employees (empid STRING, firstname STRING, lastname STRING, dept STRING, dateofjoining STRING, dateofbirth STRING)
> ROW FORMAT SERDE 'org.apache.hadoop.hive.contrib.serde2.RegexSerDe'
> WITH SERDEPROPERTIES ("input.regex" = "(.{5})(.{25})(.{25})(.{15})(.{10})(.{10}).*" )
> LOCATION 'cfs://hostname:9160/folder/';
OK
Time taken: 0.524 seconds
Do a select * from table.
hive> select * from employees;
OK
21736 Ambalavanar Thirugnanam BOY-EAG 2005-05-09 1992-11-18
21737 Anand Jeyamani BOY-AST 2005-05-09 1985-02-12
31123 Muthukumar Rajendran BOY-EES 2009-08-12 1983-02-23
Time taken: 0.698 seconds
Do a select with specific fields from hive table throws permission error (first issue)
hive> select empid, firstname from employees;
Total MapReduce jobs = 1
Launching Job 1 out of 1
Number of reduce tasks is set to 0 since there's no reduce operator
java.io.IOException: The ownership/permissions on the staging directory cfs:/tmp/hadoop-root/mapred/staging/root/.staging is not as expected. It is owned by root and permissions are rwxrwxrwx. The directory must be owned by the submitter root or by root and permissions must be rwx------
at org.apache.hadoop.mapreduce.JobSubmissionFiles.getStagingDir(JobSubmissionFiles.java:108)
at org.apache.hadoop.mapred.JobClient$2.run(JobClient.java:856)
at org.apache.hadoop.mapred.JobClient$2.run(JobClient.java:850)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:416)
at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1093)
at org.apache.hadoop.mapred.JobClient.submitJobInternal(JobClient.java:850)
at org.apache.hadoop.mapred.JobClient.submitJob(JobClient.java:824)
at org.apache.hadoop.hive.ql.exec.ExecDriver.execute(ExecDriver.java:452)
at org.apache.hadoop.hive.ql.exec.MapRedTask.execute(MapRedTask.java:136)
at org.apache.hadoop.hive.ql.exec.Task.executeTask(Task.java:133)
at org.apache.hadoop.hive.ql.exec.TaskRunner.runSequential(TaskRunner.java:57)
at org.apache.hadoop.hive.ql.Driver.launchTask(Driver.java:1332)
at org.apache.hadoop.hive.ql.Driver.execute(Driver.java:1123)
at org.apache.hadoop.hive.ql.Driver.run(Driver.java:931)
at org.apache.hadoop.hive.cli.CliDriver.processLocalCmd(CliDriver.java:255)
at org.apache.hadoop.hive.cli.CliDriver.processCmd(CliDriver.java:212)
at org.apache.hadoop.hive.cli.CliDriver.processLine(CliDriver.java:403)
at org.apache.hadoop.hive.cli.CliDriver.run(CliDriver.java:671)
at org.apache.hadoop.hive.cli.CliDriver.main(CliDriver.java:554)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:616)
at org.apache.hadoop.util.RunJar.main(RunJar.java:156)
Job Submission failed with exception 'java.io.IOException(The ownership/permissions on the staging directory cfs:/tmp/hadoop-root/mapred/staging/root/.staging is not as expected. It is owned by root and permissions are rwxrwxrwx. The directory must be owned by the submitter root or by root and permissions must be rwx------)'
FAILED: Execution Error, return code 1 from org.apache.hadoop.hive.ql.exec.MapRedTask
The second issue is, when I try to run the select * query from JDBC Hive driver (outside of dse/cassandra nodes), It says the table employees is not available. The external table created acts like a temporary table and it does not get persisted. When I use 'hive> show tables;', the employees table is not listed. Can anyone please help me figure out the problem?

I don't have an immediate answer for the first issue, but the second looks like its due to a known issue.
There is a bug in DSE 2.1 which drops external tables created from CFS files from the metastore when show tables is run. Only the table metadata is removed, the data remains in CFS so if you recreate the table definition you shouldn't have to reload it. Tables backed by Cassandra ColumnFamilies are notaffected by this bug. This has been fixed in the 2.2 release of DSE, which is due for release imminently.
I'm not familiar with the Hive JDBC driver, but if it issues a Show Tables command at any point, it could be triggering this bug.

Related

Cannot read Delta tables created by Spark in Hive or Dbeaver/JDBC

I've used Spark 3.3.1, configured with delta-core_2.12.2.2.0 and delta-storage-2.2.0, to create several tables within an external database.
spark.sql("create database if not exists {database}.{table} location {path_to_storage}")
Within that database I've got several delta tables, created and populated through Spark, e.g.:
{table_df}.write.format("delta").mode("overwrite").saveAsTable("{database}.{table}")
I can then, right away, address it so:
df = spark.sql("select * from {database}.{table} limit 10")
df.show()
And everything works fine.
When I try to run the same command (select * from {database}.{table} limit 10;) through hive, or dbeaver sql editor, I get the following error:
hive> select * from {database}.{table} limit 10;
2023-01-04T12:45:29,474 INFO [main] org.apache.hadoop.hive.conf.HiveConf - Using the default value passed in for log id: 9ecfc0dd-0606-4060-98ef-b1395fc62456
2023-01-04T12:45:29,484 INFO [main] org.apache.hadoop.hive.ql.session.SessionState - Updating thread name to 9ecfc0dd-0606-4060-98ef-b1395fc62456 main
2023-01-04T12:45:31,138 INFO [9ecfc0dd-0606-4060-98ef-b1395fc62456 main] org.apache.hadoop.hive.common.FileUtils - Creating directory if it doesn't exist: hdfs://localhost:9000/tmp/hive/user/9ecfc0dd-0606-4060-98ef-b1395fc62456/hive_2023-01-04_12-45-29_879_2613740326746479876-1/-mr-10001/.hive-staging_hive_2023-01-04_12-45-29_879_2613740326746479876-1
OK
Failed with exception java.io.IOException:java.io.IOException: file:/{path_to_file_storage}/part-00000-7708a52c-0939-4288-b56a-ecdeea197574-c000.snappy.parquet not a SequenceFile
Time taken: 1.649 seconds
2023-01-04T12:45:31,298 INFO [9ecfc0dd-0606-4060-98ef-b1395fc62456 main] org.apache.hadoop.hive.conf.HiveConf - Using the default value passed in for log id: 9ecfc0dd-0606-4060-98ef-b1395fc62456
2023-01-04T12:45:31,298 INFO [9ecfc0dd-0606-4060-98ef-b1395fc62456 main] org.apache.hadoop.hive.ql.session.SessionState - Resetting thread name to main
hive>
I have installed the delta connectors (delta-hive-assembly_2.12-0.6.0.jar) from here:
https://github.com/delta-io/connectors/blob/master/hive/README.md
Installed it in an auxjar folder in my main hive directory and added the following properties in my hive-site.xml file:
<property>
<name>hive.input.format</name>
<value>io.delta.hive.HiveInputFormat</value>
</property>
<property>
<name>hive.tez.input.format</name>
<value>io.delta.hive.HiveInputFormat</value>
</property>
<property>
<name>hive.aux.jars.path</name>
<value>file:/{path_to_file}/auxjar/delta-hive-assembly_2.12-0.6.0.jar</value>
</property>
When I start hive I'm not seeing an exceptions about the file not being found. Have I missed a critical step out?
Thanks
Tried running a simple query in hive, got an IOException.
Hive Connector
See FAQs at base of page:
If a table in the Hive Metastore is created by other systems such as
Apache Spark or Presto, can I use this connector to query it in Hive?
No. If a table in the Hive Metastore is created by other systems such
as Apache Spark or Presto, Hive cannot find the correct connector to
read it. You can follow our instruction to create a new table with a
different table name but point to the same path in Hive. Although it's
a different table name, the underlying data will be shared by all of
systems. We recommend to create different tables in different systems
but point to the same path.

spark.table vs sql() AccessControlException

Trying to run
spark.table("db.table")
.groupBy($"date")
.agg(sum($"total"))
returns
org.apache.spark.sql.AnalysisException: org.apache.hadoop.hive.ql.metadata.HiveException: Unable to alter table. java.security.AccessControlException: Permission denied: user=user, access=WRITE, inode="/sources/db/table":tech_user:bgd_group:drwxr-x---
the same script but written as
sql("SELECT sum(total) FROM db.table group by date").show()
returns actual result.
I don't understand why this is happening. What is the first script trying to write exactly? Some staging result?
I have read permission for this table and I'm only trying to perform some aggregations.
Using Spark 2.2 for this.
In Spark 2.2, the default for spark.sql.hive.caseSensitiveInferenceMode was changed from NEVER_INFER to INFER_AND_SAVE. This mode causes Spark to infer (from underlying files) and try to save case-sensitive schema into Hive metastore. This will fail if the user executing the command wasn't granted permissions to update HMS.
Obvious workaround is to set inference mode back to NEVER_INFER, or INFER_ONLY if application relies on column names as they present in files (CaseSensitivE).

insert overwrite from spark is changing permission and owner of the hive table directory

I have a simple requirement to read a hive parquet table and overwrite into the same table after applying logic.
beeline::
create external table test_table (id int,name string) stored as parquet location '/dev/test_db/test_table';
insert into test_scd values (1,'John'),(2,'Myna'),(3,'Rakesh');
pyspark
from pyspark.sql import SparkSession
spark = SparkSession.builder.enableHiveSupport().getOrCreate()
src_table=spark.sql("select * from test_db.test_table")
src_table.write.mode("overwrite").saveAsTable("test_db.temp_test_table")
tempDF=spark.sql("select * from test_db.temp_test_table")
spark.sql("insert overwrite table test_db.test_table select * from test_db.temp_test_table")
result::
It overwrites the table as expected. But it changes the permission of the table directory.
In this case it changes the permission of '/dev/test_db/test_table' from 771 to 755. Also it changes the owner of the group from hive:hive to batch_user:hive.
Since it changes the permission from the group level, I cannot write into this table from hive2 server , may it beeline or hue or any third party tool such as Oracle ODI.
Has anyone faced such issue. I am surprised why spark job is changing the directory permission. In most of the operations it doesn't change the permission, keep it as hive:hive. But in case of overwrite it is doing as per my analysis.
I have tried other apis like "saveAsTable" and "insertInto". In both cases it does the same.

Spark returns Empty DataFrame but Populated in Hive

I have a table in hive
db.table_name
When I run the following in hive I get results back
SELECT * FROM db.table_name;
When I run the following in a spark-shell
spark.read.table("db.table_name").show
It shows nothing. Similarly
sql("SELECT * FROM db.table_name").show
Also shows nothing. Selecting arbitrary columns out before the show also displays nothing. Performing a count states the table has 0 rows.
Running the same queries works against other tables in the same database.
Spark Version: 2.2.0.cloudera1
The table is created using
table.write.mode(SaveMode.Overwrite).saveAsTable("db.table_name")
And if I read the file using the parquet files directly it works.
spark.read.parquet(<path-to-files>).show
EDIT:
I'm currently using a workaround by describing the table and getting the location and using spark.read.parquet.
Have you refresh metadata table? Maybe you need to refresh table to access to new data.
spark.catalog.refreshTable("my_table")
I solved the problem by using
query_result.write.mode(SaveMode.Overwrite).format("hive").saveAsTable("table")
which stores the results in textfile.
There is probably some incompatibility with the Hive parquet.
I also found a Cloudera report about it (CDH Release Notes): they recommend creating the Hive table manually and then load data from a temporary table or by query.

How to specify the database when registering a data frame to a table in Spark1.6

I am working on a simple Spark script, and running into issues putting data where I want to, and getting the job to work. Specifically, I need to specify the database of the tables when registering a data frame to a temp table.
df_del_records,df_add_records,df_exclusion_records=get_new_records(dff)
df_del_records.registerTempTable("db.update_deletes_temp_table")
df_add_records.registerTempTable("db.update_adds_temp_table")
df_exclusion_records.registerTempTable("db.exclusions_temp_table")
sqlContext.sql("insert overwrite table db.automated_quantity_updates select * from db.update_deletes_temp_table")
sqlContext.sql("insert into table db.automated_quantity_updates select * from db.update_adds_temp_table")
sqlContext.sql("insert into table db.exclusions select * from db.exclusions_temp_table")
The code above runs without errors, but does not yield any results. Removing the database yields results, but that won't work in production where the database in which the temp tables have to be stored is not whatever default Spark is using. How do I specify which database a temp table needs to be in registering a datagrame to a temp table in Spark 1.6?
The temporary table/view which is created by registerTempTable or createOrReplaceTempView is not related to any databases. It just creates a view of the dataframe with a query plan based on how the dataframe was created.
From Apache Spark's Dataset.scala
Local temporary view is session-scoped. Its lifetime is the lifetime of the session that created it, i.e. it will be automatically dropped when the session terminates. It's not tied to any databases, i.e. we can't use db1.view1 to reference a local temporary view.
emphasis added by me.

Resources