How can I drop database in hive without deleting database directory? - apache-spark

When I run drop database command, spark deletes database directory and all its subdirectories on hdfs. How can I avoid this?

Short answer:
Unless you set up your database so that it contains only external tables that exist outside of the database HDFS directory, there is no way to achieve this without copying all of your data to another location in HDFS.
Long answer:
From the following website:
https://www.oreilly.com/library/view/programming-hive/9781449326944/ch04.html
By default, Hive won’t permit you to drop a database if it contains tables. You can either drop the tables first or append the CASCADE keyword to the command, which will cause the Hive to drop the tables in the database first:
Using the RESTRICT keyword instead of CASCADE is equivalent to the default behavior, where existing tables must be dropped before dropping the database.
When a database is dropped, its directory is also deleted.
You can copy the data to another location before dropping the database. I know it's a pain - but that's how Hive operates.
If you were trying to just drop a table without deleting the HDFS directory of the table, there's a solution for this described here: Can I change a table from internal to external in hive?
Dropping an external table preserves the HDFS location for the data.
Cascading the database drop to the tables after converting them to external will not fix this, because the database drop impacts the whole HDFS directory the database resides in. You would still need to copy the data to another location.
If you create a database from scratch, each table inside of which is external and references a location outside of the database HDFS directory, dropping this database would preserve the data. But if you have it set up so that the data is currently inside of the database HDFS directory, you will not have this functionality; it's something you would have to set up from scratch.

Related

external table with partition elimination

I am trying to understand how to create an external table that supports partition elimination. I can create a view with a column derived using the filepath function, but that can't be used by Spark. I can create an external table using create external table as select, but that gives me a copy of the data. This article from Microsoft implies it can be done.
The native external tables in Synapse pools are able to ignore the
files placed in the folders that are not relevant for the queries. If
your files are stored in a folder hierarchy (for example -
/year=2020/month=03/day=16) and the values for year, month, and day
are exposed as the columns, the queries that contain filters like
year=2020 will read the files only from the subfolders placed within
the year=2020 folder. The files and folders placed in other folders
(year=2021 or year=2022) will be ignored in this query. This
elimination is known as partition elimination.
The folder partition elimination is available in the native external
tables that are synchronized from the Synapse Spark pools. If you have
partitioned data set and you would like to leverage the partition
elimination with the external tables that you create, use the
partitioned views instead of the external tables.
So, how do you expose those partition directories as columns in an external table?

Override underlying parquet data seamlessly for impala table

I have an Impala table backed by parquet files which is used by another team.
Every day I run a batch Spark job that overwrites the existing parquet files (creating new data set, the existing files will be deleted and new files will be created)
Our Spark code look like this
dataset.write.format("parquet").mode("overwrite").save(path)
During this update (overwrite parquet data file and then REFRESH Impala table), if someone accesses the table then they would end up with error saying the underlying data files are not there.
Is there any solution or workaround available for this issue? Because I do not want other teams see the error at any point in time when they access the table.
Maybe I can write the new data files into different location then make Impala table point to that location?
The behaviour you are seeing is because of the way how Impala is designed to work. Impala fetches the Metadata of the table such as Table structure, Partition details, HDFS File paths from HMS and the block details of the corresponding HDFS File paths from NameNode. All these details are fetched by Catalog and will be distributed across the Impala daemons for their execution.
When the table's underlying files are removed and new files are written outside Impala, it is necessary to perform a REFRESH so that the new file details (such as files and corresponding block details) will be fetched and distributed across daemons. This way Impala becomes aware of the newly written files.
Since, you're overwriting the files, Impala queries would fail to find the files that it is aware of because they have been removed already and the new files are being written. This is an expected event.
As a solution, you can perform one of the below,
Append the new files in the same HDFS Path of the table, instead of overwriting. This way, Impala queries run on the table would still return the results. However the results would be only the older data (because Impala is not aware of new files yet) but the error you said would be avoided during the time when the overwrite is occurring. Once the new files are created in the Table's directories, you can perform a HDFS Operation to remove the files followed by an Impala REFRESH statement for this table.
OR
As you said, you can write the new parquet files in a different HDFS Path and once the write is complete, you can either [remove the old files, move the new files into the actual HDFS Path of the table, followed by a REFRESH] OR [Issue an ALTER statement against the table to modify the location of the table's data pointing to the new directory]. If it's a daily process, you might have to implement this through a script that runs upon successful write process done by Spark by passing the directories (new and old directories) as arguments.
Hope this helps!

Chaining spark sql queries over temporary views?

We're exploring the possibility to use temporary views in spark and relate this to some actual file storage - or to other temporary views. We want to achieve something like:
Some user uploads data to some S3/hdfs file storage.
A (temporary) view is defined so that spark sql queries can be run against the data.
Some other temporary view (referring to some other data) is created.
A third temporary view is created that joins data from (2) and (3).
By selecting from the view from (4) the user gets a table that reflects the joined data from (2) and (3). This result can be further processed and stored in a new temporary view and so on.
So we end up with a tree of temporary views - querying their parent temporary views until they end up loading data from the filesystem. Basically we want to store transformation steps (selecting, joining, filtering, modifying etc) on the data - without storing new versions. The spark SQL-support and temporary views seems like a good fit.
We did some successful testing. The idea is to store the specification of these temporary views in our application and recreate them during startup (as temporary or global views).
Not sure if this is viable solution? One problem is that we need to know how the temporary views are related (which one queries which). We create them like:
sparkSession.sql("select * from other_temp_view").createTempView(name)
So, when this is run we have to make sure that other_temp_view is already created in the session. Not sure how this can be achieved. One idea is to store a timestamp and recreate them in the same order. This could be ok since out views most likely will have to be "immutable". We're not allowed to change a query that other queries relies on.
Any thoughts would be most appreciated.
I would definately go with the SessionCatalog object :
https://jaceklaskowski.gitbooks.io/mastering-spark-sql/spark-sql-SessionCatalog.html
You can access it with spark.sessionState.catalog

How to copy complete tables between PostgreSQL Databases with Azure Data Factory

I want to copy some tables from a production system to a test system on a regular basis. Both systems run a PostgreSQL server. I want to copy only specific tables from production to test.
I´ve already set up a foreach which iterates over the table names I want to copy. The problem is, that the table structures may change during development process and the copy job might fail.
So is there a way to use some kind auf "automatic mapping"? Cause the tables in both systems always have exactly the same structure. Or is there some kind of "Copy table" procedure?
You could remove mapping and structure in your pipeline . Then it will using the default mapping behavior. Given your tables always have the same schema, both mapping by name and mapping by order should work.

Cassandra Database Problem

I am using Cassandra database for large scale application. I am new to using Cassandra database. I have a database schema for a particular keyspace for which I have created columns using Cassandra Command Line Interface (CLI). Now when I copied dataset in the folder /var/lib/cassandra/data/, I was not able to access the values using the key of a particular column. I am getting message zero rows present. But the files are present. All these files are under extension, XXXX-Data.db, XXXX-Filter.db, XXXX-Index.db. Can anyone tell me how to access the columns for existing datasets.
(a) Cassandra doesn't expect you to move its data files around out from underneath it. You'll need to restart if you do any manual surgery like that.
(b) if you didn't also copy the schema definition it will ignore data files for unknown column families.
For what you are trying to achieve it may probably be better to export and import your SSTables.
You should have a look at bin/sstable2json and bin/json2sstable.
Documentation is there (near the end of the page): Cassandra Operations

Resources