Write data frame to hive table in spark - apache-spark

could you please tell me if this command could create problems with overwriting all tables in the DB:
df.write.option(“path”, “path_to_the_db/hive/”).mode(overwrite).saveAsTable("result_data")
table_name is a new table in the DB, it hasn't existed.
After these commands, all tables disappeared.
I was using Spark3 and tried to solve an error:
Can not create the managed table('result_data').
The associated location('dbfs:/user/hive/warehouse/result_data') already exists.
I expected that a new table will be created without any issues if it doesn’t exist.

If path_to_the_db/hive contains other tables, then you overwrite into that folder, it seems possible that the whole directory would be emptied first, yes. Perhaps you should instead use path_to_the_db/hive/result_data
According to the error, though, your table does already exist.
You can use Spark to register a temporary table in SQL code, then run INSERT OVERWRITE query for existing tables.

Related

How to read hive managed table data using spark?

I am able to read hive external table using spark-shell but, when I try to read data from hive managed table it only shows column names.
Please find queries here:
Could you please try using database name as well along with table name?
sql(select * from db_name.test_managed)
If still result is same, request you to please share output of describe formatted for both the tables.

Presto - can I do alter table if exists?

How can I alter table name only if exists?
Something like: alter table mydb.myname if exists rename to mydb.my_new_name
You can do something like:
ALTER TABLE users RENAME TO people;
or
ALTER TABLE mydb.myname RENAME TO mydb.my_new_name;
Please notice that IF EXISTS syntax is not available here. Please find more informations here: https://docs.starburstdata.com/latest/sql/alter-table.html The work for that is tracked under: https://github.com/prestosql/presto/issues/2260
Currently you need to handle this on a different layer, like java program that is running SQL queries to Presto over JDBC.

Cassandra create and load data atomicity

I have got a web service which is looking for the last create table
[name_YYYYMMddHHmmss]
I have a persister job that creates and loads a table (insert or bulk)
Is there something that hides a table until it is fully loaded ?
First, I have created a technical table, it works but I will need one by keyspace (using cassandraAuth). I don’t like this.
I was thinking about tags, but it doesn’t seem to exist.
- create a table with tag and modify or remove it when the table is loaded.
There is also the table comment option.
Any ideas?
Table comment is a good option. We use it for some service information about the table, e.g. table versions tracking.

Is there a method to drop keyspace without removing schema in cassandra.If there How would i do this?

There is a script which will export Cassandra schema and it will generate two cql files. These files will be called in restoring schema script.
So previously i have dropped the keyspace. While restoring the schema I am getting an error "cannot add column family to nonexisting keyspaces "graphdb"
It sounds to me that you are attempting to create the table in a keyspace that has not been created yet. Make sure that the creation happens in the right order in the two cql files that you have exported. First the keyspace needs to be created before you attempt to create the tables.

How to overwrite data with PySpark's JDBC without losing schema?

I have a DataFrame that I'm willing to write it to a PostgreSQL database. If I simply use the "overwrite" mode, like:
df.write.jdbc(url=DATABASE_URL, table=DATABASE_TABLE, mode="overwrite", properties=DATABASE_PROPERTIES)
The table is recreated and the data is saved. But the problem is that I'd like to keep the PRIMARY KEY and Indexes in the table. So, I'd like to either overwrite only the data, keeping the table schema or to add the primary key constraint and indexes afterward. Can either one be done with PySpark? Or do I need to connect to the PostgreSQL and execute the commands to add the indexes myself?
The default behavior for mode="overwrite" is to first delete the table, then recreate it with the new data. You can instead truncate the data by including option("truncate", "true") and then push your own:
df.write.option("truncate", "true").jdbc(url=DATABASE_URL, table=DATABASE_TABLE, mode="overwrite", properties=DATABASE_PROPERTIES)
This way, you are not recreating the table so it shouldn't make any modifications to your schema.

Resources