drop column in a table/view using spark sql only - apache-spark

i have 30 columns in a table i.e table_old
i want to use 29 columns in that table except one . that column is dynamic.
i am using string interpolation.
the below sparksql query i am using
drop_column=now_current_column
var table_new=spark.sql(s"""alter table table_old drop $drop_column""")
but its throwing error
mismatched input expecting 'partition'
i dont want to drop the column using dataframe. i requirement is to drop the column in a table using sparksql only

As mentioned in previous answer, DROP COLUMN is not supported by spark yet.
But, there is a workaround to achieve the same, without much overhead. This trick works for both EXTERNAL and InMemory tables. The code snippet below works for EXTERNAL table, you can easily modify it and use it for InMemory tables as well.
val dropColumnName = "column_name"
val tableIdentifier = "table_name"
val tablePath = "table_path"
val newSchema=StructType(spark.read.table(tableIdentifier).schema.filter(col => col.name != dropColumnName))
spark.sql(s"drop table ${tableIdentifier}")
spark.catalog.createTable(tableIdentifier, "orc", newSchema, Map("path" -> tablePath))
orc is the file format, it should be replaced with the required format. For InMemory tables, remove the tablePath and you are good to go. Hope this helps.

DROP COLUMN (and in general majority of ALTER TABLE commands) are not supported in Spark SQL.
If you want to drop column you should create a new table:
CREATE tmp_table AS
SELECT ... -- all columns without drop TABLE
FROM table_old
and then drop the old table or view, and reclaim the name.

Now drop columns is supported by Spark if you´re using v2 tables. You can check this link
https://spark.apache.org/docs/latest/sql-ref-syntax-ddl-alter-table.html

Related

hive column name change to upper case?

I have a hive table with column name abc and I want to change it to aBc through:
ALTER TABLE mytable CHANGE abc aBc BIGINT
but it didn't work.
I know that hive is case insensitive so I tried to do it with spark session.
doesn't work either.
Is that possible to change it with DDL clause?
I know an alternative solution which requires me to drop the old table and recreate with
create table clause, but the table is too large and I don't want to recreate it.

SparkSQL cannot find existing rows in a Hive table that Trino(Presto) does

We are using Trino(Presto) and SparkSQL for querying Hive tables on s3 but they give different results with the same query and on the same tables. We found the main problem. There are existing rows in a problematic Hive table which can be found with a simple where filter on a specific column with Trino but cannot be found with SparkSQL. The sql statements are the same in both.
On the other hand, SparkSQL can find these rows in the source table of that problematic table, filtering on the same column.
Create sql statement:
CREATE TABLE problematic_hive_table AS SELECT c1,c2,c3 FROM source_table
The select sql that can be used to find missing rows in Trino but not in SparkSQL
SELECT * FROM problematic_hive_table WHERE c1='missing_rows_value_in_column'
And this is the select query which can find these missing rows in SparkSQL:
SELECT * FROM source_table WHERE c1='missing_rows_value_in_column'
We execute the CTAS in Trino(Presto). If we are using ...WHERE trim(c1) = 'missing_key'' then spark can also find the missing rows but the fields do not contain trailing spaces (the length of these fields are the same in the source table as in the problematic table). In the source table spark can find these missing rows without trim.

how a table data gets loaded into a dataframe in databricks? row by row or bulk?

I am new to databricks notebooks and dataframes. I have a requirement to load few columns(out of many) in a table of around 14million records into a dataframe. once the table is loaded, I need to create a new column based on values present in two columns.
I want to write the logic for the new column along with the select command while loading the table into dataframe.
Ex:
df = spark.read.table(tableName)
.select(columnsList)
.withColumn('newColumnName', 'logic')
will it have any performance impact? is it better to first load the table for the few columns into the df and then perform the column manipulation on the loaded df?
does the table data gets loaded all at once or row by row into the df? if row by row, then by including column manipulation logic while reading the table, am I causing any performance degradation?
Thanks in advance!!
This really depends on the underlying format of the table - is it backed by Parquet or Delta, or it's an interface to the actual database, etc. In general, Spark is trying to read only necessary data, and if, for example, Parquet is used (or Delta), then it's easier because it's column-oriented file format, so data for each column is placed together.
Regarding the question on the reading - Spark is lazy by default, so even if you put df = spark.read.table(....) as separate variable, then add .select, and then add .withColumn, it won't do anything until you call some action, for example .count, or write your results. Until that time, Spark will just check that table exists, your operations are correct, etc. You can always call .explain on the resulting dataframe to see how Spark will perform operations.
P.S. I recommend to grab a free copy of the Learning Spark, 2ed that is provided by Databricks - it will provide you a foundation for development of the code for Spark/Databricks

Databricks - How to change a partition of an existing Delta table?

I have a table in Databricks delta which is partitioned by transaction_date. I want to change the partition column to view_date. I tried to drop the table and then create it with a new partition column using PARTITIONED BY (view_date).
However my attempt failed since the actual files reside in S3 and even if I drop a hive table the partitions remain the same.
Is there any way to change the partition of an existing Delta table? Or the only solution will be to drop the actual data and reload it with a newly indicated partition column?
There's actually no need to drop tables or remove files. All you need to do is read the current table, overwrite the contents AND the schema, and change the partition column:
val input = spark.read.table("mytable")
input.write.format("delta")
.mode("overwrite")
.option("overwriteSchema", "true")
.partitionBy("colB") // different column
.saveAsTable("mytable")
UPDATE: There previously was a bug with time travel and changes in partitioning that has now been fixed.
As Silvio pointed out there is no need to drop the table. In fact the strongly recommended approach by databricks is to replace the table.
https://docs.databricks.com/sql/language-manual/sql-ref-syntax-ddl-create-table-using.html#parameters
in spark SQL, This can be done easily by
REPLACE TABLE <tablename>
USING DELTA
PARTITIONED BY (view_date)
AS
SELECT * FROM <tablename>
Modded example from:
https://docs.databricks.com/delta/best-practices.html#replace-the-content-or-schema-of-a-table
Python solution:
If you need more than one column in the partition
partitionBy(column, column_2, ...)
def change_partition_of(table_name, column):
df = spark.read.table(tn)
df.write.format("delta").mode("overwrite").option("overwriteSchema", "true").partitionBy(column).saveAsTable(table_name)
change_partition_of("i.love_python", "column_a")

Hive UDF for selecting all except some columns

The common query building pattern in HiveQL (and SQL in general) is to either select all columns (SELECT *) or an explicitly-specified set of columns (SELECT A, B, C). SQL has no built-in mechanism for selecting all but a specified set of columns.
There are various mechanisms for excluding some columns as outlined in this SO question but none apply naturally to HiveQL. (For example, the idea to create a temporary table with SELECT * then ALTER TABLE DROP some of its columns would wreak havoc in a big data environment.)
Ignoring the ideological discussion about whether it is a good idea to select all but some columns, this question is about the possible ways to extend Hive with this capability.
Prior to Hive 0.13.0 SELECT could take regular-expression-based columns, e.g., property_.* inside a backtick-quoted string. #invoketheshell's answer below refers to this capability but it comes at a cost, which is that, when this capability is on, Hive cannot accept columns with non-standard characters in them, e.g., $foo or x/y. That's why the Hive developers turned this behavior off by default in 0.13.0. I am looking for a generic solution that works for any column name.
A generic table-generating UDF (UDTF) could certainly do this because it can manipulate the schema. Since we are not going to generate new rows, is there a way to solve this problem using a simple row-based UDF?
This seems like a common problem with many posts around the Web showing how to solve it for various databases yet I haven't been able to find a solution for Hive. Is there code somewhere that does this?
You can choose every column except those listed in a regex based specification. This is query columns by exclusion. See below:
A SELECT statement can take regex-based column specification in Hive releases prior to 0.13.0, or in 0.13.0 and later releases if the configuration property hive.support.quoted.identifiers is set to none.
That being said you could create a new table or view using the following, and all the columns except the columns specified will be returned:
hive.support.quoted.identifiers=none;
drop table if exists database.table_name;
create table if not exists database.table_name as
select `(column_to_remove_1|...|column_to_remove_N)?+.+`
from database.some_table
where
--...
;
This will create a table that has all the columns from some_table except the columns named column_to_remove_1, ... , to column_to_remove_N. You can also choose to create a view instead.

Resources