How to retrieve a column value from DESCRIBE DETAIL <table_name> - databricks

I would like to use the "Last modified" value from the description of my table in databricks. I know how to get all columns from the table by using "DESCRIBE DETAIL table_name", but I wish to simply get the last modified value since I need to use it in my WHERE comparison

You can retrieve the result of the SQL query as a list and get the content like this :
spark.sql("DESCRIBE DETAIL database_name.table_name").collect()[0]['lastModified']
>>> Out[7]: datetime.datetime(2022, 2, 11, 8, 16, 5)

The DESCRIBE DETAIL functionality returns a dataframe with 1 row, but isn't handled as a proper table in Spark SQL using databricks as of now, but you can do it via temp view as #axel-r pointed out:
df = spark.sql("""DESCRIBE DETAIL database_name.table_name""")
df.createOrReplaceTempView("details")
%sql
SELECT lastModified FROM details
In my case, I wanted the last update date from the DESCRIBE DETAIL command, and it happens that DESCRIBE HISTORY is treated as a proper table in Spark SQL, and adding LIMIT 1 shows the most recent record. You can get the same info that way. The benefit of that is that you can save it as permament view, where in the above method you can't:
%sql
SELECT timestamp as lastModified
FROM (DESCRIBE HISTORY database_name.table_name LIMIT 1)

Related

Databricks table metadata through JDBC driver

The Spark JDBC driver (SparkJDBC42.jar) is unable to capture certain information from the below table structure:
table level comment
The TBLPROPERTIES key-value pair information
PARTITION BY information
However, it captures the column level comment (eg. the comment against employee_number column), all columns of employee table, their technical data types.
Please advise if I need to configure any additional properties to be ale to read/extract the information that the driver could not extract at the moment.
create table default.employee(
employee_number INT COMMENT ‘Unique identifier for an employee’,
employee_name VARCHAR(50),
employee_age INT)
PARTITIONED BY (employee_age)
COMMENT ‘this is a table level comment’
TBLPROPERTIES (‘created.by.user’ = ‘Noor’, ‘created.date’ = ‘10-08-2021’);
You should be able to execute:
describe table extended default.employee
via JDBC interface as well. In first case it will return a table with 3 columns, that you can parse into column level & table level properties - it shouldn't be very complex, as there are explicit delimiters between row-level & table level data:
You can also execute:
show create table default.employee
that will give you a table with one column, containing the SQL statement that you may parse:

Athena sub-query and LEFT JOIN data scanned optimization

There is a table with parquet data format of 20 GB and simple query will give results by scanning only 1GB of data.
select columns from table1 where id in (id1, id2, idn)
If same query is executed with a sub-query such as -
select columns from table1 where id in (select id from table2 limit n) This query will give results by scanning 20GB, whole the table.Even n is very small number as 10, 50 or 5000.
Same happen with LEFT JOIN.
SELECT table1.* FROM
table2 LEFT JOIN table1
ON table2.id=table1.id
Is there a way to achieve this by running single query instead of fetch and save result of sub-query and pass as args into another query?
Any best practices of How currently users runs LEFT JOIN or sub-query without full table scan on Athena ?
Similar questions- Question -1, Question -2
Is there a way to achieve this by running single query instead of fetch and save result of sub-query and pass as args into another query?
This is most commonly covered by "Dynamic filtering".
Currently there is no way to do this.
Athena is based on Presto and Presto doesn't support dynamic filtering yet, but will likely support it in the next release (Presto 321). You can track the issue here: https://github.com/prestosql/presto/issues/52
Athena is based on Presto 0.172 currently, so it still needs to upgrade.

Automatically Updating a Hive View Daily

I have a requirement I want to meet. I need to sqoop over data from a DB to Hive. I am sqooping on a daily basis since this data is updated daily.
This data will be used as lookup data from a spark consumer for enrichment. We want to keep a history of all the data we have received but we don't need all the data for lookup only the latest data (same day). I was thinking of creating a hive view from the historical table and only showing records that were inserted that day. Is there a way to automate the view on a daily basis so that the view query will always have the latest data?
Q: Is there a way to automate the view on a daily basis so that the
view query will always have the latest data?
No need to update/automate the process if you get a partitioned table based on date.
Q: We want to keep a history of all the data we have received but we
don't need all the data for lookup only the latest data (same day).
NOTE : Either hive view or hive table you should always avoid scanning the full table data aka full table scan for getting latest partitioned data.
Option 1: hive approach to query data
If you want to adapt hive approach
you have to go with partition column for example : partition_date and partitioned table in hive
select * from table where partition_column in
(select max(distinct partition_date ) from yourpartitionedTable)
or
select * from (select *,dense_rank() over (order by partition_date desc) dt_rnk from db.yourpartitionedTable ) myview
where myview.dt_rnk=1
will give the latest partition always. (if same day or todays date is there in partition data then it will give the same days partition data otherwise it will give max partition_date) and its data from the partition table.
Option 2: Plain spark approach to query data
with spark show partitions command i.e. spark.sql(s"show Partitions $yourpartitionedtablename") get the result in array and sort that to get latest partition date. using that you can query only latest partitioned date as lookup data using your spark component.
see my answer as an idea for getting latest partition date.
I prefer option2 since no hive query is needed and no full table query since
we are using show partitions command. and no performance bottle necks
and speed will be there.
One more different idea is querying with HiveMetastoreClient or with option2... see this and my answer and the other
I am assuming that you are loading daily transaction records to your history table with some last modified date. Every time you insert or update record to your history table you get your last_modified_date column updated. It could be date or timestamp also.
you can create a view in hive to fetch the latest data using analytical function.
Here's some sample data:
CREATE TABLE IF NOT EXISTS db.test_data
(
user_id int
,country string
,last_modified_date date
)
ROW FORMAT DELIMITED
FIELDS TERMINATED BY ','
STORED AS orc
;
I am inserting few sample records. you see same id is having multiple records for different dates.
INSERT INTO TABLE db.test_data VALUES
(1,'India','2019-08-06'),
(2,'Ukraine','2019-08-06'),
(1,'India','2019-08-05'),
(2,'Ukraine','2019-08-05'),
(1,'India','2019-08-04'),
(2,'Ukraine','2019-08-04');
creating a view in Hive:
CREATE VIEW db.test_view AS
select user_id, country, last_modified_date
from ( select user_id, country, last_modified_date,
max(last_modified_date) over (partition by user_id) as max_modified
from db.test_data ) as sub
where last_modified_date = max_modified
;
hive> select * from db.test_view;
1 India 2019-08-06
2 Ukraine 2019-08-06
Time taken: 5.297 seconds, Fetched: 2 row(s)
It's showing us result with max date only.
If you further inserted another set of record with max last modified date as:
hive> INSERT INTO TABLE db.test_data VALUES
> (1,'India','2019-08-07');
hive> select * from db.test_view;
1 India 2019-08-07
2 Ukraine 2019-08-06
for reference:Hive View manuual

drop column in a table/view using spark sql only

i have 30 columns in a table i.e table_old
i want to use 29 columns in that table except one . that column is dynamic.
i am using string interpolation.
the below sparksql query i am using
drop_column=now_current_column
var table_new=spark.sql(s"""alter table table_old drop $drop_column""")
but its throwing error
mismatched input expecting 'partition'
i dont want to drop the column using dataframe. i requirement is to drop the column in a table using sparksql only
As mentioned in previous answer, DROP COLUMN is not supported by spark yet.
But, there is a workaround to achieve the same, without much overhead. This trick works for both EXTERNAL and InMemory tables. The code snippet below works for EXTERNAL table, you can easily modify it and use it for InMemory tables as well.
val dropColumnName = "column_name"
val tableIdentifier = "table_name"
val tablePath = "table_path"
val newSchema=StructType(spark.read.table(tableIdentifier).schema.filter(col => col.name != dropColumnName))
spark.sql(s"drop table ${tableIdentifier}")
spark.catalog.createTable(tableIdentifier, "orc", newSchema, Map("path" -> tablePath))
orc is the file format, it should be replaced with the required format. For InMemory tables, remove the tablePath and you are good to go. Hope this helps.
DROP COLUMN (and in general majority of ALTER TABLE commands) are not supported in Spark SQL.
If you want to drop column you should create a new table:
CREATE tmp_table AS
SELECT ... -- all columns without drop TABLE
FROM table_old
and then drop the old table or view, and reclaim the name.
Now drop columns is supported by Spark if you´re using v2 tables. You can check this link
https://spark.apache.org/docs/latest/sql-ref-syntax-ddl-alter-table.html

See if a rowkey exists in Cassandra

In HBase there's a HTable.exists(Get) method that allows me to see if a rowkey has (at least) one cell associated. What about Cassandra? It seems there's no such corresponding feature
Just make a request for one column. If anything is returned, the row exists.
For example, in pycassa you would do:
if column_family.get(key, column_count=1):
print key, "exists"
Through CQL 3, depending on your schema, you would just do a simple select like:
SELECT * FROM mycf WHERE key = 'foo'

Resources