Drop Databricks table if older than 30 days - databricks

I would like to drop Databricks SQL DB tables, if the table was created more than 30 days ago. How do I get the table created datetime from databricks?
Thanks

Given a tableName, the easiest way to get the creation time is as follows:
import org.apache.spark.sql.catalyst.TableIdentifier
val createdAtMillis = spark.sessionState.catalog
.getTempViewOrPermanentTableMetadata(new TableIdentifier(tableName))
.createTime
getTempViewOrPermanentTableMetadata() returns CatalogTable that contains information such as:
CatalogTable(
Database: default
Table: dimension_npi
Owner: root
Created Time: Fri Jan 10 23:37:18 UTC 2020
Last Access: Thu Jan 01 00:00:00 UTC 1970
Created By: Spark 2.4.4
Type: MANAGED
Provider: parquet
Num Buckets: 8
Bucket Columns: [`npi`]
Sort Columns: [`npi`]
Table Properties: [transient_lastDdlTime=1578699438]
Location: dbfs:/user/hive/warehouse/dimension_npi
Serde Library: org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe
InputFormat: org.apache.hadoop.mapred.SequenceFileInputFormat
OutputFormat: org.apache.hadoop.hive.ql.io.HiveSequenceFileOutputFormat
Schema: root
|-- npi: integer (nullable = true)
...
)
You can list all tables in a database using sessionCatalog.listTables(database).
There are alternative ways of accomplishing the same but with a lot more effort and risking errors due to Spark behavior changes: poking about table metadata using SQL and/or traversing the locations where tables are stored and looking at file timestamps. That's why it's best to go via the catalog APIs.
Hope this helps.

Assuming your DB table is delta:
You can use the DESCRIBE HISTORY <database>.<table> to retrieve all transactions made to that table, including timestamps. According to the databricks documentation - history is only retained for 30 days. Depending on how you plan to implement your solution that may just work.

Related

why is my glue table creating with the wrong path?

I'm creating a table in AWS Glue using a spark job orchestrated by Airflow, it reads from a json and writes a table, the command I use within the job is the following:
spark.sql(s"CREATE TABLE IF NOT EXISTS $database.$table using PARQUET LOCATION '$path'")
The odd thing here is that I have other tables created using the same job (with different names) but they are created without problems, e.g. they have the location
s3://bucket_name/databases/my_db/my_perfectly_created_table
there is exactly one table that creates itself with this location:
s3://bucket_name/databases/my_db/my_problematic_table-__PLACEHOLDER__
I don't know where that -__PLACEHOLDER__ is coming from. I already tried deleting the table and recreating it but it always does the same thing on this exact table. The data is in parquet format in the path:
s3://bucket_name/databases/my_db/my_problematic_table
so I know the problem is just creating the table correctly because all I get is a col (array<string>) when trying to query it in Athena (as there is no data in /my_problematic_table-__PLACEHOLDER__).
Have any of you guys dealt with this before?
Upon closer inspection in AWS glue, this specific problematic_table had the following config, specific for CSV files and custom-delimiters:
Input Format org.apache.hadoop.mapred.SequenceFileInputFormat
Output Format org.apache.hadoop.hive.ql.io.HiveSequenceFileOutputFormat
Serde serialization library org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe
while my other tables had the config specific for parquet:
Input Format org.apache.hadoop.hive.ql.io.parquet.MapredParquetInputFormat
Output Format org.apache.hadoop.hive.ql.io.parquet.MapredParquetOutputFormat
Serde serialization library org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe
I tried to create the table forcing the config for parquet with the following command:
val path = "s3://bucket_name/databases/my_db/my_problematic_table/"
val my_table = spark.read.format("parquet").load(path)
val ddlSchema = my_table.toDF.schema.toDDL
spark.sql(s"""
|CREATE TABLE IF NOT EXISTS my_db.manual_myproblematic_table($ddlSchema)
|ROW FORMAT SERDE 'org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe'
|STORED AS INPUTFORMAT 'org.apache.hadoop.hive.ql.io.parquet.MapredParquetInputFormat'
|OUTPUTFORMAT 'org.apache.hadoop.hive.ql.io.parquet.MapredParquetOutputFormat'
|LOCATION '$path'
|""".stripMargin
)
but it threw the following error:
org.apache.spark.SparkException: Cannot recognize hive type string: struct<1:string,2:string,3:string>, column: problematic_column
so the problem was the naming of those columns, "1", "2" & "3" within that struct.
Given that this struct did not contain valuable info I ended up dropping it and creating the table again. now it works like a charm and it has the correct (parquet) config in glue.
Hope this helps anyone

Partitioned table on synapse

I'm trying to create a new partitioned table on my SqlDW (synapse) based on a partitioned table on Spark (synapse) with
%%spark
val df1 = spark.sql("SELECT * FROM sparkTable")
df1.write.partitionBy("year").sqlanalytics("My_SQL_Pool.dbo.StudentFromSpak", Constants.INTERNAL )
Error : StructuredStream-spark package version: 2.4.5-1.3.1
StructuredStream-spark package version: 2.4.5-1.3.1
StructuredStream-spark package version: 2.4.5-1.3.1
java.sql.SQLException:
com.microsoft.sqlserver.jdbc.SQLServerException: External file access
failed due to internal error: 'File
/synapse/workspaces/test-partition-workspace/sparkpools/myspark/sparkpoolinstances/c5e00068-022d-478f-b4b8-843900bd656b/livysessions/2021/03/09/1/tempdata/SQLAnalyticsConnectorStaging/application_1615298536360_0001/aDtD9ywSeuk_shiw47zntKz.tbl/year=2000/part-00004-5c3e4b1a-a580-4c7e-8381-00d92b0d32ea.c000.snappy.parquet:
HdfsBridge::CreateRecordReader - Unexpected error encountered
creating the record reader: HadoopExecutionException: Column count
mismatch. Source file has 5 columns, external table definition has 6
columns.' at
com.microsoft.spark.sqlanalytics.utils.SQLAnalyticsJDBCWrapper.executeUpdateStatement(SQLAnalyticsJDBCWrapper.scala:89)
at
thanks
The sqlanalytics() function name has been changed to synapsesql(). It does not currently support writing partitioned tables but you could implement this yourself, eg by writing multiple tables back to the dedicated SQL pool and the using partition switching back there.
The syntax is simply (as per the documentation):
df.write.synapsesql("<DBName>.<Schema>.<TableName>", <TableType>)
An example would be:
df.write.synapsesql("yourDb.dbo.yourTablePartition1", Constants.INTERNAL)
df.write.synapsesql("yourDb.dbo.yourTablePartition2", Constants.INTERNAL)
Now do the partition switching in the database using the ALTER TABLE ... SWITCH PARTITION syntax.

How to create a databricks database with read only access on top of an existing database

I will use this image to visualize my question:
Databricks1 creates a database (and tables) in Databricks and stores its data in the storage account. In Databricks2 I want to read the data: Databricks2 only has read permissions. I can read directly on the raw delta files, but I would like to create a database and table that is visualized as well in the Databricks UI.
I thought it would work the following way:
CREATE DATABASE IF NOT EXISTS datastore_panels
LOCATION '/mnt/readOnlyTraining/tmp/panels/';
But this gives a permission error, though the tmp/panels database is already in place.
Is there a way to create a database/table from existing resources on top of delta with read only access?
I have found the solution. I wasted quite some time on this and I never encountered someone with the same question.
The solution is actually simple, but you need to know.
I have a Service Principal with read access to my storage account.
Create the database like this (don't mention the location):
CREATE DATABASE IF NOT EXISTS datastore_panels
Create the table (use location, but do not set tableproperties or partitioning: it will read this from the delta table metadata):
CREATE TABLE IF NOT EXISTS datastore_panels.customer_data
USING delta
LOCATION '/mnt/readOnlyTraining/delta/customer-data/'
For those that want to understand the question better, this is what I tried before
%sql
CREATE TABLE IF NOT EXISTS datastore_panels.production_bazeilles_press_shopfloor (
reg_id INT,
year INT,
timestamp_utc TIMESTAMP,
unit STRING,
value DECIMAL (18,8),
descr_total STRING,
descr01 STRING,
descr02 STRING,
descr03 STRING,
descr04 STRING,
descr05 STRING,
descr06 STRING,
descr07 STRING,
descr08 STRING,
descr09 STRING,
descr10 STRING
)
USING delta
PARTITIONED BY (year)
LOCATION '/mnt/blob/panels/production/bazeilles/press/shopfloor'
TBLPROPERTIES ('delta.deletedFileRetentionDuration' = "interval 60 days",
'delta.autoOptimize.optimizeWrite' = 'true'
)
This doesn't work because spark will in the end only read, but due to the columns specified, tblproperties and partitionedby spark needs to log to the storage account with only read access that someone tries to change these properties. This logging step is not possible so it gives a "no permission" error back.
We had a similar situation and we have created a service principal with "Storage Read Access" on ADLS and mounted the databricks 1 location with databricks 2 with read only service principal. However, we need to additionally run an automated script to auto create the tables in databricks 2 based on new mount location.
Although, we are stuck with one problem, what about if different groups want to access same tables in same databricks account (databricks 1): 1 group with read and write whereas other with group with read only. And we don't want another service principal because we want give table names to the groups instead of table location. Similar to database ACL in databases to manage privileges.

Azure Data Lake Analytics - Output dates as +0000 rather than -0800

I have a datetime column in an Azure Data Lake Analytics table.
All my incoming data is UTC +0000. When using the below code, all the csv outputs convert the dates to -0800
OUTPUT #data
TO #"/data.csv"
USING Outputters.Text(quoting : false, delimiter : '|');
An example datatime in the output:
2018-01-15T12:20:13.0000000-08:00
Are there any options for controlling the output format of the dates? I don't really understand why everything is suddenly in -0800 when the incoming data isn't.
Currently, ADLA does not store TimeZone information in DateTime, meaning it will always default to the local time of the cluster machine when reading (-8:00 in your case). Therefore, you can either normalize your DateTime to this local time by running
DateTime.SpecifyKind(myDate, DateTimeKind.Local)
or use
DateTime.ConvertToUtc()
to output in Utc form (but note that next time you ingest that same data, ADLA will still default to reading it in offset -0800). Examples below:
#getDates =
EXTRACT
id int,
date DateTime
FROM "/test/DateTestUtc.csv"
USING Extractors.Csv();
#formatDates =
SELECT
id,
DateTime.SpecifyKind(date, DateTimeKind.Local) AS localDate,
date.ConvertToUtc() AS utcDate
FROM #getDates;
OUTPUT #formatDates
TO "/test/dateTestUtcKind_AllUTC.csv"
USING Outputters.Csv();
You can file a feature request for DateTime with offset on our ADL feedback site. Let me know if you have other questions!

DataStax Cassandra File System - Fixed Width Text File - Hive Integration Issue

I'm trying to read a fixed width text file stored in Cassandra File System (CFS) using Hive. I'm able to query the file when I run from hive client. However, when I try to run from Hadoop Hive JDBC, It says table is not available or bad connection. Below are the steps I followed.
Input file (employees.dat):
21736Ambalavanar Thirugnanam BOY-EAG 2005-05-091992-11-18
21737Anand Jeyamani BOY-AST 2005-05-091985-02-12
31123Muthukumar Rajendran BOY-EES 2009-08-121983-02-23
Starting Hive Client
bash-3.2# dse hive;
Logging initialized using configuration in file:/etc/dse/hive/hive-log4j.properties
Hive history file=/tmp/root/hive_job_log_root_201209250900_157600446.txt
hive> use HiveDB;
OK
Time taken: 1.149 seconds
Creating Hive External Table pointing to fixed width format text file
hive> CREATE EXTERNAL TABLE employees (empid STRING, firstname STRING, lastname STRING, dept STRING, dateofjoining STRING, dateofbirth STRING)
> ROW FORMAT SERDE 'org.apache.hadoop.hive.contrib.serde2.RegexSerDe'
> WITH SERDEPROPERTIES ("input.regex" = "(.{5})(.{25})(.{25})(.{15})(.{10})(.{10}).*" )
> LOCATION 'cfs://hostname:9160/folder/';
OK
Time taken: 0.524 seconds
Do a select * from table.
hive> select * from employees;
OK
21736 Ambalavanar Thirugnanam BOY-EAG 2005-05-09 1992-11-18
21737 Anand Jeyamani BOY-AST 2005-05-09 1985-02-12
31123 Muthukumar Rajendran BOY-EES 2009-08-12 1983-02-23
Time taken: 0.698 seconds
Do a select with specific fields from hive table throws permission error (first issue)
hive> select empid, firstname from employees;
Total MapReduce jobs = 1
Launching Job 1 out of 1
Number of reduce tasks is set to 0 since there's no reduce operator
java.io.IOException: The ownership/permissions on the staging directory cfs:/tmp/hadoop-root/mapred/staging/root/.staging is not as expected. It is owned by root and permissions are rwxrwxrwx. The directory must be owned by the submitter root or by root and permissions must be rwx------
at org.apache.hadoop.mapreduce.JobSubmissionFiles.getStagingDir(JobSubmissionFiles.java:108)
at org.apache.hadoop.mapred.JobClient$2.run(JobClient.java:856)
at org.apache.hadoop.mapred.JobClient$2.run(JobClient.java:850)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:416)
at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1093)
at org.apache.hadoop.mapred.JobClient.submitJobInternal(JobClient.java:850)
at org.apache.hadoop.mapred.JobClient.submitJob(JobClient.java:824)
at org.apache.hadoop.hive.ql.exec.ExecDriver.execute(ExecDriver.java:452)
at org.apache.hadoop.hive.ql.exec.MapRedTask.execute(MapRedTask.java:136)
at org.apache.hadoop.hive.ql.exec.Task.executeTask(Task.java:133)
at org.apache.hadoop.hive.ql.exec.TaskRunner.runSequential(TaskRunner.java:57)
at org.apache.hadoop.hive.ql.Driver.launchTask(Driver.java:1332)
at org.apache.hadoop.hive.ql.Driver.execute(Driver.java:1123)
at org.apache.hadoop.hive.ql.Driver.run(Driver.java:931)
at org.apache.hadoop.hive.cli.CliDriver.processLocalCmd(CliDriver.java:255)
at org.apache.hadoop.hive.cli.CliDriver.processCmd(CliDriver.java:212)
at org.apache.hadoop.hive.cli.CliDriver.processLine(CliDriver.java:403)
at org.apache.hadoop.hive.cli.CliDriver.run(CliDriver.java:671)
at org.apache.hadoop.hive.cli.CliDriver.main(CliDriver.java:554)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:616)
at org.apache.hadoop.util.RunJar.main(RunJar.java:156)
Job Submission failed with exception 'java.io.IOException(The ownership/permissions on the staging directory cfs:/tmp/hadoop-root/mapred/staging/root/.staging is not as expected. It is owned by root and permissions are rwxrwxrwx. The directory must be owned by the submitter root or by root and permissions must be rwx------)'
FAILED: Execution Error, return code 1 from org.apache.hadoop.hive.ql.exec.MapRedTask
The second issue is, when I try to run the select * query from JDBC Hive driver (outside of dse/cassandra nodes), It says the table employees is not available. The external table created acts like a temporary table and it does not get persisted. When I use 'hive> show tables;', the employees table is not listed. Can anyone please help me figure out the problem?
I don't have an immediate answer for the first issue, but the second looks like its due to a known issue.
There is a bug in DSE 2.1 which drops external tables created from CFS files from the metastore when show tables is run. Only the table metadata is removed, the data remains in CFS so if you recreate the table definition you shouldn't have to reload it. Tables backed by Cassandra ColumnFamilies are notaffected by this bug. This has been fixed in the 2.2 release of DSE, which is due for release imminently.
I'm not familiar with the Hive JDBC driver, but if it issues a Show Tables command at any point, it could be triggering this bug.

Resources