Automate redshift truncate/delete data after a retention period

Automate redshift truncate/delete data after a retention period - node.js

I have a redshift table and it is storing a lot of data. Every weekend I go and manually using Workbench TRUNCATE last week of data that I no longer need.
I manually have to run
DELETE FROM tableName WHERE created_date BETWEEN timeStamp1 AND timeStamp2;
Is it possible to have some way to tell the table or have some expiration policy that removes the data every Sunday for me?
If not, Is there a way to automate the delete process every 7 days? Some sort of shell script or cron job in nodeJS that does this.

No, there is no in-built ability to run commands on a regular basis on Amazon Redshift. You could, however, run a script on another system that connects to Redshift and runs the command.
For example, a cron job that calls psql to connect to Redshift and execute the command. This could be done in a one-line script.
Alternatively, you could configure an AWS Lambda function to connect to Redshift and execute the command. (You would need to write the function yourself, but there are libraries that make this easier.) Then, you would configure Amazon CloudWatch Events to trigger the Lambda function on a desired schedule (eg once a week).
A common strategy is to actually store data in separate tables per time period (eg a month, but in your case it would be a week). Then, define a view that combines several tables. To delete a week of data, simply drop the table that contains that week of data, create a new table for this week's data, then update the view to point to the new table but not the old table.
By the way...
Your example uses the DELETE command, which is not the same as the TRUNCATE command.
TRUNCATE removes all data from a table. It is an efficient way to completely empty a table.
DELETE is good for removing part of a table but it simply marks rows as deleted. The data still occupies space on disk. Therefore, it is recommended that you VACUUM the table after deleting a significant quantity of data.

Related

Azure Databricks SQL download query results

I'm fairly new to Databricks. I have an SQL query in a notebook and I want to download the full results (about 3000 rows) to a CSV file. However, when I run the query, it takes half an hour to display the first 1000 rows (which is useless to me) and then I have to click on "Download full results" which re-runs the query, hence the half hour it had just spent was completely wasted.
Is there a way to download the full results without first displaying the first 1000 rows in the browser?

might be this will help -
Create 1 variable and Load your table into the variable
Mytable = spark.table("tableName")
then storing the data into the csv file with option like
Mytable.write.format("com.databricks.spark.csv").option("delimiter", "as per your requirement").option("header", "true").save("dbfs:/df/mytabledata.csv")
then you can download/access the file under the data bricks instance file system.

Azure Data Factory DataFlow Filter is taking a lot of time

I have an ADF Pipleline which executes a DataFlow.
The Dataflow has Source A table which has around 1 Million Rows,
Filter which has a query to select only yesterday's records from the source table,
Alter Row settings which uses upsert,
Sink which is archival table where the records are getting upsert
This whole pipeline is taking around 2 hours or so which is not acceptable. Actually, the records being transferred / upserted are around 3000 only.
Core count is 16. Tried the partitioning with round robin and 20 partitions.
Similar archival doesn't take more than 15 minutes for another table which has around 100K records.
I thought of creating source which would select only yesterday's record but the dataset we can select only table.
Please suggest if I am missing anything to optimize it.

The table of the Data Set really doesn't matter. Whichever activity you use to access that Data Set can be toggled to use a query instead of the whole table, so that you can pass in a value to select only yesterday's data from the database.
Or course, if you have the ability to create a stored procedure on the source, you could also do that.
When migrating really large sets of data, you'll get much better performance using a Copy activity to stage the data into an Azure Storage Blob before using another Copy activity to pull from that Blob into the source. But, for what you're describing here, that doesn't seem necessary.

How to list all tables by searching a given column name in spark or deltalake

I'm looking for metadata table which holds all column name, table names, creation timestamps within spark sql and delta lake. I need to be able to search by a given column name and list all the tables having that column name.

This doesn't exist in baseline Spark. For this you would need to create an internal ABaC process that gathers specific metadata on process runs. For last update time you can parse the timestamp of an object in hadoop when you run a 'hadoop fs -ls' command; column names would require having a process run, recursively, a 'hive -e' while inputing 'show create table' then parsing out the header and footer; and to get all table names use previous tactic but running 'show tables'. If you have a robust Yarn server running all the code you can get start and end time of jobs, but it generally a nightmare to work with.

SQL to BTEQ scripts and automate the tasks

Hi need help with below requirement.
In Teradata, I have a set of SQL scripts which have to run one after the another. Then once running of them is done, the data has to be stored in a file based on the run date.I have to automate them so the same work triggers once everyday and data get stored in new folder/file with date included in the name.
Ex: after running of the scripts one after the another, finally one table is created that table has to be stored as tableYYYYMMDD. The scripts has to run everyday and a new table has to be created or else there should be only one table and data needs to appended to it everyday.

pyspark: insert into dataframe if key not present or row.timestamp is more recent

I have a Kudu database with a table in it. Every day, I launch a batch job which receives new data to ingest (an ETL pipeline).
I would like to insert the new data if:
the key is not present
if the key is present, update the row only if the timestamp column of the new row is more recent

I think what you need is a left outer join of the new data with the existing table, the result of which you first have to save into a temporary table, and then move it to the original table, with SaveMode.Append.
You might also be interested in using Spark Structured Streaming or Kafka instead of batch jobs. I even found an example on GitHub (didn't check how well it works, though, and whether it takes existing data into account).

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string

Automate redshift truncate/delete data after a retention period - node.js

Related

Azure Databricks SQL download query results

Azure Data Factory DataFlow Filter is taking a lot of time

How to list all tables by searching a given column name in spark or deltalake

SQL to BTEQ scripts and automate the tasks

pyspark: insert into dataframe if key not present or row.timestamp is more recent

Categories

Resources