Daily incremental copying from Amazon S3 data into Amazon Redshift - python-3.x

I have a RDS database whose snapshot is taken everyday and is kept in a S3 bucket. I copy the RDS snapshot data from S3 to Amazon Redshift database daily. I can use copy to copy the tables but instead of copying the whole table, I want to copy only the rows which were added since the last snapshot was taken(Incremental copying).
For example, in RDS, there is a table name "user" which looks like this at 25-05-2021
id | username
1 | john
2 | cathy
When I will run the data loader for first time on 26-05-2021, it will copy these two rows into the Redshift table with the same name.
Now on 26-05-2021, the table in RDS looks like this:
id | username
1 | john
2 | cathy
3 | ola
4 | mike
When I will run the data loader on 27-05-2021, instead of copying all three rows, I want to copy/take only the rows which has been newly added(id = 3 and id = 4) as I already have the other rows.
What should be the best way of doing this incremental loading?

The COPY command will always load the entire table. However, you could create an External Table using Redshift Spectrum that accesses the files without loading them into Redshift. Then, you could construct a query that does an INSERT where the ID is greater than the last ID used in the Redshift table.
Perhaps I should explain it a bit simpler...
Table existing_table in Redshift already has rows up to id = 2
CREATE EXTERNAL TABLE in_data to point at the files in S3 containing the data
The use INSERT INTO existing_table SELECT * FROM in_data WHERE id > (SELECT MAX(id) FROM existing_table
In theory, this should only load the new rows into the table.

Related

Azure Data Factory Merge to files before inserting in to DB

We have two files that are ^ delimited file and a comma separated txt files which are stored in the Blob Storage like below
File1 fields are like
ItemId^Name^c1^type^count^code^Avail^status^Ready
File2 Fields are like
ItemId,Num,c2
Here the first column in both the files are the key and based on it I need to insert them in to one table on the Azure DB using the Azure Data Factory. Can anyone suggest how can this be done in the ADF. Should we merge the two files into one file before inserting into the Database.
AzureDB columns are
ItemId Name c1 type count code Avail status Ready Num c2
So it should be like
Item1 ABC(S) 1234 Toy 10 N N/A POOL N/A 19 EM
Item2 DEF(S) 5678 toy 7 X N/A POOL N/A 6 MP
I was referring to this Merging two or more files from a storage account based on a column using Azure Data Factory but couldnt understand if we can merge the two files before inserting in to DB
You can use the 2 files to create 2 datasets, use join activity to jointhem together and simply sink to the SQL table in a dataflow.
Here Inner join is used, you can adapt to use the type of join your preferred.
You can see the preview of the join successfully merged the 2 files/data sources.
Adjust the field mapping in Sink if needed.
Here is the arrow-separated.csv I used:
ItemId^Name^c1^type^count^code^Avail^status^Ready
Item1^ABC(S)^1234^Toy^10^N^N/A^POOL^N/A
Item2^DEF(S)^5678^toy^7^X^N/A^POOL^N/A
Here is the comma-separated.csv I used:
ItemId,Num,c2
Item1,19,EM
Item2,6,MP
Result in DB:

Incremental load without date or primary key column using azure data factory

I am having a source lets say SQL DB or an oracle database and I wanted to pull the table data to Azure SQL database. But the problem is I don't have any date column on which data is getting inserting or a primary key column. So is there any other way to perform this operation.
One way of doing it semi-incremental is to partition the table by a fairly stable column in the source table, then you can use mapping data flow to compare the partitions ( can be done with row counts, aggregations, hashbytes etc ). Each load you store the compare output in the partitions metadata somewhere to be able to compare it again the next time you load. That way you can reload only the partitions that were changed since your last load.

Bulk copy from Cassandra table column to a file

I have a requirement to copy cassandra database column into a file.
The databas has 15 million records with below columns in it. I want to copy payment column data into a file. Since it a production environment that will leads to stress on cassandra clusters.
userid | contract | payment | createdDate
Any suggestions?
Out of 15 millions payment details we want to modify few (based on some condition) and insert into a different Cassandra table.
Copying to a file -> process it -> write it to new Database table. that is the plan. but first of all how to get the copy of the column from cassandra database.
Regards
Kiran
You can use Spark + Spark Cassandra Connector (SCC) to perform data loading, modification and writing back. SCC has a number of knobs that you can use to tune throughput, to not overload the cluster when reading & writing.
If you don't have Spark, you can still use the similar approach when fetching data - not issuing the select * from table (this will overload the node that handles request), but instead perform loading of the data by specific token ranges, so the queries will go to different servers and don't overload them too much. You can find code example that is doing scan by token ranges here.

Group data and extract average in Cassandra cqlsh

Lets say we have a key-space named sensors and a table named sensor_per_row.
this table has the following structure :
sensor_id | ts | value
In this case senor_id represents the partition key and ts (which is the date of the record created ) represents the clustering key.
select sensor_id, value , TODATE(ts) as day ,ts from sensors.sensor_per_row
The outcome of this select is
sensor_id | value | day | ts
-----------+-------+------------+---------------
Sensor 2 | 52.7 | 2019-01-04 | 1546640464138
Sensor 2 | 52.8 | 2019-01-04 | 1546640564376
Sensor 2 | 52.9 | 2019-01-04 | 1546640664617
How can I group data by ts more specifically group them by date and return the day average value for each row of the table using cqlsh. for instance :
sensor_id | system.avg(value) | day
-----------+-------------------+------------
Sensor 2 | 52.52059 | 2018-12-11
Sensor 2 | 42.52059 | 2018-12-10
Sensor 3 | 32.52059 | 2018-12-11
One way i guess is to use udf (user defined functions ) but this function runs only for one row . Is it possible to select data inside udf ?
Another way is using java etc. , with multiple queries for each day or with processing the data in some other contact point as a rest web service ,but i don't now about the efficiency of that ... any suggestion ?
NoSQL Limitations
While working with NoSQL, we generally have to give up:
Some ACID guarantees.
Consistency from CAP.
Shuffling operations: JOIN, GROUP BY.
You may perform above operations by reading data(rows) from the table and summing.
You can also refer to the answer MAX(), DISTINCT and group by in Cassandra
So I found the solution , I will post it in case somebody else has the same question.
As I read the data modeling seems to be the answer. Which means :
In Cassandra db we have partition keys and clustering keys .Cassandra has the ability of handling multiple inserts simultaneously . That gives us the possibility of inserting the data in more than one table at simultaneously , which pretty much means we can create different tables for the same data collection application , which will be used in a way as Materialized views (MySql) .
For instance lets say we have the log schema {sensor_id , region , value} ,
The first comes in mind is to generate a table called sensor_per_row like :
sensor_id | value | region | ts
-----------+-------+------------+---------------
This is a very efficient way of storing the data for a long time , but given the Cassandra functions it is not that simple to visualize and gain analytics out of them .
Because of that we can create different tables with ttl (ttl stands for time to live) which simply means for how long the data will be stored .
For instance if we want to get the daily measurements of our specific sensor we can create a table with day & sensor_id as partition keys and timestamp as clustering key with Desc order.
If we add and a ttl value of 12*60*60*60 which stands for a day, we can store our daily data.
So creating lets say a table sensor_per_day with the above format and ttl will actual give as the daily measurements .And at the end of the day ,the table will be flushed with the newer measurements while the data will remained stored in the previews table sensor_per_row
I hope i gave you the idea.

Can you add more than one partition in one "ALTER TABLE" command?

I'm using Amazon Athena to query through some log files stored in an S3 bucket, and am using partitions to section off days of the year for the files I need to query. I was wondering -- since I have a large batch of days to add to my table, could I do it all in one ALTER TABLE command, or do I need to have as many ALTER TABLE commands as the number of partitions I would like to create?
This is an example of the command I am using at the moment:
ALTER TABLE
logfiles
ADD PARTITION
(day='20170525')
location 's3://log-bucket/20170525/';
If I do have to use one ALTER TABLE command per partition, is there a way to create a range of days, and then have Athena loop through it to create the partitions, instead of me manually copy/pasting out this command 100+ times?
It appears that you can add many partitions to one ALTER TABLE command, per the Athena documentation found at https://docs.aws.amazon.com/athena/latest/ug/alter-table-add-partition.html or go do the athena root and search for add partition.
ALTER TABLE orders ADD
PARTITION (dt = '2016-05-14', country = 'IN') LOCATION 's3://mystorage/path/to/INDIA_14_May_2016'
PARTITION (dt = '2016-05-15', country = 'IN') LOCATION 's3://mystorage/path/to/INDIA_15_May_2016';

Resources