Using snowpipe to read parquet file's timestamp - snowpipe

I am using snowpipe to ingest data from azure blob to snowflake. Issue I am facing is that there are multiple file (rows) of data with the same primary key. Is there a way that snowpipe can get this files time stamp? Using timestamp will allow me to use the row which has max(timestamp). Just a walk around for duplicate key issue.

Related

How to load array<string> data type from parquet file stored in Amazon S3 to Azure Data Warehouse?

I am working with parquet files stored on Amazon S3. These files need to be extracted and the data from it needs to be loaded into Azure Data Warehouse.
My plan is:
Amazon S3 -> Use SAP BODS to move parquet files to Azure Blob -> Create External tables on those parquet files -> Staging -> Fact/ Dim tables
Now the problem is that in one of the parquet files there is a column that is stored as an array<string>. I am able to create external table on it using varchar data type for that column but if I perform any sql query operation (i.e. Select) on that external table then it throws below error.
Msg 106000, Level 16, State 1, Line 3
HdfsBridge::recordReaderFillBuffer - Unexpected error encountered
filling record reader buffer: ClassCastException: optional group
status (LIST) {
repeated group bag {
optional binary array_element (UTF8);
}
} is not primitive
I have tried different data types but unable to run select query on that external table.
Please let me know if there are any other options.
Thanks
On Azure, there is a service named Azure Data Factory, I think which can be used in your current scenario, as the document Parquet format in Azure Data Factory said below.
Parquet format is supported for the following connectors: Amazon S3, Azure Blob, Azure Data Lake Storage Gen1, Azure Data Lake Storage Gen2, Azure File Storage, File System, FTP, Google Cloud Storage, HDFS, HTTP, and SFTP.
And you can try to follow the tutorial Load data into Azure SQL Data Warehouse by using Azure Data Factory to set Amazon S3 with parquet format as source to directly copy data to Azure SQL Data Warehouse. Due to read the data from the parquet format file with auto schema parsering, it should be easy for your task using Azure Data Factory.
Hope it helps.

How to skip already copied files in Azure data factory, copy data tool?

I want to copy data from blob storage(parquet format) to cosmos db. Scheduled the trigger for every one hour. But all the files/data getting copied in every run. how to skip the files that are already copied?
There is no unique key with the data. We should not copy the same file content again.
Based on your requirements, you could get an idea of modifiedDatetimeStart and modifiedDatetimeEnd properties in Blob Storage DataSet properties.
But you need to modify the configuration of dataset every period of time via sdk to push the value of the properties move on.
Another two solutions you could consider :
1.Using Blob Trigger Azure Function.It could be triggered if any modifications on the blob files then you could transfer data from blob to cosmos db by sdk code.
2.Using Azure Stream Analytics.You could configure the input as Blob Storage and output as Cosmos DB.

Azure IoT data warehouse updates

I am building Azure IoT solution for my BI project. For now I have an application that once per set time window sends a .csv blob to Azure Blob Storage with incremental number in name. So after some time I will have in my storage files such as 'data1.csv', 'data2.csv', 'data3.csv', etc.
Now I will need to load these data into a database which will be my warehouse with the use of Azure Stream Analytics job. The issue might be that .CSV files will have overlapping data. They will be send every 4h and contain data for past 24h. I need to always read only last file (with highest number) and prepare lookup so it properly updates data in the warehouse. What will be the best approach to make Stream Analytics read only latest file and for updating records in DB?
EDIT:
TO clarify - I am fully aware that ASA is not capable of being an ETL job. My question is what would be best approach for my case with using IoT tools
I would suggest one of these 2 ways:
use ASA to write in a temporary SQL table, and the use a SQL trigger
to update the main table of the DW with the diff.
Or remove duplicates by adding a unique constraint as described here:
https://blogs.msdn.microsoft.com/streamanalytics/2017/01/13/how-to-achieve-exactly-once-delivery-for-sql-output/
Thanks,
JS - Azure Stream Analytics

Can I save some JSON data from a file of 1MB in Azure table storage?

I want to store and retrieve some JSON data from a file of size upto 1MB. Should I use Azure table storage or blob storage?
An entity in Table Storage (equivalent to a row in a table in RDBMS) can be up to 1MB, however individual attributes in an entity (equivalent to columns) can only be 64KB. You can spread your JSON over multiple attributes, however this would only work if you can guarantee that every file ever is guaranteed to be well below 1MB. (You will need some room for your system attributes like PartitionKey, RowKey, etc).
I would suggest looking into another store: DocumentDB, MongoDB or perhaps even a Redis cache that you back with another non-volatile storage. Maybe a Azure Sql DB will suffice, now that it has support for retrieving JSON values.
Another solution would be saving the files in BLOB storage and referencing them from the table storage. If you would need to look up multiple files at once, this might be slower though.
+1 for the solution to store the data in blob storage and referencing the blob uri in the table. You can also do is update the blob metadata properties with the unique identifiers of the table so that even if you just retrieve the blobs you can get what entity it belongs to.

Insert with AZCopy to Azure tables with original time stamp

I've successfully copied an Azure Storage Table from Azure to the local Emulator using AZCopy. However when looking at the local table, there is two columns that are named "Timestamp" and "TIMESTAMP". The latter contains the original timestamp, while the first is the timestamp when the row is being inserted.
I cant figure out if it's possible to keep the original timestamp with Azcopy or not? The "Timestamp" column i get is quite useless.
I assume you made two following steps to copy an Azure Storage Table to local Emulator by AzCopy:
Exporting the Azure Storage Table to local files or blobs;
Importing from the local files or blobs to local Emulator table.
Please correct me if my assumption is wrong.
About the column "TIMESTAMP", does your original Azure Storage Table contain this column? If not, it may be an unexpected behavior to us since AzCopy shouldn't introduce any additional columns ("TIMESTAMP" here) after exporting and importing. If this is the case, please do share us more information so that we can verify whether it's a bug in AzCopy.
Regarding your question "if it's possible to keep the original timestamp with Azcopy or not", the answer is NO. Timestamp is a property maintained by Azure Storage Table service, users are not able to customize its value.

Resources