How to create an Iceberg table built from multiple S3 buckets? - apache-spark

In my organization, it is a common practice to not hold more than 10-50k objects in one S3 bucket. But in Iceberg I've only seen an option of configuring the S3 bucket location of the data at the table level and not at the data files level.
I wonder if there's an option for partitioning the data between multiple S3 buckets, so data files will be distributed across multiple buckets, instead of writing them all to the same table bucket.

Related

Delta Lake Table Metadata Sync on Athena

now that we can query delta lake tables from Athena without having to generate manifest files, does it also take care of automatically syncing the underlying partitions? I see MSCK REPAIR is not supported for delta tables. Or do I need to use a Glue crawler for this?
I am confused because of these two statements from Athena documentation
You can use Amazon Athena to read Delta Lake tables stored in Amazon S3 directly without having to generate manifest files or run the MSCK REPAIR statement.
Athena synchronizes table metadata, including schema, partition columns, and table properties, to AWS Glue if you use Athena to create your Delta Lake table. As time passes, this metadata can lose its synchronization with the underlying table metadata in the transaction log. To keep your table up to date, you can use the AWS Glue crawler for Delta Lake tables.

Using snowpipe to read parquet file's timestamp

I am using snowpipe to ingest data from azure blob to snowflake. Issue I am facing is that there are multiple file (rows) of data with the same primary key. Is there a way that snowpipe can get this files time stamp? Using timestamp will allow me to use the row which has max(timestamp). Just a walk around for duplicate key issue.

Multiple S3 credentials in a Spark Structured Streaming application

I want to migrate our Delta lake from S3 to Parquet files in our own on-prem Ceph storage, both accessible through the S3-compliant s3a API in Spark. Is there a possibility to provide different credentials for readStream and writeStream to achieve this?
the s3a connector supports per-bucket configuration, so you can declare a different set of secrets, endpoint etc for your internal buckets from your external ones.
consult the hadoop s3a docs for the normative details and examples

How to load array<string> data type from parquet file stored in Amazon S3 to Azure Data Warehouse?

I am working with parquet files stored on Amazon S3. These files need to be extracted and the data from it needs to be loaded into Azure Data Warehouse.
My plan is:
Amazon S3 -> Use SAP BODS to move parquet files to Azure Blob -> Create External tables on those parquet files -> Staging -> Fact/ Dim tables
Now the problem is that in one of the parquet files there is a column that is stored as an array<string>. I am able to create external table on it using varchar data type for that column but if I perform any sql query operation (i.e. Select) on that external table then it throws below error.
Msg 106000, Level 16, State 1, Line 3
HdfsBridge::recordReaderFillBuffer - Unexpected error encountered
filling record reader buffer: ClassCastException: optional group
status (LIST) {
repeated group bag {
optional binary array_element (UTF8);
}
} is not primitive
I have tried different data types but unable to run select query on that external table.
Please let me know if there are any other options.
Thanks
On Azure, there is a service named Azure Data Factory, I think which can be used in your current scenario, as the document Parquet format in Azure Data Factory said below.
Parquet format is supported for the following connectors: Amazon S3, Azure Blob, Azure Data Lake Storage Gen1, Azure Data Lake Storage Gen2, Azure File Storage, File System, FTP, Google Cloud Storage, HDFS, HTTP, and SFTP.
And you can try to follow the tutorial Load data into Azure SQL Data Warehouse by using Azure Data Factory to set Amazon S3 with parquet format as source to directly copy data to Azure SQL Data Warehouse. Due to read the data from the parquet format file with auto schema parsering, it should be easy for your task using Azure Data Factory.
Hope it helps.

Query images in object storage by metadata

I have over 10GB of images for my ecommerce app. I think move them to object storage (S3, Azure, Google, etc.).
So I will have an opportunity to add custom data to metadata (like NOSQL). For example, I have an image and corresponding metadata: product_id, sku, tags.
I want to query my images by metadata? For example, get all images from my object storage where meta_key = 'tag' and tag = 'nature'
So, object storage should have indexing capabilities. I do not want to iterate over billion of images to find only one of them.
I'm new to amazon aws, azure, google, openstack. I know that Amazon S3 is able to store metadata, but It doesn't have indexes (like Apache Solr).
What service is best suited to query files|objects by custom metadata?
To do this in AWS your best best is going to be to pair the object store (S3) with a traditional database to store the meta data for easy querying.
Depending on your needs DynamoDB or RDS (in the flavor of your choice) would be 2 AWS technologies to consider for the meta-data storage and retrieval.

Resources