How to query different S3-compatible object storage by Prestosql - presto

Background
prestosql works great with data on S3 and S3 compatible object storage (e.g., IBM cloud object storage) when using the URI prefix s3a:// with S3 configuration with single HMAC key pair via hive.s3.aws-access-key and hive.s3.aws-secret-key by following the prestosql guide Amazon S3 Configuration - Hive Connector.
Question
When data is served in two different buckets across two cloud accounts, it means a client has to use two different HMAC key pairs to access objects. does it mean it has to configure two catalogs via hive connector in prestosql?
This is common case when using IBM cloud where object storage services are managed as instances for different cloud accounts.

Yes, you need to configure two separate hive catalogs.
Alternatively, you could use client-provided extra credentials (this is supported for GCS now, but can be easily extended to S3-compatible).

Related

Securely Configure Azure Storage Credentials for Flink

The Apache Flink docs on the Azure filesystem say that it's discouraged to put storage account keys in the flink-conf.yaml. Putting them in plain text wouldn't be secure, but it's not clear to me how to securely store them. The link to Azure documentation doesn't help. Assuming I wanted to run the Flink word-count example (local:///opt/flink/examples/batch/WordCount.jar in the Flink container) on Azure how would I setup secure storage access? (There is the option to specify a single key as environment variable which could be a Kubernetes secret, but if I have more than one storage account that wouldn't work).
A good starting point might be this Flink on Azure quickstart or the Flink operator.

How to import data into Neo4j from Azure Blob Storage?

Is there a way to import data into Neo4j from Azure Blob Storage?
I don't think there are any free tools.
On the commercial side, GraphAware Hume Orchestra has Azure BlobStorage connectors
There is also the possibility to create your own protocol for Neo4j LOAD CSV (for eg s3, azure etc,) .
I have written an example here : https://github.com/ikwattro/neo4j-load-csv-s3-protocol
I got it done by using python azure-blob-storage and py2neo libraries. It worked like a charm.
There are a couple of options:
https://learn.microsoft.com/en-us/azure/storage/common/storage-sas-overview - creates a URL with a signature that allows you to access files directly over https. You can then LOAD CSV WITH HEADERS FROM "<url>" AS row CREATE..., etc... This has the benefit of not requiring any additional software, custom code, etc...
https://learn.microsoft.com/en-us/azure/storage/blobs/storage-how-to-mount-container-linux - can be used to mount an Azure storage container to a folder in your Neo4j instance (e.g. /var/lib/neo4j/import/myazurecontainer). This folder can then be used to access files in blob storage as if they're local.
I'd be hesitant to install an orchestration framework (e.g. GraphAware's Hume Orchestra) or ETL tool if you only want to load some data from Azure Storage.

Query images in object storage by metadata

I have over 10GB of images for my ecommerce app. I think move them to object storage (S3, Azure, Google, etc.).
So I will have an opportunity to add custom data to metadata (like NOSQL). For example, I have an image and corresponding metadata: product_id, sku, tags.
I want to query my images by metadata? For example, get all images from my object storage where meta_key = 'tag' and tag = 'nature'
So, object storage should have indexing capabilities. I do not want to iterate over billion of images to find only one of them.
I'm new to amazon aws, azure, google, openstack. I know that Amazon S3 is able to store metadata, but It doesn't have indexes (like Apache Solr).
What service is best suited to query files|objects by custom metadata?
To do this in AWS your best best is going to be to pair the object store (S3) with a traditional database to store the meta data for easy querying.
Depending on your needs DynamoDB or RDS (in the flavor of your choice) would be 2 AWS technologies to consider for the meta-data storage and retrieval.

What is equivalent of SAS (shared access signature) feature of Azure Storage on S3?

Azure provides shared access signatures ([1], [2], [3]) that can delegate access (read/write) to specific blobs/containers/tables/queues in an Azure Storage account using an access key generated through the REST API. Does AWS offer a similar feature?
Presigned URL in S3 is the equivalent:
https://docs.aws.amazon.com/AmazonS3/latest/dev/ShareObjectPreSignedURL.html
There are also ways to generate them with SDKs not mentioned in this documentation, you can google it.
Python for example:
https://boto3.amazonaws.com/v1/documentation/api/latest/guide/s3-presigned-urls.html
Equivalent for shared access signature in Amazon AWS is Query String Authentication however it is only for Amazon S3 (equivalent to Windows Azure Blob Storage). AWS does not have anything similar to shared access signature for SimpleDB/DynamoDB (counterpart of Windows Azure Table Storage) and Simple Queue Service (counterpart of Windows Azure Queue Service).
I also did a comparison between Amazon AWS and Windows Azure Storage Services (S3 v/s Blob Storage etc.) in a series of blog posts which you can read here: http://gauravmantri.com/?s=Amazon+Comparing. Thought you might find it useful.
It is almost identical to what Azure provides, just without a special name like SAS. See http://docs.aws.amazon.com/AmazonS3/latest/dev/RESTAuthentication.html for details.
I haven't tried it yet but looks like pre-signed URLs are the thing: See the "Amazon S3: Getting a pre-signed URL for a PUT operation with a specific payload" section of http://docs.aws.amazon.com/AWSJavaScriptSDK/guide/node-examples.html

Alternative to Windows Azure tables out of the cloud

I'm developing a .NET app, which needs to run both on Azure and on regular Windows Servers(2003). It needs to store a few GB of data and SQL Azure is too expensive for me, so I'll use Azure tables in the cloud version. Can you recommend a storage solution, which will run on standalone servers and have an API and behavior similar to Azure tables? From what I've seen Server AppFabric does not include Tables.
If you think what Windows Azure Table Storage is, it is a Key-Value pair based non-relational databse which is accessible through REST API. Please download this document about Windows Azure and NoSQL database details.
If I were in your situation, my approach would have been to find something similar to Azure Table Storage which I can access over REST and have similar accessibility API. So if you try to find the similar database to run on a machine you really need to look for:
Key Value Pair DB
Support for basic operations i.e add, delete, insert, modify an entity
Partition Key and Row Key based Accessibility
RESTful Interface to connect
If you would want to try something you sure can look at:
DBreeze (C# based Key Value Pair NoSQL DB) I just saw it and looks exciting
Googles LevelDB (Key Value Pair DB, open source and available on Windows) I have no idea about API
Redis (Great Key-Value Pair DB but not sure for Windows compatibility and API)
Here is a list of key/value databases without additional indexing facilities are:
Berkeley DB
HBase
MemcacheDB
Redis
SimpleDB
Tokyo Cabinet/Tyrant
Voldemort
Riak
If none works, you sure can get any of open source DB and modify to work for your requirement and then make that available to others as your contribution to community.
ADDED
Now you can use Windows Azure Virtual Machine to run any kind of Key-Value pair DB on Linux or Windows Machine and connection with your application.
I'm not sure which storage solution to recommend, but just about any database solution would work provided that you write an Interface to abstract all your data storage code. Then write implementations of that interface for Azure Table storage and whatever other database you want to use on the non-cloud server
You should be doing that anyway so that your code isn't tightly coupled with Azure Table Storage APIs.
If you combine coding against that Interface with an IoC container, then a single line of code or a single configuration setting would enable you to switch between data implementations based on which platform the code is running on.

Resources