How to run query over more than one Azure HDInsight(HBase ) cluster installed on different region? - azure

I am new in Azure and HBase .
Say that I have 2 HDInsight (HBase ) cluster one installed in Asia and one on Europe, to get a better read/write performance for users access from different country.
but How to run a query over all data of these clusters ? Do I need to run query separately on all the clusters then combine the results ? Or there is some build-in functions like Distributed Queries for SQLserver

There is no distributed query across clusters in HBase. In your scenario the best solution would probably be setting up replication between two hbase clusters and then querying one of them. The data in both clusters will be complete with the data from the other cluster a few minutes stale as replication is asynchronous. You can also setup more complex replication typologies and have a separate central cluster that has superset of data while two others have their local subsets.
HDInsight team is working on documentation for replication setup in Azure. For now you would need to discover configuration yourself. You would need to provision clusters in the VNets, connect VNets, ensure they have name resolution setup correctly and then use hbase replication setup steps to setup replication itself: http://hbase.apache.org/book.html#_cluster_replication
Without replication solution you would need to query both clusters separately.

Related

Is Azure SQL Database a Distributed SQL database?

I am trying to understand the differences between the new CockroackDB and other distributed SQL databases as compared to a cloud-managed database like Azure SQL Database.
It seems there is no difference in the use cases between them:
Like various NOSQL databases SQL (in general) allows partitioning keys.
I can add cores in Azure to increase the performance as needed, I can also switch to Hyper-scale if I have an elastic workload.
I can have read replication across multiple nodes over multiple availability zones (geo-locations)
I can configure data replication in Azure SQL Database too.
It seems to me that a cloud SQL database covers all the use cases the newer distributed databases cover, so why would I want to use a newer product ?
Isn't Azure SQL Database basically a distributed database server ?
Am I missing something ?
Is Azure SQL Server a Distributed SQL database?
No.
Like various NOSQL databases SQL (in general) allows partitioning keys.
Partitioning in NoSQL databases like Cassandra (and Azure Table Storage) is about distributing partitions to physically distinct nodes, and requires rows to have an explicitly set partition-key value.
Cassandra nodes are physically different machines that can run independently, which gives it excellent resiliency.
Partitioning in SQL Server, Azure SQL, and Azure SQL Managed Instance is about dividing data up into row-groups that exist in the same server for performance, not resiliency.
On on-prem MS SQL Server, these row-groups (well, partitions) can exist in different FILEGROUPs, which means they can exist in different storage volumes to avoid IO bottlenecks, but Azure SQL does not support multiple FILEGROUPs.
The benefits of implementing partitioning, including on Azure SQL, are documented online - and the article explains how it's about performance, not resilience.
I can add cores in Azure to increase the performance as needed, I can also switch to Hyper-scale if I have an elastic workload.
This fact has absolutely nothing to do with distributed databases.
I can have read replication across multiple nodes over multiple availability zones (geo-locations).
I can configure data replication in Azure SQL Database too.
Replication isn't the same thing as a true distributed database:
In Cassandra and other distributed databases, all clients can connect to all nodes and accomplish the same tasks; and you can arbitrarily add and remove nodes while the system is running.
In SQL Server and Azure SQL's replication feature, the replica is strictly a "secondary" that is subordinate to your primary server.
Clients can connect to either the secondary or the primary, but the secondary server can only perform read-only queries, whereas if a client wants to do DML (INSERT/UPDATE/DELETE/MERGE) or DDL (CREATE/ALTER) then the client must connect to the primary server.
It seems to me that a cloud SQL database covers all the use cases the newer distributed databases cover, so why would I want to use a newer product?
It can't: because Azure SQL is not a distributed database it cannot allow any client to read and write to any node or endpoint and have that change replicated to all other nodes (using an eventual consistency model). Instead, Azure SQL requires writes to be performed by the single primary "server".
Note that an Azure SQL "server" or logical server is largely an abstraction that hides what Azure SQL really is: a distinct build of SQL Server's engine that runs in a high-availability Azure Service Fabric environment (which is how cores/RAM can be added and removed while it's running and provides for some kind of local resilience against hardware failure) in a single Azure datacenter.

Azure Databricks Cluster Questions

I am new to azure and am trying to understand the below things. It would be helpful if anyone can share their knowledge on this.
Can the table be created in Cluster A be accessed in Cluster B if Cluster A is down?
What is the connection between the cluster and the data in the tables?
You need to have running process (cluster) to be able to access metastore, and read data, because data is stored in the customer's location, not directly accessible from the control plane that runs UI.
When you wrote data into table, then this data should be available in other cluster in following conditions:
the both clusters are using the same metastore
user has correct permissions (could be enforced via Table ACLs)

Common metadata in databricks cluster

I have a 3-4 clusters in my databricks instance of Azure cloud platform. I want to maintain a common metastore for all the cluster. Let me know if anyone implemented this.
I recommend configuring an external Hive metastore. By default, Detabricks spins its own metastore behind the scenes. But you can create your own database (Azure SQL does work, also MySQL or Postgres) and specify it during the cluster startup.
Here are detailed steps:
https://learn.microsoft.com/en-us/azure/databricks/data/metastores/external-hive-metastore
Things to be aware of:
Data tab in Databricks - you can choose the cluster and see different metastores.
To avoid using SQL user&password, look at Managed Identities https://learn.microsoft.com/en-us/azure/stream-analytics/sql-database-output-managed-identity
Automate external Hive metastore connections by using initialization scripts for your cluster
Permissions management on your sources. In case of ADLS Gen 2, consider using password pass-through

How to Configure Hazelcast Clusters in local environment to test

I'm working on creating Hazelcast backup cluster along with the primary cluster.
It's like I want to setup primary cluster in one machine and backup cluster in another machine to have the backup maps.
How to do that?
If you want to synchronize an entire cluster from a remote cluster, you'll need Hazelcast's WAN Replication feature which is available in Enterprise version. Please see in documentation at https://hazelcast.com/product-features/wan-replication.
If you actually only want to maintain backups within the same cluster by virtue of having multiple nodes within the same datacenter then this is available out of the box in open source. For example, Map by default has backup count of 1 so if you have 2 machines clustered, you will already have a backup of every entry.

Accessing Raw Data for Hadoop

I am looking at the data.seattle.gov data sets and I'm wondering in general how all of this large raw data can get sent to hadoop clusters. I am using hadoop on azure.
It looks like data.seattle.gov is a self contained data service, not built on top of the public cloud.
They have own Restful API for the data access.
Thereof I think the simplest way is to download interested Data to your hadoop cluster, or
to S3 and then use EMR or own clusters on Amazon EC2.
If they (data.seattle.gov ) has relevant queries capabilities you can query the data on demand from Your hadoop cluster passing data references as input. It might work only if you doing very serious data reduction in these queries - otherwise network bandwidth will limit the performance.
In Windows Azure you can place your data sets (unstructured data etc..) in Windows Azure Storage and then access it from the Hadoop Cluster
Check out the blog post: Apache Hadoop on Windows Azure: Connecting to Windows Azure Storage from Hadoop Cluster:
http://blogs.msdn.com/b/avkashchauhan/archive/2012/01/05/apache-hadoop-on-windows-azure-connecting-to-windows-azure-storage-your-hadoop-cluster.aspx
You can also get your data from the Azure Marketplace e.g. Gov Data sets etc..
http://social.technet.microsoft.com/wiki/contents/articles/6857.how-to-import-data-to-hadoop-on-windows-azure-from-windows-azure-marketplace.aspx

Resources