Elastic search azure cluster prevent public write - azure

We recently created one azure elastic search cluster. Everything around writing, querying, fail-over works fine. But cluster URL is public, from any machine I can write data to the index. What is the best way to secure the cluster and prevent it from writing data, only authenticated machine should be able to write data to index.
Thanks,
Manish

Related

WriteStream a dataFrame to Elasticsearch behind an Azure API Management that requires a client certificate?

We have an environment where we have Elasticsearch that is protected behind Azure API Management. We have this locked down with client certificate requirements (as well as other security measures). Calls that come into APIM w/o the client cert are rejected.
I have a new system I am brining online where data is stored in Delta Lake tables and processed with PySpark (using Azure Synapse). At the end of the processing, I want to push the final product to Elasticsearch. I know that I can write to es using org.elasticsearch.spark, but I don't see any way that I can include a client certificate to be able to clear the APIM.
Are any of these possible?
Include a certificate when making the connection to Elasticsearch for the writeStream.
Use .Net to do the the streaming reads and writes. I am not yet sure what capabilities Microsoft.Spark has and if it can read from Delta tables with structured streaming. If it does work, I can use my existing libraries for calling to ES.
Find a way to peer the VNets so that I can call ES via a local IP address. I am doing this in another sytem, but in that case, I have access to both VNets. With Synapse, the Spark Pook is managed and I don't think it supports the Azure VNet peering functionality.
Something else?
Thanks!
The only solution I have found was to enable a Private Link Service on the Load Balancer in the cluster where I am running Elasticsearch, then create a Managed Private Endpoint connected to the PLS in Synapse.

Azure Databricks Cluster Questions

I am new to azure and am trying to understand the below things. It would be helpful if anyone can share their knowledge on this.
Can the table be created in Cluster A be accessed in Cluster B if Cluster A is down?
What is the connection between the cluster and the data in the tables?
You need to have running process (cluster) to be able to access metastore, and read data, because data is stored in the customer's location, not directly accessible from the control plane that runs UI.
When you wrote data into table, then this data should be available in other cluster in following conditions:
the both clusters are using the same metastore
user has correct permissions (could be enforced via Table ACLs)

Deploying local elasticsearch cluster on azure

I am using Apache Nutch to crawl sites for one of my project. The data and the content is successfully crawled. For indexing and search queries, I am using Elasticsearch clusters to process data easily. I am really new to Elasticsearch clusters, locally everything is working fine. But now I want to deploy the same local clusters on Azure services so I can communicate with data from another application that I am working on. I have seen some tutorials, but there is no option of deploying your local clusters on Azure.
Please guide me through this.
You can not deploy local cluster on azure, you need to setup another es cluster on azure nodes & point your writing application to that cluster.

Data Migration from AWS RDS to Azure SQL Data Warehouse

I have my application's database running in AWS RDS (postgresql). I need to migrate the data from AWS to Azure SQL Data Warehouse.
This is a kind of ETL process and I need to do some calculations/computations/aggregations on the Data from Postgresql and put it in a different schema in Azure SQL Data Warehouse for reporting purpose.
Also, I need to sync the data on a regular basis without duplication.
I am new to this Data Migration concept and kindly let me know what are the best possible ways to achieve this task?
Thanks!!!
Azure datafactory is the option for you. It is a cloud data integration service, to compose data storage, movement, and processing services into automated data pipelines.
Please find the Postgresql connector below.
https://learn.microsoft.com/en-us/azure/data-factory/data-factory-onprem-postgresql-connector
On the transform part you may have to put in some custom intermediate steps to do the data massaging.
Have you tried the Azure datafactory suggestion?
Did it solve your issue?
If not, you can try using Alooma. This solution can replicate PostgreSQL database hosted on Amazon RDS to Azure SQL data warehouse in near real time. (https://www.alooma.com/integrations/postgresql/)
Follow this steps to migrate from RDS to Azure SQL:
Verify your host configuration
On the RDS dashboard under Parameter Groups, navigate to the group that's associated with your instance.
Verify that hot_standby and hot_standby_feedback are set to 1.
Verify that max_standby_archive_delay and max_standby_streaming_delay are greater than 0 (we recommend 30000).
If any of the parameter values need to be changed, click Edit Parameters.
Connect to Alooma
You can connect via SSH server (https://support.alooma.com/hc/en-us/articles/214021869-Connecting-to-an-input-via-SSH) or to to whitelist access to Alooma's IP addresses.
52.35.19.31/32
52.88.52.130/32
52.26.47.1/32
52.24.172.83/32
Add and name your PostreSQL input from the Plumbing screen and enter the following details:
Hostname or IP address of the PostgreSQL server (default port is 5432)
User name and Password
Database name
Choose the replication method you'd like to use for PostgreSQL database replication
For full dump/load replication, provide:
A space- or comma-separated list of the names of the tables you want to replicate.
The frequency at which you'd like to replicate your tables. The more frequent, the more fresh your data will be, but the more load it puts on your PostgreSQL database.
For incremental dump/load replication, provide:
A table/update indicator column pairs for each table you want to replicate.
Don't have an update indicator column? Let us know! We can still make incremental load work for you.
Keep the mapping mode to the default of OneClick if you'd like Alooma to automatically map all PostgreSQL tables exactly to your target data warehouse. Otherwise, they'll have to be mapped manually from the Mapper screen.

Accessing Raw Data for Hadoop

I am looking at the data.seattle.gov data sets and I'm wondering in general how all of this large raw data can get sent to hadoop clusters. I am using hadoop on azure.
It looks like data.seattle.gov is a self contained data service, not built on top of the public cloud.
They have own Restful API for the data access.
Thereof I think the simplest way is to download interested Data to your hadoop cluster, or
to S3 and then use EMR or own clusters on Amazon EC2.
If they (data.seattle.gov ) has relevant queries capabilities you can query the data on demand from Your hadoop cluster passing data references as input. It might work only if you doing very serious data reduction in these queries - otherwise network bandwidth will limit the performance.
In Windows Azure you can place your data sets (unstructured data etc..) in Windows Azure Storage and then access it from the Hadoop Cluster
Check out the blog post: Apache Hadoop on Windows Azure: Connecting to Windows Azure Storage from Hadoop Cluster:
http://blogs.msdn.com/b/avkashchauhan/archive/2012/01/05/apache-hadoop-on-windows-azure-connecting-to-windows-azure-storage-your-hadoop-cluster.aspx
You can also get your data from the Azure Marketplace e.g. Gov Data sets etc..
http://social.technet.microsoft.com/wiki/contents/articles/6857.how-to-import-data-to-hadoop-on-windows-azure-from-windows-azure-marketplace.aspx

Resources