How to store YugabyteDB snapshots in S3 compatible object storage

How to store YugabyteDB snapshots in S3 compatible object storage - yugabytedb

[Question posted by a user on YugabyteDB Community Slack]
Looking at the docs and I’ve successfully taken a snapshot of my cluster.
Are there instructions on how to store YugabyteDB backup on s3 compatible object storage?

Check out at the end of the page https://docs.yugabyte.com/latest/manage/backup-restore/snapshot-ysql/, there's a script that can help you.
The yb_backup.py script automates all the steps needed to take a snapshot and move the files into NFS,AWS S3,Azure Blob and Google Cloud Storage.

Related

Mounting onprem datastore to Azure Databricks dbfs

I am looking for a solution to mount local storage which is on on premise hadoop cluster that can be used with databricks to mount onto dbfs:/// directly instead of loading to azure blob storage and then mounting it to databricks. Any advice here would be helpful. Thank You
I am in research side and still have not figured a way to come up with solution. I am not sure even if its possible with out azure storage account.

Unfortunately, mounting on Prem datastore to Azure Databricks not supported.
You can try these alternative methods:
Method 1:
Connecting local files on a remote data bricks spark cluster access local file with DBFS. Refer this MS Document.
Method 2:
Alternative use azure Databricks CLI or REST API and push local data to a location on DBFS, where it can be read into Spark from within a Databricks notebook.
For more information refer this Blog by Vikas Verma

Can we use azure blob storage for scylladb backups?

As per the scylladb doc scylla-manager is the tool we should use to take our backups.
But as per the latest version of scylla-manager (ie 2.2) it only seems to support AWS S3 and Google Cloud Storage Bucket.
Is there some way to use scylla-manager to upload our backups to Azure Blobstorage ?
or
any other way which is at least equally efficient as scylla-manager to upload backups to Azure Blobstorage ?

We just coded and merged it, it's under tests now. Will be part of manager 2.3

Azure Blobstorage is now supported as a backup location using Scylla Manager 2.4, which was released last week.
Release notes
Setting up AzureBlobStorage as your backup location
Scylla Manager 2.4 also provides an ansible playbook to automate the restore from backup
Note: Scylla Manager metrics undergo some refactoring, hence you will need Scylla Monitoring 3.8 (docs / Repo tag) in order to view the Manager metrics in the Manager dashboard in Scylla Monitoring.

Azure Blobstorage is not yet integrated into Scylla Manager. While it is in the near-term roadmap, we do not yet have it associated with a particular release or date.

SPARK : How to access AzureFileSystemInstrumentation when using azure blob storage with spark cluster?

I am working on a spark project where the storage sink is Azure Blob Storage. I write data in parquet format. I need some metrics around storage, eg. numberOfFilesCreated, writtenBytes etc. On searching for it online I came across a particular metrics that the hadoop-azure package has called the AzureFileSystemInstrumentation. I am not sure about how to access the same from spark and can't find any resources for the same. How would one access this instrumentation for the given spark job?

Based on my experience, I think there are three solution can be used in your current scenario, as below.
Directly use Hadoop API for HDFS to get HDFS Metrics Data in Spark, because hadoop-azure just implements the HDFS APIs for using Azure Blob Storage, so please see the Hadoop offical document for Metrics to know what particular metrics you want to use, such as CreateFileOps or FilesCreated as the figure below to get numberOfFilesCreated. Meanwhile, there is a similar SO thread How do I get HDFS bytes read and write for Spark applications? which you can refer to.
Directly use Azure Storage SDK for Java or other languages you used to write a program to do the statistics for files stored in Azure Blob Storage as blobs ordered by creation timestamp or others, please refer to the offical document Quickstart: Azure Blob storage client library v8 for Java to know how to use its SDK.
Use Azure Function with Blob Trigger to monitor the events of files created in Azure Blob Storage, then you can write the code for statistics on every blob created event, please refer to the offical document Create a function triggered by Azure Blob storage to know how to use Blob Trigger. Even, you can send these metrics what you want to Azure Table Storage or Azure SQL Database or other services for statistics later in the Azure Blob Trigger Function.

Upgrade/Migrate HDInsight Cluster to Last Version

I'm sure this is posted somewhere or has been communicated but I just can't seem to find anything about upgrading/migrating from a HDInsight cluster from one version to the next.
A little background. We've been using Hive with HDInsight to store all of our IIS logs since 1/24/2014. We love it and it provides good insight to our teams.
I recently was reviewing http://azure.microsoft.com/en-us/documentation/articles/hdinsight-component-versioning/ and noticed that our version of HDInsight (2.1.3.0.432823) is no longer supported and will be deprecated in May. That got me to thinking about how to get onto version 3.2. I just can't seem to find anything about how to go about doing this.
Does anyone have any insight into if this is possible and if so how?

HDInsight uses Azure Storage for persistent data, so you should be able to create a new cluster and point to the old data, as long as you are using wasb://*/* for your storage locations. This article has a great overview of the storage architecture: http://azure.microsoft.com/en-us/documentation/articles/hdinsight-use-blob-storage/
If you are using Hive and have not set up a customized metastore, then you may need to save or recreate some of the tables. Here's a blog post that covers some of those scenarios: http://blogs.msdn.com/b/bigdatasupport/archive/2014/05/01/hdinsight-backup-and-restore-hive-table.aspx
You can configure a new cluster and add the existing cluster's storage container as an "additional" storage account to test this out without first taking down the current cluster. Just be sure not to have both clusters using the same container as their default storage.

Can we use HDInsight Service for ATS?

We have a logging system called as Xtrace. We use this system to dump logs, exceptions, traces etc. in SQL Azure database. Ops team then uses this data for debugging, SCOM purpose. Considering the 150 GB limitation that SQL Azure has we are thinking of using HDInsight (Big Data) Service.
If we dump the data in Azure Table Storage, will HDInsight Service work against ATS?
Or it will work only against the blob storage, which means the log records need to be created as files on blob storage?
Last question. Considering the scenario I explained above, is it a good candidate to use HDInsight Service?

HDInsight is going to consume content from HDFS, or from blob storage mapped to HDFS via Azure Storage Vault (ASV), which effectively provides an HDFS layer on top of blob storage. The latter is the recommended approach, since you can have a significant amount of content written to blob storage, and this maps nicely into a file system that can be consumed by your HDInsight job later. This would work great for things like logs/traces. Imagine writing hourly logs to separate blobs within a particular container. You'd then have your HDInsight cluster created, attached to the same storage account. It then becomes very straightforward to specify your input directory, which is mapped to files inside your designated storage container, and off you go.
You can also store data in Windows Azure SQL DB (legacy naming: "SQL Azure"), and use a tool called Sqoop to import data straight from SQL DB into HDFS for processing. However, you'll have the 150GB limit you mentioned in your question.
There's no built-in mapping from Table Storage to HDFS; you'd need to create some type of converter to read from Table Storage and write to text files for processing (but I think writing directly to text files will be more efficient, skipping the need for doing a bulk read/write in preparation for your HDInsight processing). Of course, if you're doing non-HDInsight queries on your logging data, then it may indeed be beneficial to store initially to Table Storage, then extracting the specific data you need whenever launching your HDInsight jobs.
There's some HDInsight documentation up on the Azure Portal that provides more detail around HDFS + Azure Storage Vault.

The answer above is sligthly misleading in regard to the Azure Table Storage part. It is not necessary to first write ATS contents to text files and then process the text files. Instead a standard Hadoop InputFormat or Hive StorageHandler can be written, that reads directly from ATS. There are at least 2 implementations available at this point in time:
ATS InputFormat and Hive StorageHandler written by an MS employee
ATS Hive StorageHandler written by Simon Ball

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string