Google Analytics data in Azure - azure

Has anybody ever moved Google Analytics data into Azure? I have seen a handful of ways to do it but I am not sure what I am getting myself into. The Google Analytics data is becoming quite large and I am wondering if it is best suited to leave it in google storage and access it from Azure or move it to something like HDInsight or Data Lake. I need to join the data across several disparate data stores, SQL Azure, Blob, and Table Storage. I was also looking into Apache Drill and Presto as a possible solution to unify the data access. Just looking to see if anybody out there has dealt with this same issue and has any experience to share. Thanks!

Preface
I don't have experience with Presto so I can only comment on the feasibility of doing this with Drill. Also I have not used Azure services so my advice is theoretical.
Drill Storage Plugins
Drill will allow you to perform any SQL queries you want on data originating from different sources, provided that each data source has a storage plugin. A storage plugin is simply a piece of code in Drill that allows you to interface with a data source. Since you are concerned with performing queries on 3 data sources, we need to determine if each of those 3 data sources have a Storage plugin.
SQL Azure
I assume SQL Azure has a jdbc driver for java. If so then Drill can be configured to use SQL Azure by following these instructions.
Azure Blob
Azure Blob storage has an implementation of the hadoop filesystem api which Drill uses to read data from file systems. So you could theoretically add the hadoop-azure jar and its dependencies https://mvnrepository.com/artifact/org.apache.hadoop/hadoop-azure/2.7.0 to Drill's class path and configure Drill's DFS storage plugin to use it.
Additionally the data in Azure Blob would have to be stored in a supported file format like: json, parquet, csv, or hadoop sequence files.
Azure Table
This looks like Microsoft's custom NoSQL database. Currently Drill does not support it.
Conclusion
With a bit of work you could use Drill to query data on both Azure SQL and Blob, but not Azure Table.

Related

Get data from Azure Synpase to Azure Machine Learning

I am trying to load the data (tabular data in tables, in a schema named 'x' from a spark pool in Azure Synapse. I can't seem to find how to do that. Until now i have only linked synapse and my pool to the ML studio. How can I do that?
The Lake Database contents are stored as Parquet files and exposed via your Serverless SQL endpoint as External Tables, so you can technically just query them via the endpoint. This is true for any tool or service that can connect to SQL, like Power BI, SSMS, Azure Machine Learning, etc.
WARNING, HERE THERE BE DRAGONS: Due to the manner in which the serverless engine allocates memory for text queries, using this approach may result in significant performance issues, up to and including service interruption. Speaking from personal experience, this approach is NOT recommended. I recommend that you limit use of the Lake Database for Spark workloads or very limited investigation in the SQL pool. Fortunately there are a couple ways to sidestep these problems.
Approach 1: Read directly from your Lake Database's storage location. This will be in your workspace's root container (declared at creation time) under the following path structure:
synapse/workspaces/{workspacename}/warehouse/{databasename}.db/{tablename}/
These are just Parquet files, so there are no special rules about accessing them directly.
Approach 2: You can also create Views over your Lake Database (External Table) in a serverless database and use the WITH clause to explicitly assign properly sized schemas. Similarly, you can ignore the External Table altogether and use OPENROWSET over the same storage mentioned above. I recommend this approach if you need to access your Lake Database via the SQL Endpoint.

Which Azure storage technology for weather forecast data

I would like some advice/tips about the right technology to select in order to store some forecast data on Azure technologies.
My team and I are scraping some weather forecast data everyday from various sources and store it as is on a Azure File Storage. The files format is "grib2" which is a standard format of weather forecast data.
We are able to extract the data from those "grib2" files using python script running on a Azure VM.
We now have several files that represent hundreds gigabytes of data to store and I'm struggling to find which data store from the Azure technologies suits the best our needs in term of praticity and cost.
We started using "Azure Table Storage" first because it's cheap solution,
but I've read on many posts that it is a bit old and not very adapted to our solution as it for example does not allow more than 1,000 entites per query and no aggregation on data.
I considered using Azure SQL db but it seems that it can become very expensive very fast.
I also considered the Azure Data Lake Storage Gen2 (and HDinsight) technologies but am not very at ease with those blob storages and am not really able to say if it can suit my needs in terms of praticity and if it is "easy to query".
By now we just plan to achieve that :
1) Extract data from grib2 files thanks to a python script running on an Azure VM
2) Insert the transformed data into [Azure storage]
3) Query the [Azure storage] from Azure Machine Learning Service or a local R script (for example)
4) Insert the computed data into [Azure storage]
where [Azure Storage] technology is to determine.
Any help or advice would be much appreciated, thanks.
A couple of things I would see here:
To store the downloaded files in raw format (grib2 in your case), either place them on good ol' Azure Blob Storage. Cheap storage exactly for your needs.
Use Azure Databricks to load the data from the storage account and unpack it into memory. (python or scala)
Load the data in memory - still in Databricks - to run you ML inferencing. You could also use SparkR if you really want to.
Store the computed files in a serving layer. This really depends on what you want to do with it later. Often Azure SQL Database is an obvious choice. There is a native Spark connector which efficiently writes data from Databricks to SQL DB.
In addition to using Databricks as your inferencing environment, it's also a good choice for ML training (e.g. utilizing Azure ML Service).

Which Azure products are needed for a staging database?

I have several external data APIs that I access using some Python scripts. My scripts run from an on-premises server, transform the data, and store it in a SQL Server database on the same server. I suppose it's a rudimentary ETL system run with Python and T-SQL.
The system is about to grow quite a bit with new APIs and will require more complex data pipelines (for example, some of the API data will be spun off to more than one table). I think this would be a good time to move the system onto Azure (we are heavily integrated with Microsoft so it will have to be Azure!).
I have spent a few days researching the Azure products that would let me run Python scripts to access data from web APIs and store the processed data in a cloud database. I'm looking for advice on what sort of Azure products other people have used for similar jobs. At the moment it seems I will need:
Azure SQL Database to hold the processed data that can be accessed by various colleagues.
Azure Data Factory to manage, log, and schedule the pipeline jobs and to run my custom Python scripts (is this even possible?).
Azure Batch to run the aforementioned Python scripts but I'm not sure about this.
I want to put together a proposal basically and start thinking about costs but it would be good to hear from someone who has done something similar - am I on the right track or completely off? Should I just stay on-premises? Thank you in advance.
Azure SQL Database, Azure SQL Data Warehouse are good for relational data. And if you want to use NoSQL, you could go with Azure Cosmos DB. If you want to use Files to store data, you could use Azure Data Lake.
For python scripts, you could use custom activity or Data bricks for Azure Data Factory.
Azure SQL Warehouse should be used if the amount of data you want to load is in petabytes. Also, Azure Data warehouse is not meant for complex transformations. I would recommend it for plain data load with PolyBase.

Could any one help me how to perform Azure table storage deployment through VSTS?

I am a new to azure.Could any one help me what is table storage in Azure and how can I do table storage deployment through VSTS?Please share your thoughts and what steps involved in this and which plugin/task I can use in VSTS to perform this?
About Azure Table storage, you can refer to this article: Azure Table storage overview.
Regarding Azure table storage with VSTS, you can manage azure tables and table entities through Azure PowerShell task.
Azure Table storage stores large amounts of structured data. The service is a NoSQL datastore which accepts authenticated calls from inside and outside the Azure cloud. Azure tables are ideal for storing structured, non-relational data. Common uses of Table storage include:
Storing TBs of structured data capable of serving web scale
applications
Storing datasets that don't require complex joins, foreign keys, or
stored procedures and can be denormalized for fast access
Quickly querying data using a clustered index
Accessing data using the OData protocol and LINQ queries with WCF
Data Service .NET Libraries
You can use Table storage to store and query huge sets of structured, non-relational data, and your tables will scale as demand increases.
You’ll have to install Azure Storage Client Library for .NET to work with Azure Storage.
For more details, refer to the documentations Get started with Azure Table storage using .NET and Get started with Azure table storage and Visual Studio Connected Services (ASP.NET) incase if you haven't checked earlier.

Can we use HDInsight Service for ATS?

We have a logging system called as Xtrace. We use this system to dump logs, exceptions, traces etc. in SQL Azure database. Ops team then uses this data for debugging, SCOM purpose. Considering the 150 GB limitation that SQL Azure has we are thinking of using HDInsight (Big Data) Service.
If we dump the data in Azure Table Storage, will HDInsight Service work against ATS?
Or it will work only against the blob storage, which means the log records need to be created as files on blob storage?
Last question. Considering the scenario I explained above, is it a good candidate to use HDInsight Service?
HDInsight is going to consume content from HDFS, or from blob storage mapped to HDFS via Azure Storage Vault (ASV), which effectively provides an HDFS layer on top of blob storage. The latter is the recommended approach, since you can have a significant amount of content written to blob storage, and this maps nicely into a file system that can be consumed by your HDInsight job later. This would work great for things like logs/traces. Imagine writing hourly logs to separate blobs within a particular container. You'd then have your HDInsight cluster created, attached to the same storage account. It then becomes very straightforward to specify your input directory, which is mapped to files inside your designated storage container, and off you go.
You can also store data in Windows Azure SQL DB (legacy naming: "SQL Azure"), and use a tool called Sqoop to import data straight from SQL DB into HDFS for processing. However, you'll have the 150GB limit you mentioned in your question.
There's no built-in mapping from Table Storage to HDFS; you'd need to create some type of converter to read from Table Storage and write to text files for processing (but I think writing directly to text files will be more efficient, skipping the need for doing a bulk read/write in preparation for your HDInsight processing). Of course, if you're doing non-HDInsight queries on your logging data, then it may indeed be beneficial to store initially to Table Storage, then extracting the specific data you need whenever launching your HDInsight jobs.
There's some HDInsight documentation up on the Azure Portal that provides more detail around HDFS + Azure Storage Vault.
The answer above is sligthly misleading in regard to the Azure Table Storage part. It is not necessary to first write ATS contents to text files and then process the text files. Instead a standard Hadoop InputFormat or Hive StorageHandler can be written, that reads directly from ATS. There are at least 2 implementations available at this point in time:
ATS InputFormat and Hive StorageHandler written by an MS employee
ATS Hive StorageHandler written by Simon Ball

Resources