Azure storage for saving tweets and related information - azure

I am planning to write a windows service that comsumes the twitter streaming api to save tweets and related information (sentiment score, twitter-user, date-of-creation) of a specific topic into an azure storage. I need a way to query these information later like "show me the average sentiment score of tweets in the last 24h" therefore SQL or LINQ must be available.
Some numbers:
Number of tweets saved per day approx. 20.000
Save data for 3 month (20.000 tweets * 90 days)
Data saved: tweet text (140 chars), sentiment score, twitter user name, date (maybe some more properties)
Saving frequency: Since I am using the streaming api, I get tweets in real time which have to be saved into the storage.
Query frequency: About every 30 minutes.
I wonder which Azure Storage is suited for this purpose. I think I have to decide between Azure Table Storage and SQL database.

There are two things to consider choosing between these 2
1. Price:
SQL Azure: check the calculator: https://azure.microsoft.com/en-us/pricing/calculator/
Storage table: check the calculator: https://azure.microsoft.com/en-us/pricing/calculator/?service=storage
You should consider Capacity and Transactions and the service tier to see which one is cheaper...
2. Performance:
If you design it right, in many cases table storage should be faster than Sql Azure because of its no-sql/denormalized nature but it probably depends on the queries that you are going to write for it.
In SQL Azure you will use TSQL but in Table storage you will use C# and Linq to query the data...
if you look at #David's comment below there will be limitations based on the queries you are interested in if you use Table Storage, so you have to be aware of those limitations in Table Storage as well ...

Related

Storing IOT Data in Azure: SQL vs Cosmos vs Other Methods

The project I am working on as an architect has got an IOT setup where lots of sensors are sending data like water pressure, temperature etc. to an FTP(cant change it as no control over it due to security). From here few windows service on Azure pull the data and store it into an Azure SQL Database.
Here is my observation with respect to this architecture:
Problems: 1 TB limit in Azure SQL. With higher tier it can go to 4 TB but that's the max. So it does not appear to be infinitely scalable plus with size, the query issues could be a problem. Columnstore index and partitioning seem to be options but size limitation and DTUs is a deal breaker.
Problem-2- IOT data and SQL Database(downstream storage) seem to be tightly coupled. If some customer wants to extract few months of data or even more with millions of rows, DB will get busy and possibly throttle other customers due to DTU exhaustion.
I would like to have some ideas on possibly scaling this further. SQL DB is great and with JSON support it is awesome but a it is not horizontally scalable solution.
Here is what I am thinking:
All the messages should be consumed from FTP by Azure IOT hub by some means.
From the central hub, I want to push all messages to Azure Blob Storage in 128 MB files for later analysis at cheap cost.
At the same time,  I would like all messages to go to IOT hub and from there to Azure CosmosDB(for long term storage)\Azure SQL DB(Long term but not sure due to size restriction).
I am keeping data in blob storage because if client wants or hires a Machine learning team to create some models, I would prefer them to pull data from Blob storage rather than hitting my DB.
Kindly suggest few ideas on this. Thanks in advance!!
Chandan Jha
First, Azure SQL DB does have Hyperscale which is much larger than 4TB. That said, there is a tipping point where it makes sense to consider alternative architectures when you get to be bigger than what one machine can handle for your solution. While CosmosDB does give you a horizontal sharding solution, you can do the same with N SQL Databases (there are libraries to help there). Stepping back, it is actually pretty important to understand what you want to do with the data if it were in a database. Both CosmosDB and SQL DB are set up for OLTP-style operations (with some limited forms of broader queries - SQL DB supports columnstore and batch mode, for example, which means you could do a reasonably-sized data mart just fine there too). If you are just storing things in the database in the hope of needing to support future data scientists, then you may or may not really need either of these two OLTP stores.
Synapse SQL is set up for analytics and generally has support to read from data in formats in Azure Storage. So, this may be a better strategy if you want to support arbitrarily-large IoT data and do analytics/ML processing over it.
If you know your solution will never be above , you may not need to consider something like Synapse, but it is set up for those scenarios if you are of sufficient size.
Option - 1:
Why don't you extract and serialize the data based on the partition id (device id), send it over the to IoT hub, where you can have the Azure Functions or Logic Apps that de-serializes the data into files that are stored in the blob containers.
Option - 2:
You can also attempt to create a module that extracts the data into excel file, which is then sent to the IoT hub to be stored in the storage containers.

Using Azure Metrics Advisor with workout data

So I saw Azure Metrics advisor in Microsoft Ignite this year and decided to try to using it against some workout data that I had from Endomondo.
I uploaded 4 years of running data (only about 400 records) into an Azure SQL instance.
Then, I ingested that as a data feed into an Azure Metrics Advisor instance.
Literally, I'm just taking metrics advisor for a spin, so now I want to use this historical data for new running data.
I'd like when I upload a new workout I can run it against this data and have anomalies in either my distance or duration detected.
Does that use case make sense

How good is Azure Data Lake for storing an SQL database used for Power BI visualizations?

We have an Azure SQL database where we collect a large amount of sensor data and we regularly extract the data from it and transform it a bit with a python script. The end result is a pandas DataFrame file. We would like to store the transformed data in an Azure database and use it as a source of a power BI dashboard.
On the one hand, we want to show the "almost" real-time data on a dashboard (the latency due to the transformation etc. is acceptable, but the dashboard needs to refresh very frequently, let's say once a minute), but we also want to store the transformed data and query it later e.g. to visualize the data only for a given day.
Is it possible to convert the pandas DataFrame into SQL and store it on Data Lake and stream the data from there? I read that it is possible to store structured data on Data Lake and even query it, but I am unsure if this would be the best solution.
(My current task is to choose the best database for storing the transformed data to enable both streaming and querying it later. I am very new in Azure products and I don't have a sandbox account yet to even try around and identify possible pitfalls. I've just figured out that PowerBI does not support DirectQuery for DataLake and I feel like this can be an issue - meaning we would have to query the data on DataLake at first and store it somewhere if we wanted to visualize a subset, is that correct?)
Azure Datalake is not a database, just a store for the data both structured and unstructured, so as mentioned you can't direct query it unless you have some compute capacity (Databricks, Azure Synapse, Azure DataLake Analytics, Power BI Premium with enhanced compute)
Depending on your approach, it may be best to move from Azure SQL Database and Pandas, to Azure Databricks, that can ingest the streaming data, transform, and provide an outputted table that is stored in the data lake. You will then connect Power BI to the Databricks instance and query that. The data will only be available while the cluster is running.
Moving to Databricks, will involve rewriting your Panda code to Koalas, or preferably Pyspark.
You do have the option of using Databricks to write the items back to a Azure SQL Database table. Depending on what transformations you are doing you could keep it all in Azure SQL, or if it is sensor data streaming, take the data through Azure Event Hubs, to Azure Streaming Analytics (does transformations), to Azure SQL Database (store Realtime and historical).

Which Azure storage technology for weather forecast data

I would like some advice/tips about the right technology to select in order to store some forecast data on Azure technologies.
My team and I are scraping some weather forecast data everyday from various sources and store it as is on a Azure File Storage. The files format is "grib2" which is a standard format of weather forecast data.
We are able to extract the data from those "grib2" files using python script running on a Azure VM.
We now have several files that represent hundreds gigabytes of data to store and I'm struggling to find which data store from the Azure technologies suits the best our needs in term of praticity and cost.
We started using "Azure Table Storage" first because it's cheap solution,
but I've read on many posts that it is a bit old and not very adapted to our solution as it for example does not allow more than 1,000 entites per query and no aggregation on data.
I considered using Azure SQL db but it seems that it can become very expensive very fast.
I also considered the Azure Data Lake Storage Gen2 (and HDinsight) technologies but am not very at ease with those blob storages and am not really able to say if it can suit my needs in terms of praticity and if it is "easy to query".
By now we just plan to achieve that :
1) Extract data from grib2 files thanks to a python script running on an Azure VM
2) Insert the transformed data into [Azure storage]
3) Query the [Azure storage] from Azure Machine Learning Service or a local R script (for example)
4) Insert the computed data into [Azure storage]
where [Azure Storage] technology is to determine.
Any help or advice would be much appreciated, thanks.
A couple of things I would see here:
To store the downloaded files in raw format (grib2 in your case), either place them on good ol' Azure Blob Storage. Cheap storage exactly for your needs.
Use Azure Databricks to load the data from the storage account and unpack it into memory. (python or scala)
Load the data in memory - still in Databricks - to run you ML inferencing. You could also use SparkR if you really want to.
Store the computed files in a serving layer. This really depends on what you want to do with it later. Often Azure SQL Database is an obvious choice. There is a native Spark connector which efficiently writes data from Databricks to SQL DB.
In addition to using Databricks as your inferencing environment, it's also a good choice for ML training (e.g. utilizing Azure ML Service).

Load data to Azure SQL DW

I have large amount of data to be loaded for SQL DW. What is the best way to get the data to Azure? Should I use Import/Export or AzCopy? How long would it take for each methods?
The process of loading data depends on the amount of data. For very small data sets (<100 GB) you can simply use the bulk copy command line utility (bcp.exe) to export the data from SQL Server and then import to Azure SQL Data Warehouse.
For data sets greater than 100 GB, you can export your data using bcp.exe, move the data to Azure Blob Storage using a tool like AzCopy, create an external table (via TSQL code) and then pull the data in via a Create Table As Select (CTAS) statement. This works well update to a TB or two depending on your connectivity to the cloud.
For really large data sets, say greater than a couple of TBs, you can use the Azure Import/Export service to move the data into Azure Blob Storage and then load the data with PolyBase/CTAS.
Using the PolyBase/CTAS route will allow you to take advantage of multiple compute nodes and the parallel nature of data processing in Azure SQL Data Warehouse - an MPP based system. This will greatly improve the data ingestion performance as each compute node is able to process a block of data in parallel with the other nodes.
One consideration as well is to increase the amount of DWU (compute resources) available in SQL Data Warehouse at the time of the CTAS statement. This will increase the number of compute resources adding additional parallelism which will decrease the total ingestion time.
You can go through the documentation below and figure out which option suits you best.
https://azure.microsoft.com/en-us/documentation/articles/sql-data-warehouse-overview-load/
If you already have data in an on-premise SQL Server, you can use the migration wizard tool to load that data to Azure SQL DB.
http://sqlazuremw.codeplex.com/

Resources