How to retrieve every nth row in Azure Storage? - node.js

I have the following scenario. Information collected every minute is sent to and stored in Azure table storage. Now, I am trying to display this data in a graph. If I only show data for the last day, it would be relatively easy to filter through 1440 (24 * 60) data points to only display 200. However, if we consider showing data over a month, I would have to handle over 40,000 data points (24 * 60 * 30). However I only would need to show 200 of those data points. Assuming 40,000 points, I would only select every 200th data point, or row. Is this functionality possible in azure storage. Or would I have to select bunches at a time, select the 200th element and then move on to the next data set?

You should be able to use the 200th data point in your case. You could either use $top entities from a set, or the $filterquery to get the results you want. I'd rather suggest using PowerBI which is free and compatible with Table storage, it basically turns data into graphs, and you can apply additional filters to what suits you.
You can read more about it here:
Powerbi links:https://powerbi.microsoft.com/en-us/integrations/azure-table-storage/
And https://powerbi.microsoft.com/en-us/desktop/

Related

Spotfire- limiting Information link colum expression

I have a column of data [Sales ID] that bringing in duplicate data for an analysis. My goal is to try and limit the data to pull unique sales ID's for the max day of every month in the analysis only (instead of daily). Im basically trying to get it to only pull in unique sales ID values for the last the day of every month in the analysis ,and if the current day is the last day so far then it should pull that in. So it should pull in the MAX date in any given month. Please how do i write an expresion with the [Sales ID] column and [Date ] column to acieve this?
Probably the two easiest options are to
1) Adjust the SQL as niko mentioned
2) Limit the visualization with the "Limit Data Using Expression" option, using the following:
Rank(Day([DATE]), "desc", Month([DATE]), Year([DATE])) = 1
If you had to do it in the Data on Demand section (maybe the IL itself is a usp or you don't have permission to edit it), my preference would be to create another data table that only has the max dates for each month, and then filter your first data table by that.
However, if you really need to do it in the Data on Demand section, then I'm guessing you don't have the ability to create your own information links. This would mean you can't key off additional data tables, and you're probably going to have to get creative.
Constraints of creativity include needing to know the "rules" of your data -- are you pulling the data in daily? Once a week? Do you have today's data, or today - 2? You could probably write a python script to grab the last day of every month for the last 10 years, and then whatever yesterday's date was, and throw all those values into a document property. This would allow you to do a "Values from Property".
(Side Note: I want to say you could also do it directly in the expression portion with something like an extremely long
Date(DateTimeNow()),DateAdd("dd",-1,Date(Year(DateTimeNow()), Month(DateTimeNow()), 1))
But Spotfire is refusing to accept that as multiple values. Interestingly, when I pull the logic for a StringList property, it gives this: $map("${udDates}", ","), which suggests commas are an accurate methodology, but I get an error reading "Expected 'End of expression' but found ','" . Uncertain if this is a Spotfire issue, or related to my database connection)
tl;dr -- Doing it in the Data on Demand section is probably convoluted. Recommend adjusting in SQL if possible, and otherwise limiting in the visualization

Azure Search Index - manage indexer for when database is erase/replace every 24 hours?

We've set up an Azure Search Index on our Azure SQL Database of ~2.7 million records all contained in one Capture table. Every night, our data scrapers grab the latest data, truncate the Capture table, then rewrite all the latest data - most of which will be duplicates of what was just truncated, but with a small amount of new data. We don't have any feasible way of only writing new records each day, due to the large amounts of unstructured data in a couple fields of each record.
How should we best manage our index in this scenario? Running the indexer on a schedule requires you to indicate this "high watermark column." Because of the nature of our database (erase/replace once a day) we don't have any column that would apply here. Further, what really needs to happen for our Azure Search Index is either it also needs to go through a full daily erase/replace, or some other approach so that we don't keep adding 2.7 million duplicate records every day to the index. The former likely won't work for us because it takes 4 hours minimum to index our whole database. That's 4 hours where clients (worldwide) may not have a full dataset to query on.
Can someone from Azure Search make a suggestion here?
What's the proportion of the data that actually changes every day? If that proportion is small, then you don't need to recreate the search index. Simply reset the indexer after the SQL table has been recreated, and trigger reindexing (resetting an indexer clears its high water mark state, but doesn't change the target index). Even though it may take several hours, your index is still there with the mostly full dataset. Presumably if you update the dataset once a day, your clients can tolerate hours of latency for picking up latest data.

Questions about storing data in SQL or Table Storage

I have many questions on whether to store my data into SQL or Table Storage and the best way to store them for efficiency.
Use Case:
I have around 5 million rows of objects that are currently stored in mysql database. Currently the metadata is stored only in the database. (Lat, Long, ID, Timestamp). The other 150 columns about the object that are not used much were moved into the Table Storage.
In the table storage, should these all be stored in one row with all the 150 columns not used much in one column instead of multiple rows?
For each of these 5 million objects in the database, there are certain information about them (temperature readings, trajectories, etc). The trajectory data used to be stored in SQL (~300 rows / object) but were moved to table storage to be cost effective. Currently they are stored in the table storage in a relational manner where each row looks like (PK: ID, RK: ID-Depth-Date, X, Y, Z).
Currently it takes time time grab many of the trajectories data. Table Storage seems to be pretty slow in our case. I want to improve the performance of the gets. Should the data be stored where each Objects has 1 row for its trajectory and all the XYZ's are stored in 1 column in a JSON format? Instead of 300 rows to get, it only needs to get 1 row.
Is the table storage the best place to store all of this data? If I wanted to get a X,Y,Z at a certain Measured Depth, I would have to get the whole row and parse through the JSON. THis is probably a trade-off.
Is it feasible to have the trajectory data, readings, etc in a sql database where there can be (5,000,000 x 300 rows) for the trajectory data. THere is also some information about the objects where it can be (5,000,000 x 20,000 rows). This is probably too much for a SQL database and would have to be in a Azure CLoud Storage. If so, would the JSON option be the best one? The tradeoff is that if I want a portion which is 1000 rows, I would have to get the whole table, however, isnt that faster than querying through 20,000 rows. I can probably split the data into sets of 1000 rows and use sql as a meta data for finding out which sets of data I need from the Cloud Storage.
Pretty much I'm having trouble understanding how to group data and format it into Azure Cloud Tables to be efficient and fast when grabbing data for my application.
Here's an example of my data and how I am getting it: http://pastebin.com/CAyH4kHu
As an alternative to table storage, you can consider using Azure SQL DB Elastic Scale to spread trajectory data (and associated object metadata) across multiple Azure SQL DBs. This allows you to overcome capacity (and compute) limits of a single database. You would be able to perform object-specific queries or inserts efficiently, and have options to perform queries across multiple databases -- assuming you are working with a .Net application tier. You can find out more by looking at http://azure.microsoft.com/en-us/documentation/articles/sql-database-elastic-scale-get-started/

Azure Table Storage Delete where Row Key is Between two Row Key Values

Is there a good way to delete entities that are in the same partition given a row key range? It looks like the only way to do this would be to do a range lookup and then batch the deletes after looking them up. I'll know my range at the time that entities will be deleted so I'd rather skip the lookup.
I want to be able to delete things to keep my partitions from getting too big. As far as I know a single partition cannot be scaled across multiple servers. Each partition is going to represent a type of message that a user sends. There will probably be less than 50 types. I need a way to show all the messages of each type that were sent (ex: show recent messages regardless of who sent it of type 0). This is why I plan to make the type the partition key. Since the types don't scale with the number of users/messages though I don't want to let each partition grow indefinitely.
Unfortunately, you need to know precise Partition Keys and Row Keys in order to issue deletes. You do not need to retrieve entities from storage if you know precise RowKeys, but you do need to have them in order to issue batch delete. There is no magic "Delete from table where partitionkey = 10" command like there is in SQL.
However, consider breaking your data up into tables that represent archivable time units. For example in AzureWatch we store all of the metric data into tables that represent one month of data. IE: Metrics201401, Metrics201402, etc. Thus, when it comes time to archive, a full table is purged for a particular month.
The obvious downside of this approach is the need to "union" data from multiple tables if your queries span wide time ranges. However, if your keep your time ranges to minimum quantity, amount of unions will not be as big. Basically, this approach allows you to utilize table name as another partitioning opportunity.

How to perform operations inside Cassandra Trigger?

My application collects per sec data from devices and inserts into cassandra table. My idea is to write a trigger for the per sec data table which will automatically convert the per sec into hourly / daily data. And also I'll store the hourly and daily data in the same table with different key. To achieve this use case, I need to perform below operations inside my trigger code.
How can I insert a data into the same table which will invoke the trigger again ? ( will be used for converting per hour to per day )
How can I insert a data into different table ? ( store accumulated data into a temp table )
How can I select a data from different table ? ( fetch the last data for accumulation )
If I know the above info, my application will just insert per sec data and rest(per sec -to- hour -to- day convertion) will be automatically taken care by my trigger code.
Can you please help me to get the above info ?
It would be great if you give some code snippet for the same.
Unless you're comfortable with Cassandra internals, you should do this in a data abstraction layer instead of a trigger.

Resources