We currently have an Azure Table filled with logs. We have no idea how many records are in them but we know that we did +- 3 mil transactions. So in worst-case scenario we will have 300 mil. rows.
We want to completely delete all the logs.
If we delete the table, will this mean 1 transaction or will this mean he will batch delete all the rows he can and getting around 3 mil. transactions again?
I can't find any official info about the fact that Delete table command is actually 1 transaction.
Any help?
Thanks !
Transactions are billed as single REST requests.
As such you will be charged for 1 transaction to delete the table.
To be completely nosy that would (could) be two storage transactions:
One to drop the table.
One to re-create the table for continued logging.
Related
I have a table in oracle db which gets multiple transactions done (lets say around 100 million inserts,updates or deletes in a day). I want to get all the transactions happening in that table to be brought into hive for processing through spark or hive.
For example:
lets say a record in that oracle table goes through initial insert operation followed by 5 updates to same/different columns and finally gets deleted. I want to capture all such operations for all the records in that table and import into hive.
We want to find records with number of operations that exceed a threshold for specific columns and pull a report on them.
Has anyone come across such a use case? Appreciate any help in achieving this.
I am using the following Cassandra data model
ruleid - bigint
patternid - bigint
key - string
value - string
time - timestamp
event_uuid -time based uuid
partition key - ruleid, patterid
clustering key - event_uuid order by descending
Our ingestion rate is around 100 records per second per pattern id and there might be 10 000+ pattern ids.
Our query is fairly straightforward we query the last 100 000 records based on the desc uuid filtered by the partition key.
Also for our use case we would need to perform around 5 deletes per second on this per pattern ids.
However this leads to the so called tombstones and causes readtimeout on querying on the datastore again.
How to overcome the above issue?
It sounds like you are storing records into the table, doing some transformation/processing on the records, then deleting them.
But since you're deleting rows within partitions (instead of the partitions themselves), you have to iterate over the deleted rows (tombstones) to get to the live records.
The real problem though is reading too many rows which won't perform well. Retrieving 100K rows is going to be slow so consider paging through the result set.
With limited information you've provided, this is not an easy problem to solve. Cheers!
We've set up an Azure Search Index on our Azure SQL Database of ~2.7 million records all contained in one Capture table. Every night, our data scrapers grab the latest data, truncate the Capture table, then rewrite all the latest data - most of which will be duplicates of what was just truncated, but with a small amount of new data. We don't have any feasible way of only writing new records each day, due to the large amounts of unstructured data in a couple fields of each record.
How should we best manage our index in this scenario? Running the indexer on a schedule requires you to indicate this "high watermark column." Because of the nature of our database (erase/replace once a day) we don't have any column that would apply here. Further, what really needs to happen for our Azure Search Index is either it also needs to go through a full daily erase/replace, or some other approach so that we don't keep adding 2.7 million duplicate records every day to the index. The former likely won't work for us because it takes 4 hours minimum to index our whole database. That's 4 hours where clients (worldwide) may not have a full dataset to query on.
Can someone from Azure Search make a suggestion here?
What's the proportion of the data that actually changes every day? If that proportion is small, then you don't need to recreate the search index. Simply reset the indexer after the SQL table has been recreated, and trigger reindexing (resetting an indexer clears its high water mark state, but doesn't change the target index). Even though it may take several hours, your index is still there with the mostly full dataset. Presumably if you update the dataset once a day, your clients can tolerate hours of latency for picking up latest data.
We are investigating the performance of Azure storage table, what we want to know is the maximum number of rows per single read and write transactions for table, any official documentation can be referred?
Thanks a lot.
You can write up to 100 rows in a single table storage transaction. Assuming that all of the rows/entities have the same PartitionKey.
With respect to reading, you can read up to 1000 rows in one storage transaction. Once again, assuming the same PartitionKey
I have a need to query a store of 200 million entities in Windows Azure. Ideally, I would like to use the Table Service, rather than SQL Azure, for this task.
The use case is this: a POST containing a new entity will be incoming from a web-facing API. We must query about 200 million entities to determine whether or not we may accept the new entity.
With the entity limit of 1,000: does this apply to this type of query, i.e. I have to query 1,000 at a time and perform my comparisons / business rules, or can I query all 200 million entities in one shot? I think I would hit a timeout in the latter case.
Ideas?
Expanding on Shiraz's comment about Table storage: Tables are organized into partitions, and then your entities are indexed by a Row key. So, each row can be found extremely fast using the combination of partition key + row key. The trick is to choose the best possible partition key and row key for your particular application.
For your example above, where you're searching by telephone number, you can make TelephoneNumber the partition key. You could very easily find all rows related to that telephone number (though, not knowing your application, I don't know just how many rows you'd be expecting). To refine things further, you'd want to define a row key that you can index into, within the partition key. This would give you a very fast response to let you know whether a record exists.
Table storage (actually Azure Storage in general - tables, blobs, queues) have a well-known SLA. You can execute up to 500 transactions per second on a given partition. With the example above, the query for rows for a given telephone number would equate to one transaction (unless you exceed 1000 rows returned - to see all rows, you'd need additional fetches); adding a row key to narrow the search would, indeed, yield a single transaction). So would inserting a new row. You can also batch up multiple row inserts, within a single partition, and save them in a single transaction.
For a nice overview of Azure Table Storage, with some good labs, check out the Platform Training Kit.
For more info about transactions within tables, see this msdn blog post.
The limit of 1000 is the number of rows returned from a query, not the number of rows queried.
Pulling all of the 200 million rows into the web server to check them will not work.
The trick is to store the rows with a key that can be used to check if the record should be accepted.