Just starting to study the Azure framework.
Just say you create a row in the Table storage.
You then create a Bob in the Blob storage,
Is there some way you can correlate what you just added to the Table with the Blob you just created?
Or anytime you have related entries you must use SQL Azure?
That would be disappointing as say you wanted to store some Video in a blob, but had some row in an SQL Azure table that you wanted to link to the Blob.
Since you cannot have this link you must store your video in SQL Azure somehow?
You can store unique filename of file stored in blob into azure table row.
Storing full video/binary content in Azure table or SQL azure table is not recommended as retrival will become bit slower and SQL azure is bit expensive compared to blob.
As long as you choose your blob filename carefully, you should just be able to reference that filename in your Azure table. However, be careful how you name it. It is very tempting to jump in with a timestamp based name and reference that in the table. However, using Azure you are obviously using a scalable solution that can end up highly parallel.
Use a GUID for the filename to guarantee uniqueness. Or if you want to be able to more easily browse the storage based on when the items were added, perhaps choose a date format YYYYMMDDHHMMSS with a GUID suffixed to it.
It is perfectly possible and done many times, I personally implemented such an architecture for one of the projects I worked on.
I used Blob Name as Partition Key and Blob Container Name as Row Key for the entity I write to Table.
You can also do it other way around and choose Blob Container Name as PK and Blob Name as RK. Then you may want to consider the partitions in your table not getting too big and cause some perf issues down the road if your blob containers get too crowded.
There is one thing you need to explicitly handle though, Blob Name s and Blob Container Name s have different naming constraints then Table Partition Keys and Table Row keys, so you need to Sanitize the Blob Name and Blob Container Name and before using them as PK and RK and writing the entity to table or reading the entity from the table.
Here is how you can Sanitize Blob Name s and Blob Container Name s to be used as Table Keys:
public static readonly Regex DisallowedCharsInTableKeys = new Regex(#"[\#%/?\u0000-\u001F\u007F-\u009F]");
string sanitizedKey = DisallowedCharsInTableKeys.Replace(BlobName, disallowedCharReplacement);
At this stage you may also want to prefix the sanitized key (Partition Key or Row Key) with the hash of the original key (ie. Blob Name) to avoid false collisions of different invalid keys having the same sanitized value.
Azure supports associating metadata with the blobs. Metadata is essentially just name-value pairs. The following video might be useful: http://www.bestechvideos.com/2009/04/15/mix09-windows-azure-storage
So instead of associating the table with the blob, you might want to do the other way round by using metadata.
Related
Is there any snap available in SnapLogic to do following
Connect with snowflake and get data by SELECT * FROM VIEW
Connect with Azure Blob Storage and get the data from csv file : FILENAME_YYYYMMDD.csv
Take only those data which are available in 1 but NOT available in 2 and write this delta back to Azure Blob Storage : FILENAME_YYYYMMDD.csv
Is In-Memory Look-Up useful for this?
No, In-Memory Lookup snap is used for cases where you need to look up the value corresponding to the value in a certain field of the incoming records. For example, say you want to look up a country name against the country ISO code. This snap generally fetches the lookup table once and stores it in memory. Then it uses this stored lookup table to provide data corresponding to the incoming records.
In your case, you have to use the Join snap and configure it to an inner join.
Create a table from a storage account and I can insert records with no problem.
I decided to use this type of table to be able to share the URI (Https) and for another system to consume it, and since it is a NoSQL table, it gives me the possibility of adapting what I need to store in it.
But I have the inconvenience that this table, I must truncate it every time information is processed and the option from Data Factory where it indicates the insertion mode, (Replace or combine), does not work, it always performs an append.
I tried to do it from DataBricks, but I don't know how to reference that table, since it is outside of a Blob Storage and it cannot be mounted as such. any ideas?
Azure Table Storage
Azure data Factory
"Tipo de Inserción" = Insert Type
"Reemplazar" = Replace
Table Data
Configuration Data Factory
How to configure so that I can delete the data?
Thanks a lot.
Greetings.
I have been following Microsoft's tutorial to incrementally/delta load data from an SQL Server database.
It uses a watermark (timestamp) to keep track of changed rows since last time. The tutorial stores the watermark to an Azure SQL database using the "Stored Procedure" activity in the pipeline so it can be reused in the next execution.
It seems overkill to have an Azure SQL database just to store that tiny bit of meta information (my source database is read-only btw). I'd rather just store that somewhere else in Azure. Maybe in the blob storage or whatever.
In short: Is there an easy way of keeping track of this type of data or are we limited to using stored procs (or Azure Functions et al) for this?
I had come across a very similar scenario, and from what I found you can't store any watermark information in ADF - at least not in a way that you can easily access.
In the end I just created a basic tier Azure SQL database to store my watermark / config information on a SQL server that I was already using in my pipelines.
The nice thing about this is when my solution scaled out to multiple business units, all with different databases, I could still maintain watermark information for each of them by simply adding a column that tracks which BU that specific watermark info was for.
Blob storage is indeed a cheaper option but I've found it to require a little more effort than just using an additional database / table in an existing database.
I agree it would be really useful to be able to maintain a small dataset in ADF itself for small config items - probably a good suggestion to make to Microsoft!
There is a way to achieve this by using Copy activity, but it is complicated to get latest watermark in 'LookupOldWaterMarkActivity', just for reference.
Dataset setting:
Copy activity setting:
Source and sink dataset is the same one. Change the expression in additional columns to #{activity('LookupNewWaterMarkActivity').output.firstRow.NewWatermarkvalue}
Through this, you can save watermark as column in .txt file. But it is difficult to get the latest watermark with Lookup activity. Because your output of 'LookupOldWaterMarkActivity' will be like this:
{
"count": 1,
"value": [
{
"Prop_0": "11/24/2020 02:39:14",
"Prop_1": "11/24/2020 08:31:42"
}
]
}
The name of key is generated by ADF. If you want to get "11/24/2020 08:31:42", you need to get column count and then use expression like this: #activity('LookupOldWaterMarkActivity').output.value[0][Prop_(column count - 1)]
How to get latest watermark:
use GetMetadata activity to get columnCount
use this expression:#activity('LookupOldWaterMarkActivity').output.value[0][concat('Prop_',string(sub(activity('Get Metadata1').output.columnCount,1)))]
I came across with weird behavior of Azure table Storage query. I used following code to get the list of entities from Azure Table Storage
query = context.CreateQuery (DomainData.Employee.TABLE_NAME) .Where(strPredicate).Select(selectQuery));
where is context TableServiceContext, I was trying to pull Employee entity from Azure table storage, My requirement is dynamically construct predicate and Projections.
So strPredicate is a string, where it contains dynamically constructed predicate.
selectQuery is projection string, where it is constructed dynamically based on User Selected Properties.
When the users selects all the properties of Employee Object, here Employee object has over 200 properties. system constructed dynamic projection string based on all properties and System takes 45 minutes to retrieve 60000 records from Azure table storage.
Whereas when i enter directly object in select projection, i.e looks like below
query = (context.CreateQuery<DomainData.Employee> (DomainData.Employee.TABLE_NAME) .Where(strPredicate)
then query takes only 5 minutes to retrieve 60000 records from Azure table storage. Why is this peculiar behavior both the query are same , one with project of columns/properties other is without any projection, but Azure table storage provides same number of entity with same column property and same size of each entity why is it Azure table storage is taking too much of time in the first query why is it faster in second query. Please let me know.
The standard advice when dealing with perceived anomalies with Windows Azure Storage is to use Fiddler to identify the actual storage operation invoked. This will quickly allow you to see what the actual differences are with the two operations.
In my Table storage, there are 10.000 elements per partition. Now I would like to load a whole partition into memory. However, this is taking very long. I was wondering if I am doing something wrong, or if there is a way to do this faster. Here is my code:
public List<T> GetPartition<T>(string partitionKey) where T : TableServiceEntity
{
CloudTableQuery<T> partitionQuery = (from e in _context.CreateQuery<T>(TableName)
where e.PartitionKey == partitionKey
select e).AsTableServiceQuery<T>();
return partitionQuery.ToList();
}
Is this the way it is supposed to be done or is their anything equivalent to the batch insertion for getting elements out of the table again?
Thanks a lot,
Christian
EDIT
We have all the data also available in blob storage. That means, one partition is serialized completely as byte[] and saved in a blob. When I retrieve that from blob storage and afterwards deserialize it, it is way faster than taking it from the table. Almost 10 times faster! How can this be?
In your case I think turning off change tracking could make a difference:
context.MergeOption = MergeOption.NoTracking;
Take a look on MSDN for other possible improvements: .NET and ADO.NET Data Service Performance Tips for Windows Azure Tables
Edit: To answer your question why a big file in blob storage is faster, you have to know that the max amount of records you can get in a single request is 1000 items. This means, to fetch 10.000 items you'll need to do 10 requests instead of 1 single request on blob storage. Also, when working with blob storage you don't go through WCF Data Services which can also have a big impact.
In addition, make sure you are on the second generation of Azure Storage...its essentially a "no cost" upgrade if you are in a data center that supports it. It uses SSDs and an upgraded network Topology.
http://blogs.msdn.com/b/windowsazure/archive/2012/11/02/windows-azure-s-flat-network-storage-and-2012-scalability-targets.aspx
Microsoft will not migrate your account, simply just re-create it and you get "upgraded for FREE" to the 2nd gen Azure Storage.