Possibility of GUID collision in MS CRM Data migration - dynamics-crm-2011

We are doing CRM data migration in order to keep two CRM Systems in Sync. And removing history data from Primary CRM. Target CRM is been created by taking Source as base. Now while we migrate the data we keep guids of record, same in order to maintain data integrity. This solution expects that in target systems that GUID must be available to assign to new record. There are no new records created directly in target system except Emails, that too very low in number. But apart from that there are ways in which system creates its guids, e.g when we move newly created entity to target solution using Solution it will not maintain the GUID of entity and attributes and will create its own, since we do not have control on this. Also some of the records which are created internally will also get created by platform and assigned a new GUID. Now if we do not have control over guid creation in target system(Although number is very small), i fear of the situation where Source System has guid which target has already consumed!! And at time of data migration it will give errors.
My Question is there any possibility that above can happen? because if that happens to us whole migration solution will loose its value.

SQL Server's NEWID() generates a 128-bit ID. All IDs generated on the same machine are guaranteed to be unique but because yours have been generated across multiple machines, there's no guarantee.
That being said, from this source on GUIDs:
...for there to be a one in a billion chance of duplication, 103 trillion version 4 UUIDs must be generated.
So the answer is yes there is a chance of collision, but it's so astromonically low that most consider the answer to effectively be no.

Related

Restricting access to Excel source data

I have an Excel template which reads data from a source Excel file using vlookups and Index/Match functions. I there a way to prevent the end user from accessing the source data file/sheet? e.g. by storing the source file on a remote location and make the vlookups read from there..
Depending on what resources are available to you, it may be difficult to prevent users from just going around the restrictions you put in place. Even if the data is in a database table you will need measures in place to prevent users from querying it outside of your Excel template. I don't know your situation, but ideally there would be someone (i.e. database administrator, infosec, back-end developer) who could help engineer a proper solution.
Having said that, I do believe your idea around using MS SQL Server could be a good way to go. You could create stored procedures instead of using sql queries to limit access. See this link for more details:
Managing Permissions with Stored Procedures in SQL Server
In addition, I would be worried about users figuring out other user IDs and arbitrarily accessing data. You could implement some sort of protection by having a mapping table so that there's no way to access information with user IDs. The table would be as follows:
Columns: randomKey, userId, creationDate
randomKey is just an x digit random number/letter sequence
creationDateTime is a time stamp and used for timeout purposes
Whenever someone needs a user id you would run a stored procedure that adds a record to the mapping table. You input the user id, the procedure creates a record and returns the key. You provide the user with the key which they enter in your template. A separate stored procedure takes the key and resolves to the user id (using the mapping table) and returns the requested information. These keys expire. Either they can be single use (the procedure deletes the record from the mapping table) or use a timeout (if creationDateTime is more than x hours/days old it will not return data).
For the keys, Mark Ransom shared an interesting solution for creating random IDs for which you could base your logic:
Generate 6 Digit unique number
Sounds like a lot of work, but if there is sensitivity around your data it's worth building a more robust process around it. There's probably a better way to approach this, but I hope it at least gives you food for thought.
No, it's not possible.
Moreover, you absolutely NEED these files open to refresh the values in formulas that refer them. When you open a file with external references, their values will be calculated from local cache (which may not be equal to actual remote file contents). When you open the remote files, the values will refresh.

What strategies exist to find unreachable keys in a key/value database?

TL;DR
How can you find "unreachable keys" in a key/value store with a large amount of
data?
Background
In comparison to relational database that provide ACID guarantees, NoSQL
key/value databases provide fewer guarantees in order to handle "big data".
For example, they only provide atomicity in the context of a single key/value
pair, but they use techniques like distributed hash tables to "shard" the data
across an arbitrarily large cluster of machines.
Keys are often unfriendly for humans. For example, a key for a blob of data
representing an employee might be
Employee:39045e87-6c00-47a4-a683-7aba4354c44a. The employee might also have a
more human-friendly identifier, such as the username jdoe with which the
employee signs in to the system. This username would be stored as a separate
key/value pair, where the key might be EmployeeUsername:jdoe. The value for
key EmployeeUsername:jdoe is typically either an array of strings containing
the main key (think of it like a secondary index, which does not necessarily
contain unique values) or a denormalised version of employee blob (perhaps
aggregating data from other objects in order to improve query performance).
Problem
Now, given that key/value databases do not usually provide transactional
guarantees, what happens when a process inserts the key
Employee:39045e87-6c00-47a4-a683-7aba4354c44a (along with the serialized
representation of the employee) but crashes before inserting the
EmployeeUsername:jdoe key? The client does not know the key for the employee
data - he or she only knows the username jdoe - so how to you find the
Employee:39045e87-6c00-47a4-a683-7aba4354c44a key?
The only thing I can think of is to enumerate the keys in the key/value store
and once you find the appropriate key, "resume" the indexing/denormalisation.
I'm well aware of techniques like event sourcing, where an idempotent event
handler could respond to the event (e.g., EmployeeRegistered) in order to
recreate the username-to-employee-uuid secondary index, but using event
sourcing over key/value store still requires enumeration of keys, which could
degrade performance.
Analogy
The more experience I have in IT, the more I see the same problems being
tackled in different scenarios. For example, Linux filesystems store both file
and directory contents in "inodes". You can think of these as key/value pairs,
where the key is an integer and the value is the file/directory contents. When
writing a new file, the system creates an inode and fills it with data THEN
modifies the parent directory to add the "filename-to-inode" mapping. If the
system crashes after creating the file but before referencing it in the parent
directory, your file "exists on disk" but is essentially unreadable. When the
system comes back online, hopefully it will place this file into the
"lost+found" directory (I imagine it does this by scanning the entire disk).
There are plenty of other examples (such as domain name to IP address mappings
in the DNS system), but I specifically want to know how the above problem is
tackled in NoSQL key/value databases.
EDIT
I found this interesting article on manual secondary indexes but it doesn't "broken" or "dated" secondary indexes.
The solution I've come up with is to use a process manager (or "saga"),
whose key contains the username. This also guarantees uniqueness across
employees during registration. (Note that I'm using a key/value store
with compare-and-swap (CAS) semantics for concurrency control.)
Create an EmployeeRegistrationProcess with a key of
EmployeeRegistrationProcess:jdoe.
If a concurrency error occurs (i.e., the registration process
already exists) then this is a duplicate username.
When started, the EmployeeRegistrationProcess allocates an
employee UUID. The EmployeeRegistrationProcess attempts to create
an Employee object using this UUID (e.g.,
Employee:39045e87-6c00-47a4-a683-7aba4354c44a).
If the system crashes after starting the
EmployeeRegistrationProcess but before creating the Employee, we
can still locate the "employee" (or more accurately, the employee
regisration process) by the username "jdoe". We can then resume the
"transaction".
If there is a concurrency error (i.e., the Employee with the
generated UUID already exists), the RegistrationProcess can flag
itself as being "in error" or "for review" or whatever process we
decide is best.
After the Employee has successfully been created, the
EmployeeRegistrationProcess creates the secondary index
EmployeeUsernameToUuid:jdoe ->
39045e87-6c00-47a4-a683-7aba4354c44a.
Again, if this fails, we can still locate the "employee" by the
username "jdoe" and resume the transaction.
And again, if there is a concurrency error (i.e., the
EmployeeUsernameToUuid:jdoe key already exists), the
EmployeeRegistrationProcess can take appropriate action.
When both commands have succeeded (the creation of the Employee and
the creation of the secondary index), the
EmployeeRegistrationProcess can be deleted.
At all stages of the process, Employee (or
EmployeeRegistrationProcess) is reachable via it's human-friendly
identifier "jdoe". Event sourcing the EmployeeRegistrationProcess is
optional.
Note that using a process manager can also help in enforcing uniqueness
across usernames after registration. That is, we can create a
EmployeeUsernameChangeProcess object with a key containing the new
username. "Enforcing" uniqueness at either registration or username
change hurts scalability, so the value identified by
"EmployeeUsernameToUuid:jdoe" could be an array of employee UUIDs.
If to look at a question from the point of view of eventsourcing entities, then responsibility of an entity of EventStore includes the guaranteed saving an event into storage and sending for the bus. From this point of view it is guaranteed that the event will be written completely, and as the database in the append-only mode, there will never be a problem with a non-valid event.
At the same time of course it isn't guaranteed that all commands which generate events will be successfully executed - it is possible to guarantee only an order and protection against repeated execution of the same command, but not all transaction.
Further occurs as follows - the saga intercepts an original command, and tries to execute everything transaction. If any part of transaction comes to the end with an error, or for example, doesn't come to the end for the preset time - that process is rolled away by means of generation of the so-called compensating events. Such events can't delete an entity, however bring system to the consistent state similar to that the command never and was executed.
Note. If your specific implementation of the database for events is arranged so that the key value can guarantee record only of one couple, just serialize an event, and the combination from the identifier and the version of a root of the aggregate can be a key. The version of the aggregate in this case somewhat is a CAS operation analog.
About concurrency you can read this article: http://danielwhittaker.me/2014/09/29/handling-concurrency-issues-cqrs-event-sourced-system/

purge "DeletedDatabaseRecords" from database

I recently was asked by one of my Customers if there was a method to clean out records with the "DeletedDatabaseRecord" flagged.
They are in the process of implementing a new base company and have done several import/delete/import/delete of key records which has resulted in quite a few of these that they'd prefer not carry over to their actual live company.
Looking through the system i didn't see a build in method to clear these records out.
Is there a method of purging these records that is part of the system, be it from the ERP Configuration tools, stored procedures, or in the interface itself?
Jeff,
No, there is no special functionality to remove records flagged as DeletedDatabaseRecord, but you may always use a simple SQL script to loop over all the tables that have this column and remove from each of them the records that have it set to 1.

Azure Table Storage Partition Key

Two somewhat related questions.
1) Is there anyway to get an ID of the server a table entity lives on?
2) Will using a GUID give me the best partition key distribution possible? If not, what will?
we have been struggling for weeks on table storage performance. In short, it's really bad, but early on we realized that using a randomish partition key will distribute the entities across many servers, which is exactly what we want to do as we are trying to achieve 8000 reads per second. Apparently our partition key wasn't random enough, so for testing purposes, I have decided to just use a GUID. First impression is it is waaaaaay faster.
Really bad get performance is < 1000 per second. Partition key is Guid.NewGuid() and row key is the constant "UserInfo". Get is execute using TableOperation with pk and rk, nothing else as follows: TableOperation retrieveOperation = TableOperation.Retrieve(pk, rk); return cloudTable.ExecuteAsync(retrieveOperation). We always use indexed reads and never table scans. Also, VM size is medium or large, never anything smaller. Parallel no, async yes
As other users have pointed out, Azure Tables are strictly controlled by the runtime and thus you cannot control / check which specific storage nodes are handling your requests. Furthermore, any given partition is served by a single server, that is, entities belonging to the same partition cannot be split between several storage nodes (see HERE)
In Windows Azure table, the PartitionKey property is used as the partition key. All entities with same PartitionKey value are clustered together and they are served from a single server node. This allows the user to control entity locality by setting the PartitionKey values, and perform Entity Group Transactions over entities in that same partition.
You mention that you are targeting 8000 requests per second? If that is the case, you might be hitting a threshold that requires very good table/partitionkey design. Please see the article "Windows Azure Storage Abstractions and their Scalability Targets"
The following extract is applicable to your situation:
This will provide the following scalability targets for a single storage account created after June 7th 2012.
Capacity – Up to 200 TBs
Transactions – Up to 20,000 entities/messages/blobs per second
As other users pointed out, if your PartitionKey numbering follows an incremental pattern, the Azure runtime will recognize this and group some of your partitions within the same storage node.
Furthermore, if I understood your question correctly, you are currently assigning partition keys via GUID's? If that is the case, this means that every PartitionKey in your table will be unique, thus every partitionkey will have no more than 1 entity. As per the articles above, the way Azure table scales out is by grouping entities in their partition keys inside independent storage nodes. If your partitionkeys are unique and thus contain no more than one entity, this means that Azure table will scale out only one entity at a time! Now, we know Azure is not that dumb, and it groups partitionkeys when it detects a pattern in the way they are created. So if you are hitting this trigger in Azure and Azure is grouping your partitionkeys, it means your scalability capabilities are limited to the smartness of this grouping algorithm.
As per the the scalability targets above for 2012, each partitionkey should be able to give you 2,000 transactions per second. Theoretically, you should need no more than 4 partition keys in this case (assuming that the workload between the four is distributed equally).
I would suggest you to design your partition keys to group entities in such a way that no more than 2,000 entities per second per partition are reached, and drop using GUID's as partitionkeys. This will allow you to better support features such as Entity Transaction Group, reduce the complexity of your table design, and get the performance you are looking for.
Answering #1: There is no concept of a server that a particular table entity lives on. There are no particular servers to choose from, as Table Storage is a massive-scale multi-tenant storage system. So... there's no way to retrieve a server ID for a given table entity.
Answering #2: Choose a partition key that makes sense to your application. just remember that it's partition+row to access a given entity. If you do that, you'll have a fast, direct read. If you attempt to do a table- or partition-scan, your performance will certainly take a hit.
See
http://blogs.msdn.com/b/windowsazurestorage/archive/2010/11/06/how-to-get-most-out-of-windows-azure-tables.aspx for more info on key selection (Note the numbers are 3 years old, but the guidance is still good).
Also this talk can be of some use as far as best practice : http://channel9.msdn.com/Events/TechEd/NorthAmerica/2013/WAD-B406#fbid=lCN9J5QiTDF.
In general a given partition can support up to 2000 tps, so spreading data across partitions will help achieve greater numbers. Something to consider is that atomic batch transactions only apply to entities that share the same partitionkey. Additionally, for smaller requests you may consider disabling Nagle as small requests may be getting held up at the client layer.
From the client end, I would recommend using the latest client lib (2.1) and Async methods as you have literally thousands of requests per second. (the talk has a few slides on client best practices)
Lastly, the next release of storage will support JSON and JSON no metadata which will dramatically reduce the size of the response body for the same objects, and subsequently the cpu cycles needed to parse them. If you use the latest client libs your application will be able to leverage these behaviors with little to no code change.

Sync Framework Scope Versioning

We're currently using the Microsoft Sync Framework 2.1 to sync data between a cloud solution and thick clients. Syncing is initiated by the clients and is download only. Both ends are using SQL Server and I'm using the SqlSyncScopeProvisioning class to provision scopes. We cannot guarantee that the clients will be running the latest version of our software, but we have full control of the cloud part and this will always be up to date.
We're considering supporting versioning of scopes so that if, for example, I modify a table to have a new column then I can keep any original scope (e.g. 'ScopeA_V1'), whilst adding another scope that covers all the same data as the first scope but also with the new column (e.g. 'ScopeA_V2'). This would allow older versions of the client to continue syncing until they had been upgraded.
In order to achieve this I'm designing data model changes in a specific way so that I can only add columns and tables, never remove, and all new columns must be nullable. In theory I think this should allow older versions of my scopes to carry on functioning even if they aren't syncing the new data.
I think I'm almost there but I've hit a stumbling block. When I provision the new versions of existing scopes I'm getting the correctly versioned copies of my SelectChanges stored procedure, but all the table specific stored procedures (not specific to scopes - i.e. tableA_update, tableA_delete, etc) are not being updated as I think the provisioner sees them as existing and doesn't think they need updated.
Is there a way I can get the provisioner to update the relevant stored procedures (_update, _insert, etc) so that it adds in the new parameters for the new columns with default values (null), allowing both the new and old versions of the scopes to use them?
Also if I do this then when the client is upgraded to the newer version, will it resync the new columns even though the rows have already been synced (albeit with nulls in the new columns)?
Or am I going about this the completely wrong way? Is there another way to make scopes backwards compatible with older versions?
Sync Framework out-the-box dont support updating scope definitions to accomodate schema changes.
and creating new scope via the SetCreateProceduresForAdditionalScopeDefault will only create a new scope and a new _selectchanges stored procedure and but will re-use all the other stored procedures, tracking tables, triggers and UDT.
i wrote a series of blog posts on what needs to be changed to accomodate schema changes here: http://jtabadero.wordpress.com/2011/03/21/modifying-sync-framework-scope-definition-part-1-introduction/
the subsequent posts to that shows some ways to hack the provisioning scripts.
to answer your other question if the addition of a new column will resync that column or the row, the answer is no. first, change tracking is at the row level. second, adding a column will not fire the triggers that update the tracking tables that indicates if there are changes to be synched.

Resources