Sync Framework Scope Versioning - scope

We're currently using the Microsoft Sync Framework 2.1 to sync data between a cloud solution and thick clients. Syncing is initiated by the clients and is download only. Both ends are using SQL Server and I'm using the SqlSyncScopeProvisioning class to provision scopes. We cannot guarantee that the clients will be running the latest version of our software, but we have full control of the cloud part and this will always be up to date.
We're considering supporting versioning of scopes so that if, for example, I modify a table to have a new column then I can keep any original scope (e.g. 'ScopeA_V1'), whilst adding another scope that covers all the same data as the first scope but also with the new column (e.g. 'ScopeA_V2'). This would allow older versions of the client to continue syncing until they had been upgraded.
In order to achieve this I'm designing data model changes in a specific way so that I can only add columns and tables, never remove, and all new columns must be nullable. In theory I think this should allow older versions of my scopes to carry on functioning even if they aren't syncing the new data.
I think I'm almost there but I've hit a stumbling block. When I provision the new versions of existing scopes I'm getting the correctly versioned copies of my SelectChanges stored procedure, but all the table specific stored procedures (not specific to scopes - i.e. tableA_update, tableA_delete, etc) are not being updated as I think the provisioner sees them as existing and doesn't think they need updated.
Is there a way I can get the provisioner to update the relevant stored procedures (_update, _insert, etc) so that it adds in the new parameters for the new columns with default values (null), allowing both the new and old versions of the scopes to use them?
Also if I do this then when the client is upgraded to the newer version, will it resync the new columns even though the rows have already been synced (albeit with nulls in the new columns)?
Or am I going about this the completely wrong way? Is there another way to make scopes backwards compatible with older versions?

Sync Framework out-the-box dont support updating scope definitions to accomodate schema changes.
and creating new scope via the SetCreateProceduresForAdditionalScopeDefault will only create a new scope and a new _selectchanges stored procedure and but will re-use all the other stored procedures, tracking tables, triggers and UDT.
i wrote a series of blog posts on what needs to be changed to accomodate schema changes here: http://jtabadero.wordpress.com/2011/03/21/modifying-sync-framework-scope-definition-part-1-introduction/
the subsequent posts to that shows some ways to hack the provisioning scripts.
to answer your other question if the addition of a new column will resync that column or the row, the answer is no. first, change tracking is at the row level. second, adding a column will not fire the triggers that update the tracking tables that indicates if there are changes to be synched.

Related

Is there an idiomatic way of versioning clients to a database?

I'm supplying client drivers to a database I am maintaining. The DB has lots of tables with well defined schemas. (Cassandra in this case)
From time to time there will be some breaking changes (stemming from product and system requirements) and the clients will "break" in the sense that the queries they were performing until now will not be correct in regards to the newer schemas.
I'm curious to know if there is a good clean way to "version" the clients to work with the corresponding tables?
For instance a naive implementation could add the version number to the table name, i.e. for every table in the db , append a version number to the table name.
The clients would always query tables that match this naming convention. Newer breaking versions would change the table name to match the newer version and clients would be upgraded accordingly.
Is there a better way to handle this?
It's also possible to add 1 version for you DB and 1 version that is stored on your client, when a breaking change is made you update the database version.
When the client starts a version check is performed and if the version missmatches an auto upgrade can be done.
I came across the same problem few months ago. We have to load the schema according to the Version in which our client should support. The solution we found is as follows:
Along with the schema, one more table will be created which contains the following fields ---> version_no, ks_name, table_name, column_name, add/drop, is_loaded, primary key(version_no,(ks_name, table_name, column_name)). Note:if you have single keyspace, you can remove that column or table name can be itself written as ks_name.table_name.
Then, whenever we want to load a new version, we will log the changes in that table and when we load the previous schema again, the script will make sure that the old alterations are effected such that it will roll back to the same previous version of schema. Make sure that you update the is_loaded field as it is the only way to differentiate if a schema is half loaded or script failed such that it will not rise further more errors. Hope it helps!!

Possibility of GUID collision in MS CRM Data migration

We are doing CRM data migration in order to keep two CRM Systems in Sync. And removing history data from Primary CRM. Target CRM is been created by taking Source as base. Now while we migrate the data we keep guids of record, same in order to maintain data integrity. This solution expects that in target systems that GUID must be available to assign to new record. There are no new records created directly in target system except Emails, that too very low in number. But apart from that there are ways in which system creates its guids, e.g when we move newly created entity to target solution using Solution it will not maintain the GUID of entity and attributes and will create its own, since we do not have control on this. Also some of the records which are created internally will also get created by platform and assigned a new GUID. Now if we do not have control over guid creation in target system(Although number is very small), i fear of the situation where Source System has guid which target has already consumed!! And at time of data migration it will give errors.
My Question is there any possibility that above can happen? because if that happens to us whole migration solution will loose its value.
SQL Server's NEWID() generates a 128-bit ID. All IDs generated on the same machine are guaranteed to be unique but because yours have been generated across multiple machines, there's no guarantee.
That being said, from this source on GUIDs:
...for there to be a one in a billion chance of duplication, 103 trillion version 4 UUIDs must be generated.
So the answer is yes there is a chance of collision, but it's so astromonically low that most consider the answer to effectively be no.

Updating lucene index frequently causing performance degrade

I am trying to add lucene.net on my project where searching getting more complicated data. but transaction (or table modifying frequently like inserting new data or modifying the field which is used in lucene index).
Is it good to use lucene.net searching here?
How can I find modified fields & update to specific lucene index which is already created? Lucene index contains documents that are deleted from the table then how can I remove them from lucene index?
while loading right now,
I have removed index which are not available in the table based on unique Field
inserting if index does not exist otherwise updating all index which are matching table unique field
While loading page it's taking more time than normal, due to my removing/inserting/updating index method calling.
How can I proceed with it?
Lucene is absolutely suited for this type of feature. It is completely thread-safe... IF you use it the right way.
Solution pointers
Create a single IndexWriter and keep it in a globally accessible singleton (either a global static variable or via dependency injection). IWs are completely threadsafe. NEVER open multiple IWs on the same folder.
Perform all updates/deletes via this singleton. (I had one project doing 100's of ops/second with no issues, even on slightly crappy hardware).
Depending on the frequency of change and the latency acceptable to the app, you could:
Send an update/delete to the index every time you update the DB
Keep a "transaction log" or queue (probably in the same DB) of changed rows and deletions (which are are to track otherwise). Then update the index by consuming the log/queue.
To search, create your IndexSearcher with searcher = new IndexSearcher(writer.GetReader()). This is part of the NRT (near real time) pattern. NEVER create a separate IndexReader on an index folder that is also open by an IW.
Depending on your pattern of usage you may wish to introduce a period of "latency" between changes happening and those changes being "visible" to the searches...
Instances of IS are also threadsafe. So you can also keep an instance of an IS through which all your searches go. Then recreate it periodically (eg with a timer) then swap it using Interlocked.Exchange.
I previously created a small framework to isolate this from the app and make it reusable.
Caveat
Having said that... Hosting this inside IIS does raise some problems. IIS will occasionally restart your app. Is will also (by default) start the new instance before stopping the existing one, then swaps them (so you don't see the startup time of the new one).
So, for a short time there will be two instances of the writer (which is bad!)
You can tell IIS to disable "overlapping" or increase the time between restarts. But this will cause other side-effects.
So, you are actually better creating a separate service to host your lucene bits. A simple self hosted WebAPI Windows service is ideal and pretty simple. This also gives you better control over where the index folder goes and the ability to host it on a different machine (which isolates the disk IO load). And means that the service can be accessed from other parts of your system, tested separately etc etc
Why is this "better" than one of the other services suggested?
It's a matter of choice. I am a huge fan of ElasticSearch. It solves a lot of problems around scale and resilience. It also uses the latest version of Java Lucene which is far, far ahead of lucene.net in terms of capability and performance. (The same goes for the other two).
BUT, ES and Solr are Java (which may or may not be an issue for you). AzureSearch is hosted in Azure which again may or may not be an issue.
All three will require climbing a learning curve and will require infrastructure support or external third party SaaS commitment.
If you keep the service inhouse and in c# it keeps it simple and you have control over the capabilities and the shape of the API can be turned for your needs.
No "right" answer. You'll have to make choices based on your situation.
You should be indexing preferrably according to some schedule (periodically). The easiest approach is to keep the date of last index and then query for all the changes since then and index new, update and remove records. In order to keep track of removed entries in the database you will need to have a log of deleted records with a date it was removed. You can then query using that date to what needs to be removed from the lucene.
Now simply run that job every 2 minutes or so.
That said, Lucene.net is not really suited for web application, you should consider using ElasticSearch, SOLR or AzureSearch. Basically server that can handle load and multi threading better.

How do production Cassandra DBA's do table changes & additions?

I am interested in how the Cassandra production DBA's processes change when using Cassandra and performing many releases over a year. During the releases, columns in tables would change frequently and so would the number of Cassandra tables, as new features and queries are supported.
In the relational DB, in production, you create the 'view' and BOOM you get the data already there - loaded from the view's query.
With Cassandra, does the DBA have to create a new Cassandra table AND have to write/run a script to copy all the required data into that table? Can a production level Cassandra DBA provide some pointers on their processes?
We run a small shop, so I can tell you how I manage table/keyspace changes, and that may differ from how others get it done. First, I keep a text .cql file in our (private) Git repository that has all of our tables and keyspaces in their current formats. When changes are made, I update that file. This lets other developers know what the current tables look like, without having to use SSH or DevCenter. This also has the added advantage of giving us a file that allows us to restore our schema with a single command.
If it's a small change (like adding a new column) I'll try to get that out there just prior to deploying our application code. If it's a new table, I may create that earlier, as a new table without code to use it really doesn't hurt anything.
However, if it is a significant change...such as updating/removing an existing column or changing a key...I will create it as a new table. That way, we can deploy our code to use the new table(s), and nobody ever knows that we switched something behind the scenes. Obviously, if the table needs to have data in it, I'll have export/import scripts ready ahead of time and run those right after we deploy.
Larger corporations with enterprise deployments use tools like Chef to manage their schema deployments. When you have a large number of nodes or clusters, an automated deployment tool is really the best way to go.

How to update fields automatically

In my CouchDB database I'd like all documents to have an 'updated_at' timestamp added when they're changed (and have this enforced).
I can't modify the document with validation functions
updates functions won't run unless they're called specifically (so it'd be possible to update the document and not call the specific update function)
How should I go about implementing this?
There is no way to do this now without triggering _update handlers. This is nice idea to track documents changing time, but it faces problems with replications.
Replications are working on top of public API and this means that:
In case of enforcing such trigger you'll have replications broken since it will be impossible to sync data as it is without document modification. Since document get modified, he receives new revision which may easily lead to dead loop if you replicate data from database A to B and B to A in continuous mode.
In other case when replications are fixed there will be always way to workaround your trigger.
I can suggest one work around - you can create a view which emits a current date as a key (or a part of it):
function( doc ){
emit( new Date, null );
}
This will assign current dates to all documents as soon as the view generation gets triggered (which happens after first request to it) and will reassign new dates on each update of a specific document.
Although the above should solve your issue, I would advice against using it for the reasons already explained by Kxepal: if you're on a replicated network, each node will assign its own dates. So taking this into account, the best I can recommend is to solve the issue on the client side and just post the documents with a date already embedded.

Resources