ArangoDB comparison of documents in different databases - arangodb

I'm interested if it's possible to compare two documents with the same "_id"s (same collections names and "_keys") which are stored in different databases.
My use case is a custom "map / layout engine" that is "mainly" fed by "automatic import / conversion jobs" from an external geo-data system.
So far that works fine.
In some cases however it's necessary manually adjust e.g. the
"x/y"-coordinates of some objects to make them more usable.
By running the import job again any (e.g. to fetch the latest data)
all manual adjustments are lost as they're simply overwritten by the
"auto" data.
Therefore I think of a system setup consisting of several identically
structured ArangoDB databases, used for different "stages" of the
data lifecycle like:
"staging" - newly "auto imported" data is placed here.
"production" - the "final data" that's presented to the user
including all the latest manual adjustments is stored here.
The according (simplified) lifecycle would be this way:
Auto-import into "staging"
Compare and import all manual adjustments from "production" into "staging"
Deploy "merged" contents from 1. and 2. as the new "production" version.
So, this topic is all about step 2's "comparison phase" between the "production" and the "staging" data values.
In SQL I'd express it with sth. like this:
SELECT
x, y
FROM databaseA.layout AS layoutA
JOIN databaseB.layout ON (layoutA.id = layoutB.id) AS layoutB
WHERE
...
Thanks for any hints on how to solve this in ArangoDB using an AQL query or a FOXX service!

Hypothetically, if you had a versioning graph database handy, you could do the following:
On first import, insert new data creating a fresh revision R0 for each inserted node.
Manually change some fields of a node, say N in this data, giving rise to a new revision of N, say R1. Your previous version R0 is not lost though.
Repeat steps 1 and 2 as many times as you like.
Finally, when you need to show this data to the end user, use custom application logic to merge as many previous versions as you want with the current version, doing an n-way merge rather than a 2-way merge.
If you think this could be a potential solution, you can take a look at CivicGraph, which is a version control layer built on top of ArangoDB.
Note: I am the creator of CivicGraph, and this answer could qualify as a promotion for the product, but I also believe it could help solve your problem.

Related

Is there a way to stop Azure ML throwing an error when exporting zero lines of data?

I am currently developing an Azure ML pipeline that as one of its outputs is maintaining a SQL table holding all of the unique items that are fed into it. There is no way to know in advance if the data fed into the pipeline is new unique items or repeats of previous items, so before updating the table that it maintains it pulls the data already in that table and drops any of the new items that already appear.
However, due to this there are cases where this self-reference results in zero new items being found, and as such there is nothing to export to the SQL table. When this happens Azure ML throws an error, as it is considered an error for there to be zero lines of data to export. In my case, however, this is expected behaviour, and as such absolutely fine.
Is there any way for me to suppress this error, so that when it has zero lines of data to export it just skips the export module and moves on?
It sounds as if you are struggling to orchestrate a data pipeline because there orchestration is happening in two places. My advice would be to either move more orchestration into Azure ML, or make the separation between the two greater. One way to do this would be to have a regular export to blob of the table you want to use as training. Then you can use a Logic App to trigger a pipeline whenever a non-empty blob lands in the location
This issue has been resolved by an update to Azure Machine Learning; You can now run pipelines with a flag set to "Continue on Failure Step", which means that steps following the failed data export will continue to run.
This does mean you will need to design your pipeline to be able to handles upstream failures in its downstream modules; this must be done very carefully.

Azure Data Factory Data Migration

Not really sure this is an explicit question or just a query for input. I'm looking at Azure Data Factory to implement a data migration operation. What I'm trying to do is the following:
I have a No SQL DB with two collections. These collections are associated via a common property.
I have a MS SQL Server DB which has data that is related to the data within the No SQL DB Collections via an attribute/column.
One of the NoSQL DB collections will be updated on a regular basis, the other one on a not so often basis.
What I want to do is be able to prepare a Data Factory pipline that will grab the data from all 3 DB locations combine them based on the common attributes, which will result in a new dataset. Then from this dataset push the data wihin the dataset to another SQL Server DB.
I'm a bit unclear on how this is to be done within the data factory. There is a copy activity, but only works on a single dataset input so I can't use that directly. I see that there is a concept of data transformation activities that look like they are specific to massaging input datasets to produce new datasets, but I'm not clear on what ones would be relevant to the activity I am wanting to do.
I did find that there is a special activity called a Custom Activity that is in effect a user defined definition that can be developed to do whatever you want. This looks the closest to being able to do what I need, but I'm not sure if this is the most optimal solution.
On top of that I am also unclear about how the merging of the 3 data sources would work if the need to join data from the 3 different sources is required but do not know how you would do this if the datasets are just snapshots of the originating source data, leading me to think that the possibility of missing data occurring. I'm not sure if a concept of publishing some of the data someplace someplace would be required, but seems like it would in effect be maintaining two stores for the same data.
Any input on this would be helpful.
There are a lot of things you are trying to do.
I don't know if you have experience with SSIS but what you are trying to do is fairly common for either of these integration tools.
Your ADF diagram should look something like:
1. You define your 3 Data Sources as ADF Datasets on top of a
corresponding Linked service
2. Then you build a pipeline that brings information from SQL Server into a
temporary Data Source (Azure Table for example)
3. Next you need to build 2 pipelines that will each take one of your NoSQL
Dataset and run a function to update the temporary Data Source which is the ouput
4. Finally you can build a pipeline that will bring all your data from the
temporary Data Source into your other SQL Server
Steps 2 and 3 could be switched depending on which source is the master.
ADF can run multiple tasks one after another or concurrently. Simply break down the task in logical jobs and you should have no problem coming up with a solution.

How do production Cassandra DBA's do table changes & additions?

I am interested in how the Cassandra production DBA's processes change when using Cassandra and performing many releases over a year. During the releases, columns in tables would change frequently and so would the number of Cassandra tables, as new features and queries are supported.
In the relational DB, in production, you create the 'view' and BOOM you get the data already there - loaded from the view's query.
With Cassandra, does the DBA have to create a new Cassandra table AND have to write/run a script to copy all the required data into that table? Can a production level Cassandra DBA provide some pointers on their processes?
We run a small shop, so I can tell you how I manage table/keyspace changes, and that may differ from how others get it done. First, I keep a text .cql file in our (private) Git repository that has all of our tables and keyspaces in their current formats. When changes are made, I update that file. This lets other developers know what the current tables look like, without having to use SSH or DevCenter. This also has the added advantage of giving us a file that allows us to restore our schema with a single command.
If it's a small change (like adding a new column) I'll try to get that out there just prior to deploying our application code. If it's a new table, I may create that earlier, as a new table without code to use it really doesn't hurt anything.
However, if it is a significant change...such as updating/removing an existing column or changing a key...I will create it as a new table. That way, we can deploy our code to use the new table(s), and nobody ever knows that we switched something behind the scenes. Obviously, if the table needs to have data in it, I'll have export/import scripts ready ahead of time and run those right after we deploy.
Larger corporations with enterprise deployments use tools like Chef to manage their schema deployments. When you have a large number of nodes or clusters, an automated deployment tool is really the best way to go.

SSRS: How can I run dataset1 first in datasource1 then run dataset2 in datasource2

I have 2 data source(db1, db2) and 2 dataset. 2 dataset are store procedure from each data source.
Dataset1 must run first to create a table for dataset 2 to update and show (dataset 1 will show result too).
Cause the data of the table must base on some table in DB1, the store procedure will create a table to db2 by using link server.
I have search online and tried "single transaction" in data source, but it show error in data set 1 with no detail.
Is there anyway to do it? cause I want to generate an excel with 2 sheet for this result.
Check out this this post.
The default behavior of SSRS is to run the dataset at the same time. They are run in the order in which they are presented in your rdl (top down when looking at it in the report data area). Changing the behavior of a single data source with multiple datasets is as simple as clicking on a checkbox in data source dialog.
With multiple datsources it is a little bit more tricky!
Here is the explanation from the MSDN Blog posted above:
Serializing dataset executions when using multiple data source:
Note that datasets using different data sources will still be executed in parallel; only datasets of the same data source are serialized when using the single transaction setting. If you need to chain dataset executions across different data sources, there are still other options to consider.
For example, if the source databases of your data sources all reside on the same SQL Server instance, you could use just one data source to connect (with single transaction turned on) and then use the three-part name (catalog.schema.object_name) to execute queries or invoke stored procedures in different databases.
Another option to consider is the linked server feature of SQL Server, and then use the four-part name (linked_server_name.catalog.schema.object_name) to execute queries. However, make sure to carefully read the documentation on linked servers to understand its performance and connection credential implications.
This is an interesting question and while I think there might be another way of doing it, it would take a bit of time and playing around with your datasets and more information on your setup of the datasources.
Hope this helps though.

Sync Framework Scope Versioning

We're currently using the Microsoft Sync Framework 2.1 to sync data between a cloud solution and thick clients. Syncing is initiated by the clients and is download only. Both ends are using SQL Server and I'm using the SqlSyncScopeProvisioning class to provision scopes. We cannot guarantee that the clients will be running the latest version of our software, but we have full control of the cloud part and this will always be up to date.
We're considering supporting versioning of scopes so that if, for example, I modify a table to have a new column then I can keep any original scope (e.g. 'ScopeA_V1'), whilst adding another scope that covers all the same data as the first scope but also with the new column (e.g. 'ScopeA_V2'). This would allow older versions of the client to continue syncing until they had been upgraded.
In order to achieve this I'm designing data model changes in a specific way so that I can only add columns and tables, never remove, and all new columns must be nullable. In theory I think this should allow older versions of my scopes to carry on functioning even if they aren't syncing the new data.
I think I'm almost there but I've hit a stumbling block. When I provision the new versions of existing scopes I'm getting the correctly versioned copies of my SelectChanges stored procedure, but all the table specific stored procedures (not specific to scopes - i.e. tableA_update, tableA_delete, etc) are not being updated as I think the provisioner sees them as existing and doesn't think they need updated.
Is there a way I can get the provisioner to update the relevant stored procedures (_update, _insert, etc) so that it adds in the new parameters for the new columns with default values (null), allowing both the new and old versions of the scopes to use them?
Also if I do this then when the client is upgraded to the newer version, will it resync the new columns even though the rows have already been synced (albeit with nulls in the new columns)?
Or am I going about this the completely wrong way? Is there another way to make scopes backwards compatible with older versions?
Sync Framework out-the-box dont support updating scope definitions to accomodate schema changes.
and creating new scope via the SetCreateProceduresForAdditionalScopeDefault will only create a new scope and a new _selectchanges stored procedure and but will re-use all the other stored procedures, tracking tables, triggers and UDT.
i wrote a series of blog posts on what needs to be changed to accomodate schema changes here: http://jtabadero.wordpress.com/2011/03/21/modifying-sync-framework-scope-definition-part-1-introduction/
the subsequent posts to that shows some ways to hack the provisioning scripts.
to answer your other question if the addition of a new column will resync that column or the row, the answer is no. first, change tracking is at the row level. second, adding a column will not fire the triggers that update the tracking tables that indicates if there are changes to be synched.

Resources