How to add spark managed tables to Synapse git integration - apache-spark

This bounty has ended. Answers to this question are eligible for a +50 reputation bounty. Bounty grace period ends in 12 hours.
Michał Wesołowski wants to draw more attention to this question:
Explain how to create tables in Synapse git mode using code, such that they appear in the Data panel. Some further tips on how to work with git integration in Synapse would be good as well.
I have been trying to come up with some ways of working and branching strategy for the git integration in Synapse.
The thing that I don't understand about the git integration is how do you add tables? I know you can use the Database "admin" panel to add tables but I don't want to create them using UI. Furthermore you can only create tables here based on files that already exists and I am not sure whether that is the managed by Spark or treated as some Serverless SQL external table.
If I go ahead and create a table using spark like so:
spark.sql("CREATE DATABASE.TABLE Test (Column1 INT)")
Then this table doesn't appear in my Git working branch but rather it only appears in Live mode. I am fine with the table appearing in Live mode as I know I have physically just created something but why doesn't that table also appear in the Data panel when I switch onto my branch in Git mode? This is counterintuitive and confusing as users are creating tables when using their git branches and are not able to see them appear in the Data panel.
I think that tables created in Git mode should appear in both live and git mode.
Any ideas on how to approach this?

Related

Can you drop Azure SQL Tables from Azure ML?

I am currently developing an Azure ML pipeline that is fed data and triggered using Power Automate and outputs to a couple of SQL Tables in Azure SQL. One of the tables that is generated by the pipeline needs to be refreshed each time the pipeline is run, and as such I need to be able to drop the entire table from the SQL database so that the only data present in the table after the run is the newly calculated data.
Now, at the moment I am dropping the table as part of the Power Automate flow that feeds the data into the pipeline initially. However, due to the size of the dataset, this means that there is a 2-6 hour period during which the analytics I am calculating are not available for the end user while the pipeline I created runs.
Hence, my question; is there any way to perform the "DROP TABLE" SQL command from within my Azure ML Pipeline? If this is possible, it would allow me to move the drop to immediately before the export, which would be a great improvement in performance.
EDIT: From discussions with Microsoft Support, it does appear that this is not possible due to how the current ML Platform is designed. Not answering this question in case someone does solve it, but adding this note so that people who come along with the same problem know.
Yes you can do anything you want inside an Azure ML Pipeline with a Python Script Step. I'd recommend using the pyodbc library, and you'd just have to pass the credentials to your script as environment variables or script arguments.

Azure Data Sync: Failed to perform data sync operation: Table do not have clustered index

I'm getting the 1: https://i.stack.imgur.com/EiSRL.png
I realize there is an earlier post about this topic with an answer that seems to have been suitable for at least 2 people so far. Azure Data Sync Clustered Index Error
However, I have tried implementing the suggestion and can't get it to accept the non-clustered index on the Azure side to sync. I have tried refreshing the schema on both sides in the Azure tables within the sync group. I tried dropping the tables and letting Azure provision them on its own. Nothing yet has worked for me. Does anyone know if this option should still work or is there something I am overlooking?
HubTableStructure
MemberTableStructure
MemberTestTable
HubTestTable
From the images you added to your question we do not see a clustered index created yet. The clustered index is a requirement to use SQL Data Sync.
A Microsoft engineer is asking if you recreated the table with the clustered index on the hub and on the member database. After that he suggest you need to remove the table and then add it back to get the new changes. Or recreate the sync group may be easier.

Cassandra: Adding new denormalized query tables for existing keyspace/data

From the beginning of an application, you plan ahead and denormalize data at write-time for faster queries at read-time. Using Cassandra "BATCH" commands, you can ensure atomic updates across multiple tables.
But, what about when you add a new feature, and need a new denormalized table? Do you need to run a temporary script to populate this new table with data? Is this how people normally do it? Is there a feature in Cassandra that will do this for me?
I can't comment yet hence the new answer. The answer is yes, you'd have to write a migration script and run that when you deploy your software upgrade with the new feature. That's fairly a typical devops release process from my experience.
I've not seen anything like Code First Migrations (for MS SQL Server & Entity Framework) for Cassandra, which does the migration script automatically for you.

Azure Web Site Migrations & Concurrency

I have two Azure Websites set up - one that serves the client application with no database, another with a database and WebApi solution that the client gets data from.
I'm about to add a new table to the database and populate it with data using a temporary Seed method that I only plan on running once. I'm not sure what the best way to go about it is though.
Right now I have the database initializer set to MigrateDatabaseToLatestVersion and I've tested this update locally several times. Everything seems good to go but the update / seed method takes about 6 minutes to run. I have some questions about concurrency while migrating:
What happens when someone performs CRUD operations against the database while business logic and tables are being updated in this 6-minute window? I mean - the time between when I hit "publish" from VS, and when the new bits are actually deployed. What if the seed method modifies every entry in another table, and a user adds some data mid-seed that doesn't get hit by this critical update? Should I lock the site while doing it just in case (far from ideal...)?
Any general guidance on this process would be fantastic.
Operations like creating a new table or adding new columns should have only minimal impact on the performance and be transparent, especially if the application applies the recommended pattern of dealing with transient faults (for instance by leveraging the Enterprise Library).
Mass updates or reindexing could cause contention and affect the application's performance or even cause errors. Depending on the case, transient fault handling could work around that as well.
Concurrent modifications to data that is being upgraded could cause problems that would be more difficult to deal with. These are some possible approaches:
Maintenance window
The most simple and safe approach is to take the application offline, backup the database, upgrade the database, update the application, test and bring the application back online.
Read-only mode
This approach avoids making the application completely unavailable, by keeping it online but disabling any feature that changes the database. The users can still query and view data while the application is updated.
Staged upgrade
This approach is based on carefully planned sequences of changes to the database structure and data and to the application code so that at any given stage the application version that is online is compatible with the current database structure.
For example, let's suppose we need to introduce a "date of last purchase" field to a customer record. This sequence could be used:
Add the new field to the customer record in the database (without updating the application). Set the new field default value as NULL.
Update the application so that for each new sale, the date of last purchase field is updated. For old sales the field is left unchanged, and the application at this point does not query or show the new field.
Execute a batch job on the database to update this field for all customers where it is still NULL. A delay could be introduced between updates so that the system is not overloaded.
Update the application to start querying and showing the new information.
There are several variations of this approach, such as the concept of "expansion scripts" and "contraction scripts" described in Zero-Downtime Database Deployment. This could be used along with feature toggles to change the application's behavior dinamically as the upgrade stages are executed.
New columns could be added to records to indicate that they have been converted. The application logic could be adapted to deal with records in the old version and in the new version concurrently.
The Entity Framework may impose some additional limitations in the options, because it generates the SQL statements on behalf of the application, so you would have to take that into consideration when planning the stages.
Staging environment
Changing the production database structure and executing mass data changes is risky business, especially when it must be done in a specific sequence while data is being entered and changed by users. Your options to revert mistakes can be severely limited.
It would be necessary to do extensive testing and simulation in a separate staging environment before executing the upgrade procedures on the production environment.
I agree with the maintenance window idea from Fernando. But here is the approach I would take given your question.
Make sure your database is backed up before doing anything (I am assuming its SQL Azure)
Put up a maintenance page on the Client Application
Run the migration via Visual Studio to your database(I am assuming you are doing this through the console) or a unit test
Publish the website/web api websites
Verify your changes.
The main thing is working with the seed method via Entity Framework is that its easy to get it wrong and without a proper backup while running against Prod you could get yourself in trouble real fast. I would probably run it through your test database/environment first (if you have one) to verify what you want is happening.

aspnet_regsql and deployment to Azure

I'm pretty new to Azure and trying to work on deploying an already existing MVC 3 website (I'm late to the project).
It has membership information (where the tables should be genned from aspnet_regsql) and it links those tables to application specific tables. To get it into a working state I need to insert some form of "default data" as the code does (unfortunately) make some assumptions about what should be in the database.
No bother, I have an app that creates a default database and inserts the required data. I can then import that into Azure, this doesn't work as Azure demands clustered indexes. This is because aspnet_regsql creates some auth table keys as unclustered so I'm now left having to alter these tables as part of the process to make the primary keys clustered.
I was just wondering if aspnet_regsql had been superceded somehow due to Azure demanding clustered indexes? Am I missing a trick here or is writing a script to modify the clustering of these indexes the sensible approach?
Found the solution elsewhere here:
http://support.microsoft.com/kb/2006191/de
If you use the Universal Providers, you don't need the scripts.
Check out Hanselman's post. The Universal providers will manage the database creation if you are working with SQL Server, Compact Edition, or Windows Azure Database
There are a lot of references to updated scripts including some on my own blog that are no longer needed.

Resources