Orleans - Bulk Data Import - persistent-storage

I am currently learning about MS Orleans for our organization.
I understand that Orleans Grains will remain syncronized with the DB as long as all DB updates are coming via the grains.
But what happens if there is some bulk process process (like processing of a data file) that updates / inserts / deletes records in DB?
Is there some process or pattern to use with Orleans to cater for this?
Or do we need to process all bulk process via Grains?
In case we process bulk operations via grains - do we go about it by updating each grain (seems very expensive if each grain updates itself into DB) or is there some bulk pattern to use to force all affected grains to 'refresh'?
Answer may be obvious. I didn't find anything in documentation about these scenarios.
We would be using Orleans as On-Premises installation with MS-SQL server.
Edit:
I am referring to a process that updates data of N grains. Having a single call that updates 1000 records is much better for sql then 1000 calls that update one record.. A concrete example would be stock update: each product stock would be a grain. Every 15 minutes or so, a file would arrive from 3rd party informing about stock quantity changes that happened outside of the application. This should update in the db and reflect in the grains. File may have 10k's records...

You have a couple of options.
1) Upload via grains - the grains will cache the data and also store in the DB. This partially defeats your goal of efficient DB upload, so may not be what you want.
2) Bulk upload directly into the DB, but use grains to access and process/serve the data. The grains will read the data from the DB upon first request, cache it, serve for further requests. Also, all future data updates will go via grains. This is usually the most common pattern.
3) If you also need periodic data processing, even in the absence of serving requests, then after bulk uploading the grain, you can "bulk-kick" grains to start periodic processing. In this option you will write a controller loop (client logic for example) that will just call "Init" into the set of grains, one by one (but in parallel: call Init on X (X ~ 100) in parallel and await them all together) and then do the next batch. Grains will start a timer or a reminder upon Init.

Related

Parallel read and write to postgres database slows down application (backend)

I have a backend in nestjs using typeorm and postgres. This backend saves and reads data frequently from the database. In this database we are dealing with row counts of 10k + at times that needs to get updated and saved or created.
In this particular case where I need some brain juice I have a table (lets call it table a)
the backend fetches data from table a every few seconds
the content in table A needs to get updated frequently (properties and values overwritten). I am doing this updating task from a several application backend solely for this use-case.
Example case
Table A holds 100K records
update-service splits these 100K records into chunks of 5 and parallell updates 25K records each. While doing so, the main application that retrieves data from the backend slows down.
What is the best way to have performant read and write in parallel? I am assuming the slow down comes from locks (main backend retrieves data while update service tries to update) but I am not sure as I have not that much experience working with databases.
Don't assume, assert.
While you experiencing bad performance, check how the operating system's resources are doing; in this case, mostly CPU and disk. If one of them is maxed out, you know what is going on, and you either have to reduce the degree of parallelism or make the system stronger.
It is also interesting to look at wait events in PostgreSQL:
SELECT wait_event_type, wait_event, count(*)
FROM pg_stat_activity
WHERE state = 'active'
GROUP BY wait_event_type, wait_event;
That will show I/O related events if you are running out of disk bandwidth, but it will also show database-internal contention that you can potentially hit with very high degrees of parallelism.

How to best stage large amounts of data with Hibernate/JPA?

How can I best stage large amounts of data for migration into our database using Hibernate efficiently? Performance when dealing with >25K records that are 100+ columns are not ideal.
Let me explain:
Background
I'm working for a large company that operates around the world. I've been tasked with leading a team (at least for backend) to create a full stack application that allows for various levels of management to perform their tasks. The current tech stack for backend is Java, Spring Boot, Hibernate, and PostgreSQL. Management would like to upload Excel files to our application and have our application parse them so we can refresh the data in our database.
Unfortunately, these files range from 25K to 50K records. We're aware that these Excel files are generated using SQL queries from Excel. However, we are not permitted to access the database with this data directly. The security is very tight and will not permit us access to any APIs, DB calls, etc. to work around Excel. Due to memory constraints and scalability concerns, we're using SAX parsing to keep a low footprint. Once we parse the Excel files, we're mapping them to a Hibernate entity that represents a staging table. Then we're migrating data from it to our other tables.
Currently to stage 25K records and migrate all the data to our other tables takes 15 minutes, which is unacceptable in the eyes of management. Especially, since this will need to be done on a daily basis.
Things I've tried
Enabling batch processing in Hibernate by following Vlad's answer here. This knocked maybe 20 seconds off the overall time for staging.
Rewriting criteria and other queries for fetching data.
Reducing amount of data to process (most fields are required so the amount can't be too heavily reduced).
Indexing important columns in both the staging and destination tables. I'm doing the indexing as part of schema generation.
Optimize parts of code that clean parsed data of imperfections.
I cannot post code due to NDA
Summary of Constraints
This app needs strong support for generating reports on related data (one of the reasons we went with RDBMS. Also, the data fits well into a relational model).
Must maintain a complete audit history of all records (currently using Hibernate Envers).
We have to approve any new dependency/library through the company's cybersecurity team. This can result in days of lost production while we wait for approval. It's not ideal to request new dependencies for the project.
There are no ways of working around the Excel files at this time. An API call or simple database query would be nice, but that's not an option to us for security reasons.
Scalability is a growing concern. Another team under this project has to parse an Excel file of 50K rows with 100 rows. All of this is only data for the USA. The project owner has said the company eventually wants to expand this app's management capabilities abroad.
My Thoughts
Purely regarding the staging issue, I think it's best to get rid of the Hibernate entities responsible for staging. I'll rewrite the migration of staged data into our live tables in SQL using stored procedures. Despite it being vendor-specific (to my knowledge, anyway) I'll use Postgres' COPY command to do the heavy lifting with the large amounts of rows. I can rewrite the parser to direct data to a CSV or other delimited file instead. The only issue I have then is how to migrate the data to tables that use Hibernate sequences and generators. I haven't figured out how to synchronize Hibernate's sequences after a manual update to the database like that. It likes the throw errors about duplicate primary keys until it comes across an ID in the sequence that's not used. But I feel that's another question entirely.
Edit 1:
I should clarify. The 15 minutes is the total time for all of staging. This includes staging and migration. Just the staging of the 25K records takes around 1:30, which also isn't ideal. I've run session metrics a few times and get around the following numbers for Spring Data persisting the 25K records:
2451000 nanoseconds spent acquiring 1 JDBC connection;
0 nanoseconds spent releasing 0 JDBC connections;
96970800 nanoseconds spent preparing 24851 JDBC statements;
9534006000 nanoseconds spent executing 24849 JDBC statements;
21666942900 nanoseconds spent executing 830 JDBC statements;
23513568700 nanoseconds spent executing 2 flushes (flushing a total of 49696 entities and 0 collections)
211588700 nanoseconds spent executing 1 partial-flushes (flushing a total of 24848 entities and 24848 collections)
For this specific case, I'm staging the roughly 25K entities and then using a stored procedure to move only employee data from staging to live tables (a small fraction of what makes up the 15 total minutes). That procedure seems to run instantly. But there's other data that we have to determine via joins, group by statements, etc., which appear to be costly. I'm just not sure why it's taking Spring Data so long to persist that many records when it would take pure SQL significantly less.

DocumentDB: How to run a query without timing out

I am new to the documentDb. I wrote a stored procedure that checks all records and update them under certain circumstances.
Current scenario:
It would run 100 records at a time, updates them and after running few times( taking 100 records at a time and updating) it is timing out.
Expectation
Run the script on all the records without timing out.
The document has close to a million records. So, running the same script multiple times manually is not a the way I am looking for.
Can anyone please advise me how I can achieve that?
tl;dr; Keep calling the sproc with the query continuation token being passed back and forth.
A few thoughts:
There is no capacity of RUs for collections that will allow you to do all million in one call to the sproc.
Sprocs run in isolation on a single replica. This means that they can be transactional but their use will have lower throughput than a regular query that can use all replicas to satisfy the request, so unless you need it to be in a sproc, I recommend using direct queries for reads that don't need to be transactional with writes. Even then, with a million documents, your queries will max out and you'll have to run the query again with a continuation token.
If you must use a sproc... As you are probably aware since you have done the 100 at a time thing, each query returns a continuation token. You can actually add that to the package that you send back from your sproc when it times out. Then you can pass that back into another call to the same sproc and write your sproc to pick up where you left off. The documentdb-utils library for node.js automatically re-calls the sproc until done as long as you follow this pattern for writing your sprocs. If you are using node.js, you could use that (but it has not yet been upgraded to support partitioned collections) or you could write the equivalent in whatever platform you are using.

Mongodb, can i trigger secondary replication only at the given time or manually?

I'm not a mongodb expert, so I'm a little unsure about server setup now.
I have a single instance running mongo3.0.2 with wiredtiger, accepting both read and write ops. It collects logs from client, so write load is decent. Once a day I want to process this logs and calculate some metrics using aggregation framework, data set to process is something like all logs from last month and all calculation takes about 5-6 hours.
I'm thinking about splitting write and read to avoid locks on my collections (server continues to write logs while i'm reading, newly written logs may match my queries, but i can skip them, because i don't need 100% accuracy).
In other words, i want to make a setup with a secondary for read, where replication is not performing continuously, but starts in a configured time or better is triggered before all read operations are started.
I'm making all my processing from node.js so one option i see here is to export data created in some period like [yesterday, today] and import it to read instance by myself and make calculations after import is done. I was looking on replica set and master/slave replication as possible setups but i didn't get how to config it to achieve the described scenario.
So maybe i wrong and miss something here? Are there any other options to achieve this?
Your idea of using a replica-set is flawed for several reasons.
First, a replica-set always replicates the whole mongod instance. You can't enable it for individual collections, and certainly not only for specific documents of a collection.
Second, deactivating replication and enabling it before you start your report generation is not a good idea either. When you enable replication, the new slave will not be immediately up-to-date. It will take a while until it has processed the changes since its last contact with the master. There is no way to tell how long this will take (you can check how far a secondary is behind the primary using rs.status() and comparing the secondaries optimeDate with its lastHeartbeat date).
But when you want to perform data-mining on a subset of your documents selected by timespan, there is another solution.
Transfer the documents you want to analyze to a new collection. You can do this with an aggregation pipeline consisting only of a $match which matches the documents from the last month followed by an $out. The out-operator specifies that the results of the aggregation are not sent to the application/shell, but instead written to a new collection (which is automatically emptied before this happens). You can then perform your reporting on the new collection without locking the actual one. It also has the advantage that you are now operating on a much smaller collection, so queries will be faster, especially those which can't use indexes. Also, your data won't change between your aggregations, so your reports won't have any inconsistencies between them due to data changing between them.
When you are certain that you will need a second server for report generation, you can still use replication and perform the aggregation on the secondary. However, I would really recommend you to build a proper replica-set (consisting of primary, secondary and an arbiter) and leave replication active at all times. Not only will that make sure that your data isn't outdated when you generate your reports, it also gives you the important benefit of automatic failover should your primary go down for some reason.

alternative to polling database?

I have an application that works as follows: Linux machines generate 28 different types of letter to customers. The letters must be sent in .docx (Microsoft Word format). A secretary maintains MS Word templates, which are automatically used as necessary. Changing from using MS Word is not an option.
To coordinate all this, document jobs are placed into a database table and a python program running on each of the windows machines polls the database frequently, locking out jobs and running them as necessary.
We use a central database table for the job information to coordinate different states ("new", "processing", "finished", "printed")... as well to give accurate status information.
Anyway, I don't like the clients polling the database frequently, seeing as they aren't working most of the time. Clients hpoll every 5 seconds.
To avoid polling, I kind of want a broadcast "there's some work to do" or "check your database for some work to do" message sent to all the client machines.
I think some kind of publish/subscribe message queue would be up to the job, but I don't want any massive extra complexity.
Is there a zero or near zero config/maintenance piece of software that would achieve this? What are the options?
X
Is there any objective evidence that any significant load is being put on the server? If it works, I'd make sure there's really a problem to solve here.
It must be nice to have everything running so smoothly that you're looking at things that might only possibly be improved!
Is there a zero or near zero config/maintenance piece of software that would achieve this? What are the options?
Possibly, but what you would save in configuration and implementation time would likely hurt performance more than your polling service ever could. SQL Server isn't made to do a push really (not easily anyway). There are things that you could use to push data out (replication service, log shipping - icky stuff), but they would be more complex and require more resources than your simple polling service. Some options would be:
some kind of trigger which runs your executable using command-line calls (sp_cmdshell)
using a COM object which SQL Server could open and run
using a SQL Agent job to run a VBScript (which would again be considered "polling")
These options are a bit ridiculous considering what you have already done is much simpler.
If you are worried about the polling service using too many cycles or something - you can always throttle it back - polling every minute, every 10 minutes, or even just once a day might be more appropriate - this would be a business decision, so go ask someone in the business how fast it needs to be.
Simple polling services are fairly common, because they are, well... simple. In addition they are also low overhead, remotely stable, and error-tolerant. The down side is that they can hammer the database into dust if not carefully controlled.
A message queue might work well, as they're usually setup to be able to block for a while without wasting resources. But with MySQL, I don't think that's an option.
If you just want to reduce load on the DB, you could create a table with a single row: the latest job ID. Then clients just need to compare that to their last ID to see if they need to run a full poll against the real table. This way the overhead should be greatly reduced, if it's an issue now.
Unlike Postgres and SQL Server (or object stores like CouchDb), MySQL does not emit database events. However there are some coding patterns you can use to simulate this.
If you have one or more tables that you wish to monitor, you can create triggers on these tables that add a row to a "changes" table that records a queue of events to process. Your triggers filter the subset of data changes that you care about and create records in your changes table for each event you wish to perform. Because this pattern queues and persists events it works well even when the workers that process these events have outages.
You might think that MyISAM is the best choice for the changes table since it's mostly performing writes (or even MEMORY if you don't need to persist the events between database server outages). However, keep in mind that both Memory and MEMORY and MyISAM have only full-table locks so your trigger on an InnoDB table might hit a bottle neck when performing an insert into a MEMORY and MyISAM table. You may also require InnoDB for the changes table if you're using a ON DELETE CASCADE with another InnoDB table (requires both tables to use the same engine).
You might also use SHOW TABLE STATUS to check the last update time of you changes table to check if there's something to perform. This feature wont work for InnoDB tables.
These articles describes in more depth some of alternative ways to implement queues in MySQL and even avoid polling!
How to notify event listeners in MySQL
How to implement a queue in SQL
5 subtle ways you're using MySQL as a queue, and why it'll bite you

Resources