Inserting mappings for a large amount of existing data - azure

I am currently testing inserting a large amount of mappings for existing data into a shardmap using Elastic Scale. Turns out the whole process is time consuming. It's inserting around ~10 mappings/second. Is there anyway to speed up the insertion e.g. by inserting batches of mappings or directly via stored procedures?

We know from our own testing that inserting mappings is time consuming. Here are a couple of options I'd suggest you try:
You can run multiple parallel threads inserting the mappings.
You can increase the service level objective for the shard map database for the time where you do the bulk load.
I understand why you would want to load mappings in bulk for test scenarios. However, I am not sure I understand the reason why you will need so many mappings that this becomes a problem. Could you explain a bit more?
Thanks,
Torsten

Coming back to this question, since now we have published ShardManagement PowerShell module along with some sample scripts here: https://gallery.technet.microsoft.com/scriptcenter/Azure-SQL-DB-Elastic-731883db. This should help you in settings and querying the existing range/list mappings quickly.

Related

Node JS architecture to handle huge amount of Data returned by DB in better possible way

We have NodeJs application and SQL Server database, and there are couple of badly written queries with a lot of inner joins.
Problem and Use Case
We have use case of generating report (15-20 thousand reports) in PDF / Excel format and there is a query with a lot of joins, which takes almost 8-9 seconds, as there is a huge amount of data - 2-3 tables used in query which have a few million rows each.
For report generation we don't need the real-time data, it can contain a day old or week old data which is fine.
What I'm looking for: a few suggestions to handle this situation in better possible way.
We have few options on table
Dump data from multiple queries in separate table and use it (we are planning to do this activity in periodic manner with the help of scheduler or something on similar lines)
Use time series DB to store the result of query with the help of scheduler, and use it at the time of report generation.
Limiting report generation to use at max last 1 year of data.
Implement sharding in SQL Server
And yes improving query is also something we are working on; but I think there is scope to make it better and that's the reason I'm reaching out here to get few more suggestions.
Denormalization is a tried and true method of speeding up reporting. As Preben suggested, creating an indexed view in SQL server is an efficient way to do this with minimal plumbing. Alternatively, it may be worth thinking about whether a data warehouse implementation is needed for future queries.
If this is a 1-off issue, put together your indexed view (pay attention to the requirements), and move on. If this is the first of many reports that you need to optimize, think about creating a more substantial solution.

Elasticsearch indexing speed with Nodejs

I have an Elasticsearch indice with a large amount of documents. I've been using javascript up until this point with Node.js. I made a cron job to run every 24 hours to update the documents individually based on any metadata changes. As some of you may know, this is probably the slowest possible way to do it. Single threaded Node.js with individual indexing on Elasticsearch. When I run the cron job, it runs at a snails pace. I can update a document every maybe 1.5-2 seconds. This means it would take around 27 hours to update all the docuemnts. I'm using a free-tier AWS ES instance so I don't have access to certain features that would help me speed up the process.
Does anyone know of a way to speed up the indexing? If I were to call for a bulk update, how would that manifest in javascript? If I were to use another language to multi-thread it, what would be the fastest option?
I did not understand your question "If I were to call for a bulk update, how would that manifest in javascript?".
Bulk update should be the best solution irrespective of any language/framework. Of-course, you can explore other languages like Ruby to leverage threads to make bulk update more distributed and fast.
From experience, a bulk update with a batch size b/w 4-7k works just fine. You may want to fine tune the size in this range.
Ensure the value of refresh_interval is set to a very great value. This will ensure that the documents are not indexed very frequently. IMO the default value will also do. Read more here.

Multiple Cursors versus Multiple Connections

I'm building an automation in Python which fetches some data from a database table and populates an excel sheet. I'm using cx_Oracle module for setting up a connection. There are around 44 queries, and around 2 million rows of data are fetched for each query, which makes this script run for an hour. So I'm planning to use threading module to speed up the process. Although I'm confused whether to use multiple connections (around 4) or have less connections (say, 2) and multiple cursors per connection.
The queries are independent of each other. They are select statements to fetch the data and are not manipulating the table in any way.
I just need some pros and cons of using both approaches so that I can decide how to go about the script. I tried searching for it a lot, but curiously I'm not able to find any relevant piece of information at all. If you point me to any kind of blog post, even that will be really helpful.
Thanks.
An Oracle connection can really do just one thing at a time. Specifically while a database session can have multiple open cursors at any one time, it can only be executing one of them.
As such, you won't see any improvement by having multiple cursors in a single connection.
That said, depending on the bottleneck, you MIGHT not see any improvement from going with multiple connections either. It might be choked on bandwidth in returning the data, disk access etc. If you can code in such a way as to keep the number of threads / connections variable, then you can tweak until you find the best result.

Azure SQL virtual machine performance - Inserts very slow

I'm trying out different pricing tiers on SQL Server.
Im inserting 4000 rows distributed over 4 tables in 10 seconds
My problem: I don't any performance improvements from a small D2S_V3 to D8S_V3
My application need to insert many rows (bulking is not an option), and this kind of performance is not acceptable
I wonder why I dont see improvements.
So my noob question: Do I need to configure something to see improvements? My naive thinking says I should some difference :-)
What am I doing wrong?
Without knowing much about your schema, it looks like you are storage bound or network bound.
Storage:
Try to mount the database to the local (temporary disk) and see if you notice any difference, if it is faster then your bottleneck is the mounted disk.
Network bound:
Where is the client that's inserting these transaction? On same machine? On Azure?
I suggest you setup a client in the same region and do the tests.
inserting 4000 rows distributed over 4 tables in 10 seconds.I don't any performance improvements from a small D2S_V3 to D8S_V3
I would approach this problem using wait stats approach rather than throwing hardware first with out knowing problem..
For example,running below insert
insert into #t
select orderid from orders o
join
Customers c
on c.custid=o.custid
showed me below wait stats
Wait WaitType="SOS_SCHEDULER_YIELD" WaitTimeMs="1" WaitCount="167"
Wait WaitType="PAGEIOLATCH_SH" WaitTimeMs="12" WaitCount="3" />
Wait WaitType="MEMORY_ALLOCATION_EXT" WaitTimeMs="21" WaitCount="4975" />
most of the time, the query spent time on
PAGEIOLATCH_SH:
getting data from disk into memory
MEMORY_ALLOCATION_EXT :allocating memory for the query to run
based on this i will try to troubleshoot by seeing if i have memory pressure on my system,since this query is trying to get data from disk.
This is just one example,but hopefully this will give you an idea..
Further i will try to see if select is returing data fast
Performance can be directly linked to your hardware or configuration, but it's more likely that it has to do with the structures and the queries. Take a look at the execution plan for the INSERT operation to see how it is being resolved by the optimizer. Also, capture the query metrics using extended events to see how many resources are being used by the operation. These are more likely to lead to a resolution on why the query is performing slowly and enable you to scale the hardware to best serve the query.

Mongodb, can i trigger secondary replication only at the given time or manually?

I'm not a mongodb expert, so I'm a little unsure about server setup now.
I have a single instance running mongo3.0.2 with wiredtiger, accepting both read and write ops. It collects logs from client, so write load is decent. Once a day I want to process this logs and calculate some metrics using aggregation framework, data set to process is something like all logs from last month and all calculation takes about 5-6 hours.
I'm thinking about splitting write and read to avoid locks on my collections (server continues to write logs while i'm reading, newly written logs may match my queries, but i can skip them, because i don't need 100% accuracy).
In other words, i want to make a setup with a secondary for read, where replication is not performing continuously, but starts in a configured time or better is triggered before all read operations are started.
I'm making all my processing from node.js so one option i see here is to export data created in some period like [yesterday, today] and import it to read instance by myself and make calculations after import is done. I was looking on replica set and master/slave replication as possible setups but i didn't get how to config it to achieve the described scenario.
So maybe i wrong and miss something here? Are there any other options to achieve this?
Your idea of using a replica-set is flawed for several reasons.
First, a replica-set always replicates the whole mongod instance. You can't enable it for individual collections, and certainly not only for specific documents of a collection.
Second, deactivating replication and enabling it before you start your report generation is not a good idea either. When you enable replication, the new slave will not be immediately up-to-date. It will take a while until it has processed the changes since its last contact with the master. There is no way to tell how long this will take (you can check how far a secondary is behind the primary using rs.status() and comparing the secondaries optimeDate with its lastHeartbeat date).
But when you want to perform data-mining on a subset of your documents selected by timespan, there is another solution.
Transfer the documents you want to analyze to a new collection. You can do this with an aggregation pipeline consisting only of a $match which matches the documents from the last month followed by an $out. The out-operator specifies that the results of the aggregation are not sent to the application/shell, but instead written to a new collection (which is automatically emptied before this happens). You can then perform your reporting on the new collection without locking the actual one. It also has the advantage that you are now operating on a much smaller collection, so queries will be faster, especially those which can't use indexes. Also, your data won't change between your aggregations, so your reports won't have any inconsistencies between them due to data changing between them.
When you are certain that you will need a second server for report generation, you can still use replication and perform the aggregation on the secondary. However, I would really recommend you to build a proper replica-set (consisting of primary, secondary and an arbiter) and leave replication active at all times. Not only will that make sure that your data isn't outdated when you generate your reports, it also gives you the important benefit of automatic failover should your primary go down for some reason.

Resources