Nodejs and Sqlite. Perform long queries

Nodejs and Sqlite. Perform long queries - node.js

I have to perform 2 queries: query A is long (20 seconds) and query B is fast (1 second).
I want to guarantee that query B is performed fast also if query A is running.
How can i achive this behaviour?

It may not be easy to do because of how SQLite does locking.
From official Appropriate Uses For SQLite documentation:
SQLite supports an unlimited number of simultaneous readers, but it will only allow one writer at any instant in time. For many situations, this is not a problem. Writer queue up. Each application does its database work quickly and moves on, and no lock lasts for more than a few dozen milliseconds. But there are some applications that require more concurrency, and those applications may need to seek a different solution.
[...]
SQLite only supports one writer at a time per database file. But in most cases, a write transaction only takes milliseconds and so multiple writers can simply take turns. SQLite will handle more write concurrency that many people suspect. Nevertheless, client/server database systems, because they have a long-running server process at hand to coordinate access, can usually handle far more write concurrency than SQLite ever will.
It may not be the best way to use SQLite, as the SQLite documentation states, when you have so many data that a single query takes so long.
There is no easy solution to fix that, other than using a real RDBMS like PostgreSQL.
And since you didn't include those queries that take so long it's also impossible to tell you anything more than that. Of course maybe your queries could be optimized but we don't know that.

Related

Consistent reads from multiple readers

Consider a scenario, a web request makes N database requests. If I know that all or majority of the requests can be sent to db-readers. With Vitess's architecture, when there are multiple readers setup, wouldn't those N db requests get distributed to different db-readers?
When different readers have different replication lag, it is possible that N db requests result in inconsistent results.
Does Vitess have special ways of handling this?
Or how should an application deal with such situation?

Vitess now supports replica transactions. So, that's what I'd recommend you use if you want consistent reads from replicas. There's a longer answer below if you don't want to use transactions.
The general idea of a replica read is that it's a dirty read. Even if you hit the same replica, the data could have changed from the previous read.
The only difference is that time moves forward if you went back to the same replica.
In reality, this is not very different from cases where you read older data from a different replica. Essentially, you have to deal with the fact that two pieces of data you read are potentially inconsistent with each other.
In other words, if you wrote the application to tolerate inconsistency between two reads, that code would likely tolerate reads that go back in time also. But it all depends on the situation.

Multiple Cursors versus Multiple Connections

I'm building an automation in Python which fetches some data from a database table and populates an excel sheet. I'm using cx_Oracle module for setting up a connection. There are around 44 queries, and around 2 million rows of data are fetched for each query, which makes this script run for an hour. So I'm planning to use threading module to speed up the process. Although I'm confused whether to use multiple connections (around 4) or have less connections (say, 2) and multiple cursors per connection.
The queries are independent of each other. They are select statements to fetch the data and are not manipulating the table in any way.
I just need some pros and cons of using both approaches so that I can decide how to go about the script. I tried searching for it a lot, but curiously I'm not able to find any relevant piece of information at all. If you point me to any kind of blog post, even that will be really helpful.
Thanks.

An Oracle connection can really do just one thing at a time. Specifically while a database session can have multiple open cursors at any one time, it can only be executing one of them.
As such, you won't see any improvement by having multiple cursors in a single connection.
That said, depending on the bottleneck, you MIGHT not see any improvement from going with multiple connections either. It might be choked on bandwidth in returning the data, disk access etc. If you can code in such a way as to keep the number of threads / connections variable, then you can tweak until you find the best result.

Count distinct in infinite stream

I am looking for a way to create a streaming application that can withstand millions of events per second and output a distinct count of those events in real time. As this stream is unbounded by any time window it obviously has to be backed by some storage. However, I cannot find the best way to do this maintaining a good level of abstraction (meaning that I want a framework to handle storing and counting for me, otherwise I don't need a framework at all). The preferred storage for me are Cassandra and Redis (both ideally).
The options I've considered are Flink, Spark and Kafka Streams. I do know the differences between them, but I still can't pick the best solution. Can someone advice? Thanks in advance.

Regardless of which solution you choose, if you can withstand it not being 100% accurate (but being very very close), you can have your operator using HyperLogLog (there are Java implementations available). This allows you to not actually have to keep around data about each individual item, drastically reducing your memory usage.
Assuming Flink, the necessary state is quite small (< 1MB), so can easily use the FSStateBackend which is heap-based and checkpoints to the file system, allowing you to reduce serialization overhead.
Again assuming you go with Flink, Using the [ContinuousEventTimeTrigger][2], you can also get a view into how many unique items are currently being tracked.

I'd suggest to reconsider the choice of storage system. Using an external system is significantly slower than using local state. Flink applications locally maintain state on the JVM heap or in RocksDB (on disk) and can checkpoint it in regular intervals to persistent storage such as HDFS. This state can grow very big (10s of TBs) and still be efficiently maintained because checkpoints can be incrementally and asynchronously done. This gives much better performance than sending a query to an external system for each record.
If you still prefer Redis or Cassandra, you can use Flink's AsyncIO operator to send asynchronous requests to improve the throughput of your application.

How to Create slowness in Cassandra?

I want to create slowness in Cassandra to test my application. Is there any specific ways to induce slowness in Cassandra. In RDBMS we use locking, to wait for other operation until the lock is released. As Cassandra doesn't have locking, is there any other way to create deadlock, slowness etc.

You could use cassandra-stress tool

You could check out our project here simulacron. https://github.com/datastax/simulacron
This is a C*/DSE simulator, that was written specifically to test things like race conditions, and error conditions. You would have to prime all your relevant queries ahead of time, but it would allow you introduce a wait time, or errors to your responses. You can also simulate a large cluster on your local machine.
There is also a similar tool called scassandra, which does much of the same thing.
http://www.scassandra.org/

There are many ways to do it, i'll list two:
Create UDF with sleep/wait function within, if your version of Cassandra supports it.
Link to the docs:
https://docs.datastax.com/en/cql/3.3/cql/cql_using/useCreateUDF.html
Create large table (the larger it be, slower it will run), and run:
select some_column from table where other_column = 'something' allow filtering;
where other_column is not a partition key of the table. It will result in full table scan, and since Cassandra isn't built for it, it will take some time (also I/O and CPU).

Maybe easier to just limit the network on the nodes. Depending on the OS ure using there are different options.

alternative to polling database?

I have an application that works as follows: Linux machines generate 28 different types of letter to customers. The letters must be sent in .docx (Microsoft Word format). A secretary maintains MS Word templates, which are automatically used as necessary. Changing from using MS Word is not an option.
To coordinate all this, document jobs are placed into a database table and a python program running on each of the windows machines polls the database frequently, locking out jobs and running them as necessary.
We use a central database table for the job information to coordinate different states ("new", "processing", "finished", "printed")... as well to give accurate status information.
Anyway, I don't like the clients polling the database frequently, seeing as they aren't working most of the time. Clients hpoll every 5 seconds.
To avoid polling, I kind of want a broadcast "there's some work to do" or "check your database for some work to do" message sent to all the client machines.
I think some kind of publish/subscribe message queue would be up to the job, but I don't want any massive extra complexity.
Is there a zero or near zero config/maintenance piece of software that would achieve this? What are the options?
X

Is there any objective evidence that any significant load is being put on the server? If it works, I'd make sure there's really a problem to solve here.
It must be nice to have everything running so smoothly that you're looking at things that might only possibly be improved!

Is there a zero or near zero config/maintenance piece of software that would achieve this? What are the options?
Possibly, but what you would save in configuration and implementation time would likely hurt performance more than your polling service ever could. SQL Server isn't made to do a push really (not easily anyway). There are things that you could use to push data out (replication service, log shipping - icky stuff), but they would be more complex and require more resources than your simple polling service. Some options would be:
some kind of trigger which runs your executable using command-line calls (sp_cmdshell)
using a COM object which SQL Server could open and run
using a SQL Agent job to run a VBScript (which would again be considered "polling")
These options are a bit ridiculous considering what you have already done is much simpler.
If you are worried about the polling service using too many cycles or something - you can always throttle it back - polling every minute, every 10 minutes, or even just once a day might be more appropriate - this would be a business decision, so go ask someone in the business how fast it needs to be.
Simple polling services are fairly common, because they are, well... simple. In addition they are also low overhead, remotely stable, and error-tolerant. The down side is that they can hammer the database into dust if not carefully controlled.

A message queue might work well, as they're usually setup to be able to block for a while without wasting resources. But with MySQL, I don't think that's an option.
If you just want to reduce load on the DB, you could create a table with a single row: the latest job ID. Then clients just need to compare that to their last ID to see if they need to run a full poll against the real table. This way the overhead should be greatly reduced, if it's an issue now.

Unlike Postgres and SQL Server (or object stores like CouchDb), MySQL does not emit database events. However there are some coding patterns you can use to simulate this.
If you have one or more tables that you wish to monitor, you can create triggers on these tables that add a row to a "changes" table that records a queue of events to process. Your triggers filter the subset of data changes that you care about and create records in your changes table for each event you wish to perform. Because this pattern queues and persists events it works well even when the workers that process these events have outages.
You might think that MyISAM is the best choice for the changes table since it's mostly performing writes (or even MEMORY if you don't need to persist the events between database server outages). However, keep in mind that both Memory and MEMORY and MyISAM have only full-table locks so your trigger on an InnoDB table might hit a bottle neck when performing an insert into a MEMORY and MyISAM table. You may also require InnoDB for the changes table if you're using a ON DELETE CASCADE with another InnoDB table (requires both tables to use the same engine).
You might also use SHOW TABLE STATUS to check the last update time of you changes table to check if there's something to perform. This feature wont work for InnoDB tables.
These articles describes in more depth some of alternative ways to implement queues in MySQL and even avoid polling!
How to notify event listeners in MySQL
How to implement a queue in SQL
5 subtle ways you're using MySQL as a queue, and why it'll bite you

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string