delphi a good schema for high i/o usage

delphi a good schema for high i/o usage - multithreading

i have a program that uses 5 serial port for get data from hardware and save them in MySQL database and also transfer them to a third place with tcp/ip
for serial ports i use Async so the have separated threads.so thay cant make the system ui lagy.but if i use MySQL connection in threads created by serial port component it waste the time to read from buffer and make buffer over-load error
on the other hand if i send data process and MySQL store to an anonymous thread it cause a many working threads in queue...
what is best way to handle this kind of apps?

Problem you have described is resolvable by setting FIFO Queue between COM ports and MySQL threads. There is some data flow (name it upstream) from COM devices, put this data to Thread-safe FIFO (Delphi has one implementation: System.Generics.Collections.TThreadedQueue) and allocate N threads for MySQL that will read data portion from FIFO Queue (name it downstream). So you convert you problem to classic Producer/Consumer task. If MySQL threads are too slow, FIFO size will grow, you have to grow number of MySQL threads in this case. You can detect optimal MySQL threads count by viewing FIFO avg size.
Problem of driver buffer over-load is caused in case when physical device write data more quickly than your program can process that data. In your case it happens due to slowness of MySQL thread part.
Truth be told, COM port speed is very slow compared to CPU and Threads, I am sure core of your problem - you read from COM devices large count of little size data-pieces and wrap every one little data piece to SQL operation in background of your MySQL library. Shift your look to BATCHING - combine big count of low size datas to more large data batches, so you can decrease amount of SQL operations over TCP/IP

Related

using mutex or not for concurrent files sending over tcp sockets

I have developped a multithreaded system that creates a child process for each client request to read and send files to clients via tcp sockets.
I have difficulties to see if using a mutex for every file reading is going to better the performance, or is it better to let child processes read files concurrently from the hard disk without using mutexes.
The files sizes are 500 ko in average and we estimated the simultanuous tcp connections to be at max 2000 per minute.
PS : the program reads each file in chucks of 2000 bytes, sends each buffer and loop until transmission is finished

I have been approaching these sorts of issues as: which operations are being performed on which resources by whom?
Some of the operations being performed are:
reads on files (potentially shared?)
Does the OS (linux?) guarantee that file reads are threadsafe? Yes it looks like it . If this is truly the case (something i'm not super familiar with) then coordinating mutex's across process for file access will be unnecessary overhead.
Writes to TCP per proccess
This shouldn't be any issue with concurrency because each process handles its own TCP connection
If the link posted reflects reality than there shouldn't be any need for cross process coordination (mutexes)

Sockets - select / thread / both

Recently I have learnt about network programming. I know that for server to handle multiple clients, there is a need to use select or Thread (at least in python/c/c++, I do not know nothing about something similar to select in java, in java I only know the thread approach).
I have read that using select is better from the performance point of view and threads are better for small servers. However, yesterday I found this page: http://www.assembleforce.com/2012-08/how-to-write-a-multi-threading-server-in-python.h and I do not understand why in the provided code guy uses both select and threads? It's difficult for me to understand how does exactly it works and why it is better than other methods I mentioned? I do not understand the idea behind this code.
Thank you.

Threads and select are not mutually exclusive.
Multi-threading is a form of parallel processing, allowing a single process to seemingly perform multiple tasks in an asynchronous manner.
Using select allows your program to monitor a file descriptor (e.g, a socket), waiting for an event.
Both can (and, to my knowledge, are frequently) used together. In a network server environment, threading can be used to service multiple clients, while select is used so that one of the threads will not hog CPU time while idling.
Imagine that you are receiving data from multiple clients. A thread is waiting for data from client1, which is taking too long, meanwhile, client2 is sending data like crazy. You have three options:
Without select, using blocking calls: Block waiting for data from client1, and leave client2 waiting.
With select, using non-blocking calls: Continuously poll client1, giving up after n tries without any data transfer.
With select: Monitor the clients sockets. If they have data to transfer, read it. Else, relinquish the current thread CPU time.
This is a simple non-blocking approach to network servers, trying to give a low latency response to client. There are different approaches, and for that I recommend you check the book UNIX Network Programming.

Boost: multithread performance, reuse of threads/sockets

I'll first describe my task and then present my questions below.
I am trying to implement the "one thread one connection" scheme for our distributed DAQ system. I have used Boost for threads (thread_group) and ASIO for sockets, on a Linux platform.
We have 320 networked DAQ modules. Approx once every 0.25ms, about half of them will each generate a packet of data (size smaller than standard MTU) and send to a linux server. Each of the modules has its own long life TCP connection to its dedicated port on the server. That is, the server side application runs 320 threads 320 tcp syncronous receivers, on a 1Gbe NIC, 8 CPU cores.
The 320 threads do not have to do any computing on the incoming data - only receive data, generate and add timestamp and store the data in thread owned memory. The sockets are all syncronous, so that threads that have no incoming data are blocked. Sockets are kept open for duration of a run.
Our requirement is that the threads should read their individual socket connections with as little time lag as possible. Having read about the C10K and this post I expected that each thread will easily process the equivalent of atleast 1K of MTU size packets every second.
My problem is this: I first tested the system by firing time syncronized data at the server (incoming data on different sockets are less than few microsecs apart). When the number of data packets is very small (less than 10), I find that the threads timestamps are separated by few microsecs. However, if more than 10 then the timestamps are spread by as much as 0.7sec.
My questions are:
Have I totally misunderstood the C10K issue and messed up the implementation? 320 does seems trivial compared to C10K
Any hints as to whats going wrong?
Could this really be a case for reuse of threads and/or sockets? (I really dont know how to implement reuse in my case, so any explanation is appreciated.)

320 threads is chump change in terms of resources, but the scheduling may pose issues.
320*0.25 = 80 requests per seconds, implying at least 80 context switches because you decided you must have each connection on a thread.
I'd simply suggest: don't do this. It's well known that thread-per-connection doesn't scale. And it almost always implies further locking contention on any shared resources (assuming that all the responses aren't completely stateless).
Q. Having read about the C10K and this post I expected that each thread will easily process the equivalent of atleast 1K of MTU size packets every second
Yes. A single thread can easily sustain that (on most systems). But that is no longer true, obviously, if you have hundreds of threads trying to the same, competing for a physical core.
So for maximum throughput and low latency, it's hardly ever useful to have more threads than there are available (!) physical cores.
Q. Could this really be a case for reuse of threads and/or sockets? (I really dont know how to implement reuse in my case, so any explanation is appreciated.)
The good news is that Boost Asio makes it very easy to use a single thread (or a limited pool of threads) to service the asynchronous tasks from it's service queue.
That is, assuming you did already use the *_async version of ASIO API functions.
I think the vast majority - if not all - the Boost Asio examples of asynchronous IO show how to run the service on a limited number of threads only.
http://www.boost.org/doc/libs/1_57_0/doc/html/boost_asio/examples.html

Linux File IO - Multithreading performance - writing to different files

I'm currently working on an audio recording application, that fetches up to 8 audio streams from the network and saves the data to the disk (simplified ;) ).
Right now, each stream gets handled by one thread -> the same thread also does the saving work on the disk.
That means I got 8 different threads that perform writes on the same disk, each one into a different file.
Do you think there would be an increase in the disk i/o performance if all the writing work would be done by one common thread (that would sequently write the data into the particular files)?
OS is an embedded Linux, the "disk" is a CF card, the application is written in C.
Thanks for your ideas
Nick

The short answer: Given that you are writing to a Flash disk, I wouldn't expect the number of threads to make much difference one way or another. But if it did make a difference, I would expect multiple threads to be faster than a single thread, not slower.
The longer answer:
I wrote a similar program to the one you describe about 6 years ago -- it ran on an embedded PowerPC Linux card and read/wrote multiple simultaneous audio files to/from a SCSI hard drive. I originally wrote it with a single thread doing I/O, because I thought that would give the best throughput, but it turned out that that was not the case.
In particular, when multiple threads were reading/writing at once, the SCSI layer was aware of all the pending requests from all the different threads, and was able to reorder the I/O requests such that seeking of the drive head was minimized. In the single-thread-IO scenario, on the other hand, the SCSI layer knew only about the single "next" outstanding I/O request and thus could not do that optimization. That meant extra travel for the drive head in many cases, and therefore lower throughput.
Of course, your application is not using SCSI or a rotating drive with heads that need seeking, so that may not be an issue for you -- but there may be other optimizations that the filesystem/hardware layer can do if it is aware of multiple simultaneous I/O requests. The only real way to find out is to try various models and measure the results.
My suggestion would be to decouple your disk I/O from your network I/O by moving your disk I/O into a thread-pool. You can then vary the maximum size of your I/O-thread-pool from 1 to N, and for each size measure the performance of the system. That would give you a clear idea of what works best on your particular hardware, without requiring you to rewrite the code more than once.

If it's embedded linux, I guess your machine has only one processor/core. In this case threads won't improve I/O performance at all. Of course linux block subsystem works well in concurrent environment, but in your case (if my guess about number of cores is right) there can't be a situation when several threads do something simultaneously.
If my guess is wrong and you have more than 1 core, then I'd suggest to benchmark disk I/O. Write a program that writes a lot of data from different threads and another program that does the same from only one thread. The results will show you everything you want to know.

I think that there is no big difference between multithreaded and singlethreaded solution in your case, but in case of multithreading you can syncronize between receiving threads and no one thread can affect on other threads in case of blocking in some system call.
I did particulary the same thing on embedded system, the problem was the high cpu usage when kernel drop many cached dirty pages to the CF, pdflush kernel process take all cpu time in that moment and if you receive stream via udp so it can be skipped because of cpu was busy when udp stream came, so I solved that problem by fdatasync() call every time when some not big amount of data received.

Efficient multi-threaded server implementation in Qt

I'm planning a multithreaded server written in Qt. Each connection would be attended in a separate thread. Each of those threads would run its own event loop and use asynchronous sockets. I would like to dispatch a const value (for instance, a QString containing an event string) from the main thread to all the client threads in the most efficient possible way. The value should obviously be deleted when all the client threads have read it.
If I simply pass the data in a queued signal/slot connection, would this introduce a considerable overhead? Would it be more efficient to pass a QSharedPointer<QString>? What about passing a const QString* together with a QAtomicInt* for the reference counting and letting the thread decrease it and delete it when the reference counter reaches 0?

Somewhat off-topic, but please be aware that the one-thread-per-connection model could enable anyone able to connect to conduct a highly efficient denial of service attack against the system running the server, since the maximum number of threads that can be created on any system is limited. Also, if it's 32-bit, you can also starve address space since each thread gets its own stack. The default stack size varies acorss systems. On Win32 it's 1 MB, IIRC, so 2048 connections kept open and alive will eat 2 GB, i.e. the entire address space reserved for userspace (you can bump it up to 3 GB but that doesn't help much.)
For more details, check The C10K Problem, specifically the I/O Strategies -> Serve one client with each server thread chapter.

According to the documentation:
Behind the scenes, QString uses implicit sharing (copy-on-write) to reduce memory usage and to avoid the needless copying of data.
Based on this, you shouldn't have any more overhead sending copies of strings through the queued signal/slot connections than you would with your other proposed solutions. So I wouldn't worry about it until and unless it is a demonstrable performance problem.

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string