I'm creating a windows console application that will read text file line by line and extract the data from the string that is fixed length data. The application is written as windows application for now but will convert to windows console app later on. I've notice that it take a while for the application to run from reading the text, inserting into the database and exporting out of the database.
Would it help speed up the process if i use multiple threads? I'm thinking one thread to read the data and another thread to do inserting the data to the database.
any suggestion?
edit: the application is going to be done in VB.net
I will assume this is an SQL database.
Your problem is likely to be that you are doing one item at a time. SQL hates that. SQL and SQL databases operate on sets of items.
So, open a transaction, read and insert 1,000 items. Save those items in case the transaction commit fails for some reason so that you can retry.
I have managed to speed up some Perl scripts doing work that sounds similar to your description by over 20x with this technique.
I do not know the Microsoft library that you are using, but here is a sample in Perl using DBI. The parts that make it work are AutoCommit => 0 and $dbh->commit.
#!/usr/bin/perl
use strict;
use DBI;
my $dbname = 'urls';
my $user = 'postgres';
my $pass = '';
my $dbh = DBI->connect(
"DBI:Pg:dbname=$dbname",
$user,
$pass,
{ 'RaiseError' => 1, AutoCommit => 0 }
);
my $insert = $dbh->prepare('
INSERT INTO todo (domain, path)
VALUES (?, ?)
');
my $count = 0;
while(<>) {
if( $count++ % 1000 == 0) {
$dbh->commit;
}
chomp;
my ($one, $two) = split;
$insert->execute($one, $two);
}
$dbh->commit;
$dbh->disconnect;
With multiple threads, you may be able to get some overlap - one thread is reading from disk while another thread is doing a database insert. I'm guessing that you probably won't see that much of an improvement - unless you're reading very large files, most of your time is probably spent inserting into the database, and the time in disk I/O is just noise.
It's impossible to say in general - the only way to find out is to build the app and test the performance. The bottleneck is likely to be the DB insert, but whether multi-threading will speed thibngs up depennds on a host of factors:
are your app and the db server running on thge same machine?
do they use the same disk?
can one insert cause contention with another?
You get the idea. Having said that, I have written servers in the finance industry where multi-threading the DB access did make a huge difference. But these were talking to a gigantic Sun enterprise server which had database I/Os to spare, so flooding it with requests from a multi-threaded app made sense.
Comitting data to the database is a time intensive operation. Try collecting items in batches (say 1000) and submit these batches to the database rather than submitting the items one by one. This should improve your performance. Multithreading is overkill for this type of application.
You probably wouldn't gain much from that, as the task you're outlining here is pretty much sequential in nature.
You won't know if multithreading will help until you build the application, but it seems that you really just want better performance. Before doing anything you need to measure the performance of the application. Perhaps there is some code that is inefficient, so use a profiler to identify bottlenecks.
multiple Threads not always improve the performance. If the activities can truly be executed in parallel then only the basic multithreading works. If lots of IO operations are being done in reading data then its worth to give a try. Best way is to prototype and verify.
What are you using to build windows app? If you are using .Net use thread pool. There is nice library called Power threading developed by Jeff Richter.Download
Also, understand how threads work in windows OS. Adding multiple threads sometimes may not help and I often not encourage it.
Related
I have never used threads before, but think I may have encountered an opportunity:
I have written a script that chews through an array of ~500 Excel files, and uses Parse::Excel to pull values from specific sheets in the workbook (on average, two sheets per workbook; one cell extracted per sheet.)
Running it now, where I just go through the array of files one by one and extract the relevant info from the file, it takes about 45 minutes to complete.
My question is: is this an opportunity to use threads, and have more than one file get hit at a time*, or should I maybe just accept the 45 minute run time?
(* - if this is a gross misunderstanding of what I can do with threads, please say so!)
Thanks in advance for any guidance you can offer!
Edit - adding example code. The code below is a sub that is called in a foreach loop for each file location stored in an array:
# Init the parser
my $parser = Spreadsheet::ParseExcel->new;
my $workbook = $parser->parse($inputFile) or die("Unable to load $inputFile: $!");
# Get a list of any sheets that have 'QA' in the sheet name
foreach my $sheet ($workbook->worksheets) {
if ($sheet->get_name =~ m/QA/) {
push #sheetsToScan, $sheet->get_name;
}
}
shift #sheetsToScan;
# Extract the value from the appropriate cell
foreach (#sheetsToScan) {
my $worksheet = $workbook->worksheet($_);
if ($_ =~ m/Production/ or $_ =~ m/Prod/) {
$cell = $worksheet->get_cell(1, 1);
$value = $cell ? $cell->value: undef;
if (not defined $value) {
$value = "Not found.";
}
} else {
$cell = $worksheet->get_cell(6,1);
$value = $cell ? $cell->value: undef;
if (not defined $value) {
$value = "Not found.";
}
}
push(#outputBuffer, $line);
Threads (or using multiple processes using fork) allow your script to utilize more than one CPU at time. For many tasks, this can save a lot of "user time" but will not save "system time" (and may even increase system time to handle the overhead of starting and managing threads and processes). Here are the situations where threading/multiprocessing will not be helpful:
the task of your script does not lend itself to parallelization -- when each step of your algorithm depends on the previous steps
the task your script performs is fast and lightweight compared to the overhead of creating and managing a new thread or new process
your system only has one CPU or your script is only enabled to use one CPU
your task is constrained by a different resource than CPU, such as disk access, network bandwidth, or memory -- if your task involves processing large files that you download through a slow network connection, then your network is the bottleneck, and processing the file on multiple CPUs will not help. Likewise, if your task consumes 70% of your system's memory, than using a second and third thread will require paging to your swap space and will not save any time. Parallelization will also be less effective if your threads compete for some synchronized resource -- file locks, database access, etc.
you need to be considerate of other users on your system -- if you are using all the cores on a machine, then other users will have a poor experience
[added, threads only] your code uses any package that is not thread-safe. Most pure Perl code will be thread-safe, but packages that use XS may not be
[added] when you are still actively developing your core task. Debugging is a lot harder in parallel code
Even if none of these apply, it is sometimes hard to tell how much a task will benefit from parallelization, and the only way to be sure is to actually implement the parallel task and benchmark it. But the task you have described looks like it could be a good candidate for parallelization.
It seems to me that your task should benefit from multiple threads of execution (processes or threads), as it seems to have a very roughly even blend of I/O and CPU. I would expect a speedup of a factor of a few but it is hard to tell without knowing details.
One way is to break the list of files into groups, as many as there are cores that you can spare. Then process each group in a fork, which assembles its results and passes them back to the parent once done, via a pipe or files. There are modules that do this and much more, for example Forks::Super or Parallel::ForkManager. They also offer a queue, another approach you can use.
I do this regularly when a lot of data in files is involved and get near linear speedup with up to 4 or 5 cores (on NFS), or even with more cores depending on the job details and on hardware.
I would cautiously assert that this may be simpler than threads, so to try first.
Another way would be to create a thread queue (Thread::Queue)
and feed it the filename groups. Note that Perl's threads are not the lightweight "threads" as one might expect; quite the opposite, they are heavy, they copy everything to each thread (so start them upfront, before there is much data in the program), and they come with yet other subtleties. Have a small number of workers with a sizable job (nice list of files) for each, instead of many threads rapidly working with the queue.
In this approach, too, be careful about how to pass results back since frequent communication poses a significant overhead for (Perl's) threads.
In either case it is important that the groups are formed so to provide for a balanced workload per thread/process. If this is not possible (you may not know which files may take much longer than others), then threads should take smaller batches while for forks use a queue from a module.
Handing only a file or a few to a thread or a process is most likely way too light of a workload, in which case the overhead of managing may erase (or reverse) possible speed gains. The I/O overlap across threads/processes would also increase, which is the main limit to speedup here.
The optimal number of files to pass to a thread/process is hard to estimate, even with all details on hand; just have to try. I assume that the reported runtime (over 5sec for a file) is due to some inefficiency which can be removed so first check your code for undue inefficiencies. If a file somehow really takes that long to process then start by passing a single file at a time to the queue.
Also, please consider mob's answer carefully. And note that these are advanced techniques.
What you do is just change "for ...." into "mce_loop...." and you'll see the boost, although I suggest you take a look mceloop first.
I have written a sybase stored procedure to move data from certain tables[~50] on primary db for given id to archive db. Since it's taking a very long time to archive, I am thinking to execute the same stored procedure in parallel with unique input id for each call.
I manually ran the stored proc twice at same time with different input and it seems to work. Now I want to use Perl threads[maximum 4 threads] and each thread execute the same procedure with different input.
Please advise if this is recommended way or any other efficient way to achieve this. If the experts choice is threads, any pointers or examples would be helpful.
What you do in Perl does not really matter here: what matters is what happens on the side of the Sybase server. Assuming each client task creates its own connection to the database, then it's all fine and how the client achieved this makes no diff for the Sybase server. But do not use a model where the different client tasks will try to use the same client-server connection as that will never happen in parallel.
No 'answer' per se, but some questions/comments:
Can you quantify taking a very long time to archive? Assuming your archive process consists of a mix of insert/select and delete operations, do query plans and MDA data show fast, efficient operations? If you're seeing table scans, sort merges, deferred inserts/deletes, etc ... then it may be worth the effort to address said performance issues.
Can you expand on the comment that running two stored proc invocations at the same time seems to work? Again, any sign of performance issues for the individual proc calls? Any sign of contention (eg, blocking) between the two proc calls? If the archival proc isn't designed properly for parallel/concurrent operations (eg, eliminate blocking), then you may not be gaining much by running multiple procs in parallel.
How many engines does your dataserver have, and are you planning on running your archive process during a period of moderate-to-heavy user activity? If the current archive process runs at/near 100% cpu utilization on a single dataserver engine, then spawning 4 copies of the same process could see your archive process tying up 4 dataserver engines with heavy cpu utilization ... and if your dataserver doesn't have many engines ... combined with moderate-to-heavy user activity at the same time ... you could end up invoking the wrath of your DBA(s) and users. Net result is that you may need to make sure your archive process hog the dataserver.
One other item to consider, and this may require input from the DBAs ... if you're replicating out of either database (source or archive), increasing the volume of transactions per a given time period could have a negative effect on replication throughput (ie, an increase in replication latency); if replication latency needs to be kept at a minimum, then you may want to rethink your entire archive process from the point of view of spreading out transactional activity enough so as to not have an effect on replication latency (eg, single-threaded archive process that does a few insert/select/delete operations, sleeps a bit, then does another batch, then sleeps, ...).
It's been my experience that archive processes are not considered high-priority operations (assuming they're run on a regular basis, and before the source db fills up); this in turn means the archive process is usually designed so that it's efficient while at the same time putting a (relatively) light load on the dataserver (think: running as a trickle in the background) ... ymmv ...
I can't Access database with multithread. It's Exception database is locked or database is busy. I dont understand why database is locked when I read or write in different table.
I try code below to multithread
SQLite3.Config(SQLite3.ConfigOption.MultiThread);
It's not working. Anyone know? I need it so much!
If you have multi threaded application, then both thread have the liberty to update the DB. But inside DB, The first update will take lock on the rows you are trying to update, and if the second update also tries to work on the locked rows, then you have the possibility of getting "locked" or "busy", if the first update request take more the x amount of time, where "x" is configurable.
From the SQLite web site:
SQLite supports an unlimited number of simultaneous readers, but it will only allow one writer at any instant in time. For many situations, this is not a problem. Writer queue up. Each application does its database work quickly and moves on, and no lock lasts for more than a few dozen milliseconds. But there are some applications that require more concurrency, and those applications may need to seek a different solution.
So, you could use SQL from different threads for reading, but not for writing concurrently. There are many answers for this in stackoverflow. See for instance: How to use SQLite in a multi-threaded application?
I have a project where i should use multiple tables to avoid keeping dublicated data in my sqlite file(Even though i knew usage of several tables was nightmare).
In my application i am reading data from one table in some method and inserting data into another table in some other method. When i do this i am getting from sqlite step function, error code 21 which is sqlite misuse.
Accoding to my researches that was because i was not able to reach tables from multi threads.
Up to now, i read the sqlite website and learned that there are 3 modes to configurate sqlite database:
1) singlethread: you have no chances to call several threads.
2) multithread: yeah multi thread; but there are some obstacles.
3) serialized: this is the best match with multithread database applications.
if sqlite3_threadsafe() == 2 returns true then yes your sqlite database is serialized and this returned true, so i proved it for myself.
then i have a code to configurate my sqlite database for serialized to take it under guarantee.
sqlite3_config(SQLITE_CONFIG_SERIALIZED);
when i use above codes in class where i read and insert data from 1 table works perfectly :). But if i try to use it in class where i read and insert data from 2 tables (actually where i really need it) problem sqlite misuse comes up.
I checked my code where i open and close database, there is no problem with them. they work unless i delete the other.
I am using ios5 and this is really a big problem for my project. i heard that instagram uses postgresql may be this was the reason ha? Would you suggest postgresql or sqlite at first?
It seems to me like you've got two things mixed up.
Single vs. multi-threaded
Single threaded builds are only ever safe to use from one thread of your code because they lack the mechanisms (mutexes, critical sections, etc.) internally that permit safe use from several. If you are using multiple threads, use a multi-threaded build (or expect “interesting” trouble; you have been warned).
SQLite's thread support is pretty simple. With a multi-threaded build, particular connections should only be used from a single thread (except that they can be initially opened in another).
All recent (last few years?) SQLite builds are happy with access to a single database from multiple processes, but the degree of parallelism depends on the…
Transaction type
SQL in general supports multiple types of transaction. SQLite supports only a subset of them, and its default is SERIALIZABLE. This is the safest mode of access; it simulates what you would see if only one thing could happen at a time. (Internally, it's implemented using a scheme that lets many readers in at once, but only one writer; there's some cleverness to prevent anyone from starving anyone else.)
SQLite also supports read-uncommitted transactions. This increases the amount of parallelism available to code, but at the risk of readers seeing information that's not yet been guaranteed to persist. Whether this matters to you depends on your application.
I'm returning A LOT (500k+) documents from a MongoDB collection in Node.js. It's not for display on a website, but rather for data some number crunching. If I grab ALL of those documents, the system freezes. Is there a better way to grab it all?
I'm thinking pagination might work?
Edit: This is already outside the main node.js server event loop, so "the system freezes" does not mean "incoming requests are not being processed"
After learning more about your situation, I have some ideas:
Do as much as you can in a Map/Reduce function in Mongo - perhaps if you throw less data at Node that might be the solution.
Perhaps this much data is eating all your memory on your system. Your "freeze" could be V8 stopping the system to do a garbage collection (see this SO question). You could Use V8 flag --trace-gc to log GCs & prove this hypothesis. (thanks to another SO answer about V8 and Garbage collection
Pagination, like you suggested may help. Perhaps even splitting up your data even further into worker queues (create one worker task with references to records 1-10, another with references to records 11-20, etc). Depending on your calculation
Perhaps pre-processing your data - ie: somehow returning much smaller data for each record. Or not using an ORM for this particular calculation, if you're using one now. Making sure each record has only the data you need in it means less data to transfer and less memory your app needs.
I would put your big fetch+process task on a worker queue, background process, or forking mechanism (there are a lot of different options here).
That way you do your calculations outside of your main event loop and keep that free to process other requests. While you should be doing your Mongo lookup in a callback, the calculations themselves may take up time, thus "freezing" node - you're not giving it a break to process other requests.
Since you don't need them all at the same time (that's what I've deduced from you asking about pagination), perhaps it's better to separate those 500k stuff into smaller chunks to be processed at the nextTick?
You could also use something like Kue to queue the chunks and process them later (thus not everything in the same time).