Massive Mutithreading Operations - multithreading

EDITED WITH NEW CODE BELOW
I'm relatively newbie on Multithreading but to achieve my goal, doing it quickly and learning something new, I decided to do it using a multithread App.
The goal: Parse a huge amount of string from a file and save every word into the SQLite db using CoreData.
Huge because the amount of words is around 300.000 ...
So this is my approach.
Step 1. Parse all the words into the file placing it into a huge NSArray. (Done quickly)
Step 2. Create the NSOperationQueue inserting the NSBlockOperation.
The main problem is that the process start very quickly but than slow down very soon. I'm Using an NSOperationQueue with max concurrent operation setted to 100. I have a Core 2 Duo Process (Dual core without HT).
I seen that using NSOperationQueue there is a lot of overhead creating the NSOperation (stopping the dispatch of the queue it need about 3 min just to create 300k NSOperation.)
CPU goes to 170% when I start dispatching the queue.
I tryed also removing the NSOperationQueue and using the GDC (the 300k loop is done instantaneous (commented lines)) but cpu used is only 95% and the problem is the same as with NSOperations. Very soon the process slow down.
Some tips to do it well?
Here some Code (Original question Code):
- (void)inserdWords:(NSArray *)words insideDictionary:(Dictionary *)dictionary {
NSDate *creationDate = [NSDate date];
__block NSUInteger counter = 0;
NSArray *dictionaryWords = [dictionary.words allObjects];
NSMutableSet *coreDataWords = [NSMutableSet setWithCapacity:words.count];
NSLog(#"Begin Adding Operations");
for (NSString *aWord in words) {
void(^wordParsingBlock)(void) = ^(void) {
#synchronized(dictionary) {
NSManagedObjectContext *context = [(PRDGAppDelegate*)[[NSApplication sharedApplication] delegate] managedObjectContext];
[context lock];
Word *toSaveWord = [NSEntityDescription insertNewObjectForEntityForName:#"Word" inManagedObjectContext:context];
[toSaveWord setCreated:creationDate];
[toSaveWord setText:aWord];
[toSaveWord addDictionariesObject:dictionary];
[coreDataWords addObject:toSaveWord];
[dictionary addWordsObject:toSaveWord];
[context unlock];
counter++;
[self.countLabel performSelectorOnMainThread:#selector(setStringValue:) withObject:[NSString stringWithFormat:#"%lu/%lu", counter, words.count] waitUntilDone:NO];
}
};
[_operationsQueue addOperationWithBlock:wordParsingBlock];
// dispatch_queue_t queue = dispatch_get_global_queue(DISPATCH_QUEUE_PRIORITY_DEFAULT, 0);
// dispatch_async(queue, wordParsingBlock);
}
NSLog(#"Operations Added");
}
Thank you in advance.
Edit...
Thanks to Stephen Darlington I rewrite my code and I figured out the problem. The most important thing is: Do not share CoreData object between Thread ... it means do not mix Core data objects retrieved by different context.
This bring me to use #synchronized(dictionary) that result in a slow motion code execution!
Than I removed the massive NSOperation creation using just MAXTHREAD instance. (2 or 4 instead of 300k ... is a huge difference)
Now I can parse 300k+ String in just 30/40 seconds. Impressive!!
Still I have some issue (seams it parse more words than they are with just 1 thread and it parse not all the words if threads are more than 1 ... I need to figure it out) but now the code is really efficient. Maybe the next step could be using OpenCL and injecting it into the GPU :)
Here the new Code
- (void)insertWords:(NSArray *)words forLanguage:(NSString *)language {
NSDate *creationDate = [NSDate date];
NSPersistentStoreCoordinator *coordinator = [(PRDGAppDelegate*)[[NSApplication sharedApplication] delegate] persistentStoreCoordinator];
// The number of words to be parsed by the single thread.
NSUInteger wordsPerThread = (NSUInteger)ceil((double)words.count / (double)MAXTHREADS);
NSLog(#"Start Adding Operations");
// Here I minimized the number of threads. Every thread will parse and convert a finite number of words instead of 1 word per thread.
for (NSUInteger threadIdx = 0; threadIdx < MAXTHREADS; threadIdx++) {
// The NSBlockOperation.
void(^threadBlock)(void) = ^(void) {
// A new Context for the current thread.
NSManagedObjectContext *context = [[NSManagedObjectContext alloc] init];
[context setPersistentStoreCoordinator:coordinator];
// Dictionary now is in accordance with the thread context.
Dictionary *dictionary = [PRDGMainController dictionaryForLanguage:language usingContext:context];
// Stat Variable. Needed to update the UI.
NSTimeInterval beginInterval = [[NSDate date] timeIntervalSince1970];
NSUInteger operationPerInterval = 0;
// The NSOperation Core. It create a CoreDataWord.
for (NSUInteger wordIdx = 0; wordIdx < wordsPerThread && wordsPerThread * threadIdx + wordIdx < words.count; wordIdx++) {
// The String to convert
NSString *aWord = [words objectAtIndex:wordsPerThread * threadIdx + wordIdx];
// Some Exceptions to skip certain words.
if (...) {
continue;
}
// CoreData Conversion.
Word *toSaveWord = [NSEntityDescription insertNewObjectForEntityForName:#"Word" inManagedObjectContext:context];
[toSaveWord setCreated:creationDate];
[toSaveWord setText:aWord];
[toSaveWord addDictionariesObject:dictionary];
operationPerInterval++;
NSTimeInterval endInterval = [[NSDate date] timeIntervalSince1970];
// Update case.
if (endInterval - beginInterval > UPDATE_INTERVAL) {
NSLog(#"Thread %lu Processed %lu words", threadIdx, wordIdx);
// UI Update. It will be updated only by the first queue.
if (threadIdx == 0) {
// UI Update code.
}
beginInterval = endInterval;
operationPerInterval = 0;
}
}
// When the NSOperation goes to finish the CoreData thread context is saved.
[context save:nil];
NSLog(#"Operation %lu finished", threadIdx);
};
// Add the NSBlockOperation to queue.
[_operationsQueue addOperationWithBlock:threadBlock];
}
NSLog(#"Operations Added");
}

A few thoughts:
Setting max concurrent operations so high is not going to have much effect. It's unlikely to be more than two if you have two cores
It looks as though you're using the same NSManagedObjectContext for all your processes. This is Not Good
Let's assume that your max concurrent operations was 100. The bottle-neck would be the main thread, where you're trying to update a label for every operation. Try to update the main thread for every n records instead of every one
You shouldn't need to lock the context if you're using Core Data correctly... which means using a different context for each thread
You don't seem to ever save the context?
Batching operations is a good way to improve performance... but see previous point
As you suggest, there's an overhead in creating a GCD operation. Creating a new one for each word is probably not optimal. You need to balance the overhead of creating a new processes with the benefits of parallelisation
In short, threading is hard, even when you use something like GCD.

It's hard to way without measuring and profiling but what looks suspicious to me is your saving the full dictionary of words that have been saved so far with the save of each word. So the amount of data per save gets successively larger and larger.
// the dictionary at this point contains all words saved so far
// which each contains a full dictionary
[toSaveWord addDictionariesObject:dictionary];
// add each time so it gets bigger each time
[dictionary addWordsObject:toSaveWord];
So, each save is saving more and more data. Why save a dictionary of all words with each word?
Some other thoughts:
why build up coreDataWords that you never use?
I wonder if you're getting the concurrency you're since you're synchronizing the full block of work.
Things to try:
comment out the dictionary on the toSaveWord in addition to the dictionary you're building up and try again - see if it's your data/data structures or DB/coreData.
Do the first but also create a serial version of it to see if you're actually getting concurency benefits.

Related

How to get fields (attributes) out of a single CoreData record without using [index]?

I have one CoreData record that contains all of the app's settings. When I read that single record (using MagicalRecord), I get an array back. My question is: can I get addressabiltiy to the individual fields in the record without using "[0]" (field index), but rather using [#"shopOpens"]?
I was thinking something like this, but I don't think it's right:
NSPredicate *predicate = [NSPredicate predicateWithFormat:#"aMostRecentFlag == 1"]; // find old records
preferenceData = [PreferenceData MR_findAllWithPredicate:predicate inContext:defaultContext]; // source
NSUserDefaults *userDefaults = [NSUserDefaults standardUserDefaults];
NSMutableDictionary *preferencesDict = [[userDefaults dictionaryForKey:#"preferencesDictionary"] mutableCopy]; // target
// start filling the userDefaults from the last Preferences record
/*
Printing description of preferencesDict: {
apptInterval = 15;
colorScheme = Saori;
servicesType = 1;
shopCloses = 2000;
shopOpens = 900;
showServices = 0;
syncToiCloud = 0;
timeFormat = 12;
}
*/
[preferencesDict setObject: preferenceData.colorScheme forKey:#"shopOpens"];
UPDATE
This is how I finally figured it out, for those who have a similar question:
NSPredicate *filter = [NSPredicate predicateWithFormat:#"aMostRecentFlag == 0"]; // find old records
NSFetchRequest *freqest = [PreferenceData MR_requestAllWithPredicate: filter];
[freqest setResultType: NSDictionaryResultType];
NSDictionary *perferenceData = [PreferenceData MR_executeFetchRequest:freqest];
Disclaimer: I've never used magical record, so the very first part is just an educated guess.
I imagine that preferenceData is an instance of NSArray firstly because the method name uses findAll which indicates that it will return multiple instances. Secondly, a normal core data fetch returns an array, and there is no obvious reason for that find method to return anything different. Thirdly, you referenced using an index operation in your question.
So, preferenceData is most likely an array of all objects in the store that match the specified predicate. You indicated that there is only one such object, which means you can just grab the first one.
PreferenceData *preferenceData = [[PreferenceData
MR_findAllWithPredicate:predicate inContext:defaultContext] firstObject];
Now, unless it is nil, you have the object from the core data store.
You should be able to reference it in any way you like to access its attributes.
Note, however, that you can fetch objects from core data as dictionary using NSDictionaryResultType, which may be a better alternative for you.
Also, you can send dictionaryWithValuesForKeys: to a managed object to get a dictionary of specific attributes.

Core data iterate over fetchrequest in chunks with setFetchLimit only processing half the records

I am trying to process a lot of objects in chunks of a certain size (batchSize). This loop seems to work, but it processes only half the records. Relevant piece of code is:
{
//Prepare fetching products without images in the database
NSFetchRequest * productFetchRequest = [NSFetchRequest fetchRequestWithEntityName:#"Product"];
//Sort by last changed photo first
NSSortDescriptor *sortDescriptor = [[NSSortDescriptor alloc] initWithKey:#"photoModificationDate" ascending:NO];
[productFetchRequest setSortDescriptors:#[sortDescriptor]];
NSPredicate *predicate = [NSPredicate predicateWithFormat: predicateString];
[productFetchRequest setPredicate:predicate];
//First get the total count
NSUInteger numberOfProducts = [self.backgroundMOC countForFetchRequest: productFetchRequest error: &error];
NSLog(#"Getting images for: %d products", numberOfProducts);
//Then set the batchsize to get chunks of data
NSUInteger batchSize = 25;
[productFetchRequest setFetchBatchSize: batchSize];
[productFetchRequest setFetchLimit:batchSize];
//Fetch the products in batches
for (NSUInteger offset = 0; offset < numberOfProducts; offset += batchSize) {
#autoreleasepool {
[productFetchRequest setFetchOffset: offset];
NSArray * products = [self.backgroundMOC executeFetchRequest:productFetchRequest error:&error];
NSLog(#"Offset: %d, number of products: %d", offset, [products count]);
if (!products) {
return NO;
}
for (Product * product in products) {
NSLog(#"Downloading photo for product: %#", product.number);
[self downLoadAndStoreImageForProduct:product];
}
[self saveAndResetBackgroundMOC];
}
}
return YES;
}
The log shows that for the first half of the count (numberOfProducts), it works as expected. So chunks of 25 products are processed. After that first half, the fetchrequest in the loop has 0 records as a result.
If I retry the same code again, again only half of the (remaining) records is processed, so 3/4 in total.
What am I doing wrong?
Note that the managedObjectContext is not only saved, but also reset after the save to save memory. If I do not do this in chunks, the program crashes consistantly after downloading about 3000 pictures.
First point: maybe there is some basic misunderstanding about what fetchLimit and fetchBatchSize do.
fetchLimit and fetchOffset determine which and how many records are fetched.
fetchBatchSize indicates how many records should be retrieved during one trip to the persistent store. Thus if (with or without fetchBatchSize) the number of records that would be retrieved is 100, a fetchBatchSize of 25 would result in 4 trips to the store. (In other words, 4 executed SQL statements for the typical SQLite store. However, this all happens behind the scenes.)
Thus, the code snippet
request.fetchLimit = x;
request.fetchBatchSize = x;
is redundant. The number of trips to the store will always be one anyway.
Second point: I am not sure your setup with the second MOC makes a lot of sense. I suppose you are in a background thread already. As far as I know resetting the MOC is quite expensive. It is not really necessary if you disable the undo manager of the MOC. As for the looping, I believe you can just fetch all records and let fetchBatchSize take care of the discrete "chunking". Because of Core Data's faulting behavior, your #autoreleasepool in the loop maybe will bring only limited advantage.
Where the #autoreleaspool is useful is when you download the images. Perhaps it is enough to batch this part of the process.
That being said, you might not want to change something that is (sort of) working.
Third point: you calculate the number of records based on an unknown (to us) predicate string. Is it dynamic? Not sure if this might not also be part of the issue. After all, not knowing what it is, it is surprising that the number of records changes.
Finally: check if you can do without resetting your MOC.
The problem is in the predicate. It fetches all products without an image. If I download images, the result set for the predicate changes on a subsequent fetch and gets smaller every time. The solution is to process the result set in reverse order. So change:
for (NSUInteger offset = 0; offset < numberOfProducts; offset += batchSize)
Into:
for (NSInteger offset = MAX(numberOfProducts - batchSize, 0); offset > 0; offset -= batchSize)

memory leak that i cant seem to solve

So analyzer is now telling me i have a memory leak. In the function below it says 'potential leak of an object allocated into 'theAudio'
I think it speaks the truth because the app works well for a few minutes then slowly crashes.
I've tried 'autorelease' but it tells me 'object sent autorelease too many times'.
Sorry to be a pest but does anybody have any ideas on this?
-(void) playFile:(NSString*) nameOfFile { // plays audio file passed in by a string
fileLocation = nameOfFile;
NSString *path = [[NSBundle mainBundle] pathForResource:nameOfFile ofType:#"mp3"];
AVAudioPlayer* theAudio = [[AVAudioPlayer alloc] initWithContentsOfURL:[NSURL fileURLWithPath: path] error:NULL];
[theAudio play];
[fileLocation release];
}
Haven't used this, but you probably need to keep a retain on the player (as you do) but then release it when you're done with it, e.g., when you get one of the AVAudioPlayerDelegate methods (so you need to implement the player's `delegate.)

CoreData, NSManagedObject fetch or create if not exists

I am trying to parse a lot of text files and organize their contents as managed objects. There are a lot of duplicates in the text files, so one of the "collateral" tasks is to get rid of them.
What i am trying to do in this respect is to check whether an entity with the given content exists, and if it doesn't, i create one. However, i have different entities with different attributes and relationships. What i want is a kind of function that would take a number of attributes as an input and return a new NSManagedObject instance, and i wouldn't have to worry if it was inserted into the data store or fetched from it.
Is there one?
I must also say that i am a noob at core data.
Some more detail, if you want:
I am trying to write a sort of dictionary. I have words (Word{NSString *word, <<-> Rule rule}), rules (Rule{NSString name, <->>Word word, <<->PartOfSpeech partOfSpeech, <<-> Ending endings}), parts of speech (PartOfSpeech{NSString name, <<-> Rule rule}) (i hope the notation is clear).
Two words are equal, if they have the same word property, and "linked" to the same rule. Two rules are the same, if they have the same endings and part of speech.
So far i've written a method that takes NSPredicate, NSManagedObjectContext and NSEntityDescription as an input, and first queries the datastore and returns an entity if it finds one, or creates a new one, inserts it into the datastore and returns it. However, in this case I cannot populate the new entity with the necessary data (within that method), so i have to either pass an NSDictionary with the names of attributes and their values and insert them, or return by reference a flag as to whether i created a new object or returned an old one, so that i could populate it with the data outside.
But it looks kind of ugly. I'm sure there must be something more elegant than that, i just couldn't find it. Please, help me if you can.
Your basically on the right path. Core Data is an object graph. There not a lot of dynamic built in. There's also no "upsert". like you surmise, you have to fetch and if it doesn't exist, you insert one.
Here is what I have just started using to handle a fetch-or-create scenario. I am using a top level managed object which contains a few to-many relationships to subordinate objects. I have a class that houses a few arrays of data (those are not shown here). This class is responsible for saving and retrieving to and from core data. When the class is created, I do a fetch-or-create to access my top level NSManagedObject.
#implementation MyDataManagerClass
...
#synthesize MyRootDataMO;
- (MyDataManagerClass *) init {
// Init managed object
NSManagedObjectContext *managedObjectContext = [(MyAppDelegate *)[[UIApplication sharedApplication] delegate] managedObjectContext];
// Fetch or Create root user data managed object
NSEntityDescription *entityDescription = [NSEntityDescription entityForName:#"MyRootDataMO" inManagedObjectContext:managedObjectContext];
NSFetchRequest *request = [[[NSFetchRequest alloc] init] autorelease];
[request setEntity:entityDescription];
NSError *error = nil;
NSArray *result = [managedObjectContext executeFetchRequest:request error:&error];
if (result == nil) {
NSLog(#"fetch result = nil");
// Handle the error here
} else {
if([result count] > 0) {
NSLog(#"fetch saved MO");
MyRootDataMO = (MyRootDataMO *)[result objectAtIndex:0];
} else {
NSLog(#"create new MO");
MyRootDataMO = (MyRootDataMO *)[NSEntityDescription insertNewObjectForEntityForName:#"MyRootDataMO" inManagedObjectContext:managedObjectContext];
}
}
return self;
}
...

Multi-threading on a foreach loop?

I want to process some data. I have about 25k items in a Dictionary. IN a foreach loop, I query a database to get results on that item. They're added as value to the Dictionary.
foreach (KeyValuePair<string, Type> pair in allPeople)
{
MySqlCommand comd = new MySqlCommand("SELECT * FROM `logs` WHERE IP = '" + pair.Key + "' GROUP BY src", con);
MySqlDataReader reader2 = comd.ExecuteReader();
Dictionary<string, Dictionary<int, Log>> allViews = new Dictionary<string, Dictionary<int, Log>>();
while (reader2.Read())
{
if (!allViews.ContainsKey(reader2.GetString("src")))
{
allViews.Add(reader2.GetString("src"), reader2.GetInt32("time"));
}
}
reader2.Close();
reader2.Dispose();
allPeople[pair.Key].View = allViews;
}
I was hoping to be able to do this faster by multi-threading. I have 8 threads available, and CPU usage is about 13%. I just don't know if it will work because it's relying on the MySQL server. On the other hand, maybe 8 threads would open 8 DB connections, and so be faster.
Anyway, if multi-threading would help in my case, how? o.O I've never worked with (multiple) threads, so any help would be great :D
MySqlDataReader is stateful - you call Read() on it and it moves to the next row, so each thread needs their own reader, and you need to concoct a query so they get different values. That might not be too hard, as you naturally have many queries with different values of pair.Key.
You also need to either have a temp dictionary per thread, and then merge them, or use a lock to prevent concurrent modification of the dictionary.
The above assumes that MySQL will allow a single connection to perform concurrent queries; otherwise you may need multiple connections too.
First though, I'd see what happens if you only ask the database for the data you need ("SELECT src,time FROMlogsWHERE IP = '" + pair.Key + "' GROUP BY src") and use GetString(0) and GetInt32(1) instead of using the names to look up the src and time; also only get the values once from the result.
I'm also not sure on the logic - you are not ordering the log events by time, so which one is the first returned (and so is stored in the dictionary) could be any of them.
Something like this logic - where each of N threads only operates on the Nth pair, each thread has its own reader, and nothing actually changes allPeople, only the properties of the values in allPeople:
private void RunSubQuery(Dictionary<string, Type> allPeople, MySqlConnection con, int threadNumber, int threadCount)
{
int hoppity = 0; // used to hop over the keys not processed by this thread
foreach (var pair in allPeople)
{
// each of the (threadCount) threads only processes the (threadCount)th key
if ((hoppity % threadCount) == threadNumber)
{
// you may need con per thread, or it might be that you can share con; I don't know
MySqlCommand comd = new MySqlCommand("SELECT src,time FROM `logs` WHERE IP = '" + pair.Key + "' GROUP BY src", con);
using (MySqlDataReader reader = comd.ExecuteReader())
{
var allViews = new Dictionary<string, Dictionary<int, Log>>();
while (reader.Read())
{
string src = reader.GetString(0);
int time = reader.GetInt32(1);
// do whatever to allViews with src and time
}
// no thread will be modifying the same pair.Value, so this is safe
pair.Value.View = allViews;
}
}
++hoppity;
}
}
This isn't tested - I don't have MySQL on this machine, nor do I have your database and the other types you're using. It's also rather procedural (kind of how you would do it in Fortran with OpenMPI) rather than wrapping everything up in task objects.
You could launch threads for this like so:
void RunQuery(Dictionary<string, Type> allPeople, MySqlConnection connection)
{
lock (allPeople)
{
const int threadCount = 8; // the number of threads
// if it takes 18 seconds currently and you're not at .net 4 yet, then you may as well create
// the threads here as any saving of using a pool will not matter against 18 seconds
//
// it could be more efficient to use a pool so that each thread takes a pair off of
// a queue, as doing it this way means that each thread has the same number of pairs to process,
// and some pairs might take longer than others
Thread[] threads = new Thread[threadCount];
for (int threadNumber = 0; threadNumber < threadCount; ++threadNumber)
{
threads[threadNumber] = new Thread(new ThreadStart(() => RunSubQuery(allPeople, connection, threadNumber, threadCount)));
threads[threadNumber].Start();
}
// wait for all threads to finish
for (int threadNumber = 0; threadNumber < threadCount; ++threadNumber)
{
threads[threadNumber].Join();
}
}
}
The extra lock held on allPeople is done so that there is a write barrier after all the threads return; I'm not quite sure if it's needed. Any object would do.
Nothing in this guarantees any performance gain - it might be that the MySQL libraries are single threaded, but the server certainly can handle multiple connections. Measure with various numbers of threads.
If you're using .net 4, then you don't have to mess around creating the threads or skipping the items you aren't working on:
// this time using .net 4 parallel; assumes that connection is thread safe
static void RunQuery(Dictionary<string, Type> allPeople, MySqlConnection connection)
{
Parallel.ForEach(allPeople, pair => RunPairQuery(pair, connection));
}
private static void RunPairQuery(KeyValuePair<string, Type> pair, MySqlConnection connection)
{
MySqlCommand comd = new MySqlCommand("SELECT src,time FROM `logs` WHERE IP = '" + pair.Key + "' GROUP BY src", connection);
using (MySqlDataReader reader = comd.ExecuteReader())
{
var allViews = new Dictionary<string, Dictionary<int, Log>>();
while (reader.Read())
{
string src = reader.GetString(0);
int time = reader.GetInt32(1);
// do whatever to allViews with src and time
}
// no iteration will be modifying the same pair.Value, so this is safe
pair.Value.View = allViews;
}
}
The biggest problem that comes to mind is that you are going to use multithreading to add values to a dictionary, which isn't thread safe.
You'll have to do something like this to make it work, and you might not get that much of a benefit from implementing it this was as it still has to lock the dictionary object to add a value.
Assumptions:
There is a table People in your
database
There are alot of people in
your database
Each database query adds overhead you are doing one db query for each of the people in your database I would suggest it was faster to get all the data back in one query then to make repeated calles
select l.ip,l.time,l.src
from logs l, people p
where l.ip = p.ip
group by l.ip, l.src
Try this with a loop in a single thread, I belive this will be much faster then your existing code.
With in your existing code another thing you can do is to take the creation of the MySqlCommand out of the loop, prepare it in advance and just change the parameter. This should speed up execution of the SQL. see http://dev.mysql.com/doc/refman/5.0/es/connector-net-examples-mysqlcommand.html#connector-net-examples-mysqlcommand-prepare
MySqlCommand comd = new MySqlCommand("SELECT * FROM `logs` WHERE IP = ?key GROUP BY src", con);
comd.prepare();
comd.Parameters.Add("?key","example");
foreach (KeyValuePair<string, Type> pair in allPeople)
{
comd.Parameters[0].Value = pair.Key;
If you are using mutiple threads, each thread will still need there own Command, at lest in MS-SQL this would still be faster even if you recreated and prepared the statment every time, due to the ability for the SQL server to be able to cache the execution plan of a paramertirised statment.
Before you do anything else, find out exactly where the time is being spent. Check the execution plan of the query. The first thing I'd suspect is a missing index on logs.IP.
18 minutes for something like this seems much too long to me. Even if you can cut the execution time in eight by adding more threads (which is unlikely!) you still end up using more than 2 minutes. You could probably read the whole 25k rows into memory in less than five seconds and do the necessary processing in memory...
EDIT: Just to clarify, I'm not advocating actually doing this in memory, just saying that it looks like there's a bigger bottleneck here that can be removed.
I think if you are running this on a multi core machine you could gain benefits from multi threading.
However the way I would approach it is to first look at unblocking the thread you are currently using by making asynchronous database calls. The call backs will execute on background threads, so you will get some multi core benefit there and you won't be blocking threads waiting for the db to come back.
For IO intensive apps like this example sounds like you are likely to see improved throughput depending on what load the db can handle. Assuming the db scales to handle more than one concurrent request you should be good.
Thanks everyone for your help. Currently I am using this
for (int i = 0; i < 8; i++)
{
ThreadPool.QueueUserWorkItem(addDistinctScres, i);
}
ThreadPool to run all the threads. I use the method provided by Pete Kirkham, and I'm creating a new connection per thread.
Times went down to 4 minutes.
Next I'll make something wait for the callback of the threadpool? before performing other functions.
I think the bottleneck now is the MySQL server, because the CPU usage has drops.
#odd parity I thought about that, but the real thing is waaay more than 25k rows. Idk if that'd work.
This sound like the perfect job for map/reduce, i am not a .Net-programmer, but this seems like a reasonable guide:
http://ox.no/posts/minimalistic-mapreduce-in-net-4-0-with-the-new-task-parallel-library-tpl

Resources