azure-sdk-for-java eventhubs Partition has been lost - azure

We recently deployed azure event-hub java receiver/listener client by following azure-docs.
I truly believe arrays starts with 0, but that has nothing to do with this question. So anyways, I observed the following error raised from processError & also processPartitionClose
Error occurred in partition14 - connectionId[MF_5fba9c_1636350888640] sessionName[eventhub-name/ConsumerGroups/consumer-group-name/Partitions/14] entityPath[eventhub-name/ConsumerGroups/consumer-group-name/Partitions/14] linkName[14_500701_1636350888641] Cannot create receive link from a closed session., errorContext[NAMESPACE: namespace.servicebus.windows.net. ERROR CONTEXT: N/A, PATH: eventhub-name/ConsumerGroups/consumer-group-name/Partitions/14]
ERROR | Partition has been lost 14 reason LOST_PARTITION_OWNERSHIP
Question :
Do azure-sdk-for-java-sdk-eventhubs reconnect on such partition lost automatically ?
If NOT then what is the best practice before restarting manually ?
do I need to update the checkpoint manually ?
do I need to do anything on the ownership ?
This is our sdk setup with Sample Code
EventProcessorClientBuilder eventProcessorClientBuilder = new EventProcessorClientBuilder()
.checkpointStore(new BlobCheckpointStore(blobContainerAsyncClient))
.connectionString(getEventHubConnectionString(), getEventHubName())
.consumerGroup(getConsumerGroup())
.initialPartitionEventPosition(initialPartitionEventPosition)
.processEvent(PARTITION_PROCESSOR)
.processError(ERROR_HANDLER)
.processPartitionClose(CLOSE_HANDLER);
EventProcessorClient eventProcessorClient = eventProcessorClientBuilder.buildEventProcessorClient();
// Starts the event processor
eventProcessorClient.start();
private final Consumer < ErrorContext > ERROR_HANDLER = errorContext->{
log.error("Error occurred in partition" + errorContext.getPartitionContext().getPartitionId()
+ " - " + errorContext.getThrowable().getMessage());
};
private final Consumer < CloseContext > CLOSE_HANDLER = closeContext->{
log.error("Partition has been lost " + closeContext.getPartitionContext().getPartitionId()
+ " reason " + closeContext.getCloseReason());
EventContext lastContext = lastEvent.get();
if (lastContext != null && (lastContext.getEventData().getSequenceNumber() % 10) != 0) {
lastContext.updateCheckpoint();
}
};
jdk : 1.8
<dependency>
<groupId>com.azure</groupId>
<artifactId>azure-messaging-eventhubs-checkpointstore-blob</artifactId>
<version>1.10.0</version>
</dependency>
<dependency>
<groupId>com.azure</groupId>
<artifactId>azure-messaging-eventhubs</artifactId>
<version>5.10.1</version>
</dependency>
I did come across github-issue-15164 but could not find it anywhere mentioned.

Do azure-sdk-for-java-sdk-eventhubs reconnect on such partition lost automatically ?
Yes, the EventProcessorClient in azure-messaging-eventhubs library will reconnect on such partitions. You don't need to change anything manually.
If there are multiple instances of EventProcessorClients running and they all process events from the same Event Hub and use the same consumer group, then you see this LOST_PARTITION_OWNERSHIP error on one processor because the ownership of a partition might have been claimed by the other processor. The checkpoints are read from the checkpoint store (Storage Blob in your code sample above) and the processing resumes from the next sequence number.
Please refer to partition ownership and checkpointing for more details.

Related

MongoDB - Error: getMore command failed: Cursor not found

I need to create a new field sid on each document in a collection of about 500K documents. Each sid is unique and based on that record's existing roundedDate and stream fields.
I'm doing so with the following code:
var cursor = db.getCollection('snapshots').find();
var iterated = 0;
var updated = 0;
while (cursor.hasNext()) {
var doc = cursor.next();
if (doc.stream && doc.roundedDate && !doc.sid) {
db.getCollection('snapshots').update({ "_id": doc['_id'] }, {
$set: {
sid: doc.stream.valueOf() + '-' + doc.roundedDate,
}
});
updated++;
}
iterated++;
};
print('total ' + cursor.count() + ' iterated through ' + iterated + ' updated ' + updated);
It works well at first, but after a few hours and about 100K records it errors out with:
Error: getMore command failed: {
"ok" : 0,
"errmsg": "Cursor not found, cursor id: ###",
"code": 43,
}: ...
EDIT - Query performance:
As #NeilLunn pointed out in his comments, you should not be filtering the documents manually, but use .find(...) for that instead:
db.snapshots.find({
roundedDate: { $exists: true },
stream: { $exists: true },
sid: { $exists: false }
})
Also, using .bulkWrite(), available as from MongoDB 3.2, will be far way more performant than doing individual updates.
It is possible that, with that, you are able to execute your query within the 10 minutes lifetime of the cursor. If it still takes more than that, you cursor will expire and you will have the same problem anyway, which is explained below:
What is going on here:
Error: getMore command failed may be due to a cursor timeout, which is related with two cursor attributes:
Timeout limit, which is 10 minutes by default. From the docs:
By default, the server will automatically close the cursor after 10 minutes of inactivity, or if client has exhausted the cursor.
Batch size, which is 101 documents or 16 MB for the first batch, and 16 MB, regardless of the number of documents, for subsequent batches (as of MongoDB 3.4). From the docs:
find() and aggregate() operations have an initial batch size of 101 documents by default. Subsequent getMore operations issued against the resulting cursor have no default batch size, so they are limited only by the 16 megabyte message size.
Probably you are consuming those initial 101 documents and then getting a 16 MB batch, which is the maximum, with a lot more documents. As it is taking more than 10 minutes to process them, the cursor on the server times out and, by the time you are done processing the documents in the second batch and request a new one, the cursor is already closed:
As you iterate through the cursor and reach the end of the returned batch, if there are more results, cursor.next() will perform a getMore operation to retrieve the next batch.
Possible solutions:
I see 5 possible ways to solve this, 3 good ones, with their pros and cons, and 2 bad one:
πŸ‘ Reducing the batch size to keep the cursor alive.
πŸ‘ Remove the timeout from the cursor.
πŸ‘ Retry when the cursor expires.
πŸ‘Ž Query the results in batches manually.
πŸ‘Ž Get all the documents before the cursor expires.
Note they are not numbered following any specific criteria. Read through them and decide which one works best for your particular case.
1. πŸ‘ Reducing the batch size to keep the cursor alive
One way to solve that is use cursor.bacthSize to set the batch size on the cursor returned by your find query to match those that you can process within those 10 minutes:
const cursor = db.collection.find()
.batchSize(NUMBER_OF_DOCUMENTS_IN_BATCH);
However, keep in mind that setting a very conservative (small) batch size will probably work, but will also be slower, as now you need to access the server more times.
On the other hand, setting it to a value too close to the number of documents you can process in 10 minutes means that it is possible that if some iterations take a bit longer to process for any reason (other processes may be consuming more resources), the cursor will expire anyway and you will get the same error again.
2. πŸ‘ Remove the timeout from the cursor
Another option is to use cursor.noCursorTimeout to prevent the cursor from timing out:
const cursor = db.collection.find().noCursorTimeout();
This is considered a bad practice as you would need to close the cursor manually or exhaust all its results so that it is automatically closed:
After setting the noCursorTimeout option, you must either close the cursor manually with cursor.close() or by exhausting the cursor’s results.
As you want to process all the documents in the cursor, you wouldn't need to close it manually, but it is still possible that something else goes wrong in your code and an error is thrown before you are done, thus leaving the cursor opened.
If you still want to use this approach, use a try-catch to make sure you close the cursor if anything goes wrong before you consume all its documents.
Note I don't consider this a bad solution (therefore the πŸ‘), as even thought it is considered a bad practice...:
It is a feature supported by the driver. If it was so bad, as there are alternatives ways to get around timeout issues, as explained in the other solutions, this won't be supported.
There are ways to use it safely, it's just a matter of being extra cautious with it.
I assume you are not running this kind of queries regularly, so the chances that you start leaving open cursors everywhere is low. If this is not the case, and you really need to deal with these situations all the time, then it does make sense not to use noCursorTimeout.
3. πŸ‘ Retry when the cursor expires
Basically, you put your code in a try-catch and when you get the error, you get a new cursor skipping the documents that you have already processed:
let processed = 0;
let updated = 0;
while(true) {
const cursor = db.snapshots.find().sort({ _id: 1 }).skip(processed);
try {
while (cursor.hasNext()) {
const doc = cursor.next();
++processed;
if (doc.stream && doc.roundedDate && !doc.sid) {
db.snapshots.update({
_id: doc._id
}, { $set: {
sid: `${ doc.stream.valueOf() }-${ doc.roundedDate }`
}});
++updated;
}
}
break; // Done processing all, exit outer loop
} catch (err) {
if (err.code !== 43) {
// Something else than a timeout went wrong. Abort loop.
throw err;
}
}
}
Note you need to sort the results for this solution to work.
With this approach, you are minimizing the number of requests to the server by using the maximum possible batch size of 16 MB, without having to guess how many documents you will be able to process in 10 minutes beforehand. Therefore, it is also more robust than the previous approach.
4. πŸ‘Ž Query the results in batches manually
Basically, you use skip(), limit() and sort() to do multiple queries with a number of documents you think you can process in 10 minutes.
I consider this a bad solution because the driver already has the option to set the batch size, so there's no reason to do this manually, just use solution 1 and don't reinvent the wheel.
Also, it is worth mentioning that it has the same drawbacks than solution 1,
5. πŸ‘Ž Get all the documents before the cursor expires
Probably your code is taking some time to execute due to results processing, so you could retrieve all the documents first and then process them:
const results = new Array(db.snapshots.find());
This will retrieve all the batches one after another and close the cursor. Then, you can loop through all the documents inside results and do what you need to do.
However, if you are having timeout issues, chances are that your result set is quite large, thus pulling everything in memory may not be the most advisable thing to do.
Note about snapshot mode and duplicate documents
It is possible that some documents are returned multiple times if intervening write operations move them due to a growth in document size. To solve this, use cursor.snapshot(). From the docs:
Append the snapshot() method to a cursor to toggle the β€œsnapshot” mode. This ensures that the query will not return a document multiple times, even if intervening write operations result in a move of the document due to the growth in document size.
However, keep in mind its limitations:
It doesn't work with sharded collections.
It doesn't work with sort() or hint(), so it will not work with solutions 3 and 4.
It doesn't guarantee isolation from insertion or deletions.
Note with solution 5 the time window to have a move of documents that may cause duplicate documents retrieval is narrower than with the other solutions, so you may not need snapshot().
In your particular case, as the collection is called snapshot, probably it is not likely to change, so you probably don't need snapshot(). Moreover, you are doing updates on documents based on their data and, once the update is done, that same document will not be updated again even though it is retrieved multiple times, as the if condition will skip it.
Note about open cursors
To see a count of open cursors use db.serverStatus().metrics.cursor.
It's a bug in mongodb server session management. Fix currently in progress, should be fixed in 4.0+
SERVER-34810: Session cache refresh can erroneously kill cursors that are still in use
(reproduced in MongoDB 3.6.5)
adding collection.find().batchSize(20) helped me with about a tiny reduced performance.
I also ran into this problem, but for me it was caused by a bug in the MongDB driver.
It happened in the version 3.0.x of the npm package mongodb which is e.g. used in Meteor 1.7.0.x, where I also recorded this issue. It's further described in this comment and the thread contains a sample project which confirms the bug: https://github.com/meteor/meteor/issues/9944#issuecomment-420542042
Updating the npm package to 3.1.x fixed it for me, because I already had taken into account the good advises, given here by #Danziger.
When using Java v3 driver, noCursorTimeout should be set in the FindOptions.
DBCollectionFindOptions options =
new DBCollectionFindOptions()
.maxTime(90, TimeUnit.MINUTES)
.noCursorTimeout(true)
.batchSize(batchSize)
.projection(projectionQuery);
cursor = collection.find(filterQuery, options);
in my case, It was a Load balancing issue, had the same issue running with Node.js service and Mongos as a pod on Kubernetes.
The client was using mongos service with default load balancing.
changing the kubernetes service to use sessionAffinity: ClientIP (stickiness) resolved the issue for me.
noCursorTimeout will NOT work
now is 2021 year, for
cursor id xxx not found, full error: {'ok': 0.0, 'errmsg': 'cursor id xxx not found', 'code': 43, 'codeName': 'CursorNotFound'}
official says
Consider an application that issues a db.collection.find() with cursor.noCursorTimeout(). The server returns a cursor along with a batch of documents defined by the cursor.batchSize() of the find(). The session refreshes each time the application requests a new batch of documents from the server. However, if the application takes longer than 30 minutes to process the current batch of documents, the session is marked as expired and closed. When the server closes the session, it also kills the cursor despite the cursor being configured with noCursorTimeout(). When the application requests the next batch of documents, the server returns an error.
that means: Even if you have set:
noCursorTimeout=True
smaller batchSize
will still cursor id not found after default 30 minutes
How to fix/avoid cursor id not found?
make sure two point
(explicitly) create new session, get db and collection from this session
refresh session periodically
code:
(official) js
var session = db.getMongo().startSession()
var sessionId = session.getSessionId().id
var cursor = session.getDatabase("examples").getCollection("data").find().noCursorTimeout()
var refreshTimestamp = new Date() // take note of time at operation start
while (cursor.hasNext()) {
// Check if more than 5 minutes have passed since the last refresh
if ( (new Date()-refreshTimestamp)/1000 > 300 ) {
print("refreshing session")
db.adminCommand({"refreshSessions" : [sessionId]})
refreshTimestamp = new Date()
}
// process cursor normally
}
(mine) python
import logging
from datetime import datetime
import pymongo
mongoClient = pymongo.MongoClient('mongodb://127.0.0.1:27017/your_db_name')
# every 10 minutes to update session once
# Note: should less than 30 minutes = Mongo session defaul timeout time
# https://docs.mongodb.com/v5.0/reference/method/cursor.noCursorTimeout/
# RefreshSessionPerSeconds = 10 * 60
RefreshSessionPerSeconds = 8 * 60
def mergeHistorResultToNewCollection():
mongoSession = mongoClient.start_session() # <pymongo.client_session.ClientSession object at 0x1081c5c70>
mongoSessionId = mongoSession.session_id # {'id': Binary(b'\xbf\xd8\xd...1\xbb', 4)}
mongoDb = mongoSession.client["your_db_name"] # Database(MongoClient(host=['127.0.0.1:27017'], document_class=dict, tz_aware=False, connect=True), 'your_db_name')
mongoCollectionOld = mongoDb["collecion_old"]
mongoCollectionNew = mongoDb['collecion_new']
# historyAllResultCursor = mongoCollectionOld.find(session=mongoSession)
historyAllResultCursor = mongoCollectionOld.find(no_cursor_timeout=True, session=mongoSession)
lastUpdateTime = datetime.now() # datetime.datetime(2021, 8, 30, 10, 57, 14, 579328)
for curIdx, oldHistoryResult in enumerate(historyAllResultCursor):
curTime = datetime.now() # datetime.datetime(2021, 8, 30, 10, 57, 25, 110374)
elapsedTime = curTime - lastUpdateTime # datetime.timedelta(seconds=10, microseconds=531046)
elapsedTimeSeconds = elapsedTime.total_seconds() # 2.65892
isShouldUpdateSession = elapsedTimeSeconds > RefreshSessionPerSeconds
# if (curIdx % RefreshSessionPerNum) == 0:
if isShouldUpdateSession:
lastUpdateTime = curTime
cmdResp = mongoDb.command("refreshSessions", [mongoSessionId], session=mongoSession)
logging.info("Called refreshSessions command, resp=%s", cmdResp)
# do what you want
existedNewResult = mongoCollectionNew.find_one({"shortLink": "http://xxx"}, session=mongoSession)
# mongoSession.close()
mongoSession.end_session()
Refer doc
MongoDB
ClientSession
refreshSessions
pymongo
find
command

how we can set sequential execution of azure service bus queue..?

I have three queue in my project.
1.verify email and number.
2.register user.
3.perform investment operation like. deposit, withdrawn, invest etc.
I want the flow of execution is first then second when second is running first run next record. and when second is completed then third.. because we have some data dependency for all.
how I create this kind of sequence of queue
queue 1
Trace.TraceInformation("verification is started");
BrokeredMessage verificationqueuedata = Client.Receive();
try
{
if (creditcheckqueuedata != null)
{
UserModel userModel = verificationqueuedata.GetBody<UserModel>();
if (userModel == null)
{
verificationqueuedata.Abandon();
}
else
{//project code
verificationqueuedata.Complete();
}
}
all the three queue are created in same manner..
Support me for creating of sequence
I have three queue in my project. 1.verify email and number.
2.register user. 3.perform investment operation like. deposit, withdrawn, invest etc.
If you mean, three separate queues for separate tasks: Pick up item from Queue-1 once it is completed put message into Queue-2 & so on. There is no race condition here.
If you are using same queue for three types of messages: You need to maintain correlation-id with each of your message & use some kind of state mechanism (database) to find whether previous operation for this correlation id is completed or no.

Repeating 503's messages when querying DBpedia

I'm conducting a series of queries to DBpedia SPARQL endpoint (from inside a loop). The code looks more or less like this:
for (String citySplit : citiesSplit) {
RepositoryConnection conn = dbpediaEndpoint.getConnection();
String sparqlQueryLat = " SELECT ?lat ?lon WHERE { "
+ "<http://dbpedia.org/resource/" + citySplit.trim().replaceAll(" ", "_") + "> <http://www.w3.org/2003/01/geo/wgs84_pos#lat> ?lat . "
+ "<http://dbpedia.org/resource/" + citySplit.trim().replaceAll(" ", "_") + "> <http://www.w3.org/2003/01/geo/wgs84_pos#long> ?lon ."
+ "}";
TupleQuery queryLat = conn.prepareTupleQuery(QueryLanguage.SPARQL, sparqlQueryLat);
TupleQueryResult resultLat = queryLat.evaluate();
}
The problem is that, after a few iterations, I get a 503 message:
httpclient.wire.header - << "HTTP/1.1 503 Service Temporarily Unavailable[\r][\n]"
(...)
org.openrdf.query.QueryInterruptedException
at org.openrdf.http.client.HTTPClient.getTupleQueryResult(HTTPClient.java:1041)
at org.openrdf.http.client.HTTPClient.sendTupleQuery(HTTPClient.java:438)
at org.openrdf.http.client.HTTPClient.sendTupleQuery(HTTPClient.java:413)
at org.openrdf.repository.http.HTTPTupleQuery.evaluate(HTTPTupleQuery.java:41)
If I understand correctly, this 503 message is from DBpedia. Am I right?
The number of consecutive queries that manage to succeed is variable. Sometimes it runs for 13 seconds before getting the message, sometimes 15 minutes.
In any case, I don't think this is normal.
What could be happening?
The Accessing the DBpedia Data Set over the Web page of the DBpedia wiki says, in section 1.1. Public SPARQL Endpoint says:
Fair Use Policy: Please read this post for information about restrictions on the public DBpedia endpoint. These might also be usefull [sic]: 1, 2.
The linked post says that the public DBpedia SPARQL endpoint implements rate limiting.
The http://dbpedia.org/sparql endpoint has both rate limiting on the number of connections/sec you can make, as well as restrictions on resultset and query time, as per the following settings:
[SPARQL]
ResultSetMaxRows = 2000
MaxQueryExecutionTime = 120
MaxQueryCostEstimationTime = 1500
These are in place to make sure that everyone has a equal chance to de-reference data from dbpedia.org, as well as to guard against badly written queries/robots.
I think that it is likely that you are hitting that limit.

Multi-threading on a foreach loop?

I want to process some data. I have about 25k items in a Dictionary. IN a foreach loop, I query a database to get results on that item. They're added as value to the Dictionary.
foreach (KeyValuePair<string, Type> pair in allPeople)
{
MySqlCommand comd = new MySqlCommand("SELECT * FROM `logs` WHERE IP = '" + pair.Key + "' GROUP BY src", con);
MySqlDataReader reader2 = comd.ExecuteReader();
Dictionary<string, Dictionary<int, Log>> allViews = new Dictionary<string, Dictionary<int, Log>>();
while (reader2.Read())
{
if (!allViews.ContainsKey(reader2.GetString("src")))
{
allViews.Add(reader2.GetString("src"), reader2.GetInt32("time"));
}
}
reader2.Close();
reader2.Dispose();
allPeople[pair.Key].View = allViews;
}
I was hoping to be able to do this faster by multi-threading. I have 8 threads available, and CPU usage is about 13%. I just don't know if it will work because it's relying on the MySQL server. On the other hand, maybe 8 threads would open 8 DB connections, and so be faster.
Anyway, if multi-threading would help in my case, how? o.O I've never worked with (multiple) threads, so any help would be great :D
MySqlDataReader is stateful - you call Read() on it and it moves to the next row, so each thread needs their own reader, and you need to concoct a query so they get different values. That might not be too hard, as you naturally have many queries with different values of pair.Key.
You also need to either have a temp dictionary per thread, and then merge them, or use a lock to prevent concurrent modification of the dictionary.
The above assumes that MySQL will allow a single connection to perform concurrent queries; otherwise you may need multiple connections too.
First though, I'd see what happens if you only ask the database for the data you need ("SELECT src,time FROMlogsWHERE IP = '" + pair.Key + "' GROUP BY src") and use GetString(0) and GetInt32(1) instead of using the names to look up the src and time; also only get the values once from the result.
I'm also not sure on the logic - you are not ordering the log events by time, so which one is the first returned (and so is stored in the dictionary) could be any of them.
Something like this logic - where each of N threads only operates on the Nth pair, each thread has its own reader, and nothing actually changes allPeople, only the properties of the values in allPeople:
private void RunSubQuery(Dictionary<string, Type> allPeople, MySqlConnection con, int threadNumber, int threadCount)
{
int hoppity = 0; // used to hop over the keys not processed by this thread
foreach (var pair in allPeople)
{
// each of the (threadCount) threads only processes the (threadCount)th key
if ((hoppity % threadCount) == threadNumber)
{
// you may need con per thread, or it might be that you can share con; I don't know
MySqlCommand comd = new MySqlCommand("SELECT src,time FROM `logs` WHERE IP = '" + pair.Key + "' GROUP BY src", con);
using (MySqlDataReader reader = comd.ExecuteReader())
{
var allViews = new Dictionary<string, Dictionary<int, Log>>();
while (reader.Read())
{
string src = reader.GetString(0);
int time = reader.GetInt32(1);
// do whatever to allViews with src and time
}
// no thread will be modifying the same pair.Value, so this is safe
pair.Value.View = allViews;
}
}
++hoppity;
}
}
This isn't tested - I don't have MySQL on this machine, nor do I have your database and the other types you're using. It's also rather procedural (kind of how you would do it in Fortran with OpenMPI) rather than wrapping everything up in task objects.
You could launch threads for this like so:
void RunQuery(Dictionary<string, Type> allPeople, MySqlConnection connection)
{
lock (allPeople)
{
const int threadCount = 8; // the number of threads
// if it takes 18 seconds currently and you're not at .net 4 yet, then you may as well create
// the threads here as any saving of using a pool will not matter against 18 seconds
//
// it could be more efficient to use a pool so that each thread takes a pair off of
// a queue, as doing it this way means that each thread has the same number of pairs to process,
// and some pairs might take longer than others
Thread[] threads = new Thread[threadCount];
for (int threadNumber = 0; threadNumber < threadCount; ++threadNumber)
{
threads[threadNumber] = new Thread(new ThreadStart(() => RunSubQuery(allPeople, connection, threadNumber, threadCount)));
threads[threadNumber].Start();
}
// wait for all threads to finish
for (int threadNumber = 0; threadNumber < threadCount; ++threadNumber)
{
threads[threadNumber].Join();
}
}
}
The extra lock held on allPeople is done so that there is a write barrier after all the threads return; I'm not quite sure if it's needed. Any object would do.
Nothing in this guarantees any performance gain - it might be that the MySQL libraries are single threaded, but the server certainly can handle multiple connections. Measure with various numbers of threads.
If you're using .net 4, then you don't have to mess around creating the threads or skipping the items you aren't working on:
// this time using .net 4 parallel; assumes that connection is thread safe
static void RunQuery(Dictionary<string, Type> allPeople, MySqlConnection connection)
{
Parallel.ForEach(allPeople, pair => RunPairQuery(pair, connection));
}
private static void RunPairQuery(KeyValuePair<string, Type> pair, MySqlConnection connection)
{
MySqlCommand comd = new MySqlCommand("SELECT src,time FROM `logs` WHERE IP = '" + pair.Key + "' GROUP BY src", connection);
using (MySqlDataReader reader = comd.ExecuteReader())
{
var allViews = new Dictionary<string, Dictionary<int, Log>>();
while (reader.Read())
{
string src = reader.GetString(0);
int time = reader.GetInt32(1);
// do whatever to allViews with src and time
}
// no iteration will be modifying the same pair.Value, so this is safe
pair.Value.View = allViews;
}
}
The biggest problem that comes to mind is that you are going to use multithreading to add values to a dictionary, which isn't thread safe.
You'll have to do something like this to make it work, and you might not get that much of a benefit from implementing it this was as it still has to lock the dictionary object to add a value.
Assumptions:
There is a table People in your
database
There are alot of people in
your database
Each database query adds overhead you are doing one db query for each of the people in your database I would suggest it was faster to get all the data back in one query then to make repeated calles
select l.ip,l.time,l.src
from logs l, people p
where l.ip = p.ip
group by l.ip, l.src
Try this with a loop in a single thread, I belive this will be much faster then your existing code.
With in your existing code another thing you can do is to take the creation of the MySqlCommand out of the loop, prepare it in advance and just change the parameter. This should speed up execution of the SQL. see http://dev.mysql.com/doc/refman/5.0/es/connector-net-examples-mysqlcommand.html#connector-net-examples-mysqlcommand-prepare
MySqlCommand comd = new MySqlCommand("SELECT * FROM `logs` WHERE IP = ?key GROUP BY src", con);
comd.prepare();
comd.Parameters.Add("?key","example");
foreach (KeyValuePair<string, Type> pair in allPeople)
{
comd.Parameters[0].Value = pair.Key;
If you are using mutiple threads, each thread will still need there own Command, at lest in MS-SQL this would still be faster even if you recreated and prepared the statment every time, due to the ability for the SQL server to be able to cache the execution plan of a paramertirised statment.
Before you do anything else, find out exactly where the time is being spent. Check the execution plan of the query. The first thing I'd suspect is a missing index on logs.IP.
18 minutes for something like this seems much too long to me. Even if you can cut the execution time in eight by adding more threads (which is unlikely!) you still end up using more than 2 minutes. You could probably read the whole 25k rows into memory in less than five seconds and do the necessary processing in memory...
EDIT: Just to clarify, I'm not advocating actually doing this in memory, just saying that it looks like there's a bigger bottleneck here that can be removed.
I think if you are running this on a multi core machine you could gain benefits from multi threading.
However the way I would approach it is to first look at unblocking the thread you are currently using by making asynchronous database calls. The call backs will execute on background threads, so you will get some multi core benefit there and you won't be blocking threads waiting for the db to come back.
For IO intensive apps like this example sounds like you are likely to see improved throughput depending on what load the db can handle. Assuming the db scales to handle more than one concurrent request you should be good.
Thanks everyone for your help. Currently I am using this
for (int i = 0; i < 8; i++)
{
ThreadPool.QueueUserWorkItem(addDistinctScres, i);
}
ThreadPool to run all the threads. I use the method provided by Pete Kirkham, and I'm creating a new connection per thread.
Times went down to 4 minutes.
Next I'll make something wait for the callback of the threadpool? before performing other functions.
I think the bottleneck now is the MySQL server, because the CPU usage has drops.
#odd parity I thought about that, but the real thing is waaay more than 25k rows. Idk if that'd work.
This sound like the perfect job for map/reduce, i am not a .Net-programmer, but this seems like a reasonable guide:
http://ox.no/posts/minimalistic-mapreduce-in-net-4-0-with-the-new-task-parallel-library-tpl

question on Implementing IQueryCancelAutoPlay in a windows service

I am implementing IQueryCancelAutoPlay COM interface and registering it with the Running Objects Table from a Windows Service*.
My problem is that it never gets called when I insert a mass storage device (or any device really). Here's some more information:
My code for registering with the ROT:
Text::string clsIdString = Text::to_string(Com::CLSID_QCAListener);
// remove curly braces
clsIdString = clsIdString.substr(1, clsIdString.length() - 2);
// set registry key to make sure we get notifications from windows
Reg::SetValue(HKEY_LOCAL_MACHINE,
_T("SOFTWARE\\Microsoft\\Windows\\CurrentVersion\\Explorer\\AutoplayHandlers\\CancelAutoplay\\CLSID"),
clsIdString, _T(""));
HRESULT result = S_OK;
// create class moniker ...
CComPtr<IMoniker> moniker;
result = CreateClassMoniker(Com::CLSID_QCAListener, &moniker);
if( !ValidateResult(result, "Error creating class moniker") )
return;
DBG << _T("Getting IRunningObjectTable pointer ...") << std::endl;
// get running oject table ...
CComPtr<IRunningObjectTable> runningObjectTable;
result = GetRunningObjectTable(0, &runningObjectTable);
if( !ValidateResult(result, "Error getting running object table") )
return;
// create an instance of the QCAListener class ...
Com::QCAListener * listenerInstance = new Com::QCAListener();
if(!ValidateResult( listenerInstance != 0,
"Error creating QueryCancelAutoplayListener"))
return;
// ... and set the pointer in the _qcaListener variable
CComPtr<IQueryCancelAutoPlay> qcaListener;
listenerInstance->QueryInterface(IID_IQueryCancelAutoPlay, reinterpret_cast<void**>(&qcaListener));
DBG << _T("Registering IQueryCancelAutoPlay with ROT ...") << std::endl;
result = runningObjectTable->Register(
ROTFLAGS_REGISTRATIONKEEPSALIVE,
listenerInstance,
moniker,
&_qcaRegistration);
ValidateResult(result, "Error registering QueryCancelAutoplayListener with the ROT");
runningObjectTable->Register returns S_OK, and at the end of the code block's execution the ref-count for listenerInstance is 1 (if I remove the call to runningObjectTable->Register completely, the ref-count remains 0 when qcaListener goes out of scope so this means an instance of my class remains active in the ROT).
More details: In development, my service runs with my account credentials (local administrator). Although this will probably change, it should work as it is with the current configuration.
Can anyone shed any light on this?
*- I know the documentation says I shouldn't implement IQueryCancelAutoPlay in a service but I need to do this for various reasons (business requirement, etc).
I figured it out (for those who stumble upon this answer when having a similar problem):
The service runs under a different window station and a different desktop. When the IQueryCalcelAutoPlay implementation is registered in the ROT this is done for a different desktop.
The current user's desktop shell (explorer) will not find this registration when a new USB device is inserted (as it is not registered with the current desktop).

Resources