I understand why the first for loop produces one request to the database but why does the second for loop produce 5 requests to the database?
class Program
{
static void Main(string[] args)
{
TAXIDBEntities1 ctx = new TAXIDBEntities1();
List<TestObject> Tests;
//This block makes 1 requests to database
Tests = ctx.TestObjects.ToList();
for (int i = 0; i < 5; i++)
{
foreach (TestObject item in Tests)
{
System.Diagnostics.Debug.WriteLine(item.id);
}
System.Threading.Thread.Sleep(2000);
}
//This block makes 5 requests to database
var x = ctx.TestObjects;
for (int i = 0; i < 5; i++)
{
foreach (TestObject item in x)
{
System.Diagnostics.Debug.WriteLine(item.id);
}
System.Threading.Thread.Sleep(2000);
}
}
}
I don't understand what is going on in the second for loop. Could someone explain why the database has 5 requests as shown in Sql profiler?
The ToList call in your first example instructs EF to read all of the objects from the database and convert them into an in-memory List.
You then iterate over this list.
In the second loop you are querying the database 5 times.
Each time a foreach loop call GetEnumerator on the DbSet it will query the database.
Entity Framework has no way of knowing that there hasn't been made any changes to the database since last execution of GetEnumerator, and hence it needs to make 5 database reads.
Related
Meta-Question:
We're pulling data from EventHub, running some logic, and saving it off to cosmos. Currently Cosmos inserts are our bottleneck. How do we maximize our throughput?
Details
We're trying to optimize our Cosmos throughput and there seems to be some contention in the SDK that makes parallel inserts only marginally faster than serial inserts.
We're logically doing:
for (int i = 0; i < insertCount; i++)
{
taskList.Add(InsertCosmos(sdkContainerClient));
}
var parallelTimes = await Task.WhenAll(taskList);
Here's the results comparing serial inserts, parallel inserts, and "faking" an insert (with Task.Delay):
Serial took: 461ms for 20
- Individual times 28,8,117,19,14,11,10,12,5,8,9,11,18,15,79,23,14,16,14,13
Cosmos Parallel
Parallel took: 231ms for 20
- Individual times 17,15,23,39,45,52,72,74,80,91,96,98,108,117,123,128,139,146,147,145
Just Parallel (no cosmos)
Parallel took: 27ms for 20
- Individual times 27,26,26,26,26,26,26,25,25,25,25,25,25,24,24,24,23,23,23,23
Serial is obvious (just add each value)
no cosmos (the last timing) is also obvious (just take the min time)
But parallel cosmos doesn't parallelize nearly as well, indicating there's some contention.
We're running this on a VM in Azure (same datacenter as Cosmos), have enough RUs so aren't getting 429s, and using Microsoft.Azure.Cosmos 3.2.0.
Full Code Sample
class Program
{
public static void Main(string[] args)
{
CosmosWriteTest().Wait();
}
public static async Task CosmosWriteTest()
{
var cosmosClient = new CosmosClient("todo", new CosmosClientOptions { ConnectionMode = ConnectionMode.Direct });
var database = cosmosClient.GetDatabase("<ourcontainer>");
var sdkContainerClient = database.GetContainer("<ourcontainer>");
int insertCount = 25;
//Warmup
await sdkContainerClient.CreateItemAsync(new TestObject());
//---Serially inserts into Cosmos---
List<long> serialTimes = new List<long>();
var serialTimer = Stopwatch.StartNew();
Console.WriteLine("Cosmos Serial");
for (int i = 0; i < insertCount; i++)
{
serialTimes.Add(await InsertCosmos(sdkContainerClient));
}
serialTimer.Stop();
Console.WriteLine($"Serial took: {serialTimer.ElapsedMilliseconds}ms for {insertCount}");
Console.WriteLine($" - Individual times {string.Join(",", serialTimes)}");
//---Parallel inserts into Cosmos---
Console.WriteLine(Environment.NewLine + "Cosmos Parallel");
var parallelTimer = Stopwatch.StartNew();
var taskList = new List<Task<long>>();
for (int i = 0; i < insertCount; i++)
{
taskList.Add(InsertCosmos(sdkContainerClient));
}
var parallelTimes = await Task.WhenAll(taskList);
parallelTimer.Stop();
Console.WriteLine($"Parallel took: {parallelTimer.ElapsedMilliseconds}ms for {insertCount}");
Console.WriteLine($" - Individual times {string.Join(",", parallelTimes)}");
//---Testing parallelism minus cosmos---
Console.WriteLine(Environment.NewLine + "Just Parallel (no cosmos)");
var justParallelTimer = Stopwatch.StartNew();
var noCosmosTaskList = new List<Task<long>>();
for (int i = 0; i < insertCount; i++)
{
noCosmosTaskList.Add(InsertCosmos(sdkContainerClient, true));
}
var justParallelTimes = await Task.WhenAll(noCosmosTaskList);
justParallelTimer.Stop();
Console.WriteLine($"Parallel took: {justParallelTimer.ElapsedMilliseconds}ms for {insertCount}");
Console.WriteLine($" - Individual times {string.Join(",", justParallelTimes)}");
}
//inserts
private static async Task<long> InsertCosmos(Container sdkContainerClient, bool justDelay = false)
{
var timer = Stopwatch.StartNew();
if (!justDelay)
await sdkContainerClient.CreateItemAsync(new TestObject());
else
await Task.Delay(20);
timer.Stop();
return timer.ElapsedMilliseconds;
}
//Test object to save to Cosmos
public class TestObject
{
public string id { get; set; } = Guid.NewGuid().ToString();
public string pKey { get; set; } = Guid.NewGuid().ToString();
public string Field1 { get; set; } = "Testing this field";
public double Number { get; set; } = 12345;
}
}
This is the scenario for which Bulk is being introduced. Bulk mode is in preview at this moment and available in the 3.2.0-preview2 package.
What you need to do to take advantage of Bulk is turn the AllowBulkExecution flag on:
new CosmosClient(endpoint, authKey, new CosmosClientOptions() { AllowBulkExecution = true } );
This mode was made to benefit this scenario you describe, a list of concurrent operations that need throughput.
We have a sample project here: https://github.com/Azure/azure-cosmos-dotnet-v3/tree/master/Microsoft.Azure.Cosmos.Samples/Usage/BulkSupport
And we are still working on the official documentation, but the idea is that when concurrent operations are issued, instead of executing them as individual requests like you are seeing right now, the SDK will group them based on partition affinity and execute them as grouped (batch) operations, reducing the backend service calls and potentially increasing throughput between 50%-100% depending on the volume of operations. This mode will consume more RU/s as it is pushing a higher volume of operations per second than issuing the operations individually (so if you hit 429s it means the bottleneck is now on the provisioned RU/s).
var cosmosClient = new CosmosClient("todo", new CosmosClientOptions { AllowBulkExecution = true });
var database = cosmosClient.GetDatabase("<ourcontainer>");
var sdkContainerClient = database.GetContainer("<ourcontainer>");
//The more operations the better, just 25 might not yield a great difference vs non bulk
int insertCount = 10000;
//Don't do any warmup
List<Task> operations = new List<Tasks>();
var timer = Stopwatch.StartNew();
for (int i = 0; i < insertCount; i++)
{
operations.Add(sdkContainerClient.CreateItemAsync(new TestObject()));
}
await Task.WhenAll(operations);
serialTimer.Stop();
Important: This is a feature that is still in preview. Since this is a mode optimized for throughput (not latency), any single individual operation you do, won't have a great operational latency.
If you want to optimize even further, and your data source lets you access Streams (avoid serialization), you can use the CreateItemStream SDK methods for even better throughput.
for (int i = 0; i < 100,000; i++)
{
threadEvent.Invoke(i, new EventArgs());// tell processbar value
}
threadEvent += new EventHandler(method_threadEvent);
void method_threadEvent(object sender, EventArgs e)
{
int nowValue = Convert.ToInt32(sender);
nowValueDelegate now = new nowValueDelegate(setNow);
this.Invoke(now, nowValue);
}
private void setNow(int nowValue)
{
this.progressBar1.Value = nowValue;
}
private delegate void nowValueDelegate(int nowValue);
in the loop i do nothing, but it also waste alot of time !
why threadEvent.Invoke spend so much time ?
Invoking is an expensive operation, because it has to cross thread boundaries.
It's best to reduce the amount of invokes, by for instance only updating the progress bar for each percentage of work you do, rather than for each iteration of the loop. That way, only 100 updates need to be processed, rather than one for each iteration.
First thing you need to do is to calculate or estimate the current progress.
For a typical loop
for (int i = 0; i < someValue; ++i)
{
... // Work here
}
A good estimate of progress is (i / someValue) * 100, which gives the percentage of the loop that has been completed. To update the progress to the UI thread only when the next percentage has been reached you could do something in the line of:
int percentCompleted = 0;
threadEvent.Invoke(percentCompleted, new EventArgs()); // Initial progressbar value
for (int i = 0; i < someValue; ++i)
{
int newlyCompleted = (i / someValue) * 100;
if (newlyCompleted > percentCompleted)
threadEvent.Invoke(percentCompleted, new EventArgs());
percentCompleted = newlyCompleted;
... // Work here
}
Now finally, you could use BeginInvoke instead of Invoke to make sure the worker thread doesn't wait for the threadEvent to complete (PostMessage behaviour). This works well here because there is no return value from threadEvent that you need.
I am running into some strange behavior in the backgroundworker class that leads me to believe that I don't fully understand how it works. I assumed that the following code sections were more or less equal except for some extra features that the BackgroundWorker implements (like progress reporting, etc.):
section 1:
void StartSeparateThread(){
BackgroundWorker bw = new BackgroundWorker();
bw.DoWork += new DoWorkEventHandler(bw_DoWork);
bw.RunWorkerAsync();
}
void bw_DoWork(object sender, DoWorkEventArgs e)
{
//Execute some code asynchronous to the thread that owns the function
//StartSeparateThread() but synchronous to itself.
var SendCommand = "SomeCommandToSend";
var toWaitFor = new List<string>(){"Various","Possible","Outputs to wait for"};
var SecondsToWait = 30;
//this calls a function that sends the command over the NetworkStream and waits
//for various responses.
var Result=SendAndWaitFor(SendCommand,toWaitFor,SecondsToWait);
}
Section 2:
void StartSeparateThread(){
Thread pollThread = new Thread(new ThreadStart(DoStuff));
pollThread.Start();
}
void DoStuff(object sender, DoWorkEventArgs e)
{
//Execute some code asynchronous to the thread that owns the function
//StartSeparateThread() but synchronous to itself.
var SendCommand = "SomeCommandToSend";
var toWaitFor = new List<string>(){"Various","Possible","Outputs to wait for"};
var SecondsToWait = 30;
//this calls a function that sends the command over the NetworkStream and waits
//for various responses.
var Result=SendAndWaitFor(SendCommand,toWaitFor,SecondsToWait);
}
I was using Section 1 to run some code that sent a string over a networkstream and waited for a desired response string, capturing all output during that time. I wrote a function to do this that would return the networkstream output, the index of the the sent string, as well as the index of the desired response string. I was seeing some strange behavior with this so I changed the function to only return when both the send string and the output string were found, and that the index of the found string was greater than the index of the sent string. It would otherwise loop forever (just for testing). I would find that the function would indeed return but that the index of both strings were -1 and the output string was null or sometimes filled with the expected output of the previous call. If I were to make a guess about what was happening, it would be that external functions called from within the bw_DoWork() function are run asynchronously to the thread that owns the bw_DoWork() function. As a result, since my SendAndWaitFor() function was called multiple times in succession. the second call would be run before the first call finished, overwriting the results of the first call after they were returned but before they could be evaluated. This seems to make sense because the first call would always run correctly and successive calls would show the strange behavior described above but it seems counter intuitive to how the BackgroundWorker class should behave. Also If I were to break within the SendAndWaitFor function, things would behave properly. This again leads me to believe there is some multi-threading going on within the bwDoWork function itself.
When I change the code in the first section above to the code of the second section, things work entirely as expected. So, can anyone who understands the BackgroundWorker class explain what could be going on? Below are some related functions that may be relevant.
Thanks!
public Dictionary<string, string> SendAndWaitFor(string sendString, List<string> toWaitFor, int seconds)
{
var toReturn = new Dictionary<string, string>();
var data = new List<byte>();
var enc = new ASCIIEncoding();
var output = "";
var FoundString = "";
//wait for current buffer to clear
output = this.SynchronousRead();
while(!string.IsNullOrEmpty(output)){
output = SynchronousRead();
}
//output should be null at this point and the buffer should be clear.
//send the desired data
this.write(enc.GetBytes(sendString));
//look for all desired strings until timeout is reached
int sendIndex=-1;
int foundIndex = -1;
output += SynchronousRead();
for (DateTime start = DateTime.Now; DateTime.Now - start < new TimeSpan(0, 0, seconds); )
{
//wait for a short period to allow the buffer to fill with new data
Thread.Sleep(300);
//read the buffer and add it to the output
output += SynchronousRead();
foreach (var s in toWaitFor)
{
sendIndex = output.IndexOf(sendString);
foundIndex = output.LastIndexOf(s);
if (foundIndex>sendIndex)
{
toReturn["sendIndex"] = sendIndex.ToString();
toReturn["foundIndex"] = sendIndex.ToString();
toReturn["Output"] = output;
toReturn["FoundString"] = s;
return toReturn;
}
}
}
//Set this to loop infinitely while debuging to make sure the function was only
//returning above
while(true){
}
toReturn["sendIndex"]="";
toReturn["foundIndex"]="";
toReturn["Output"] =output;
toReturn["FoundString"] = "";
return toReturn;
}
public void write(byte[] toWrite)
{
var enc = new ASCIIEncoding();
var writeString = enc.GetString(toWrite);
var ns = connection.GetStream();
ns.Write(toWrite, 0, toWrite.Length);
}
public string SynchronousRead()
{
string toReturn = "";
ASCIIEncoding enc = new ASCIIEncoding();
var ns = connection.GetStream();
var sb = new StringBuilder();
while (ns.DataAvailable)
{
var buffer = new byte[4096];
var numberOfBytesRead = ns.Read(buffer, 0, buffer.Length);
sb.AppendFormat("{0}", Encoding.ASCII.GetString(buffer, 0, numberOfBytesRead));
toReturn += sb.ToString();
}
return toReturn;
}
All data to be used by a background worker should be passed in through the DoWorkEventArgs and nothing should be pulled off of the class (or GUI interface).
In looking at your code I could not identify where the property(?) connnection was being created. My guess is that connection is created on a different thread, or may be pulling read information, maybe from a GUI(?) and either one of those could cause problems.
I suggest that you create the connection instance in the dowork event and not pull an existing one off of a different thread. Also verify that the data connection works with does not access any info off of a GUI, but its info is passed in as its made.
I discuss an issue with the Background worker on my blog C# WPF: Linq Fails in BackgroundWorker DoWork Event which might show you where the issue lies in your code.
I am writing very large (both size and count) documents to a solr index(100s of fields with many numeric and some text fields) . I am using Tomcat 7 on W7 x64.
Based on #Maurico's suggestion when indexing millions of documents I parallelize the write operation (see code sample below)
The write to Solr method is being "Task"ed out from a main loop (Note: I task it out since the write op takes too long and holds up the main app)
The problem is that the memory consumption grows uncontrollably, the culprit is the solr write operations (when I comment them out the run works fine). How do I handle this issue? via Tomcat? or SolrNet?
Thanks for your suggestions.
//main loop:
{
:
:
:
//indexDocsList is the list I create in main loop and "chunk" it out to send to the task.
List<IndexDocument> indexDocsList = new List<IndexDocument>();
for(int n = 0; n< N; n++)
{
indexDocsList.Add(new IndexDocument{X=1, Y=2.....});
if(n%5==0) //every 5th time we write to solr
{
var chunk = new List<IndexDocument>(indexDocsList);
indexDocsList.Clear();
Task.Factory.StartNew(() => WriteToSolr(chunk)).ContinueWith(task => chunk.Clear());
GC.Collect();
}
}
}
private void WriteToSolr(List<IndexDocument> indexDocsList)
{
try
{
if (indexDocsList == null) return;
if (indexDocsList.Count <= 0) return;
int fromInclusive = 0;
int toExclusive = indexDocsList.Count;
int subRangeSize = 25;
//TO DO: This is still leaking some serious memory, need to fix this
ParallelLoopResult results = Parallel.ForEach(Partitioner.Create(fromInclusive, toExclusive, subRangeSize), (range) =>
{
_solr.AddRange(indexDocsList.GetRange(range.Item1, range.Item2 - range.Item1));
_solr.Commit();
});
indexDocsList.Clear();
GC.Collect();
}
catch (Exception ex)
{
logger.ErrorException("WriteToSolr()", ex);
}
finally
{
GC.Collect();
};
return;
}
You are manually committing after each batch. This is the most expensive operation for Solr. In your case, I would recommend autoCommit every x seconds and do a softAutoCommit (Solr 4.0) feature. That should take care of Solr's side of things. You'll also have to tweak your JVM garbage collection options so that you don't get pause the world GC.
This is an interesting error I've come across while implementing IEnumerable on a class. It appears to be similar to an "access to modified closure" issue, but I'm at a loss as to how to fix it.
Here is a simple example that demonstrates the issue:
void Main()
{
var nodeCollection = new NodeCollection();
nodeCollection.MyItems = new List<string>() { "a", "b", "c" };
foreach (var node in nodeCollection)
{
node.Dump();
}
}
public class NodeCollection : IEnumerable<Node>
{
public List<string> MyItems;
public IEnumerator<Node> GetEnumerator()
{
// This isn't necessary, but it should prove that it's not an "access to modified closure" issue.
var items = MyItems;
for (var i = 0; i < 3; i++)
{
var node = new Node();
// I want the node to contains the items in MyItems.
node.Items = items;
// Plus an additional item. Note that I am adding the item to the node, NOT to MyItems.
node.Items.Add(string.Format("iteration: {0}", i));
yield return node;
}
}
IEnumerator IEnumerable.GetEnumerator()
{
return GetEnumerator();
}
}
public class Node
{
public List<string> Items;
}
As you can see from the Dump() statement, I'm running this in LINQPad, but the issue will present itself in any IDE.
When I run the snippet, I get the following output:
Because I am adding the item to Items in the newly instantiated Node, I would NOT expect the item to be added to MyItems, but this is obviously what is occurring.
It seems that Items in Node is pointing to MyItems in NodeCollection.
Can anyone tell me:
Why this is happening?
How to make it not happen?
You are creating new nodes each iteration, but then setting the same items instance to the Items property of each node. Then you are adding the iteration string to the items instance stored in the Items collection (which is always the same instance), resulting in each subsequent node having more and more "iteration" entries. If you kept all of the nodes, you'd find that all of them have exactly the same Items value.
I think the basic misunderstanding here was that you were assuming that setting the Items property of the Node (node.Items = items;) would copy the items list into the node. In fact, all it does is set node.Items to point to the already-existing list that you call items.
This should give you an idea where you went wrong:
// This same instance of items is being reused each time
var items = MyItems;
for (var i = 0; i < 3; i++)
{
var node = new Node();
// I want the node to contains the items in MyItems.
// Assuming node.Items is a List<String>
node.Items = new List<String>();
node.Items.AddRange(items);
node.Items.Add(string.Format("iteration: {0}", i));
yield return node;
}
node.Items = items; sets node.Items to be a reference to the items list. There is just one list, with several references to it.
I suppose that what you want is to have a separate list in each node and that you want to copy the elements in items into that list. To do that, create a new list wich contains all of the elements from items.
node.Items = new List<string>(items);
When you do:
var item = MyItems;
you just create a reference to MyItems and store it in the variable item. Then when you do:
node.Items = items;
you just take the same reference and store it in node.Items. If you need node.Items to be a new list (point to a different memory location) initialize it again.
node.Items = new List();