I am using Visual Studio 2010 Beta 2.
In Parallel.For loop I execute the same method with different parameter values. After execution processed data must be stored in the database.
But I've got an exception hat says that I could not work with the same data context from different threads.
So the question will be how to work with data context and SubmitChanges() from multiple threads?
I would recommend creating a threadsafe structure for storing your results. Once your parallel for has completed you can read these out of the structure and push them into your linq dataset.
Related
Can documentDb stored procedures run in parallel and update the same object? Will documentDb process them sequentially?
Consider the following scenario.
I have an app and I have 10000 coins to give away to my users when they complete a task. And I have the following object
{
remainingPoints: 10000
}
I have a stored procedure that subtracts 10 points from this object and adds them to the users' points.
Now lets say 10 users complete the task at the same time and I call the stored procedure 10 times at the same time, will DocDb execute them sequentially? Or will I have to execute the stored procedures sequentially?
I had similar questions when I first started using DocumentDB and got good answers here and in email from the DocumentDB product managers. Quoting:
Stored procedures ... get an isolated snapshot of the database for transactional support. The snapshot reflects the current state of the world (no stale data) at the time the sproc begins execution (strongly consistent).
Caveat – since stored procedures are operating on a snapshot, you can still get a stale read in a sproc if a new write come in from the outside world during execution.
Also, stored procedures will ALWAYS read their owns writes.
Sprocs are DocumentDB’s mechanism for multi-document transactions. Sproc writes are committed when a sproc successfully complete execution. If an exception is thrown, all work done in a sproc gets rolled back.
So if two are sprocs are running concurrently, they won’t see eachother’s writes.
If both sprocs happen to write to the same document (replace) – then the 2nd one will fail due to an etag mismatch when it attempts to commit writes.
From that, I went forward with my design making sure to use ETags in my writes as #Julian suggests. I also automatically retry up to 3 times each sproc execution to handle the case where they fail due to parallel operations among other reasons. In practice, I've never exceed the 3 retries (except in cases where my sproc had a bug) and I rarely even get a single retry.
I assume from the behavior that I observe that it sends each new sproc execution to a different replica until it runs out of replicas and then it queues them for sequential execution, so it's a hybrid of parallel and serial execution.
One other tip that I learned through experimentation is that you are better off doing pure read operations (no writes and no significant aggregation) client-side rather than in a sproc when you are on a heavily loaded system. I assume the advantage is because DocumentDB can satisfy different reads from different replicas in parallel. I have modularized my sproc code using the expandScript functionality of documentdb-utils to make sure that I use the exact same code for write validation, intra-document consistency, and derived fields both client-side and server-side, which is possible using node.js. Even if you are mostly .NET, you may want to use expandScripts to build your sprocs in a modular DRY way. You'll still need to run node.js in your build process to pre-process your sprocs or use Edge.NET (node running inside of .NET) to do so on the fly.
It will depend on the consistency you have choose for your collection. But the idea is that DocumentDb handle concurrency using etag and executes stored procedure on a snapshot of a document version, and commit the result only if the execution succeed.
See: https://azure.microsoft.com/en-us/documentation/articles/documentdb-faq/#develop
This thread may help too: Atomically increment an integer in a document in Azure DocumentDB
Being new to Snowflake I am trying to understand how to write JavaScript based Stored Procedures (SP) to take advantage of multi-thread/parallel processing.
My background is SQL Server and writing SP, taking advantage of performance feature such as degrees of parallelism, worker threads, indexing, column store segment elimination.
I am started to get accustomed to setting up the storage and using clustering keys, micro partitioning, and any other performance feature available, but I don't get how Snowflake SPs break down a given SQL statement into parallel streams. I am struggling to find any documentation to explain the internal workings.
My concern is producing SPs that serialise everything on one thread and become bottlenecks.
I am wondering if I am applying the correct technique/ need a different mindset to developing SPs.
I hope I have explained my concern sufficiently. In essence I am building a PoC to migrate an on-premise SQL Server DWH ETL solution to Snowflake/Matillion ELT solution, one aspect being evaluating the compute virtual warehouse size I need.
stateless UDF will run in parallel by default, this what I observed when do large amount of binary data importing via base64 encoding.
stateful UDF's run in parallel on the date as controlled by the PARTITION BY and ORDER BY clauses used on the data. The only trick to remember is to always force initialize your data, as the javascript instance can be used on subsequent PARTITON BY batches, thus don't rely on check for undefined to know if it's the first row.
Is it possible to run a table to table mapping scenario in parallel (multi threading)
we have a huge table and we already created table mapping and scenario on the mapping.
we also executing it from loadplan.
but is there way I can run the scenario in multiple threads to make the data transfer faster.
I am using groovy to script all these task.
It will be better if I get someway to script it in groovy.
A load plan with Parallel steps or a packages with scenarios in asynchronous mode will do for the parallelism part.
An issue you might run in, depending on which KMs are used, is that the same name will be used by temporary tables in all mappings. To avoid that, select the "Use Unique Temporary Object Names" checkbox appears in the Physical tab of your mapping. It will generate a different name for these objects for each execution.
It is possible on the ODI side, you may need some modifications on the mapping to not load any duplicate data. We have a similar flow where we use modula function on a numeric key to split source data into partitions. Then this data gets loaded into target.
To run this interface in multi-thread way, we have a package with a loop that executes the scenario asynchronously of this mapping with a MODULO_VALUE variable.
For loading data we are using oracle sqlloader utility, it is able to work in a parallel way to load data into one target table. I am not sure about if data pump utility also has this ability. But I know if you try to load data by SQL as a multithread approach you would get a ORA-00054: resource busy and acquire with NOWAIT specified error.
As you see there is no Groovy code included in this flow, all handled by ODI mappings, packages and KMs. I hope this helps.
I currently have a spark app that reads a couple of files and forms a data frame out of them and implements some logic on the data frames.
I can see the number and size of these files growing by a lot in the future and wanted to understand what goes on behind the scenes to be able to keep up with this growth.
Firstly, I just wanted to double check that since all machines on the cluster can access the files (which is a requirement by spark), the task of reading in data from these files is distributed and no one machine is burdened by it?
I was looking at the Spark UI for this app but since it only shows what actions were performed by which machines and since "sc.textFile(filePath)" is not an action I couldn't be sure what machines are performing this read.
Secondly, what advantages/disadvantages would I face if I were to read this data from a database like Cassandra instead of just reading in files?
Thirdly, in my app I have some code where I perform a collect (val treeArr = treeDF.collect()) on the dataframe to get an array and then I have some logic implemented on those arrays. But since these are not RDDs, how does Spark distribute this work? Or does it distribute them at all?
In other words, should I be doing maximum amount of my work transforming and performing actions on RDDs than converting them into arrays or some other data structure and then implementing the logic like I would in any programming language?
I am only about two weeks into Spark so I apologize if these are stupid questions!
Yes, sc.textFile is distributed. It even has an optional minPartitions argument.
This question is too broad. But the short answer is that you should benchmark it for yourself.
collect fetches all the data to the master. After that it's just a plain array. Indeed the idea is that you should not use collect if you want to perform distributed computations.
Assuming the decision to delete is independent and threadsafe, is it threadsafe to call DeleteOnSubmit in parallel?
All entities will be added to the delete in parallel, then the change will be submitted afterwards.
My testing hasn't shown a problem, but that doesn't inherently mean it's safe....
No.
LINQ to SQL itself is not thread-safe; nor are any of its methods.