Performance difference between Runspace and Job - multithreading

I built a file processor script to convert some files to json. It works but not fast enough, so I am multithreading it. I prefer to use runspace pools since you can specify a max thread limit and it will run that many threads at a time and add new work as it completes other threads, spiffy. But I've found that if I have, say, 6 threads of work to complete, using runspaces takes ~50 minutes and keeps my computer at 40% CPU, while just using Start-Job for each piece of work pegs my computer at 100% CPU, and the work completes in 15 minutes. Am I misconfiguring the runspacepool in some way? Here are simplified examples of each
### Using Start-Job ###
$files = C:\temp | Get-Childitem -filter '*.xel' # returns 6 items
foreach ($file in $files) {
#simplified
Start-Job -ScriptBlock { C:\temp\FileProcessor.ps1 -filepath $using:file.fullname }
}
### Using Runspace Pool ###
$files = C:\temp | Get-Childitem -filter '*.xel' # returns 6 items
$Code = {
param ($filepath)
#simplified
C:\temp\FileProcessor.ps1 -filepath $filepath
}
$rsPool = [runspacefactory]::CreateRunspacePool(1,100)
$rsPool.Open()
$threads = #()
foreach ($file in $files) {
$PSinstance = [powershell]::Create().AddScript($Code).AddArgument($file.FullName)
$PSinstance.RunspacePool = $rsPool
$threads += $PSinstance.BeginInvoke()
}
while ($threads.IsCompleted -contains $False) {}
$rsPool.Dispose()
I may also be misunderstanding runspaces compared to jobs, any help is welcome. Thank you!

Jobs use multiple processes...

Related

Runspace performance improvement less than expected

I am profiling some code for performance, and not getting the results from Runspaces that I would expect.
My source files are 7 Autodesk Revit journal files, ranging from 13MB and 150K lines to 90MB and 900K lines. Each file contains reference to it's own name some number of times, so I am getting that count as a proxy for some real work I want to do later. In the code below, I process all the files with a simple foreach, and then again using runspaces throttled to 8. In both cases I am using a stream reader to parse the files since the files can get rather larger than the ones I am testing with. I wouldn't expect the runspace example to be 25% the time of the loop, but I certainly would expect it to be closer to 25% than even 50%. Instead, I am seeing less than a 50% improvement. The last run was 14.26 seconds for the single thread and 8.74 seconds for 8 runspaces. Am I doing something wrong in my code, or are my expectations incorrect? FWIW I am testing on a VM at the moment. I have tried assigning 4, 6, 8 & 12 cores to the VM with little difference in results. That last test was 12 cores assigned, runspaces throttled to 8. This with a 10 cores hyper threaded Xeon on the host machine.
EDIT: I modified the code to copy the resource files to temp, to remove the network variable, and I added a Jobs based test, again constrained to the same 8 concurrent threads the Runspaces are throttled to. Times are along the lines of 16.8 vs 9.6 vs 7.3. So, Jobs are consistently better, even though my understanding was that runspaces are more efficient and should be faster, and still performance is barely better than a 50% savings, even with 8 threads.
$source = '\\Px\Support\Profiling\Source'
$localSource = "$env:TEMP\Px"
Clear-Host
if (Test-Path $localSource) {
Remove-Item "$localSource\*" -Recurse -force
} else {
New-Item $localSource -ItemType:Directory > $null
}
Copy-Item "$source\*" $localSource
$journals = Get-ChildItem $localSource
Write-Host "Single Thread"
(Measure-Command {
foreach ($journal in $journals) {
$count = 0
#$reader = [IO.StreamReader]::New($journal.fullName, $true)
$reader = New-Object -typeName:System.IO.StreamReader -argumentList $journal.fullName
while (-not ($reader.EndOfStream)) {
$line = ($reader.ReadLine()).Trim()
if ($line -match $journal) {
$count ++
}
}
Write-Host "$journal $count"
$reader.Close()
$reader.Dispose()
}
}).totalSeconds
Write-Host
Write-Host "Runspace 1,8"
(Measure-Command {
$runspacePool = [RunspaceFactory]::CreateRunspacePool(1,8)
$runspacePool.Open()
$runspaceCollection = New-Object system.collections.arraylist
$scriptBlock = {
param (
[string]$journal
)
$journalName = Split-Path $journal -leaf
$count = 0
#$reader = [IO.StreamReader]::New($journal, $true)
$reader = New-Object -typeName:System.IO.StreamReader -argumentList $journal
while (-not ($reader.EndOfStream)) {
$line = ($reader.ReadLine()).Trim()
if ($line -match $journalName) {
$count ++
}
}
$reader.Close()
$reader.Dispose()
"$journalName $count"
}
foreach ($journal in $journals) {
$parameters = #{
journal = $journal.fullName
}
$powershell = [PowerShell]::Create()
$powershell.RunspacePool = $RunspacePool
$powershell.AddScript($scriptBlock) > $null
$powershell.AddParameters($parameters) > $null
$runspace = New-Object -TypeName PSObject -Property #{
runspace = $powershell.BeginInvoke()
powerShell = $powershell
}
$runspaceCollection.Add($runspace) > $null
}
while($runspaceCollection){
foreach($runspace in $runspaceCollection.ToArray()){
if($runspace.RunSpace.IsCompleted -eq $true){
Write-Host "$($runspace.Powershell.EndInvoke($runspace.RunSpace))"
$runspace.Powershell.dispose()
$runspaceCollection.Remove($runspace)
#[System.GC]::Collect()
Start-Sleep -m:100
}
}
}
}).totalSeconds
Write-Host
Write-Host "Jobs 8"
Remove-Job *
(Measure-Command {
$scriptBlock = {
param (
[string]$journal
)
$journalName = Split-Path $journal -leaf
$count = 0
#$reader = [IO.StreamReader]::New($journal, $true)
$reader = New-Object -typeName:System.IO.StreamReader -argumentList:$journal
while (-not ($reader.EndOfStream)) {
$line = ($reader.ReadLine()).Trim()
if ($line -match $journalName) {
$count ++
}
}
$reader.Close()
$reader.Dispose()
Write-Output "$journalName $count"
}
foreach ($journal in $journals) {
Start-Job -ScriptBlock:$scriptBlock -argumentlist:$journal.fullName
While($(Get-Job -State 'Running').Count -ge 8) {
sleep -m:100
}
}
Get-Job | Wait-Job
foreach ($job in Get-Job) {
Write-Host "$(Receive-Job $job)"
Remove-Job $job
}
}).totalSeconds
Write-Host
Remove-Item $localSource -Recurse -force
It's interesting that the Start-Sleep command improves your performance - that suggests that your while($runspaceCollection) loop is part of what's bottlenecking the run speed. After your scripts have all been set running, this loop is constantly re-checking every RunSpace, and only pauses for 100ms whenever one has completed. I think you've built this step the wrong way round - it's probably more important for it to sleep when it hasn't found anything to do:
while($runspaceCollection){
foreach($runspace in $runspaceCollection.ToArray()){
if($runspace.RunSpace.IsCompleted -eq $true){
Write-Host "$($runspace.Powershell.EndInvoke($runspace.RunSpace))"
$runspace.Powershell.dispose()
$runspaceCollection.Remove($runspace)
}else{
Start-Sleep -m:100
}
}
}
Alternatively, you could refactor this part of the code using something like if($RunspacePool.GetAvailableRunspaces() -eq 8) to only begin cleanup of the runspaces once they have all completed. I've left over a thousand runspaces waiting for decomissioning at the end of a script without any noticeable performance drop, so you may be wasting system resources cleaning up earlier than you need to.
Beyond this I'd suggest monitoring CPU and RAM usage on the local machine as you're running the script, to see if there are any obvious performance bottlenecks there that would indicate it slowing down.

Powershell Throttle Multi thread jobs via job completion

All the tuts I have found use a pre defined sleep time to throttle jobs.
I need the throttle to wait until a job is completed before starting a new one.
Only 4 jobs can be running at one time.
So The script will run up 4 and currently pauses for 10 seconds then runs up the rest.
What I want is for the script to only allow 4 jobs to be running at one time and as a job is completed a new one is kicked off.
Jobs are initialised via a list of servers names.
Is it possible to archive this?
$servers = Get-Content "C:\temp\flashfilestore\serverlist.txt"
$scriptBlock = { #DO STUFF }
$MaxThreads = 4
foreach($server in $servers) {
Start-Job -ScriptBlock $scriptBlock -argumentlist $server
While($(Get-Job -State 'Running').Count -ge $MaxThreads) {
sleep 10 #Need this to wait until a job is complete and kick off a new one.
}
}
Get-Job | Wait-Job | Receive-Job
You can test the following :
$servers = Get-Content "C:\temp\flashfilestore\serverlist.txt"
$scriptBlock = { #DO STUFF }
invoke-command -computerName $servers -scriptblock $scriptBlock -jobname 'YourJobSpecificName' -throttlelimit 4 -AsJob
This command uses the Invoke-Command cmdlet and its AsJob parameter to start a background job that runs a scriptblock on numerous computers. Because the command must not be run more than 4 times concurrently, the command uses the ThrottleLimit parameter of Invoke-Command to limit the number of concurrent commands to 4.
Be careful that the file contains the computer names in a domain.
In order to avoid inventing a wheel I would recommend to use one of the
existing tools.
One of them is the script
Invoke-Parallel.ps1.
It is written in PowerShell, you can see how it is implemented directly. It is
easy to get and it does not require any installation for using it.
Another one is the module SplitPipeline.
It may work faster because it is written in C#. It also covers some more use
cases, for example slow or infinite input, use of initialization and cleanup scripts.
In the latter case the code with 4 parallel pipelines will be
$servers | Split-Pipeline -Count 4 {process{ <# DO STUFF on $_ #> }}
I wrote a blog article which covers multithreading any given script via actual threads. You can find the full post here:
http://www.get-blog.com/?p=189
The basic setup is:
$ISS = [system.management.automation.runspaces.initialsessionstate]::CreateDefault()
$RunspacePool = [runspacefactory]::CreateRunspacePool(1, $MaxThreads, $ISS, $Host)
$RunspacePool.Open()
$Code = [ScriptBlock]::Create($(Get-Content $FileName))
$PowershellThread = [powershell]::Create().AddScript($Code)
$PowershellThread.RunspacePool = $RunspacePool
$Handle = $PowershellThread.BeginInvoke()
$Job = "" | Select-Object Handle, Thread, object
$Job.Handle = $Handle
$Job.Thread = $PowershellThread
$Job.Object = $Object.ToString()
$Job.Thread.EndInvoke($Job.Handle)
$Job.Thread.Dispose()
Instead of sleep 10 you could also just wait on a job (-any job):
Get-Job | Wait-Job -Any | Out-Null
When there are no more jobs to kick off, start printing the output. You can also do this within the loop immediately after the above command. The script will receive jobs as they finish instead of waiting until the end.
Get-Job -State Completed | % {
Receive-Job $_ -AutoRemoveJob -Wait
}
So your script would look like this:
$servers = Get-Content "C:\temp\flashfilestore\serverlist.txt"
$scriptBlock = { #DO STUFF }
$MaxThreads = 4
foreach ($server in $servers) {
Start-Job -ScriptBlock $scriptBlock -argumentlist $server
While($(Get-Job -State Running).Count -ge $MaxThreads) {
Get-Job | Wait-Job -Any | Out-Null
}
Get-Job -State Completed | % {
Receive-Job $_ -AutoRemoveJob -Wait
}
}
While ($(Get-Job -State Running).Count -gt 0) {
Get-Job | Wait-Job -Any | Out-Null
}
Get-Job -State Completed | % {
Receive-Job $_ -AutoRemoveJob -Wait
}
Having said all that, I prefer runspaces (similar to Ryans post) or even workflows if you can use them. These are far less resource intensive than starting multiple powershell processes.
Your script looks good, try and add something like
Write-Host ("current count:" + ($(Get-Job -State 'Running').Count) + " on server:" + $server)
after your while loop to work out whether the job count is going down where you wouldn't expect it.
I noticed that every Start-Job command resulted in an additional conhost.exe process in the task manager. Knowing this, I was able to throttle using the following logic, where 5 is my desired number of concurrent threads (so I use 4 in my -gt statement since I am looking for a count greater than):
while((Get-Process conhost -ErrorAction SilentlyContinue).Count -gt 4){Start-Sleep -Seconds 1}

powershell how to implement worker threads

I have a little performance issue in my script, so i would like to implement some sort of worker theads. but so far i have not been able to find a solution..
what im hoping for is something like this:
start a pool of worker threads - these threads takes "commands" from a queue and process them
the main script will write "commands" to the queue as it runs
once complete the main will tell each thread to stop
main will wait for all workers to end before exiting.
does anybody have en idea on how to do this?
You can do this with Powershell workflows.
From Windows PowerShell: What is Windows PowerShell Workflow?
Workflows can also execute things in parallel, if you like. For
example, if you have a set of tasks that can run in any order, with no
interdependencies, then you can have them all run at more or less the
same time
Just do a search on "Powershell workflows" and you will find a good amount of documentation to get you started.
The basic approach to using a job is this:
$task1 = { ls c:\windows\system32 -r *.dll -ea 0 | where LastWriteTime -gt (Get-Date).AddDays(-21) }
$task2 = { ls E:\Symbols -r *.dll | where LastWriteTime -gt (Get-Date).AddDays(-21) }
$task3 = { Invoke-WebRequest -Uri http://blogs.msdn.com/b/mainfeed.aspx?Type=BlogsOnly | % Content }
$job1 = Start-Job $task1; $job2 = Start-Job $task2; $job3 = Start-Job $task3
Wait-Job $job1,$job2,$job3
$job1Data = Receive-Job $job1
$job2Data = Receive-Job $job2
$job3Data = Receive-Job $job3
If you need to have those background jobs waiting in a loop to do work as the main script dictates have a look at this SO answer to see how to use MSMQ to do this.
With some help from the pointers made by Keith hill - i got it working - thanks a bunch...
Here is a snipping of the code that did my prove of concept:
function New-Task([int]$Index,[scriptblock]$ScriptBlock) {
$ps = [Management.Automation.PowerShell]::Create()
$res = New-Object PSObject -Property #{
Index = $Index
Powershell = $ps
StartTime = Get-Date
Busy = $true
Data = $null
async = $null
}
[Void] $ps.AddScript($ScriptBlock)
[Void] $ps.AddParameter("TaskInfo",$Res)
$res.async = $ps.BeginInvoke()
$res
}
$ScriptBlock = {
param([Object]$TaskInfo)
$TaskInfo.Busy = $false
Start-Sleep -Seconds 1
$TaskInfo.Data = "test $($TaskInfo.Data)"
}
$a = New-Task -Index 1 -ScriptBlock $ScriptBlock
$a.Data = "i was here"
Start-Sleep -Seconds 5
$a
And here is the result proving that the data was communicated into the thread and back again:
Data : test i was here
Busy : False
Powershell : System.Management.Automation.PowerShell
Index : 1
StartTime : 11/25/2013 7:37:07 AM
async : System.Management.Automation.PowerShellAsyncResult
as you can see the $a.data now have "test" in front
So thanks a lot...

Calling multiple URLs at a time using multithreading in powershell

I have one URL for which its query changes. The queries are stored in an array so changing the URLs isn't a problem within a loop (I'm not interested in any particular query).
I'm having a hard time creating jobs for each URL and starting a group of jobs at the same time and monitoring them.
I figure to start to iterate through the array of queries 5 at a time, I'd be calling 5 new URLs so every iteration needs to have an array of jobs whose elements are the URLs for that iteration.
Is my approach right? Any pointers will be appreciated!
This is sample code to demonstrate my approach:
$queries = 1..10
$jobs = #()
foreach ($i in $queries) {
if ($jobs.Count -lt 5) {
$ScriptBlock = {
$query = $queries[$i]
$path = "http://mywebsite.com/$query"
Invoke-WebRequest -Uri $path
}
$jobs += Start-Job -ScriptBlock $ScriptBlock
} else {
$jobs | Wait-Job -Any
}
}
You will run into a couple of issues with the code above. The scriptblock gets transferred to a different PowerShell.exe process to execute so it won't have acess to $queries. You will to pass that it like so:
...
$scriptblock = {param($queries)
...
}
...
$jobs += Start-Job $scriptblock -Arg $queries
The other issue is that you never remove a completed job from $job so once this $jobs.Count -lt 5 expression evals to false because the count has reached 5, you'll never add anymore jobs. Try something like this:
$jobs | Wait-Job -Any
$jobs = $jobs | Where ($_.State -eq 'Running'}
Then you'll wind up with only the running jobs in $jobs which will allow you to start more jobs as previous jobs complete (or fail).

Powershell to wake up multiple media drives simultaneously

I have a server with lots of media drives ~43TB. An areca 1882ix-16 is set to spin the drives down after 30 minutes of inactivity since most days an individual drive is not even used. This works nicely to prevent unnecessary power and heat. In this case the drives still show up in windows explorer but when you click to access them it takes about 10 seconds for the folder list to show up since it has to wait for the drive to spin up.
For administrative work I have a need to spin up all the drives to be able to search among them. Clicking on each drive in windows explorer and then waiting for it to spin up before clicking the next drive is very tedious. Obviously multiple explorer windows makes it faster but it is still tedious. I thought a powershell script may ease the pain.
So I started with the following:
$mediaDrives = #('E:', 'F:', 'G:', 'H:', 'I:', 'J:', 'K:', 'L:',
'M:','N:', 'O:', 'P:', 'Q:', 'R:', 'S:')
get-childitem $mediaDrives | foreach-object -process { $_.Name }
This is just requesting that each drive in the array have its root folder name listed. That works to wake the drive but it is again a linear function. The script pauses for each drive before printing. Looking for a solution as to how to wake each drive simultaneously. Is there a way to multi-thread or something else?
Here's a script that will do what you want, but it must be run under powershell using the MTA threading mode (which is the default for powershell.exe 2.0, but powershell.exe 3.0 must be launched with the -MTA switch.)
#require -version 2.0
# if running in ISE or in STA console, abort
if (($host.runspace.apartmentstate -eq "STA") -or $psise) {
write-warning "This script must be run under powershell -MTA"
exit
}
$mediaDrives = #('E:', 'F:', 'G:', 'H:', 'I:', 'J:', 'K:', 'L:',
'M:','N:', 'O:', 'P:', 'Q:', 'R:', 'S:')
# create a pool of 8 runspaces
$pool = [runspacefactory]::CreateRunspacePool(1, 8)
$pool.Open()
$jobs = #()
$ps = #()
$wait = #()
$count = $mediaDrives.Length
for ($i = 0; $i -lt $count; $i++) {
# create a "powershell pipeline runner"
$ps += [powershell]::create()
# assign our pool of 8 runspaces to use
$ps[$i].runspacepool = $pool
# add wake drive command
[void]$ps[$i].AddScript(
"dir $($mediaDrives[$i]) > `$null")
# start script asynchronously
$jobs += $ps[$i].BeginInvoke();
# store wait handles for WaitForAll call
$wait += $jobs[$i].AsyncWaitHandle
}
# wait 5 minutes for all jobs to finish (configurable)
$success = [System.Threading.WaitHandle]::WaitAll($wait,
(new-timespan -Minutes 5))
write-host "All completed? $success"
# end async call
for ($i = 0; $i -lt $count; $i++) {
write-host "Completing async pipeline job $i"
try {
# complete async job
$ps[$i].EndInvoke($jobs[$i])
} catch {
# oops-ee!
write-warning "error: $_"
}
# dump info about completed pipelines
$info = $ps[$i].InvocationStateInfo
write-host "State: $($info.state) ; Reason: $($info.reason)"
}
So, for example, save as warmup.ps1 and run like: powershell -mta c:\scripts\warmup.ps1
To read more about runspace pools and the general technique above, take a look at my blog entry about runspacepools:
http://nivot.org/blog/post/2009/01/22/CTP3TheRunspaceFactoryAndPowerShellAccelerators
I chose 8 pretty much arbitrarily for the parallelism factor - experiment yourself with lower or higher numbers.
Spin up a separate powershell instance for each drive or use workflows in PowerShell 3.0.
Anyhow, you can pass drives directly to the Path parameter and skip Foreach-Object all togeteher:
Get-ChildItem $mediaDrives
Have you considered approaching this with the Start-Job cmdlet:
$mediaDrives = #('E:', 'F:', 'G:', 'H:', 'I:', 'J:', 'K:')
$mediaDrives | ForEach-Object {
Start-Job -ArgumentList $_ -ScriptBlock {param($drive)
Get-ChildItem $drive
}
}
The only clever part is that you need to use the -ArgumentList parameter on the Start-Job cmdlet to pass the correct value through for each iteration. This will create a background task that runs in parallel with the execution of the script. If you are curious
If you don't want to wait, well, don't wait: start those wake-up calls in the background.
In bash one would write
foreach drive ($mediadrives) {tickle_and_wake $drive &}
(note the ampersand, which means: start the command in the background, don't wait for it to complete)
In PowerShell that would translate to something like
foreach ($drive in $mediadrives) {
Start-Job {param($d) tickle_and_wake $d} -Arg $drive
}
If you want confirmation that all background jobs have completed, use wait in bash or Wait-Job in Powershell

Resources