Runspace performance improvement less than expected - multithreading

I am profiling some code for performance, and not getting the results from Runspaces that I would expect.
My source files are 7 Autodesk Revit journal files, ranging from 13MB and 150K lines to 90MB and 900K lines. Each file contains reference to it's own name some number of times, so I am getting that count as a proxy for some real work I want to do later. In the code below, I process all the files with a simple foreach, and then again using runspaces throttled to 8. In both cases I am using a stream reader to parse the files since the files can get rather larger than the ones I am testing with. I wouldn't expect the runspace example to be 25% the time of the loop, but I certainly would expect it to be closer to 25% than even 50%. Instead, I am seeing less than a 50% improvement. The last run was 14.26 seconds for the single thread and 8.74 seconds for 8 runspaces. Am I doing something wrong in my code, or are my expectations incorrect? FWIW I am testing on a VM at the moment. I have tried assigning 4, 6, 8 & 12 cores to the VM with little difference in results. That last test was 12 cores assigned, runspaces throttled to 8. This with a 10 cores hyper threaded Xeon on the host machine.
EDIT: I modified the code to copy the resource files to temp, to remove the network variable, and I added a Jobs based test, again constrained to the same 8 concurrent threads the Runspaces are throttled to. Times are along the lines of 16.8 vs 9.6 vs 7.3. So, Jobs are consistently better, even though my understanding was that runspaces are more efficient and should be faster, and still performance is barely better than a 50% savings, even with 8 threads.
$source = '\\Px\Support\Profiling\Source'
$localSource = "$env:TEMP\Px"
Clear-Host
if (Test-Path $localSource) {
Remove-Item "$localSource\*" -Recurse -force
} else {
New-Item $localSource -ItemType:Directory > $null
}
Copy-Item "$source\*" $localSource
$journals = Get-ChildItem $localSource
Write-Host "Single Thread"
(Measure-Command {
foreach ($journal in $journals) {
$count = 0
#$reader = [IO.StreamReader]::New($journal.fullName, $true)
$reader = New-Object -typeName:System.IO.StreamReader -argumentList $journal.fullName
while (-not ($reader.EndOfStream)) {
$line = ($reader.ReadLine()).Trim()
if ($line -match $journal) {
$count ++
}
}
Write-Host "$journal $count"
$reader.Close()
$reader.Dispose()
}
}).totalSeconds
Write-Host
Write-Host "Runspace 1,8"
(Measure-Command {
$runspacePool = [RunspaceFactory]::CreateRunspacePool(1,8)
$runspacePool.Open()
$runspaceCollection = New-Object system.collections.arraylist
$scriptBlock = {
param (
[string]$journal
)
$journalName = Split-Path $journal -leaf
$count = 0
#$reader = [IO.StreamReader]::New($journal, $true)
$reader = New-Object -typeName:System.IO.StreamReader -argumentList $journal
while (-not ($reader.EndOfStream)) {
$line = ($reader.ReadLine()).Trim()
if ($line -match $journalName) {
$count ++
}
}
$reader.Close()
$reader.Dispose()
"$journalName $count"
}
foreach ($journal in $journals) {
$parameters = #{
journal = $journal.fullName
}
$powershell = [PowerShell]::Create()
$powershell.RunspacePool = $RunspacePool
$powershell.AddScript($scriptBlock) > $null
$powershell.AddParameters($parameters) > $null
$runspace = New-Object -TypeName PSObject -Property #{
runspace = $powershell.BeginInvoke()
powerShell = $powershell
}
$runspaceCollection.Add($runspace) > $null
}
while($runspaceCollection){
foreach($runspace in $runspaceCollection.ToArray()){
if($runspace.RunSpace.IsCompleted -eq $true){
Write-Host "$($runspace.Powershell.EndInvoke($runspace.RunSpace))"
$runspace.Powershell.dispose()
$runspaceCollection.Remove($runspace)
#[System.GC]::Collect()
Start-Sleep -m:100
}
}
}
}).totalSeconds
Write-Host
Write-Host "Jobs 8"
Remove-Job *
(Measure-Command {
$scriptBlock = {
param (
[string]$journal
)
$journalName = Split-Path $journal -leaf
$count = 0
#$reader = [IO.StreamReader]::New($journal, $true)
$reader = New-Object -typeName:System.IO.StreamReader -argumentList:$journal
while (-not ($reader.EndOfStream)) {
$line = ($reader.ReadLine()).Trim()
if ($line -match $journalName) {
$count ++
}
}
$reader.Close()
$reader.Dispose()
Write-Output "$journalName $count"
}
foreach ($journal in $journals) {
Start-Job -ScriptBlock:$scriptBlock -argumentlist:$journal.fullName
While($(Get-Job -State 'Running').Count -ge 8) {
sleep -m:100
}
}
Get-Job | Wait-Job
foreach ($job in Get-Job) {
Write-Host "$(Receive-Job $job)"
Remove-Job $job
}
}).totalSeconds
Write-Host
Remove-Item $localSource -Recurse -force

It's interesting that the Start-Sleep command improves your performance - that suggests that your while($runspaceCollection) loop is part of what's bottlenecking the run speed. After your scripts have all been set running, this loop is constantly re-checking every RunSpace, and only pauses for 100ms whenever one has completed. I think you've built this step the wrong way round - it's probably more important for it to sleep when it hasn't found anything to do:
while($runspaceCollection){
foreach($runspace in $runspaceCollection.ToArray()){
if($runspace.RunSpace.IsCompleted -eq $true){
Write-Host "$($runspace.Powershell.EndInvoke($runspace.RunSpace))"
$runspace.Powershell.dispose()
$runspaceCollection.Remove($runspace)
}else{
Start-Sleep -m:100
}
}
}
Alternatively, you could refactor this part of the code using something like if($RunspacePool.GetAvailableRunspaces() -eq 8) to only begin cleanup of the runspaces once they have all completed. I've left over a thousand runspaces waiting for decomissioning at the end of a script without any noticeable performance drop, so you may be wasting system resources cleaning up earlier than you need to.
Beyond this I'd suggest monitoring CPU and RAM usage on the local machine as you're running the script, to see if there are any obvious performance bottlenecks there that would indicate it slowing down.

Related

Not able to move to the next loop of vm list

I have a code that starts VM's in batches. I am able to start vm's in batches. I am facing an issue here where ('*adds*','*DB*','*') are vm's containing the following wildcard names and need to be started in the following order.The script only starts the first list of vms with 'adds' and continues to restart it in loop it does not move to the next. Any help guys.
$bubbleName="VmList"
do{
$vname=#('*adds*','*DB*','*')
$i=0
Write-Host "Starting VM with "$vname[$i]
$vmList = Get-AzVM -ResourceGroupName $bubbleName -Name $vname[$i]
$batch = #{
Skip = 0
First = 2
}
do{
do{
foreach($vm in ($vmList | Select-Object #batch)){
$params = #($vm.Name, $vm.ResourceGroupName)
$job = Start-Job -ScriptBlock {
param($ComputerName,$serviceName)
Start-AzVM -Name $ComputerName -ResourceGroupName $serviceName
} -ArgumentList $params
}
Wait-Job -Job $job
Get-Job | Receive-Job
Write-Host $batch
$batch.Skip += 2
}
until($batch.skip -ge $vmList.count)
}while($job.state -ne "Completed")
$i++
}while($vname[$i] -ne $null)
You changed this a bit from an earlier (now deleted) version. However, I continued working on the old one and I'll share my thoughts.
If I understand correctly you want to start VMs in orders
*adds*
*db*
* (everything else)
Honestly, I think you're going about your loops are a little convoluted. If it were me, I'd sort the a list of all VMs according to the above order. Then code a single traditional for loop to start 2 at a time and use Wait-Job in the loop.
I don't have an Azure environment to test with but here are my notes from earlier:
Step 1: get the list in the order you need it. This will prevent the funky stuff you're doing with -First & -Skip:
# Example data, substituting for VM objects
$Objects =
#(
[PSCustomObject]#{ Name = "adds1" }
[PSCustomObject]#{ Name = "DB1" }
[PSCustomObject]#{ Name = "adds2" }
[PSCustomObject]#{ Name = "DB2" }
[PSCustomObject]#{ Name = "something" }
[PSCustomObject]#{ Name = "another" }
[PSCustomObject]#{ Name = "last" }
)
$Objects =
$Objects.Where( { $_.Name -match 'adds'} ) +
$Objects.Where( { $_.Name -match 'db'} ) +
$Objects.Where( { $_.Name -notmatch '(adds|db)'} )
Obviously this is a simulation that you'd have to adjust to your VMs. Maybe Objects will actually be defined by Get-AzVM -ResourceGroupName $bubbleName -Name *. At any rate, you should be able to get the VMs in the desired order.
Step 2: a loop to start 2 at a time, waiting for them to start before moving on to the next 2:
$Step = 1
# Note: 1 means 2 at a time, $i and $i+$Step, in this case 1.
# So the first iteration will work on index 0 & 1
# You can modify $Step to different size chunks.
For($i = 0; $i -lt $Objects.Count; ++$i )
{
$ii = $i + $Step
$Jobs =
$Objects[$i..$ii] |
ForEach-Object{
# Run the Start-Job commands against the appropriate machines
}
# Note: The assignment of $Jobs solves another problem. You were only waiting
# for the 2nd of the 2 jobs that were started. Now Wait-Job -Job $Job
# will be an array of jobs, which is allowed by the cmdlet without piping..
# Wait-Job -Job $Jobs
# ...
# I also didn't understand why one of the loops was conditioned on While
# $job.State -ne 'Completed'. Wait job should be enough and the above code
# assumed that.
# Reassign $i so when the loop increments naturally it's 1 past $ii.
# Effectively this chunks 2 elements per iteration.
$i = $ii
}
Again, I don't have the right environment to test all this. But conceptually this should be enough to get you through. I'm sorry about the variable renaming, but given the disadvantage, I had to hack my way through purely on concept.

Performance difference between Runspace and Job

I built a file processor script to convert some files to json. It works but not fast enough, so I am multithreading it. I prefer to use runspace pools since you can specify a max thread limit and it will run that many threads at a time and add new work as it completes other threads, spiffy. But I've found that if I have, say, 6 threads of work to complete, using runspaces takes ~50 minutes and keeps my computer at 40% CPU, while just using Start-Job for each piece of work pegs my computer at 100% CPU, and the work completes in 15 minutes. Am I misconfiguring the runspacepool in some way? Here are simplified examples of each
### Using Start-Job ###
$files = C:\temp | Get-Childitem -filter '*.xel' # returns 6 items
foreach ($file in $files) {
#simplified
Start-Job -ScriptBlock { C:\temp\FileProcessor.ps1 -filepath $using:file.fullname }
}
### Using Runspace Pool ###
$files = C:\temp | Get-Childitem -filter '*.xel' # returns 6 items
$Code = {
param ($filepath)
#simplified
C:\temp\FileProcessor.ps1 -filepath $filepath
}
$rsPool = [runspacefactory]::CreateRunspacePool(1,100)
$rsPool.Open()
$threads = #()
foreach ($file in $files) {
$PSinstance = [powershell]::Create().AddScript($Code).AddArgument($file.FullName)
$PSinstance.RunspacePool = $rsPool
$threads += $PSinstance.BeginInvoke()
}
while ($threads.IsCompleted -contains $False) {}
$rsPool.Dispose()
I may also be misunderstanding runspaces compared to jobs, any help is welcome. Thank you!
Jobs use multiple processes...

Decreased output with PowerShell multithreading than with singlethread script

I am using PowerShell 2.0 on a Windows 7 desktop. I am attempting to search the enterprise CIFS shares for keywords/regex. I already have a simple single threaded script that will do this but a single keyword takes 19-22 hours. I have created a multithreaded script, first effort at multithreading, based on the article by Surly Admin.
Can Powershell Run Commands in Parallel?
Powershell Throttle Multi thread jobs via job completion
and the links related to those posts.
I decided to use runspaces rather than background jobs as the prevailing wisdom says this is more efficient. Problem is, is I am only getting partial resultant output with the multithreaded script I have. Not sure if it is an I/O thing or a memory thing, or something else. Hopefully someone here can help. Here is the code.
cls
Get-Date
Remove-Item C:\Users\user\Desktop\results.txt
$Throttle = 5 #threads
$ScriptBlock = {
Param (
$File
)
$KeywordInfo = Select-String -pattern KEYWORD -AllMatches -InputObject $File
$KeywordOut = New-Object PSObject -Property #{
Matches = $KeywordInfo.Matches
Path = $KeywordInfo.Path
}
Return $KeywordOut
}
$RunspacePool = [RunspaceFactory]::CreateRunspacePool(1, $Throttle)
$RunspacePool.Open()
$Jobs = #()
$Files = Get-ChildItem -recurse -erroraction silentlycontinue
ForEach ($File in $Files) {
$Job = [powershell]::Create().AddScript($ScriptBlock).AddArgument($File)
$Job.RunspacePool = $RunspacePool
$Jobs += New-Object PSObject -Property #{
File = $File
Pipe = $Job
Result = $Job.BeginInvoke()
}
}
Write-Host "Waiting.." -NoNewline
Do {
Write-Host "." -NoNewline
Start-Sleep -Seconds 1
} While ( $Jobs.Result.IsCompleted -contains $false)
Write-Host "All jobs completed!"
$Results = #()
ForEach ($Job in $Jobs) {
$Results += $Job.Pipe.EndInvoke($Job.Result)
$Job.Pipe.EndInvoke($Job.Result) | Where {$_.Path} | Format-List | Out-File -FilePath C:\Users\user\Desktop\results.txt -Append -Encoding UTF8 -Width 512
}
Invoke-Item C:\Users\user\Desktop\results.txt
Get-Date
This is the single threaded version I am using that works, including the regex I am using for socials.
cls
Get-Date
Remove-Item C:\Users\user\Desktop\results.txt
$files = Get-ChildItem -recurse -erroraction silentlycontinue
ForEach ($file in $files) {
Select-String -pattern '[sS][sS][nN]:*\s*\d{3}-*\d{2}-*\d{4}' -AllMatches -InputObject $file | Select-Object matches, path |
Format-List | Out-File -FilePath C:\Users\user\Desktop\results.tx -Append -Encoding UTF8 -Width 512
}
Get-Date
Invoke-Item C:\Users\user\Desktop\results.txt
I am hoping to build this answer over time as I dont want to over comment. I dont know yet why you are losing data from the multithreading but i think we can increase performace with an updated regex. For starters you have many greedy quantifiers that i think we can shrink down.
[sS][sS][nN]:*\s*\d{3}-*\d{2}-*\d{4}
Select-String is case insensitive by default so you dont need the portion in the beginning. Do you have to check for multiple colons? Since you looking for 0 or many :. Same goes for the hyphens. Perhaps these would be better with ? which matches 0 or 1.
ssn:?\s*\d{3}-?\d{2}-?\d{4}
This is assuming you are looking for mostly proper formatted SSN's. If people are hiding them in text maybe you need to look for other delimiters as well.
I would also suggest adding the text to separate files and maybe combining them after execution. If nothing else just to test.
Hoping this will be the start of a proper solution.
It turns out that for some reason the Select-String cmdlet was having problems with the multithreading. I don't have enough of a developer background to be able to tell what is happening under the hood. However I did discover that by using the -quiet option in Select-String, which turns it into a boolean output, I was able to get the results I wanted.
The first pattern match in each document gives a true value. When I get a true then I return the Path of the document to an array. When that is finished I run the pattern match against the paths that were output from the scriptblock. This is not quite as effective performance wise as I had hoped for but still a pretty dramatic improvement over singlethread.
The other issue I ran into was the read/writes to disk by trying to output results to a document at each stage. I have changed that to arrays. While still memory intensive, it is much quicker.
Here is the resulting code. Any additional tips on performance improvement are appreciated:
cls
Remove-Item C:\Users\user\Desktop\output.txt
$Throttle = 5 #threads
$ScriptBlock = {
Param (
$File
)
$Match = Select-String -pattern 'ssn:?\s*\d{3}-?\d{2}-?\d{4}' -Quiet -InputObject $File
if ( $Match -eq $true ) {
$MatchObjects = Select-Object -InputObject $File
$MatchOut = New-Object PSObject -Property #{
Path = $MatchObjects.FullName
}
}
Return $MatchOut
}
$RunspacePool = [RunspaceFactory]::CreateRunspacePool(1, $Throttle)
$RunspacePool.Open()
$Jobs = #()
$Files = Get-ChildItem -Path I:\ -recurse -erroraction silentlycontinue
ForEach ($File in $Files) {
$Job = [powershell]::Create().AddScript($ScriptBlock).AddArgument($File)
$Job.RunspacePool = $RunspacePool
$Jobs += New-Object PSObject -Property #{
File = $File
Pipe = $Job
Result = $Job.BeginInvoke()
}
}
$Results = #()
ForEach ($Job in $Jobs) {
$Results += $Job.Pipe.EndInvoke($Job.Result)
}
$PathValue = #()
ForEach ($Line in $Results) {
$PathValue += $Line.psobject.properties | % {$_.Value}
}
$UniqValues = $PathValue | sort | Get-Unique
$Output = ForEach ( $Path in $UniqValues ) {
Select-String -Pattern '\d{3}-?\d{2}-?\d{4}' -AllMatches -Path $Path | Select-Object -Property Matches, Path
}
$Output | Out-File -FilePath C:\Users\user\Desktop\output.txt -Append -Encoding UTF8 -Width 512
Invoke-Item C:\Users\user\Desktop\output.txt

powershell how to implement worker threads

I have a little performance issue in my script, so i would like to implement some sort of worker theads. but so far i have not been able to find a solution..
what im hoping for is something like this:
start a pool of worker threads - these threads takes "commands" from a queue and process them
the main script will write "commands" to the queue as it runs
once complete the main will tell each thread to stop
main will wait for all workers to end before exiting.
does anybody have en idea on how to do this?
You can do this with Powershell workflows.
From Windows PowerShell: What is Windows PowerShell Workflow?
Workflows can also execute things in parallel, if you like. For
example, if you have a set of tasks that can run in any order, with no
interdependencies, then you can have them all run at more or less the
same time
Just do a search on "Powershell workflows" and you will find a good amount of documentation to get you started.
The basic approach to using a job is this:
$task1 = { ls c:\windows\system32 -r *.dll -ea 0 | where LastWriteTime -gt (Get-Date).AddDays(-21) }
$task2 = { ls E:\Symbols -r *.dll | where LastWriteTime -gt (Get-Date).AddDays(-21) }
$task3 = { Invoke-WebRequest -Uri http://blogs.msdn.com/b/mainfeed.aspx?Type=BlogsOnly | % Content }
$job1 = Start-Job $task1; $job2 = Start-Job $task2; $job3 = Start-Job $task3
Wait-Job $job1,$job2,$job3
$job1Data = Receive-Job $job1
$job2Data = Receive-Job $job2
$job3Data = Receive-Job $job3
If you need to have those background jobs waiting in a loop to do work as the main script dictates have a look at this SO answer to see how to use MSMQ to do this.
With some help from the pointers made by Keith hill - i got it working - thanks a bunch...
Here is a snipping of the code that did my prove of concept:
function New-Task([int]$Index,[scriptblock]$ScriptBlock) {
$ps = [Management.Automation.PowerShell]::Create()
$res = New-Object PSObject -Property #{
Index = $Index
Powershell = $ps
StartTime = Get-Date
Busy = $true
Data = $null
async = $null
}
[Void] $ps.AddScript($ScriptBlock)
[Void] $ps.AddParameter("TaskInfo",$Res)
$res.async = $ps.BeginInvoke()
$res
}
$ScriptBlock = {
param([Object]$TaskInfo)
$TaskInfo.Busy = $false
Start-Sleep -Seconds 1
$TaskInfo.Data = "test $($TaskInfo.Data)"
}
$a = New-Task -Index 1 -ScriptBlock $ScriptBlock
$a.Data = "i was here"
Start-Sleep -Seconds 5
$a
And here is the result proving that the data was communicated into the thread and back again:
Data : test i was here
Busy : False
Powershell : System.Management.Automation.PowerShell
Index : 1
StartTime : 11/25/2013 7:37:07 AM
async : System.Management.Automation.PowerShellAsyncResult
as you can see the $a.data now have "test" in front
So thanks a lot...

Powershell v2.0 Using multiple threads

Basic script idea:
Hello. I've created a powershell script which I use to check the filesizes of certain executables, and then keep them in a text file. Next time the script runs, if a filesize differs it will replace the one in the text file with the new one.
The structure:
I have a main script and a folder which contains many scripts, each for every executable of which I want to check the filesize. So the scripts in the folder will return a string containing the link to the executable, which will be fed to the main script.
The code:
$progdir = "C:\script\programms"
$items = Get-ChildItem -filter *.ps1 -Path $progdir
$webclient = New-Object System.Net.WebClient
$filesizes = get-content C:\updatechecker\programms\filesizes
if ($filesizes.length -ne $items.length) {
if ($filesizes.length -eq $null) {
Write-Host ("Building filesize database...") -nonewline
}
else {
Write-Host ("Rebuilding filesize database...") -nonewline
}
clear-content C:\programms\filesizes
for ($i=0; $i -le $items.length-1; $i++) {
$command = "c:\programms\" + $items[$i].name
$link = & $command
$webclient.OpenRead($link) | Out-Null
$filesize = $webclient.ResponseHeaders["Content-Length"]
$filesize >> C:\programms\filesizes
}
echo "Done."
}
else {
...
Question:
This for loop is the one I want to run in parallel. I need your advice on how to do this since I'm new to powershell. I tried to implement a few things I found but they didn't work correctly (took very long to finish, output errors, multiple entries of filesizes in my filesizes file). I suspect it's a synchronization issue and somehow I need to lock the critical parts. Isn't there anything like omp parallel for in powershell? :P
Any help,advice on how to achieve this would be appreciated :)
edit:
Get-Job | Remove-Job -Force
$progdir = "C:\programms"
$items = Get-ChildItem -filter *.ps1 -Path $progdir
$webclient = New-Object System.Net.WebClient
$filesizes = get-content C:\programms\filesizes
$jobWork = {
param ($MyInput)
$command = "c:\programms\" + $MyInput
$link = & $command
$webclient.OpenRead($link) | Out-Null
$filesize = $webclient.ResponseHeaders["Content-Length"]
$filesize >> C:\programms\filesizes
}
foreach ($item in $items) {
Start-Job -ScriptBlock $jobWork -ArgumentList $item.name | out-null
}
Get-Job | Wait-Job
Get-Job | Receive-Job | Out-GridView | out-null
echo "Done."
Edit 2: Used code I found here: http://ryan.witschger.net/?p=22
$mutex = new-object -TypeName System.Threading.Mutex -ArgumentList $false, “RandomGlobalMutexName”;
$MaxThreads = 4
$SleepTimer = 500
$jobWork = {
param ($MyInput)
$webclient = New-Object System.Net.WebClient
$command = "c:\programms\" + $MyInput
$link = & $command
$webclient.OpenRead($link) | Out-Null
$result = $mutex.WaitOne();
$file = $webclient.ResponseHeaders["Content-Length"]
$file >> C:\programms\filesizes
$mutex.ReleaseMutex();
}
$progdir = "C:\programms"
$items = Get-ChildItem -filter *.ps1 -Path $progdir
$webclient = New-Object System.Net.WebClient
$filesizes = get-content C:\programms\filesizes
Get-Job | Remove-Job -Force
$i = 0
ForEach ($item in $items){
While ($(Get-Job -state running).count -ge $MaxThreads){
Start-Sleep -Milliseconds $SleepTimer
}
$i++
Start-Job -ScriptBlock $jobWork -ArgumentList $item.name | Out-Null
}
You can run each iteration of the loop in a background job which is not the same a seperate thread in that it is a whole other PowerShell.exe process. Data is passed from the background processes through serialization.
To approach it using background jobs you'll need to define a script block that will do that actual work and then call the script block with parameters in each iteration of the loop. The script block can report back status via Write-Output or by throwing an exception.
You'll probably want to throttle how many concurrent background jobs are running. Here's an example of how to throttle:
$jobItems = "a", "b", "c", "d", "e"
$jobMax = 2
$jobs = #()
$jobWork = {
param ($MyInput)
if ($MyInput -eq "d") {
throw "an example of an error"
} else {
write-output "Processed $MyInput"
}
}
foreach ($jobItem in $jobItems) {
if ($jobs.Count -le $jobMax) {
$jobs += Start-Job -ScriptBlock $jobWork -ArgumentList $jobItem
} else {
$jobs | Wait-Job -Any
}
}
$jobs | Wait-Job
As an alternative you might try eventing. Take a look at this thread for some examples of how to implement concurrency using events.
PowerShell: Runspace problem with DownloadFileAsync
You might be able to replace DownloadFileAsync with OpenReadAsync

Resources