I have this PowerShell script that reads lines of integers from a file and creates a new job with $CHUNK_SIZE amount of lines to get the sum of all the prime numbers until the end of the file. The $MAX_THREADS is the amount of jobs hat can be running at the same time, I'm testing with 1, but will change to 2, 4, and 8 later. I wait for all the jobs to complete and then receive all the subtotals from the jobs to get the actual sum of all primes in the file. My problem is that the accumulated total at the end should be 2844292, but I keep getting 2766271. I have checked my prime function and whats being read from the file and being sent to the job - but they do not seem to be the problem. When I look at my output, I notice that I receive the same value twice in a row for the last two jobs..i'm not sure why that's happening.
These photos show my output:
1.Get-Job Info
2.What I get back from Receive-Job and my total increasing
Any help as to why my total is off would be greatly appreciated! Thanks!
Set-StrictMode -Version latest
$CHUNK_SIZE = 1024
$MAX_THREADS = 1
$scriptBlock = {
param($chunkArr)
function isPrime([int]$data){
if($data -lt 2){return $FALSE}
if($data -eq 2){return $TRUE}
if($data % 2 -eq 0){return $FALSE}
for($i=3;$i*$i-le$data;$i+=2){
if($data % $i -eq 0){return $FALSE}
}
return $TRUE
}
$total = 0
foreach($line in $chunkArr){
$data = [int]$line
if(isPrime $data){
$total += $data
}
}
$total
}
$eof = $FALSE
$reader = New-Object System.IO.StreamReader("$PWD/ass2-20000.txt")
$chunkArr = New-Object System.Collections.ArrayList
while(!$eof){
$chunkArr.Clear()
for($i=0;$i -lt $CHUNK_SIZE;$i++){
$line = $reader.ReadLine()
if($line -eq $NULL){
$eof = $TRUE
break
}
$chunkArr.add($line) | Out-Null
}
While(#(Get-Job -state running).count -ge $MAX_THREADS){
Start-Sleep -Seconds .2
}
Start-Job -ArgumentList (,$chunkArr) -ScriptBlock $scriptBlock
}
While(#(Get-Job -state running).count -ge 1){
Start-Sleep -Seconds .2
}
Get-Job
$total = 0
foreach($job in Get-Job){
$tmp = Receive-job $job
Write-Output ("Recieved: " + $tmp)
$total += $tmp
Write-Output ("Total: " + $total)
Remove-Job $job
}
$reader.Close()
$total
EDIT: this image shows what the jobs should be returning, the last line showing the correct total..it seems my first job is never being received and my last one is being received twice
EDIT: Compare What I should be getting from each Job to What I actually get
Related
I am profiling some code for performance, and not getting the results from Runspaces that I would expect.
My source files are 7 Autodesk Revit journal files, ranging from 13MB and 150K lines to 90MB and 900K lines. Each file contains reference to it's own name some number of times, so I am getting that count as a proxy for some real work I want to do later. In the code below, I process all the files with a simple foreach, and then again using runspaces throttled to 8. In both cases I am using a stream reader to parse the files since the files can get rather larger than the ones I am testing with. I wouldn't expect the runspace example to be 25% the time of the loop, but I certainly would expect it to be closer to 25% than even 50%. Instead, I am seeing less than a 50% improvement. The last run was 14.26 seconds for the single thread and 8.74 seconds for 8 runspaces. Am I doing something wrong in my code, or are my expectations incorrect? FWIW I am testing on a VM at the moment. I have tried assigning 4, 6, 8 & 12 cores to the VM with little difference in results. That last test was 12 cores assigned, runspaces throttled to 8. This with a 10 cores hyper threaded Xeon on the host machine.
EDIT: I modified the code to copy the resource files to temp, to remove the network variable, and I added a Jobs based test, again constrained to the same 8 concurrent threads the Runspaces are throttled to. Times are along the lines of 16.8 vs 9.6 vs 7.3. So, Jobs are consistently better, even though my understanding was that runspaces are more efficient and should be faster, and still performance is barely better than a 50% savings, even with 8 threads.
$source = '\\Px\Support\Profiling\Source'
$localSource = "$env:TEMP\Px"
Clear-Host
if (Test-Path $localSource) {
Remove-Item "$localSource\*" -Recurse -force
} else {
New-Item $localSource -ItemType:Directory > $null
}
Copy-Item "$source\*" $localSource
$journals = Get-ChildItem $localSource
Write-Host "Single Thread"
(Measure-Command {
foreach ($journal in $journals) {
$count = 0
#$reader = [IO.StreamReader]::New($journal.fullName, $true)
$reader = New-Object -typeName:System.IO.StreamReader -argumentList $journal.fullName
while (-not ($reader.EndOfStream)) {
$line = ($reader.ReadLine()).Trim()
if ($line -match $journal) {
$count ++
}
}
Write-Host "$journal $count"
$reader.Close()
$reader.Dispose()
}
}).totalSeconds
Write-Host
Write-Host "Runspace 1,8"
(Measure-Command {
$runspacePool = [RunspaceFactory]::CreateRunspacePool(1,8)
$runspacePool.Open()
$runspaceCollection = New-Object system.collections.arraylist
$scriptBlock = {
param (
[string]$journal
)
$journalName = Split-Path $journal -leaf
$count = 0
#$reader = [IO.StreamReader]::New($journal, $true)
$reader = New-Object -typeName:System.IO.StreamReader -argumentList $journal
while (-not ($reader.EndOfStream)) {
$line = ($reader.ReadLine()).Trim()
if ($line -match $journalName) {
$count ++
}
}
$reader.Close()
$reader.Dispose()
"$journalName $count"
}
foreach ($journal in $journals) {
$parameters = #{
journal = $journal.fullName
}
$powershell = [PowerShell]::Create()
$powershell.RunspacePool = $RunspacePool
$powershell.AddScript($scriptBlock) > $null
$powershell.AddParameters($parameters) > $null
$runspace = New-Object -TypeName PSObject -Property #{
runspace = $powershell.BeginInvoke()
powerShell = $powershell
}
$runspaceCollection.Add($runspace) > $null
}
while($runspaceCollection){
foreach($runspace in $runspaceCollection.ToArray()){
if($runspace.RunSpace.IsCompleted -eq $true){
Write-Host "$($runspace.Powershell.EndInvoke($runspace.RunSpace))"
$runspace.Powershell.dispose()
$runspaceCollection.Remove($runspace)
#[System.GC]::Collect()
Start-Sleep -m:100
}
}
}
}).totalSeconds
Write-Host
Write-Host "Jobs 8"
Remove-Job *
(Measure-Command {
$scriptBlock = {
param (
[string]$journal
)
$journalName = Split-Path $journal -leaf
$count = 0
#$reader = [IO.StreamReader]::New($journal, $true)
$reader = New-Object -typeName:System.IO.StreamReader -argumentList:$journal
while (-not ($reader.EndOfStream)) {
$line = ($reader.ReadLine()).Trim()
if ($line -match $journalName) {
$count ++
}
}
$reader.Close()
$reader.Dispose()
Write-Output "$journalName $count"
}
foreach ($journal in $journals) {
Start-Job -ScriptBlock:$scriptBlock -argumentlist:$journal.fullName
While($(Get-Job -State 'Running').Count -ge 8) {
sleep -m:100
}
}
Get-Job | Wait-Job
foreach ($job in Get-Job) {
Write-Host "$(Receive-Job $job)"
Remove-Job $job
}
}).totalSeconds
Write-Host
Remove-Item $localSource -Recurse -force
It's interesting that the Start-Sleep command improves your performance - that suggests that your while($runspaceCollection) loop is part of what's bottlenecking the run speed. After your scripts have all been set running, this loop is constantly re-checking every RunSpace, and only pauses for 100ms whenever one has completed. I think you've built this step the wrong way round - it's probably more important for it to sleep when it hasn't found anything to do:
while($runspaceCollection){
foreach($runspace in $runspaceCollection.ToArray()){
if($runspace.RunSpace.IsCompleted -eq $true){
Write-Host "$($runspace.Powershell.EndInvoke($runspace.RunSpace))"
$runspace.Powershell.dispose()
$runspaceCollection.Remove($runspace)
}else{
Start-Sleep -m:100
}
}
}
Alternatively, you could refactor this part of the code using something like if($RunspacePool.GetAvailableRunspaces() -eq 8) to only begin cleanup of the runspaces once they have all completed. I've left over a thousand runspaces waiting for decomissioning at the end of a script without any noticeable performance drop, so you may be wasting system resources cleaning up earlier than you need to.
Beyond this I'd suggest monitoring CPU and RAM usage on the local machine as you're running the script, to see if there are any obvious performance bottlenecks there that would indicate it slowing down.
I am having issue with threading and hoping someone can clear this up for me. My thread is returning duplicate entries in an array. I have been going in circles trying to figure out why. Here is the code:
$arrayofinfo | Start-RSJob -Name {"Command_$($_)"} -throttle 10 -ScriptBlock {
$command = $_
$array_1 = #()
$array_1 = Invoke-Expression " & $command" -EA SilentlyContinue
if(($array_1.count) -gt 20)
{
$array_1 += $command
$array_1 += $array_1
return $array_1
}
} ## end of scriptblock
get-rsjob | wait-rsjob #-Timeout 7
$array_complete = get-rsjob -HasMoreData -ErrorAction SilentlyContinue | Receive-RSJob -ErrorAction SilentlyContinue | Select-Object -ErrorAction SilentlyContinue
What is happening is either the $command is executed twice or results are put in $array_1 twice. Somehow... $array_complete is double in size and contains duplicate entries for each entry. HOW?????? Anything else that looks like it can be improved please comment on. thanks.
I have access to a single-core single-processor VM with which to do logging for my team. I have the following code:
$sb = {
Param($_)
if($_.CONTROLLER -ne ".xx" ){
$posIP = "10." + $_.IP + $_.CONTROLLER
if (Test-Connection -ComputerName $posIP -Count 1 -Quiet) {
$mapPath = "\\" + $posIP + "\c$"
net use $mapPath $password /user:$userName | Out-Null
if(Test-Path $mapPath$dataFile) {
[xml]$periods = Get-Content $mapPath$dataFile
$endDate = $periods.IndataDbf.ingredient.PeriodDetail.PeriodEndDate | select -last 1
$output = "$($_.STORE);$endDate" }
else {
$outPut = $_.STORE + ';' + "$dataFile Not Found" }
net use $mapPath /de | Out-Null
}
else {
$outPut = $_.STORE + ';' + "Map FAILED" }
Write-Output $OutPut
}
}
Import-Csv $inFile | ForEach-Object {
while ((Get-Job -State Running).Count -ge 100) {
Start-Sleep -Seconds 5;
}
Write-Output $_.STORE
Start-Job -Scriptblock $sb -ArgumentList $_ | Write-Verbose
Get-Job -State Completed -HasMoreData 1 | Receive-Job | Out-File -Append -FilePath $outLog
}
Get-Job | Wait-Job | Receive-Job | Out-File -Append -FilePath $outLog
Which runs well, but takes the same amount of time as running the same code without Start-Job and just a loop. However, the previous logging command used BATCH files and automatically opened a couple dozen child command windows to process data, then return, and it runs in under half the time. The code used is the same, so I don't understand why adding more threads didn't make the script run faster. Can anyone tell me why a BATCH file program with a couple dozen child windows runs so much faster with arguably the same code? Any why does the Start-Job command not improve the speed at all? I would think it would try to execute multiple threads simultaneously.
Because there is a lot of overhead when using start-job and whenever you use pipeline.
If you use runspaces instead it maybe faster.Take a look at http://newsqlblog.com/2012/05/22/concurrency-in-powershell-multi-threading-with-runspaces/
I have a script that pulls one line of data from a file on multiple servers. I have a single-threaded version that works just fine, but I want to get it to run faster. Since I only need one line of one file from each server, I'm sure I could run this in parallel. I pulled code from multiple places to get a multi-threaded script running, but when I try to get all the results to print to one output file, nothing prints. I wonder if anyone can look at my code to tell me why this same script, without the Jobs, works fine, but after adding jobs, it doesn't.
$sb = {
Param($computer, $fileName, $outLog)
net use "\\$computer\c$" **** /user:****
if(test-path \\$computer\c$\sc\$fileName){
[xml]$periods = Get-Content \\$computer\c$\sc\$fileName
$endDate = $periods.PeriodDetail | select -last 1
$output = "$computer;$endDate"
}
Else {
$output = "$computer;$fileName Not Found"
}
#Synchronize file usage
$mutex = new-object System.Threading.Mutex $false,'SomeUniqueName'
$mutex.WaitOne() > $null
#Write data to log
Out-File -Append -InputObject $output -FilePath $outLog
#Release file hold
$mutex.ReleaseMutex()
net use "\\$computer\c$" /de
}
foreach($computer in $computerName){
while ((Get-Job -State Running).Count -ge 20) {
Start-Sleep -Seconds 5;
}
Start-Job -Scriptblock $sb -ArgumentList $computer,$fileName,$outLog
}
Get-Job | Wait-Job | Receive-Job
Thank you for all the assistance. Here is the resulting code that works pretty well:
$sb = {
Param($computer, $fileName, $outLog)
net use "\\$computer\c$" $password /user:$userName | Out-Null
if(test-path \\$computer\c$\sc\$fileName){
[xml]$periods = Get-Content \\$computer\c$\sc\$fileName
$endDate = $periods.IndataDbf.ingredient.PeriodDetail.PeriodEndDate | select -last 1
$output = "$computer;$endDate"
}
Else {
$output = "$computer;$fileName Not Found"
}
Write-Output -InputObject $output
net use "\\$computer\c$" /de | Out-Null
}
foreach($computer in $computerName){
while ((Get-Job -State Running).Count -ge 20) {
Start-Sleep -Seconds 5;
}
Start-Job -Scriptblock $sb -ArgumentList $computer,$fileName,$outLog
}
Get-Job | Wait-Job | Receive-Job | Out-File -Append -FilePath $outLog
I'm thinking of doing another Get-Job right before the Start-Job, getting only jobs that are complete with more data, but I haven't tested it yet.
Basic script idea:
Hello. I've created a powershell script which I use to check the filesizes of certain executables, and then keep them in a text file. Next time the script runs, if a filesize differs it will replace the one in the text file with the new one.
The structure:
I have a main script and a folder which contains many scripts, each for every executable of which I want to check the filesize. So the scripts in the folder will return a string containing the link to the executable, which will be fed to the main script.
The code:
$progdir = "C:\script\programms"
$items = Get-ChildItem -filter *.ps1 -Path $progdir
$webclient = New-Object System.Net.WebClient
$filesizes = get-content C:\updatechecker\programms\filesizes
if ($filesizes.length -ne $items.length) {
if ($filesizes.length -eq $null) {
Write-Host ("Building filesize database...") -nonewline
}
else {
Write-Host ("Rebuilding filesize database...") -nonewline
}
clear-content C:\programms\filesizes
for ($i=0; $i -le $items.length-1; $i++) {
$command = "c:\programms\" + $items[$i].name
$link = & $command
$webclient.OpenRead($link) | Out-Null
$filesize = $webclient.ResponseHeaders["Content-Length"]
$filesize >> C:\programms\filesizes
}
echo "Done."
}
else {
...
Question:
This for loop is the one I want to run in parallel. I need your advice on how to do this since I'm new to powershell. I tried to implement a few things I found but they didn't work correctly (took very long to finish, output errors, multiple entries of filesizes in my filesizes file). I suspect it's a synchronization issue and somehow I need to lock the critical parts. Isn't there anything like omp parallel for in powershell? :P
Any help,advice on how to achieve this would be appreciated :)
edit:
Get-Job | Remove-Job -Force
$progdir = "C:\programms"
$items = Get-ChildItem -filter *.ps1 -Path $progdir
$webclient = New-Object System.Net.WebClient
$filesizes = get-content C:\programms\filesizes
$jobWork = {
param ($MyInput)
$command = "c:\programms\" + $MyInput
$link = & $command
$webclient.OpenRead($link) | Out-Null
$filesize = $webclient.ResponseHeaders["Content-Length"]
$filesize >> C:\programms\filesizes
}
foreach ($item in $items) {
Start-Job -ScriptBlock $jobWork -ArgumentList $item.name | out-null
}
Get-Job | Wait-Job
Get-Job | Receive-Job | Out-GridView | out-null
echo "Done."
Edit 2: Used code I found here: http://ryan.witschger.net/?p=22
$mutex = new-object -TypeName System.Threading.Mutex -ArgumentList $false, “RandomGlobalMutexName”;
$MaxThreads = 4
$SleepTimer = 500
$jobWork = {
param ($MyInput)
$webclient = New-Object System.Net.WebClient
$command = "c:\programms\" + $MyInput
$link = & $command
$webclient.OpenRead($link) | Out-Null
$result = $mutex.WaitOne();
$file = $webclient.ResponseHeaders["Content-Length"]
$file >> C:\programms\filesizes
$mutex.ReleaseMutex();
}
$progdir = "C:\programms"
$items = Get-ChildItem -filter *.ps1 -Path $progdir
$webclient = New-Object System.Net.WebClient
$filesizes = get-content C:\programms\filesizes
Get-Job | Remove-Job -Force
$i = 0
ForEach ($item in $items){
While ($(Get-Job -state running).count -ge $MaxThreads){
Start-Sleep -Milliseconds $SleepTimer
}
$i++
Start-Job -ScriptBlock $jobWork -ArgumentList $item.name | Out-Null
}
You can run each iteration of the loop in a background job which is not the same a seperate thread in that it is a whole other PowerShell.exe process. Data is passed from the background processes through serialization.
To approach it using background jobs you'll need to define a script block that will do that actual work and then call the script block with parameters in each iteration of the loop. The script block can report back status via Write-Output or by throwing an exception.
You'll probably want to throttle how many concurrent background jobs are running. Here's an example of how to throttle:
$jobItems = "a", "b", "c", "d", "e"
$jobMax = 2
$jobs = #()
$jobWork = {
param ($MyInput)
if ($MyInput -eq "d") {
throw "an example of an error"
} else {
write-output "Processed $MyInput"
}
}
foreach ($jobItem in $jobItems) {
if ($jobs.Count -le $jobMax) {
$jobs += Start-Job -ScriptBlock $jobWork -ArgumentList $jobItem
} else {
$jobs | Wait-Job -Any
}
}
$jobs | Wait-Job
As an alternative you might try eventing. Take a look at this thread for some examples of how to implement concurrency using events.
PowerShell: Runspace problem with DownloadFileAsync
You might be able to replace DownloadFileAsync with OpenReadAsync