Powershell | How can I use Multi Threading for my File Deleter Powershell script? - multithreading

So I've written a Script to delete files in a specific folder after 5 days. I'm currently implementing this in a directory with hundreds of thousands of files and this is taking a lot of time.
This is currently my code:
#Variables
$path = "G:\AdeptiaSuite\AdeptiaSuite-6.9\AdeptiaServer\ServerKernel\web\repository\equalit\PFRepository"
$age = (Get-Date).AddDays(-5) # Defines the 'x days old' (today's date minus x days)
# Get all the files in the folder and subfolders | foreach file
Get-ChildItem $path -Recurse -File | foreach{
# if creationtime is 'le' (less or equal) than $age
if ($_.CreationTime -le $age){
Write-Output "Older than $age days - $($_.name)"
Remove-Item $_.fullname -Force -Verbose # remove the item
}
else{
Write-Output "Less than $age days old - $($_.name)"
}
}
I've searched around the internet for some time now to find out how to use
Runspaces, however I find it very confusing and I'm not sure how to implement it with this script. Could anyone please give me an example of how to use Runspaces for this code?
Thank you very much!
EDIT:
I've found this post: https://adamtheautomator.com/powershell-multithreading/
And ended up changing my script to this:
$Scriptblock = {
# Variables
$path = "G:\AdeptiaSuite\AdeptiaSuite-6.9\AdeptiaServer\ServerKernel\web\repository\equalit\PFRepository"
$age = (Get-Date).AddDays(-5) # Defines the 'x days old' (today's date minus x days)
# Get all the files in the folder and subfolders | foreach file
Get-ChildItem $path -Recurse -File | foreach{
# if creationtime is 'le' (less or equal) than $age
if ($_.CreationTime -le $age){
Write-Output "Older than $age days - $($_.name)"
Remove-Item $_.fullname -Force -Verbose # remove the item
}
else{
Write-Output "Less than $age days old - $($_.name)"
}
}
}
$MaxThreads = 5
$RunspacePool = [runspacefactory]::CreateRunspacePool(1, $MaxThreads)
$RunspacePool.Open()
$Jobs = #()
1..10 | Foreach-Object {
$PowerShell = [powershell]::Create()
$PowerShell.RunspacePool = $RunspacePool
$PowerShell.AddScript($ScriptBlock).AddArgument($_)
$Jobs += $PowerShell.BeginInvoke()
}
while ($Jobs.IsCompleted -contains $false) {
Start-Sleep 1
}
However I'm not sure if this works correctly now, I don't get any error's however the Terminal doesn't do anything, so I'm not sure wether it works or just doesn't do anything.
I'd love any feedback on this!

The easiest answer is: get PowerShell v7.2.5 (look in the assets for PowerShell-7.2.5-win-x64.zip), download and extract it. It's a no-install PowerShell 7 which has easy multithreading and lets you change foreach { to foreach -parallel {. The executable is pwsh.exe.
But, if it's severely overloading the server, running it several times will only make things worse, right? And I think the Get-ChildItem will be the slowest part, putting the most load on the server, and so doing the delete in parallel probably won't help.
I would first try changing the script to this shape:
$path = "G:\AdeptiaSuite\AdeptiaSuite-6.9\AdeptiaServer\ServerKernel\web\repository\equalit\PFRepository"
$age = (Get-Date).AddDays(-5)
$logOldFiles = [System.IO.StreamWriter]::new('c:\temp\log-oldfiles.txt')
$logNewFiles = [System.IO.StreamWriter]::new('c:\temp\log-newfiles.txt')
Get-ChildItem $path -Recurse -File | foreach {
if ($_.CreationTime -le $age){
$logOldFiles.WriteLine("Older than $age days - $($_.name)")
$_ # send file down pipeline to remove-item
}
else{
$logNewFiles.WriteLine("Less than $age days old - $($_.name)")
}
} | Remove-Item -Force
$logOldFiles.Close()
$logNewFiles.Close()
So it pipelines into remove-item and doesn't send hundreds of thousands of text lines to the console (also a slow thing to do).
If that doesn't help, I would switch to robocopy /L and maybe look at robocopy /L /MINAGE... to do the file listing, then process that to do the removal.
(I also removed the comments which just repeat the lines of code # removed comments which repeat what the code says.
The code tells you what the code says # read the code to see what the code does. Comments should tell you why the code does things, like who wrote the script and what business case was it solving, what is the PFRepository, why is there a 5 day cutoff, or whatever.)

Related

Filter items that are older than a specific date

I have a portion of a script that will currently provide me items that are up to a specific day old. I would like it instead to go back that many days a d then get anything older than that date. How should I modify this to achieve that result?
If ($null -notlike $UpdatedSinceDays) {
$filterDate = ("(LastUpdatedDateTime gt {0})" -f (Get-Date (get-date).AddDays($UpdatedSinceDays) -UFormat %y-%m-%dT00:00:00z))
If ($null -eq $filterbuilder) {
$filterbuilder = $filterDate
}
Else {
Rest of filter statement
}
}
$filterbuilder gets fed into $ParamCollection.Filter to add several filters to a command.
The Get-ChildItem cmdlet in PowerShell gets files from the specified directory and using recurse parameters it recursively gets files from folders and subfolders.
The Where-Object cmdlet is used to filter condition for files having CreationTime is older than 15 days.
Example Script to find all files older than 15 days in the C:\temp directory
# Search Path
$Folder = "C:\temp\"
# Search using gci cmdlet
Get-ChildItem -Path $folder -Recurse | Where-Object { $_.CreationTime -lt (Get-Date).AddDays(-15) }

Performance difference between Runspace and Job

I built a file processor script to convert some files to json. It works but not fast enough, so I am multithreading it. I prefer to use runspace pools since you can specify a max thread limit and it will run that many threads at a time and add new work as it completes other threads, spiffy. But I've found that if I have, say, 6 threads of work to complete, using runspaces takes ~50 minutes and keeps my computer at 40% CPU, while just using Start-Job for each piece of work pegs my computer at 100% CPU, and the work completes in 15 minutes. Am I misconfiguring the runspacepool in some way? Here are simplified examples of each
### Using Start-Job ###
$files = C:\temp | Get-Childitem -filter '*.xel' # returns 6 items
foreach ($file in $files) {
#simplified
Start-Job -ScriptBlock { C:\temp\FileProcessor.ps1 -filepath $using:file.fullname }
}
### Using Runspace Pool ###
$files = C:\temp | Get-Childitem -filter '*.xel' # returns 6 items
$Code = {
param ($filepath)
#simplified
C:\temp\FileProcessor.ps1 -filepath $filepath
}
$rsPool = [runspacefactory]::CreateRunspacePool(1,100)
$rsPool.Open()
$threads = #()
foreach ($file in $files) {
$PSinstance = [powershell]::Create().AddScript($Code).AddArgument($file.FullName)
$PSinstance.RunspacePool = $rsPool
$threads += $PSinstance.BeginInvoke()
}
while ($threads.IsCompleted -contains $False) {}
$rsPool.Dispose()
I may also be misunderstanding runspaces compared to jobs, any help is welcome. Thank you!
Jobs use multiple processes...

using threads to delete files with specific extensions ignoring files over a certain date

In my profession I make forensic images from "foreign" PCs which I extract later on my local storage.
To clean up the data I'd hope to delete all files that aren't relevant for me. (not limited to: audio, movies, systemfiles,...)
Since we're speaking of multiple TB of data, I'd have hoped to use threads, especially since my storage is all flash and the limitation on the disk is somewhat less of a problem.
To speed the process up after an initial manual run, I would want the script to exclude files older then 1 day (since I have done that one already with a manual run).
what I have so far:
$IncludeFiles = "*.log", "*.sys", "*.avi", "*.mpg", "*.mkv", ".mp3", "*.mp4",
"*.mpeg", "*.mov", "*.dll", "*.mof", "*.mui", "*.zvv", "*.wma",
"*.wav", "*.MPA", "*.MID", "*.M4A", "*.AIF", "*.IFF", "*.M3U",
"*.3G2", "*.3GP", "*.ASF", "*.FLV", "*.M4V", "*.RM", "*.SWF",
"*.VOB"
$ScriptBlock = {
Param($mypath = "D:\")
Get-ChildItem -Path $mypath -Recurse -File -Include $file | Where-Object {
$_.CreationTime -gt (Get-Date).AddDays(-1)
}
foreach ($file in $IncludeFiles) {
Start-Job -ScriptBlock $ScriptBlock -ArgumentList $file
}
Get-Job | Wait-Job
$out = Get-Job | Receive-Job
Write-Host $out
the only thing that doesn't work is the limitation that it only looks at files "younger" than 1 day. If I run the script without it, it seems to work perfectly. (as it gives me a list of files with the extensions I want to remove)
Parameter passing doesn't work the way you seem to expect. Param($mypath = "D:\") defines a parameter mypath with a default value of D:\. That default value is superseded by the value you pass into the scriptblock via -ArgumentList. Also the variable $file inside the scriptblock and the variable $file outside the scriptblock are not the same. Because of that an invocation
Start-Job -ScriptBlock $ScriptBlock -ArgumentList '*.log'
will run the command
Get-ChildItem -Path '*.log' -Recurse -File -Include $null | ...
Change your code to something like this to make it work:
$ScriptBlock = {
Param($extension)
$mypath = "D:\"
Get-ChildItem -Path $mypath -Recurse -File -Filter $extension | Where-Object {
$_.CreationTime -gt (Get-Date).AddDays(-1)
}
}
foreach ($file in $IncludeFiles) {
Start-Job -ScriptBlock $ScriptBlock -ArgumentList $file
}
Get-Job | Wait-Job | Receive-Job
Using -Filter should provide better performance than -Include, but accepts only a single string (not a list of strings like -Include), so you can only filter one extension at a time.

Remove known Excel passwords with PowerShell

I have this PowerShell code that loops through Excel files in a specified directory; references a list of known passwords to find the correct one; and then opens, decrypts, and saves that file to a new directory.
But it's not executing as quickly as I'd like (it's part of a larger ETL process and it's a bottleneck). At this point I can remove the passwords faster manually as the script takes ~40 minutes to decrypt 40 workbooks while referencing a list of ~50 passwords.
Is there a cmdlet or function (or something) that's missing which would speed this up, an overlooked flaw in the processing, or is PowerShell, perhaps, just not the right tool for this job?
Original Code (updated code can be found below):
$ErrorActionPreference = "SilentlyContinue"
CLS
# Paths
$encrypted_path = "C:\PoShTest\Encrypted\"
$decrypted_Path = "C:\PoShTest\Decrypted\"
$original_Path = "C:\PoShTest\Originals\"
$password_Path = "C:\PoShTest\Passwords\Passwords.txt"
# Load Password Cache
$arrPasswords = Get-Content -Path $password_Path
# Load File List
$arrFiles = Get-ChildItem $encrypted_path
# Create counter to display progress
[int] $count = ($arrfiles.count -1)
# Loop through each file
$arrFiles| % {
$file = get-item -path $_.fullname
# Display current file
write-host "Processing" $file.name -f "DarkYellow"
write-host "Items remaining: " $count `n
# Excel xlsx
if ($file.Extension -eq ".xlsx") {
# Loop through password cache
$arrPasswords | % {
$passwd = $_
# New Excel Object
$ExcelObj = $null
$ExcelObj = New-Object -ComObject Excel.Application
$ExcelObj.Visible = $false
# Attempt to open file
$Workbook = $ExcelObj.Workbooks.Open($file.fullname,1,$false,5,$passwd)
$Workbook.Activate()
# if password is correct - Save new file without password to $decrypted_Path
if ($Workbook.Worksheets.count -ne 0) {
$Workbook.Password=$null
$savePath = $decrypted_Path+$file.Name
write-host "Decrypted: " $file.Name -f "DarkGreen"
$Workbook.SaveAs($savePath)
# Close document and Application
$ExcelObj.Workbooks.close()
$ExcelObj.Application.Quit()
# Move original file to $original_Path
move-item $file.fullname -Destination $original_Path -Force
}
else {
# Close document and Application
write-host "PASSWORD NOT FOUND: " $file.name -f "Magenta"
$ExcelObj.Close()
$ExcelObj.Application.Quit()
}
}
}
$count--
# Next File
}
Write-host "`n Processing Complete" -f "Green"
Updated code:
# Get Current EXCEL Process ID's so they are not affected but the scripts cleanup
# SilentlyContinue in case there are no active Excels
$currentExcelProcessIDs = (Get-Process excel -ErrorAction SilentlyContinue).Id
$a = Get-Date
$ErrorActionPreference = "SilentlyContinue"
CLS
# Paths
$encrypted_path = "C:\PoShTest\Encrypted"
$decrypted_Path = "C:\PoShTest\Decrypted\"
$processed_Path = "C:\PoShTest\Processed\"
$password_Path = "C:\PoShTest\Passwords\Passwords.txt"
# Load Password Cache
$arrPasswords = Get-Content -Path $password_Path
# Load File List
$arrFiles = Get-ChildItem $encrypted_path
# Create counter to display progress
[int] $count = ($arrfiles.count -1)
# New Excel Object
$ExcelObj = $null
$ExcelObj = New-Object -ComObject Excel.Application
$ExcelObj.Visible = $false
# Loop through each file
$arrFiles| % {
$file = get-item -path $_.fullname
# Display current file
write-host "`n Processing" $file.name -f "DarkYellow"
write-host "`n Items remaining: " $count `n
# Excel xlsx
if ($file.Extension -like "*.xls*") {
# Loop through password cache
$arrPasswords | % {
$passwd = $_
# Attempt to open file
$Workbook = $ExcelObj.Workbooks.Open($file.fullname,1,$false,5,$passwd)
$Workbook.Activate()
# if password is correct, remove $passwd from array and save new file without password to $decrypted_Path
if ($Workbook.Worksheets.count -ne 0)
{
$Workbook.Password=$null
$savePath = $decrypted_Path+$file.Name
write-host "Decrypted: " $file.Name -f "DarkGreen"
$Workbook.SaveAs($savePath)
# Added to keep Excel process memory utilization in check
$ExcelObj.Workbooks.close()
# Move original file to $processed_Path
move-item $file.fullname -Destination $processed_Path -Force
}
else {
# Close Document
$ExcelObj.Workbooks.Close()
}
}
}
$count--
# Next File
}
# Close Document and Application
$ExcelObj.Workbooks.close()
$ExcelObj.Application.Quit()
Write-host "`nProcessing Complete!" -f "Green"
Write-host "`nFiles w/o a matching password can be found in the Encrypted folder."
Write-host "`nTime Started : " $a.ToShortTimeString()
Write-host "Time Completed : " $(Get-Date).ToShortTimeString()
Write-host "`nTotal Duration : "
NEW-TIMESPAN –Start $a –End $(Get-Date)
# Remove any stale Excel processes created by this script's execution
Get-Process excel -ErrorAction SilentlyContinue | Where-Object{$currentExcelProcessIDs -notcontains $_.id} | Stop-Process
If nothing else I do see one glaring performance issue that should be easy to address. You are opening a new excel instance for testing each individual password for each document. 40 workbooks with 50 passwords mean you have opened 2000 Excel instances one at a time.
You should be able to keep using the same one without a functionality hit. Get this code out of your inner most loop
# New Excel Object
$ExcelObj = $null
$ExcelObj = New-Object -ComObject Excel.Application
$ExcelObj.Visible = $false
as well as the snippet that would close the process. It would need to be out of the loop as well.
$ExcelObj.Close()
$ExcelObj.Application.Quit()
If that does not help enough you would have to consider doing some sort of parallel processing with jobs etc. I have a basic solution in a CodeReview.SE answer of mine doing something similar.
Basically what it does is run several excels at once where each one works on a chunk of documents which runs faster than one Excel doing them all. Just like I do in the linked answer I caution the automation of Excel COM with PowerShell. COM objects don't always get released properly and locks can be left on files or processes.
You are looping for all 50 passwords regardless of success or not. That means you could find the right password on the first go but you are still going to try the other 49! Set a flag in the loop to break that inner loop when that happens.
As far as the password logic goes you say that
At this point I can remove the passwords faster manually since the script takes ~40 minutes
Why can you do it faster? What do you know that the script does not. I don't see you being able to out perform the script but doing exactly what it does.
With what I see another suggestion would be to keep/track successful passwords and associated file name. So that way when it gets processed again you would know the first password to try.
This solution uses the modules ImportExcel for easier working with Excel files, and PoshRSJob for multithreaded processing.
If you do not have these, install them by running:
Install-Module ImportExcel -scope CurrentUser
Install-Module PoshRSJob -scope CurrentUser
I've raised an issue on the ImportExcel module GitHub page where I've proposed a solution to open encrypted Excel files. The author may propose a better solution (and consider the impact across other functions in the module, but this works for me). For now, you'll need to make a modification to the Import-Excel function yourself:
Open: C:\Username\Documents\WindowsPowerShell\Modules\ImportExcel\2.4.0\ImportExcel.psm1 and scroll to the Import-Excel function. Replace:
[switch]$DataOnly
With
[switch]$DataOnly,
[String]$Password
Then replace the following line:
$xl = New-Object -TypeName OfficeOpenXml.ExcelPackage -ArgumentList $stream
With the code suggested here. This will let you call the Import-Excel function with a -Password parameter.
Next we need our function to repeatedly try and open a singular Excel file using a known set of passwords. Open a PowerShell window and paste in the following function (note: this function has a default output path defined, and also outputs passwords in the verbose stream - make sure no-one is looking over your shoulder or just remove that if you'd prefer):
function Remove-ExcelEncryption
{
[CmdletBinding()]
Param
(
[Parameter(Mandatory=$true)]
[String]
$File,
[Parameter(Mandatory=$false)]
[String]
$OutputPath = 'C:\PoShTest\Decrypted',
[Parameter(Mandatory=$true)]
[Array]
$PasswordArray
)
$filename = Split-Path -Path $file -Leaf
foreach($Password in $PasswordArray)
{
Write-Verbose "Attempting to open $file with password: $Password"
try
{
$ExcelData = Import-Excel -path $file -Password $Password -ErrorAction Stop
Write-Verbose "Successfully opened file."
}
catch
{
Write-Verbose "Failed with error $($Error[0].Exception.Message)"
continue
}
try
{
$null = $ExcelData | Export-Excel -Path $OutputPath\$filename
return "Success"
}
catch
{
Write-Warning "Could not save to $OutputPath\$filename"
}
}
}
Finally, we can run code to do the work:
$Start = get-date
$PasswordArray = #('dj7F9vsm','kDZq737b','wrzCgTWk','DqP2KtZ4')
$files = Get-ChildItem -Path 'C:\PoShTest\Encrypted'
$files | Start-RSJob -Name {$_.Name} -ScriptBlock {
Remove-ExcelEncryption -File $_.Fullname -PasswordArray $Using:PasswordArray -Verbose
} -FunctionsToLoad Remove-ExcelEncryption -ModulesToImport Import-Excel | Wait-RSJob | Receive-RSJob
$end = Get-Date
New-TimeSpan -Start $Start -End $end
For me, if the correct password is first in the list it runs in 13 seconds against 128 Excel files. If I call the function in a standard foreach loop, it takes 27 seconds.
To view which files were successfully converted we can inspect the output property on the RSJob objects (this is the output of the Remove-ExcelEncryption function where I've told it to return "Success"):
Get-RSJob | Select-Object -Property Name,Output
Hope that helps.

Powershell - Optimizing a very, very large csv and text file search and replace

I have a directory with ~ 3000 text files in it, and I'm doing periodic search and replaces on those text files as I transition a program to a new server.
Each text file may have an average of ~3000 lines, and I need to search the files for maybe 300 - 1000 terms at a time.
I'm replacing the server prefix which is related to the string I'm searching for. So for every one of the csv entries, I'm looking for Search_String, \\Old_Server\"Search_String" and making sure that after the program completes, the result is "\\New_Server\Search_String".
I cobbled together a powershell program, and it works. But it's so slow I've never seen it complete.
Any suggestions for making it faster?
EDIT 1:
I changed get-content as suggested, but it still took 3 minutes to search two files (~8000 lines) for 9 separate search terms. I must still be screwing up; a notepad++ search and replace would still be way faster if done manually 9 times.
I'm not sure how to get rid of the first (Get-Content) because I want to make a copy of the file for backup before I make any changes to it.
EDIT 2:
So this is an order of magnitude faster; it's searching a file in maybe 10 seconds. But now it doesn't write changes to files, and it only searches the first file in the directory! I didn't change that code, so I don't know why it broke.
EDIT 3:
Success! I adapted a solution posted below to make it much, much faster. It's searching each file in a couple of seconds now. I may reverse the loop order, so that it loads the file into the array and then searches and replaces each entry in the CSV rather than the other way around. I'll post that if I get it to work.
Final script is below for reference.
#get input from the user
$old = Read-Host 'Enter the old cimplicity qualifier (F24, IRF3 etc'
$new = Read-Host 'Enter the new cimplicity qualifier (CB3, F24_2 etc)'
$DirName = Get-Date -format "yyyy_MM_dd_hh_mm"
New-Item -ItemType directory -Path $DirName -force
New-Item "$DirName\log.txt" -ItemType file -force -Value "`nMatched CTX files on $dirname`n"
$logfile = "$DirName\log.txt"
$VerbosePreference = "SilentlyContinue"
$points = import-csv SearchAndReplace.csv -header find #Import CSV File
#$ctxfiles = Get-ChildItem . -include *.ctx | select -expand fullname #Import local directory of CTX Files
$points | foreach-object { #For each row of points in the CSV file
$findvar = $_.find #Store column 1 as string to search for
$OldQualifiedPoint = "\\\\"+$old+"\\" + $findvar #Use escape slashes to escape each invidual bs so it's not read as regex
$NewQualifiedPoint = "\\"+$new+"\" + $findvar #escape slashes are NOT required on the new string
$DuplicateNew = "\\\\" + $new + "\\" + "\\\\" + $new + "\\"
$QualifiedNew = "\\" + $new + "\"
dir . *.ctx | #Grab all CTX Files
select -expand fullname | #grab all of those file names and...
foreach {#iterate through each file
$DateTime = Get-Date -Format "hh:mm:ss"
$FileName = $_
Write-Host "$DateTime - $FindVar - Checking $FileName"
$FileCopied = 0
#Check file contents, and copy matching files to newly created directory
If (Select-String -Path $_ -Pattern $findvar -Quiet ) {
If (!($FileCopied)) {
Copy $FileName -Destination $DirName
$FileCopied = 1
Add-Content $logfile "`n$DateTime - Found $Findvar in $filename"
Write-Host "$DateTime - Found $Findvar in $filename"
}
$FileContent = Get-Content $Filename -ReadCount 0
$FileContent =
$FileContent -replace $OldQualifiedPoint,$NewQualifiedPoint -replace $findvar,$NewQualifiedPoint -replace $DuplicateNew,$QualifiedNew
$FileContent | Set-Content $FileName
}
}
$File.Dispose()
}
If I'm reading this correctly, you should be able to read a 3000 line file into memory, and do those replaces as an array operation, eliminating the need to iterate through each line. You can also chain those replace operations into a single command.
dir . *.ctx | #Grab all CTX Files
select -expand fullname | #grab all of those file names and...
foreach {#iterate through each file
$DateTime = Get-Date -Format "hh:mm:ss"
$FileName = $_
Write-Host "$DateTime - $FindVar - Checking $FileName"
#Check file contents, and copy matching files to newly created directory
If (Select-String -Path $_ -Pattern $findvar -Quiet ) {
Copy $FileName -Destination $DirName
Add-Content $logfile "`n$DateTime - Found $Findvar in $filename"
Write-Host "$DateTime - Found $Findvar in $filename"
$FileContent = Get-Content $Filename -ReadCount 0
$FileContent =
$FileContent -replace $OldQualifiedPoint,$NewQualifiedPoint -replace $findvar,$NewQualifiedPoint -replace $DuplicateNew,$QualifiedNew
$FileContent | Set-Content $FileName
}
}
On another note, Select-String will take the filepath as an argument, so you don't have to do a Get-Content and then pipe that to Select-String.
Yes, you can make it much faster by not using Get-Content... Use Stream Reader instead.
$file = New-Object System.IO.StreamReader -Arg "test.txt"
while (($line = $file.ReadLine()) -ne $null) {
# $line has your line
}
$file.dispose()
i wanted to use PowerShell for this and created a script like the one below:
$filepath = "input.csv"
$newfilepath = "input_fixed.csv"
filter num2x { $_ -replace "aaa","bbb" }
measure-command {
Get-Content -ReadCount 1000 $filepath | num2x | add-content $newfilepath
}
It took 19 minutes on my laptop to process 6.5Gb file. The code below is reading file in a batch (using ReadCount) and uses filter that should optimize performance.
But then I tried FART and it did the same thing in 3 minutes! quite a difference!

Resources