Powershell - Optimizing a very, very large csv and text file search and replace - search

I have a directory with ~ 3000 text files in it, and I'm doing periodic search and replaces on those text files as I transition a program to a new server.
Each text file may have an average of ~3000 lines, and I need to search the files for maybe 300 - 1000 terms at a time.
I'm replacing the server prefix which is related to the string I'm searching for. So for every one of the csv entries, I'm looking for Search_String, \\Old_Server\"Search_String" and making sure that after the program completes, the result is "\\New_Server\Search_String".
I cobbled together a powershell program, and it works. But it's so slow I've never seen it complete.
Any suggestions for making it faster?
EDIT 1:
I changed get-content as suggested, but it still took 3 minutes to search two files (~8000 lines) for 9 separate search terms. I must still be screwing up; a notepad++ search and replace would still be way faster if done manually 9 times.
I'm not sure how to get rid of the first (Get-Content) because I want to make a copy of the file for backup before I make any changes to it.
EDIT 2:
So this is an order of magnitude faster; it's searching a file in maybe 10 seconds. But now it doesn't write changes to files, and it only searches the first file in the directory! I didn't change that code, so I don't know why it broke.
EDIT 3:
Success! I adapted a solution posted below to make it much, much faster. It's searching each file in a couple of seconds now. I may reverse the loop order, so that it loads the file into the array and then searches and replaces each entry in the CSV rather than the other way around. I'll post that if I get it to work.
Final script is below for reference.
#get input from the user
$old = Read-Host 'Enter the old cimplicity qualifier (F24, IRF3 etc'
$new = Read-Host 'Enter the new cimplicity qualifier (CB3, F24_2 etc)'
$DirName = Get-Date -format "yyyy_MM_dd_hh_mm"
New-Item -ItemType directory -Path $DirName -force
New-Item "$DirName\log.txt" -ItemType file -force -Value "`nMatched CTX files on $dirname`n"
$logfile = "$DirName\log.txt"
$VerbosePreference = "SilentlyContinue"
$points = import-csv SearchAndReplace.csv -header find #Import CSV File
#$ctxfiles = Get-ChildItem . -include *.ctx | select -expand fullname #Import local directory of CTX Files
$points | foreach-object { #For each row of points in the CSV file
$findvar = $_.find #Store column 1 as string to search for
$OldQualifiedPoint = "\\\\"+$old+"\\" + $findvar #Use escape slashes to escape each invidual bs so it's not read as regex
$NewQualifiedPoint = "\\"+$new+"\" + $findvar #escape slashes are NOT required on the new string
$DuplicateNew = "\\\\" + $new + "\\" + "\\\\" + $new + "\\"
$QualifiedNew = "\\" + $new + "\"
dir . *.ctx | #Grab all CTX Files
select -expand fullname | #grab all of those file names and...
foreach {#iterate through each file
$DateTime = Get-Date -Format "hh:mm:ss"
$FileName = $_
Write-Host "$DateTime - $FindVar - Checking $FileName"
$FileCopied = 0
#Check file contents, and copy matching files to newly created directory
If (Select-String -Path $_ -Pattern $findvar -Quiet ) {
If (!($FileCopied)) {
Copy $FileName -Destination $DirName
$FileCopied = 1
Add-Content $logfile "`n$DateTime - Found $Findvar in $filename"
Write-Host "$DateTime - Found $Findvar in $filename"
}
$FileContent = Get-Content $Filename -ReadCount 0
$FileContent =
$FileContent -replace $OldQualifiedPoint,$NewQualifiedPoint -replace $findvar,$NewQualifiedPoint -replace $DuplicateNew,$QualifiedNew
$FileContent | Set-Content $FileName
}
}
$File.Dispose()
}

If I'm reading this correctly, you should be able to read a 3000 line file into memory, and do those replaces as an array operation, eliminating the need to iterate through each line. You can also chain those replace operations into a single command.
dir . *.ctx | #Grab all CTX Files
select -expand fullname | #grab all of those file names and...
foreach {#iterate through each file
$DateTime = Get-Date -Format "hh:mm:ss"
$FileName = $_
Write-Host "$DateTime - $FindVar - Checking $FileName"
#Check file contents, and copy matching files to newly created directory
If (Select-String -Path $_ -Pattern $findvar -Quiet ) {
Copy $FileName -Destination $DirName
Add-Content $logfile "`n$DateTime - Found $Findvar in $filename"
Write-Host "$DateTime - Found $Findvar in $filename"
$FileContent = Get-Content $Filename -ReadCount 0
$FileContent =
$FileContent -replace $OldQualifiedPoint,$NewQualifiedPoint -replace $findvar,$NewQualifiedPoint -replace $DuplicateNew,$QualifiedNew
$FileContent | Set-Content $FileName
}
}
On another note, Select-String will take the filepath as an argument, so you don't have to do a Get-Content and then pipe that to Select-String.

Yes, you can make it much faster by not using Get-Content... Use Stream Reader instead.
$file = New-Object System.IO.StreamReader -Arg "test.txt"
while (($line = $file.ReadLine()) -ne $null) {
# $line has your line
}
$file.dispose()

i wanted to use PowerShell for this and created a script like the one below:
$filepath = "input.csv"
$newfilepath = "input_fixed.csv"
filter num2x { $_ -replace "aaa","bbb" }
measure-command {
Get-Content -ReadCount 1000 $filepath | num2x | add-content $newfilepath
}
It took 19 minutes on my laptop to process 6.5Gb file. The code below is reading file in a batch (using ReadCount) and uses filter that should optimize performance.
But then I tried FART and it did the same thing in 3 minutes! quite a difference!

Related

Powershell: Replace string in File1 based on string in File2

I am being forced to use Powershell because of my work. I have used it to do a couple of things but one of my codes is now trash because I have to update a string in a file to include a year that is in a second file. Here is what I'm working with:
File1: Contains a few strings but in there is 48 strings that say:
Jenga_Sequence-XXXX.consensus_Bob_0.6_quality_20
The main point of the string is Sequence-XXXX, sorry for the random place holders.
File2: is a table that has the strings:
John/USA/Sequence-XXXX/Year
I need to replace the strings in File1 with the corresponding Strings in File2.
Sample Text of File1:
Jenga_Sequence-0001.consensus_Bob_0.6_quality_20
AAAAAAAAAAAAAAAAAAAAAAAAA
Jenga_Sequence-0002.consensus_Bob_0.6_quality_20
aaaaaaaaaaaaaaaaaaaaaaaaa
Jenga_Sequence-0003.consensus_Bob_0.6_quality_20
bbbbbbbbbbbbbbbbbbbbbbbbb
Jenga_Sequence-0004.consensus_Bob_0.6_quality_20
BBBBBBBBBBBBBBBBBBBBBBBBB
Jenga_Sequence-0005.consensus_Bob_0.6_quality_20
QQQQQQQQQQQQQQQQQQQQQ
Sample Table of File2:
|Sequence_ID|Date|
|---------------------------|----------|
|John/USA/Sequence-0003/2020|10/11/2020|
|John/USA/Sequence-0001/2021|1/5/2021|
|John/USA/Sequence-0005/2021|1/10/2021|
|John/USA/Sequence-0004/2020|12/23/2020|
|John/USA/Sequence-0002/2021|1/6/2021|
So, I need a Powershell code that replaces
Jenga_Sequence-0001.consensus_Bob_0.6_quality_20 with John/USA/Sequence-0001/2021,
Jenga_Sequence-0002.consensus_Bob_0.6_quality_20 with John/USA/Sequence-0002/2021,
Jenga_Sequence-0003.consensus_Bob_0.6_quality_20 with John/USA/Sequence-0003/2020, and so on. There are typically 48 of these in a file.
My previous code simple replaced "Jenga_" with "John/USA/" and ".consensus_Bob_0.6_quality_20" with "/2020" but now that we are seeing "/2021" the static code will not work.
I am still open to replacing pieces of the string and having a code that sets the year replacement to the correct year.
That was the angle I was doing a broad search on but I could never find anything specific enough to help.
Any help will be appreciated!
EDIT: Here is the part of my previous code that dealt with the finding and replacing, even though I feel it needs to be trashed:
$filePath = 'Jenga_Combined.txt'
$tempFilePath = "$env:TEMP\$($filePath | Split-Path -Leaf)"
$find = 'Jenga_'
$replace = 'John/USA/'
$find2 = '.consensus_Bob_0.6_quality_20'
$replace2 = '/2020'
(Get-Content -Path $filePath) -replace $find, $replace -replace $find2, $replace2 | Add-Content -Path $tempFilePath
Remove-Item -Path $filePath
Move-Item -Path $tempFilePath -Destination $filePath
EDIT2: The "Real Data" from file2. File2 is a Tab Delimited .txt file which makes it not "look great" when copy and pasting. Hopefully this helps. File1 is exactly like above (although the AAAAA stuff is roughly 30,000 letters long)
Sequence_ID date
John/USA/Sequence-0003/2020 2020-10-11
John/USA/Sequence-0001/2021 2021-01-05
John/USA/Sequence-0005/2021 2021-01-10
John/USA/Sequence-0004/2020 2020-12-23
John/USA/Sequence-0002/2021 2021-01-06
Dan
The common factor here is the Sequence_ID number in both files.
You can do this like:
$csvData = Import-Csv -Path 'D:\Test\File2.txt' -Delimiter "`t"
$result = switch -Regex -File 'D:\Test\Jenga_Combined.txt' {
'^Jenga_Sequence-(\d+).*' {
$replace = $csvData | Where-Object { $_.Sequence_ID -like "*Sequence-$($matches[1])*" }
if (!$replace) { Write-Warning "No corresponding Sequence_ID $($matches[1]) found!"; $_ }
else { $replace.Sequence_ID }
}
default { $_ }
}
# output on screen
$result
# output to new file
$result | Set-Content -Path 'D:\Test\Jenga_Combined_NEW.txt' -Force
Output on screen:
John/USA/Sequence-0001/2021
AAAAAAAAAAAAAAAAAAAAAAAAA
John/USA/Sequence-0002/2021
aaaaaaaaaaaaaaaaaaaaaaaaa
John/USA/Sequence-0003/2020
bbbbbbbbbbbbbbbbbbbbbbbbb
John/USA/Sequence-0004/2020
BBBBBBBBBBBBBBBBBBBBBBBBB
John/USA/Sequence-0005/2021
QQQQQQQQQQQQQQQQQQQQQ
Of course, you need to change the file paths to match your environment

Automate creating configurations from template and excel

I'm having trouble in automation of a configuration.
I have a template of a configuration and need to change all the hostname (marked as YYY) and IP (marked as XXX(only 3rd octet needs replacement)) according to a list of excel values.
Now I have a list of 100 different sites and IPs and I want to have also 100 different configurations.
A friend suggested to use the following Powershell code but it doesn't any create files..:
$replaceValues = Import-Csv -Path "\\ExcelFile.csv"
$file = "\\Template.txt"
$contents = Get-Content -Path $file
foreach ($replaceValue in $replaceValues)
{
$contents = $contents -replace "YYY", $replaceValue.hostname
$contents = $contents -replace "XXX", $replaceValue.site
Copy-Item $file "$($file.$replaceValue.hostname)"
Set-Content -Path "$($file.$replaceValue.hostname)" -Value $contents
echo "$($file.$replaceValue.hostname)"
}
Your code tries to overwrite the same $contents string in the loop, so if the values are replaced the first time you enter the loop, there won't be any YYY or XXX values to replace left..
You need to keep the template text intact, and create a new copy from the template inside the loop. That copy can then be altered the way you want. Every next iteration wil then start off with a fresh copy of the template.
There is no need to first copy the template text to a new location and then overwrite this file with the new contents. Set-Content is happy to create a new file for you if it does not already exist.
Try
$replaceValues = Import-Csv -Path 'D:\Test\Values.csv'
$template = Get-Content -Path 'D:\Test\Template.txt'
foreach ($item in $replaceValues) {
$content = $template -replace 'YYY', $item.hostname -replace 'XXX', $item.site
$newFile = Join-Path -Path 'D:\Test' -ChildPath ('{0}.txt' -f $item.hostname)
Write-Host "Creating file '$newFile'"
$content | Set-Content -Path $newFile
}

PowerShell does not replace string although you can see it in cmd

I normally find the answer to my problem by going through the site, but this time I have read every question yet still I am in despair and really need an experienced eye.
What I have is basically a structural health monitoring system. I measure strains and receive raw data. This raw data is processed by a MATLAB executable that I wrote myself and then uploaded to an ftp-server. We had a student that automated this with a PowerShell script which was working perfectly until I changed literally one small line in MATLAB and recompiled the code.
I do not understand much about PowerShell, so please be patient with me. The error I receive is you cannot call a method on a null-valued expression. This occurs when I try to replace a set of strings (just called xxx_xxx) with a date that exists as a variable in PowerShell. I can see xxx_xxx in the command window (see attached image), I can print out the date that I want to use as replacement, but somehow it does not work.
I cannot provide a working code snippet because you would need the DAQ to generate data, and as I said, I don't understand the language much. But below is the code. For easier reading, the line that I am receiving the error is the following:
$outData = $cmdOutput.Replace("xxx_xxx",$snaps[$i].Substring(6,4)+"-"+$snaps[$i].Substring(3,2)+"-"+$snaps[$i].Substring(0,2)+" "+$snaps[$i].Substring(11,8)+";")
If anyone could help me with this, I would be eternally grateful!
$retry=3
while(1){
#$dir = "C:\Users\Petar\Documents\Zoo\PetarData\INPUT DATA\New folder\"
$dir = "C:\Users\Yunus\Documents\Micron Optics\ENLIGHT\Data\" + $(get-date -f yyyy) + "\" + $(get-date -f MM) + "\"
#$outdir = "C:\Users\Petar\Documents\Zoo\PetarData\OUTPUT DATA\New folder\"
$archivedirin = "C:\Users\Yunus\Documents\Elefantenhaus\Archive\IN\"
$archivedirout = "C:\Users\Yunus\Documents\Elefantenhaus\Archive\OUT\"
$tempdir = "C:\Users\Yunus\Documents\Elefantenhaus\Archive\TEMP\"
$prefix = "EHZZ";
$filecount=(Get-ChildItem $dir).Count
$latest = Get-ChildItem -Path $dir | Sort-Object LastAccessTime -Descending | Select-Object -First 1
if($filecount -gt 1){
$exclude = $latest.name
$Files = GCI -path $dir | Where-object {$_.name -ne $exclude}
$dest = $archivedirin + "batch_"+$(get-date -f MM-dd-yyyy_HH_mm_ss)+"\"
new-item -type directory $dest
foreach ($file in $Files){move-item -path ($dir+$file) -destination $dest}
$latest = Get-ChildItem -Path $dest | Sort-Object LastAccessTime -Descending | Select-Object -First 1
$filename = $dest + $latest.name
$s=Get-Content $filename
while($s -eq $null){
if($retry -lt 0){break}
write-host "could not read file"
$retry = $retry -1
$s=Get-Content $filename
}
#read content of input file
$snaps = $s
#loop through the lines in the file until the first occurence of a timestamp, that is our desired line
for ($i = 0; $i -lt $snaps.length; $i++)
{
$ismatch =[regex]::Matches($snaps[$i], '^(\d\d.\d\d.\d\d\d\d\s\d\d+)')
if ( $ismatch -ne $null -and $ismatch[0].Groups[1].Value)
{
$temp=Get-Content $filename | select -skip $i
$filenametemp = $tempdir+"\temp.txt" #temp file path, don't change the filename "temp.txt"
#$filename3 = $tempdir+"\test.txt"
Add-Content $filenametemp $temp
$filename = $archivedirout+$prefix+"_"+$snaps[$i].Substring(8,2)+$snaps[$i].Substring(3,2)+$snaps[$i].Substring(0,2)+"_"+$snaps[$i].Substring(11,2)+$snaps[$i].Substring(14,2)+$snaps[$i].Substring(17,2)+".txt"
$cmdOutput = (cmd /c new_modified.exe $tempdir) | Out-String
write-output $cmdOutput #"$cmdOutput is:"
#IF ([string]::IsNullOrWhitespace($cmdOutput)){
# break
#}
$outData = $cmdOutput.Replace("xxx_xxx",$snaps[$i].Substring(6,4)+"-"+$snaps[$i].Substring(3,2)+"-"+$snaps[$i].Substring(0,2)+" "+$snaps[$i].Substring(11,8)+";")
Add-Content $filename $outData
remove-item -path $filenametemp
break
}
}
#break
}
else
{
write-host "waiting for file"
}
Start-Sleep -s 30
}
I think what is happening is that the output of the external program isn't being piped into a variable correctly. I haven't had a chance to test this but Tee-Object looks like the appropriate method for you.
I would suggest you try replacing...
$cmdOutput = (cmd /c new_modified.exe $tempdir) | Out-String
with...
cmd /c new_modified.exe $tempdir | Tee-Object -variable $cmdOutput

Rename-Item : Object reference not set to an instance of an object

I have a script that I've been working on, which reads a specified directory, locates multiple .CSV files, and executes some logic for each of the .CSV files and ultimately renames them .csv.archived . I'm trying to handle this as cleanly as possible, but I am making a mess.
The issue at hand, is that I cannot seem to figure out how to pass the individual file names through to strings for purposes of renaming the existing file. The process loops through fine, and the files ultimately get renamed but I get the following error:
#set the location where the .CSV files will be pulled from
$Filecsv = get-childitem "\\SERVER\Audit Test\" -recurse | where {$_.extension -eq ".csv"} | % {
$_.Name
}
In the code above, my thoughts are that $_.Name is where I (believe I) am pulling the file name of each file. At the end of this next block, the file is renamed with the file name.
#for each file found in the directory
ForEach ($item in $Filecsv) {
#count the times we've looped through
"Iterations : " + $iterations
# get the date and time from the system
$datetime = get-date -f MMddyy-hhmmtt
# rename the file
rename-item -path ("\\SERVER\Audit Test\" + $_.Name ) -newname ($filename + $datetime + ".csv.archived")
$iterations ++
}
I think the process is fubar'd here:
rename-item -path ("\\SERVER\Audit Test\" + $_.Name )
I've gutted the irrelevant code for testing purposes, and would be happy if someone could tell me that I am doing something wrong, and that I am not crazy.
I am not sure that I properly understand the way that the ForEach loop works, TechNet helps: http://social.technet.microsoft.com/Forums/en-US/e8da8249-ea91-4772-ae85-582a4b37425b/powershell-foreachobject-vs-foreach?forum=smallbusinessserver
But doesn't answer my particular question.
Anyone care to shed some light?
Thanks! Here's the full script:
$iterations = 1
#set the location where the .CSV files will be pulled from
$Filecsv = get-childitem "\\SERVER\Audit Test\" -recurse | where {$_.extension -eq ".csv"} | % {
$_.Name
}
#check to see if files exist, if not exit cleanly
#for each file found in the directory
ForEach ($item in $Filecsv) {
#count the times we've looped through
"Iterations : " + $iterations
# get the date and time from the system
$datetime = get-date -f MMddyy-hhmmtt
# rename the file
rename-item -path ("\\SERVER\Audit Test\" + $_.Name ) -newname ($_.Name + $datetime + ".csv.archived")
$iterations ++
}
If it were me I'd work with the files as objects and not just a string for their name. So I'd do something like this:
#set the location where the .CSV files will be pulled from
$Filecsv = get-childitem "\\SERVER\Audit Test\" -recurse | where {$_.extension -eq ".csv"}
#for each file found in the directory
ForEach ($item in $Filecsv) {
#count the times we've looped through
"Iterations : $iterations"
# get the date and time from the system
$datetime = get-date -f MMddyy-hhmmtt
# rename the file
$NewName = $item.fullname -replace ".csv$","$datetime.csv.archived"
$Item.MoveTo($NewName)
$iterations ++
}
That takes each file, sets up a new name by replacing the .csv at the end with the $datetime.csv.archived that you want, and then moves the file to the new name effectively renaming it.
Also, if you want the Name property for each item instead of doing a ForEach{$_.Name} you are probably better off doing Select -Expand Name
This works for me, I changed $_.Name to $item (which is a string at this point so doesn't have the name property) and it works
$iterations = 1
#set the location where the .CSV files will be pulled from
$Filecsv = get-childitem "c:\AuditTest\" -recurse | where {$_.extension -eq ".csv"} | % {
$_.Name
}
#check to see if files exist, if not exit cleanly
#for each file found in the directory
ForEach ($item in $Filecsv)
{
#count the times we've looped through
"Iterations : " + $iterations
# get the date and time from the system
$datetime = get-date -f MMddyy-hhmmtt
# rename the file
rename-item -path ("c:\AuditTest\" + $item ) -newname ("c:\AuditTest\" + $item + $datetime + ".csv.archived")
$iterations ++
}
Depending on exactly what you're doing though I'd change it like this
$Filecsv = get-childitem "c:\AuditTest\" -recurse | where {$_.extension -eq ".csv"}
This will give you the actual file object rather than a string which gives you more options in terms of parsing the names etc.

Renaming files using a strings from a txt file

The question might sound confusing but all I really need is the ability to change a file name using an array of strings.
For example:
File 1 contains:
abc1234cd.jpg
abc2543ac.jpg
...
File 2 contains (array/reference)
1234c
2543a
...
The new file name for abc1234cd.jpg should now be 1234c.jpg and so forth.
Is this possible with powershell or any other language to do?
Thanks,
This should do it, assuming the files have a one-to-one match.
# Get contents of file 1
$File1 = Get-Content -Path $PSScriptRoot\File1.txt;
# Get contents of file 2
$File2 = Get-Content -Path $PSScriptRoot\File2.txt;
# Iterate over each item in $File1
foreach ($Item in $File1) {
$FileList = Get-ChildItem -Path "c:\test\$Item*.jpg";
foreach ($File in $FileList) {
# Determine file's new name, based on corresponding value in File2
$NewName = $File.Name -replace $Item, $File2[$File1.IndexOf($Item)];
Write-Host -Object ('Old name: {0}, new name: {1}' -f $Item, $NewName);
}
}
After countless hours, I finally got the code to work.
$dir = 'file.txt'
$backup = 'C:\Users\all users\Desktop\backup'
$file = 'file2.txt'
$files = Import-Csv -Header Name -Path $file
foreach ($line in $files){
$linefiles = Get-ChildItem $dir | where {$_.BaseName.Contains($line.Name)}
$count = 0
foreach ($linefile in $linefiles) {
#do stuff to each file here
$name = $linefile.BaseName
$extension = $linefile.Extension
$newName = $line.Name
Copy-Item -Path "$dir\$linefile" -Destination "$backup\$linefile"
if ($count -gt 0){
Rename-Item -NewName "$newName-$count$extension" -Path "$dir\$linefile"}
else{
Rename-Item -NewName "$newName$extension" -Path "$dir\$linefile"}
$count++}
}

Resources