Azure Powershell: How Do I search for files in a BLOB storage quickly? - azure

We store log files in an Azure storage account, sorted in directories, by date and customer, like this:
YYYY/MM/DD/customerNo/.../.../somestring.customerNo.applicatoinID.log
I need to parse some of these files automatically every day which works fine. However, all I know is the prefix mentioned above and the suffix, they might be in different subdirectories.
So this is how I did it:
$files = (Get-AzStorageBlob -Container logfiles -Context $context) | Where-Object { $_.Name -like "$customerId.$appID.txt" }
This was fast while there weren't any log files, but now after a year this search takes ages. I read somewhere that it would be faster to search by prefix than by suffix. Unfortunately, I have to use the suffix, but I now use the date as a prefix as well. I tried to improve it by doing this:
$date = Get-Date -UFormat "%Y/%m/%d"
$prefix = "$date/$customerId/"
$files = (Get-AzStorageBlob -Container logfiles -Context $context) | Where-Object { $_.Name -like "$prefix*$customerId.$appID.txt" }
However, there is no improvement whatsoever, it just takes as long as before. And it feels like the time the search takes is increasing exponentially with the amount of log files (A few hundred thousand in a very few tens of GBs)
I get a status message which stays there literally for half an hour:
From what I understand, Azure's BLOB storage does not have a hierarchical file system that supports folders, so the "/" are part of the BLOB name and are being interpreted as folders by client software.
However, that does not help me speeding up the search. Any suggestions on how to improve the situation?

Azure Blob Storage supports server-side filtering of blobs by prefix however your code is not taking advantage of that.
$files = (Get-AzStorageBlob -Container logfiles -Context $context) | Where-Object { $_.Name -like "$prefix*$customerId.$appID.txt" }
Essentially the code above is listing all blobs and then doing the filtering on the client side.
To speed up the search, please modify your code to something like:
$files = (Get-AzStorageBlob -Container logfiles -Prefix $prefix -Context $context) | Where-Object { $_.Name -like "$prefix*$customerId.$appID.txt" }
I simply passed the prefix in the Prefix parameter. Now you'll receive only the blobs names of which start with the prefix .

Related

Looking to validate that certain string is present in a text file, send warning if not

I have a process where files containing data are generated in separate locations, saved to a networked location, and merged into a single file.
And the end of the process, I would like to check that all locations are present in that merged file, and notify me if not.
I am having a problem finding a way to identify that a string specific to each location isn't present, to be used in an if statement, but it doesn't seem to be identifying the string correctly?
I have tried :
get-childitem -filter *daily.csv.ready \\x.x.x.x\data\* -recurse | where-object {$_ -notin 'D,KPI,KPI,1,'}
I know it's probably easier to do nothing if it is present, and perform the warning action if not, but I'm curious if this can be done in the reverse.
Thank you,
As Doug Maurer points out, your command does not search through the content of the files output by the Get-ChildItem command, because what that cmdlet emits are System.IO.FileInfo (or, potentially, System.IO.DirectoryInfo) instances containing metadata about the matching files (directories) rather than their content.
In other words: the automatic $_ variable in your Where-Object command refers to an object describing a file rather than its content.
However, you can pipe System.IO.FileInfo instances to the Select-String cmdlet, which indeed searches the input files' content:
Get-ChildItem -Filter *daily.csv.ready \\x.x.x.x\data\* -Recurse |
Where-Object { $_ | Select-String -Quiet -NotMatch 'D,KPI,KPI,1,' }

Compare two Files and list only the differences of 2nd File using Powershell

I'm trying to get the current list of azure vms on first run of script -> Stores to Storage Account in CSV File
O the 2nd run - Current List should be compared with existing csv file in Storage Account incase of any vms decommisioned that should be recorded and stored in 2nd File in Storage Account
This works fine for me but the issue is when we create a new azure vm which was also gets added to decommission csv list
$Difference = Compare-Object $existingVmCsv $vmrecordFile -Property VmName -PassThru | Select-Object VmName,ResourceGroupName,SubscriptionName
I tried couple of side indicators but dint work
$Difference = Compare-Object -ReferenceObject #($vmrecordFile | Select-Object) -DifferenceObject #($existingVmCsv | Select-Object) -PassThru -Property VmName,ResourceGroupName,SubscriptionName | Where-Object {$_sideIndicator -eq "<="}
$Difference = Compare-Object -ReferenceObject $vmrecordFile -DifferenceObject $existingVmCsv -PassThru -Property VmName,ResourceGroupName,SubscriptionName | Where-Object {$_sideIndicator -eq "<="}
Thank you User Cpt.Whale - Stack Overflow . Posting your suggestions as answer to help other community members.
It seems, you have a typo in a syntax. Object property references should use a "." , like Where-Object { $_.sideIndicator -eq '<=' }
'<=' This indicates that property value appears only in the -ReferenceObject setReferences: powershell - compare two files and update the differences to 2nd file - Stack Overflow , Powershell : How to Compare Two Files, and List Differences | | Dotnet Helpers (dotnet-helpers.com) and compare-object not working : PowerShell (reddit.com)

Need to parse thousands of files for thousands of results - prefer powershell

I am getting consistently pinged from our government contract holder to search for IP addresses in our logs. I have three firewalls, 30 plus servers, etc so you can imagine how unwieldy it becomes. To amplify the problem, I have been provided a list of over 1500 IP addresses for which I am to search all log files...
I have all of the logs downloaded and can use powershell to go through them one by one but it takes forever. I need to be able to run the search using multi-thread in Powershell but cannot figure out the logic to do so. Here's my one by one script...
Any help would be appreciated!
$log = (import-csv C:\temp\FWLogs\IPSearch.csv)
$ip = ($log.IP)
ForEach($log in $log){ Get-ChildItem -Recurse -path C:\temp\FWLogs -filter *.log | Select-String $ip -List | Select Path
}

PowerShell: update O365 AD bulk attributes through csv file

We are trying to bulk update our Azure Active Directory. We have a excel csv list of UserPrincipalNames that we will update the Title, Department, and Office attributes
# Get List of Clinical CMs
$PATH = "C:\Users\cs\Documents\IT Stuff\Project\Azure AD Update\AD-Update-ClinicalCMs-Test.csv"
$CMs = Import-csv $PATH
# Pass CMs into Function
ForEach ($UPN in $CMs) {
# Do AD Update Task Here
Set-Msoluser -UserPrincipalName $UPN -Title "Case Manager" -Department "Clinical" -Office "Virtual"
}
The CSV:
User.1#domain.com
User.2#domain.com
User.3#domain.com
The Set-MsolUser command will work on its own, but it is not working as intended in this For loop. Any help or insight is greatly appreciated
As Jim Xu commented, here my comment as answer.
The input file you show us is not a CSV file, instead, it is a list of UPN values all on a separate line.
To read these values as string array, the easiest thing to is to use Get-Content:
$PATH = "C:\Users\cs\Documents\IT Stuff\Project\Azure AD Update\AD-Update-ClinicalCMs-Test.csv"
$CMs = Get-Content -Path $PATH
Of course, although massive overkill, it can be done using the Import-Csv cmdlet:
$CMs = (Import-Csv -Path $PATH -Header upn).upn

Powershell multithreading

I have a Powershell script that converts Office documents to PDF. I would like to multithread it, but cannot figure out how based on other examples I have seen. The main script (OfficeToPDF.ps1) scans through a list of files and calls separate scripts for each file type/office application (ex. for .doc files WordToPDF.ps1 is called to convert). The main script passes 1 file name at a time to the child script ( I did this for a couple of reasons).
Here is an example of the main script:
$documents_path = "C:\Documents\Test_Docs"
$pdf_out_path = "C:\Documents\Converted_PDFs"
$failed_path = "C:\Documents\Failed_to_Convert"
# Sets the root directory of this script
$PSScriptRoot = Split-Path -parent $MyInvocation.MyCommand.Definition
$date = Get-Date -Format "MM_dd_yyyy"
$Logfile = "$PSScriptRoot\logs\OfficeToTiff_$Date.log"
$word2PDF = "$PSScriptRoot\WordToPDF.ps1"
$arguments = "'$documents_path'", "'$pdf_out_path'", "'$Logfile'"
# Function to write to log file
Function LogWrite
{
Param ([string]$logstring)
$time = Get-Date -Format "hh:mm:ss:fff"
Add-content $Logfile -value "$date $time $logstring"
}
################################################################################
# Word to PDF #
################################################################################
LogWrite "*** BEGIN CONVERSION FROM DOC, DOCX, RTF, TXT, HTM, HTML TO PDF ***"
Get-ChildItem -Path $documents_path\* -Include *.docx, *.doc, *.rtf, *.txt, *.htm? -recurse | ForEach-Object {
$original_document = "$($_.FullName)"
# Verifies that a document exists before calling the convert script
If ($original_document -ne $null)
{
Invoke-Expression "$word2PDF $arguments"
#checks to see if document was successfully converted and deleted. If not, doc is moved to another directory
If(Test-Path -path $original_document)
{
Move-Item $original_document $failed_path
}
}
}
$original_document = $null
[gc]::collect()
[gc]::WaitForPendingFinalizers()
Here is the script (WordToPDF.ps1) that is called by the main script:
Param($documents, $pdf_out_path, $Logfile)
# Function to write to the log file
Function LogWrite
{
Param ([string]$logstring)
$time = Get-Date -Format "hh:mm:ss:fff"
Add-content $Logfile -value "$date $time $logstring"
}
$word_app = New-Object -ComObject Word.Application
$document = $word_app.Documents.Open($_.FullName)
$original_document = "$($_.FullName)"
# Creates the output file name with path
$pdf_document = "$($pdf_out_path)\$($_.BaseName).pdf"
LogWrite "Converting: $original_document to $pdf_document"
$document.SaveAs([ref] $pdf_document, [ref] 17)
$document.Close()
# Deletes the original document after it has been converted
Remove-Item $original_document
LogWrite "Deleting: $original_document"
$word_app.Quit()
Any suggestions would be appreciated.
Thanks.
I was just going to comment and link you to this question: Can PowerShell run commands in Parallel. I then noted the date of that question and the answers, and with PowerShell v3.0 there are some new features that might work better for you.
The question goes over use of the PowerShell jobs. Which can work but require you to keep up with the job status, so can add a bit extra coding to manage.
PowerShell v3 opened up the door a bit more with workflow which is based on Windows Workflow Foundation. A good article on the basics of how this new command works can be found on Script Guy's blog here. You can basically adjust your code to run your conversion via workflow and it will perform this in parallel:
workflow foreachfile {
foreach -parallel ($f in $files) {
#Put your code here that does the work
}
}
Which from what I can find the thread limit this has is 5 threads at a time. I am not sure how accurate that is but blog post here noted the limitation. However, being that the Application com objects for Word and Excel can be very CPU intensive doing 5 threads at a time would probably work well.
I have a multithreaded powershell environment for indicator of compromise scanning on all AD devices- threaded 625 times with Gearman. http://gearman.org
It is open source and allows for an option to go cross platform. It threads with a server worker flow and runs via Python. Extremely recommended by yours truly- someone that has abused threading in powershell. This isn't so much an answer but something that I had never heard of but love and use daily. Pass it forward. Open source for the win :)
I have also used psjobs before and they are great until a certain point of magnitude. Maybe it is my lack of .net expertise but ps has some querky subtle memory nuances that in a large scale can create some nasty effects.

Resources