Powershell - parsing a PDF file for a literal or image - excel

Using Powershell & running PowerGUI. I have a PDF file that I need to search through in order to find if there was an attachment referenced within the content of a particular page. Either that, or I need to search for images, such as a Microsoft Word or Excel icon or a PDF icon within the document.
I am using the following code to read in the page:
Add-Type -Path "c:\itextsharp-all-5.4.5\itextsharp-dll-core\itextsharp.dll"
$reader = New-Object iTextSharp.text.pdf.pdfreader -ArgumentList "c:\files\searchfile.pdf"
for ($page = 1; $page -le 3; $page++) {
$lines = [char[]]$reader.GetPageContent($page) -join "" -split "`n"
foreach ($line in $lines) {
if ($line -match "^\[") {
$line = $line -replace "\\([\S])", $matches[1]
$line -replace "^\[\(|\)\]TJ$", "" -split "\)\-?\d+\.?\d*\(" -join ""
}
}
}
However, the above gives a few bits of text, but mostly unprintable characters.
How can you search a PDF file using Powershell searching for either a literal (like ".doc" or ".xlsx")? Can a PDF be searched for a graphic (like the Excel or Word icon)?

Without seeing the PDF raw content, it's not easy to give specific help, so if you can share a sample PDF or it's contents that would be helpful.
Once you know what to look for in the stream, you can search by reading in the file line by line and using the -match operator:
$file = [io.file]::ReadAllLines('C:\test​.pdf')
$title = ($file -match "<rdf:li")[0].Split(">")[1].Split("<")[0]
$description = ($file -match "<rdf:li")[2].Split(">")[1].Split("<")[0]
write-host ("Title: " + $title)
write-host ("Description: " + $description)
I doubt very much that the contents of the file will tell you much more than that an image exists at particular page coordinates (although I'm by no means a PDF expert) but it may also include the binary file stream, in which case you may be able to save that stream as a file (I haven't tried it as yet).

Related

Automate creating configurations from template and excel

I'm having trouble in automation of a configuration.
I have a template of a configuration and need to change all the hostname (marked as YYY) and IP (marked as XXX(only 3rd octet needs replacement)) according to a list of excel values.
Now I have a list of 100 different sites and IPs and I want to have also 100 different configurations.
A friend suggested to use the following Powershell code but it doesn't any create files..:
$replaceValues = Import-Csv -Path "\\ExcelFile.csv"
$file = "\\Template.txt"
$contents = Get-Content -Path $file
foreach ($replaceValue in $replaceValues)
{
$contents = $contents -replace "YYY", $replaceValue.hostname
$contents = $contents -replace "XXX", $replaceValue.site
Copy-Item $file "$($file.$replaceValue.hostname)"
Set-Content -Path "$($file.$replaceValue.hostname)" -Value $contents
echo "$($file.$replaceValue.hostname)"
}
Your code tries to overwrite the same $contents string in the loop, so if the values are replaced the first time you enter the loop, there won't be any YYY or XXX values to replace left..
You need to keep the template text intact, and create a new copy from the template inside the loop. That copy can then be altered the way you want. Every next iteration wil then start off with a fresh copy of the template.
There is no need to first copy the template text to a new location and then overwrite this file with the new contents. Set-Content is happy to create a new file for you if it does not already exist.
Try
$replaceValues = Import-Csv -Path 'D:\Test\Values.csv'
$template = Get-Content -Path 'D:\Test\Template.txt'
foreach ($item in $replaceValues) {
$content = $template -replace 'YYY', $item.hostname -replace 'XXX', $item.site
$newFile = Join-Path -Path 'D:\Test' -ChildPath ('{0}.txt' -f $item.hostname)
Write-Host "Creating file '$newFile'"
$content | Set-Content -Path $newFile
}

How can I modify this PowerShell script to continue looking for one string after another?

I want this power shell script to search for the occurrence of multiple strings, one after the other, and to append the results in a .txt file.
Currently I am specifying the string that I want to look for, waiting for the script to finish looking for that string and transferring the results into a spreadsheet. This is taking a lot of time as I have to keep specifying the string I want to look for, especially since there are well over 100 that I need to look for.
#ERROR REPORTING ALL
Set-StrictMode -Version latest
$path = "C:\Users\username\Documents\FileName"
$files = Get-Childitem $path -Include *.docx,*.doc,*.ppt, *.xls,
*.xlsx, *.pptx, *.eap -Recurse | Where-Object { !($_.psiscontainer) }
$output =
"C:\Users\username\Documents\FileName\wordfiletry.txt"
$application = New-Object -comobject word.application
$application.visible = $False
$findtext = "First_String"
Function getStringMatch
{
# Loop through all *.doc files in the $path directory
Foreach ($file In $files)
{
$document = $application.documents.open($file.FullName,$false,$true)
$range = $document.content
$wordFound = $range.find.execute($findText)
if($wordFound)
{
"$file.fullname has found the string called $findText and it is
$wordfound" | Out-File $output -Append
}
}
$document.close()
$application.quit()
}
getStringMatch
This script will look for 'First_String' successfully, I was hoping to be able to specify 'Second_String', 'Third_String' etc rather than replace First_String every time.
As an alternative to the suggestion from #Mathias, you could use Regex to query the document text instead.
Read the context of the document as a string $text = $document.content.text and then use Select-String $findtext -AllMatches to evaluate the matches with $findtext as string representation of a regular expression instead.
Example:
# pipe delimited string as a regular expression
$findtext = "First_String|Second_String|Third_String"
Function getStringMatch
{
# Loop through all *.doc files in the $path directory
Foreach ($file In $files)
{
$document = $application.documents.open($file.FullName,$false,$true)
$text = $document.content.text
$result = $text | Select-String $findtext -AllMatches
if($result)
{
"$file.fullname has found the strings called $($result.Matches.Value) at indexes $($result.Matches.Index)" | Out-File $output -Append
}
}
$document.close()
$application.quit()
}
Note that if you're trying find strings that do have reserved regex character, you'll need to escape them first

Capture Get-ChildItem results into string array for HTML output

I'm writing a PowerShell script and I want to capture the list of files in a directory and store into a variable. Once captured I'd like to print it to an HTML file. I've captured the data like this:
$listDownloads = Get-ChildItem -Force "C:\Users\Someuser\Downloads"
Now, I want to convert $listDownloads into a string and split it. How do I go about doing that?
I've tried this:
$listSplit = $listDownloads.ToString().Split(" ")
but I get this output:
System.Object[]
Updated:
$html = '<html>'
$body = '<body>'
$p = '<p>'
$pClose = '</p>'
$StatusDownloads = 'Directory of the Downloads Folder'
$pTags = $p.ToString() + $StatusDownloads + $pClose.ToString();
$listDownloads = (Get-ChildItem -Force "C:\Users\Someuser\Downloads").Name;
function createHTML($addtoFile){
$addtoFile | Add-Content 'status.html'
}
function printData($printData){
Write-Output $printData
}
$html | Set-Content 'status.html'
createHTML($html)
createHTML($body)
createHTML($pTags)
createHTML($listDownloads)
When I try to print $listDownloads it gives me the data all on the same line (i.e. File 1 File 2) I want each file or folder displayed on a new line. How would I do that? If I were to type Get-ChildItem -Force "C:\Users\Someuser\Downloads" into Powershell it would give the list of files line by line. I want that type of output on my HTML page.
The only reason I can think you want to split is to get an array of filenames.
If that's the case, you can do:
$list = (Get-ChildItem -Force "C:\Users\Someuser\Downloads").FullName
From your update you would need to pass into your function as:
createHTML($listDownloads -join "`n")

Powershell - Optimizing a very, very large csv and text file search and replace

I have a directory with ~ 3000 text files in it, and I'm doing periodic search and replaces on those text files as I transition a program to a new server.
Each text file may have an average of ~3000 lines, and I need to search the files for maybe 300 - 1000 terms at a time.
I'm replacing the server prefix which is related to the string I'm searching for. So for every one of the csv entries, I'm looking for Search_String, \\Old_Server\"Search_String" and making sure that after the program completes, the result is "\\New_Server\Search_String".
I cobbled together a powershell program, and it works. But it's so slow I've never seen it complete.
Any suggestions for making it faster?
EDIT 1:
I changed get-content as suggested, but it still took 3 minutes to search two files (~8000 lines) for 9 separate search terms. I must still be screwing up; a notepad++ search and replace would still be way faster if done manually 9 times.
I'm not sure how to get rid of the first (Get-Content) because I want to make a copy of the file for backup before I make any changes to it.
EDIT 2:
So this is an order of magnitude faster; it's searching a file in maybe 10 seconds. But now it doesn't write changes to files, and it only searches the first file in the directory! I didn't change that code, so I don't know why it broke.
EDIT 3:
Success! I adapted a solution posted below to make it much, much faster. It's searching each file in a couple of seconds now. I may reverse the loop order, so that it loads the file into the array and then searches and replaces each entry in the CSV rather than the other way around. I'll post that if I get it to work.
Final script is below for reference.
#get input from the user
$old = Read-Host 'Enter the old cimplicity qualifier (F24, IRF3 etc'
$new = Read-Host 'Enter the new cimplicity qualifier (CB3, F24_2 etc)'
$DirName = Get-Date -format "yyyy_MM_dd_hh_mm"
New-Item -ItemType directory -Path $DirName -force
New-Item "$DirName\log.txt" -ItemType file -force -Value "`nMatched CTX files on $dirname`n"
$logfile = "$DirName\log.txt"
$VerbosePreference = "SilentlyContinue"
$points = import-csv SearchAndReplace.csv -header find #Import CSV File
#$ctxfiles = Get-ChildItem . -include *.ctx | select -expand fullname #Import local directory of CTX Files
$points | foreach-object { #For each row of points in the CSV file
$findvar = $_.find #Store column 1 as string to search for
$OldQualifiedPoint = "\\\\"+$old+"\\" + $findvar #Use escape slashes to escape each invidual bs so it's not read as regex
$NewQualifiedPoint = "\\"+$new+"\" + $findvar #escape slashes are NOT required on the new string
$DuplicateNew = "\\\\" + $new + "\\" + "\\\\" + $new + "\\"
$QualifiedNew = "\\" + $new + "\"
dir . *.ctx | #Grab all CTX Files
select -expand fullname | #grab all of those file names and...
foreach {#iterate through each file
$DateTime = Get-Date -Format "hh:mm:ss"
$FileName = $_
Write-Host "$DateTime - $FindVar - Checking $FileName"
$FileCopied = 0
#Check file contents, and copy matching files to newly created directory
If (Select-String -Path $_ -Pattern $findvar -Quiet ) {
If (!($FileCopied)) {
Copy $FileName -Destination $DirName
$FileCopied = 1
Add-Content $logfile "`n$DateTime - Found $Findvar in $filename"
Write-Host "$DateTime - Found $Findvar in $filename"
}
$FileContent = Get-Content $Filename -ReadCount 0
$FileContent =
$FileContent -replace $OldQualifiedPoint,$NewQualifiedPoint -replace $findvar,$NewQualifiedPoint -replace $DuplicateNew,$QualifiedNew
$FileContent | Set-Content $FileName
}
}
$File.Dispose()
}
If I'm reading this correctly, you should be able to read a 3000 line file into memory, and do those replaces as an array operation, eliminating the need to iterate through each line. You can also chain those replace operations into a single command.
dir . *.ctx | #Grab all CTX Files
select -expand fullname | #grab all of those file names and...
foreach {#iterate through each file
$DateTime = Get-Date -Format "hh:mm:ss"
$FileName = $_
Write-Host "$DateTime - $FindVar - Checking $FileName"
#Check file contents, and copy matching files to newly created directory
If (Select-String -Path $_ -Pattern $findvar -Quiet ) {
Copy $FileName -Destination $DirName
Add-Content $logfile "`n$DateTime - Found $Findvar in $filename"
Write-Host "$DateTime - Found $Findvar in $filename"
$FileContent = Get-Content $Filename -ReadCount 0
$FileContent =
$FileContent -replace $OldQualifiedPoint,$NewQualifiedPoint -replace $findvar,$NewQualifiedPoint -replace $DuplicateNew,$QualifiedNew
$FileContent | Set-Content $FileName
}
}
On another note, Select-String will take the filepath as an argument, so you don't have to do a Get-Content and then pipe that to Select-String.
Yes, you can make it much faster by not using Get-Content... Use Stream Reader instead.
$file = New-Object System.IO.StreamReader -Arg "test.txt"
while (($line = $file.ReadLine()) -ne $null) {
# $line has your line
}
$file.dispose()
i wanted to use PowerShell for this and created a script like the one below:
$filepath = "input.csv"
$newfilepath = "input_fixed.csv"
filter num2x { $_ -replace "aaa","bbb" }
measure-command {
Get-Content -ReadCount 1000 $filepath | num2x | add-content $newfilepath
}
It took 19 minutes on my laptop to process 6.5Gb file. The code below is reading file in a batch (using ReadCount) and uses filter that should optimize performance.
But then I tried FART and it did the same thing in 3 minutes! quite a difference!

Powershell: Searching Content of files and write results to text file

I'm new to powershell so I don't know where to start. I want a script that searches in all (pdf, word, excell, powerpoint, ...) file content for a specific string combination.
I tried this script but it doesn't work:
function WordSearch ($sample, $staining, $sampleID, $patientID, $folder)
{
$objConnection = New-Object -com ADODB.Connection
$objRecordSet = New-Object -com ADODB.Recordset
$objConnection.Open(“Provider=Search.CollatorDSO;Extended Properties=’Application=Windows’;”)
$objRecordSet.Open(“SELECT System.ItemPathDisplay FROM SYSTEMINDEX WHERE ((Contains(Contents,’$sample’)) or (Contains(Contents,’$sampleID’) and Contains(Contents,’$staining’)) or (Contains(Contents,’$staining’) and Contains(Contents,’$patientID’))) AND System.ItemPathDisplay LIKE ‘$folder\%’”, $objConnection)
if ($objRecordSet.EOF -eq $false) {$objRecordSet.MoveFirst() }
while ($objRecordset.EOF -ne $true) {
$objRecordset.Fields.Item(“System.ItemPathDisplay”).Value
$objRecordset.MoveNext()
}
}
Can someone help me?
You should try this, but first make sure your in the folder you want to start searching down: (if your trying to search your whole computer, start in C:\ , but I imagine the script will take a decent amount of time to run.
$Paths = #()
$Paths = gci . *.* -rec | where { ! $_.PSIsContainer } |? {($_.Extension -eq ".doc") -or ($_.Extension -eq ".ppt") -or ($_.Extension -eq ".pdf") -or ($_.Extension -eq ".xls")} | resolve-path
This will retrieve all the file paths of those file types. If you have Microsoft office 2007 or above you may want to add searches for ".xlsx" or ".docx" or ".pptx"
Then you can begin looking through those files for your "specific string combination
array = #()
foreach($path in $Paths)
{$array += Select-String -Path $Path -Pattern "Search String"}
This will give you all the lines and paths that that string exists on in those files. The actual line output you get may be a little distorted though due to microsoft encrypting their files. Use $array | get-member -MemberType Property to find what items you can index to and the Select-object commandlet to pull those items out.

Resources