Strange characters found in XML file and PowerShell output after exporting from Excel: ​ - excel

I have an XML file that I'm trying to read with PowerShell. However when I read it, the output of some of the XML objects have the following characters in them: ​
I simply downloaded an XML file I needed from a third-party, which opens in Excel. Then I grab the columns I need and paste them into a new Excel Workbook. Then I map the fields with an XML Schema and then export it as an XML file, which I then use for scripting.
In the Excel spreadsheet my data looks clean, but then when I export it and run the PS script, these strange characters appear in the output. The characters even appear in the actual XML file after exporting. What am I doing wrong?
I tried using -Encoding UTF8, but I'm relatively new to PowerShell and am not sure how to appropriately apply it to my script. Appreciate any help!
PowerShell
$xmlpath = 'Path\To\The\File.xml'
[xml]$xmldata = (Get-Content $xmlpath)
$xmldata.applications.application.name
Example of Output
​ABC_DEF_GHI​.com​​
​JKL_MNO_PQRS​.com​
TUV_WXY_Z.com
AB_CD_EF_GH​.com

This is a prime example of why you shouldn't use the idiom [xml]$xmldata = (Get-Content $xmlpath) - as convenient as it is.[1] The problem is indeed one of character encoding: your file is UTF-8-encoded, but Windows PowerShell's Get-Content cmdlet interprets it as ANSI-encoded in the absence of a BOM - this answer explains the encoding part in detail.Thanks, choroba.
Instead, to ensure that the XML file's character encoding is interpreted correctly, use the following:
# Note: If you know that $xmlPath contains a *full*, native path,
# you don't need the Convert-Path call.
($xmlData = [xml]::new()).Load((Convert-Path -LiteralPath $xmlPath))
This delegates interpretation of the character encoding to the System.Xml.XmlDocument.Load .NET API method, which not only assumes the proper default for XML (UTF-8), but also respects any explicit encoding specification as part of the XML declaration, if present (e.g., <?xml version="1.0" encoding="iso-8859-1"?>)
See also:
the bottom section of this answer for background information.
GitHub proposal #14505, which proposes introducing a New-Xml cmdlet that robustly parses XML files.
[1] If you happen to know the encoding of the input file ahead of time, you can get away with using Get-Content's -Encoding parameter in your original approach ([xml]$xmldata = (Get-Content -Encoding utf8 $xmlpath), but the .Load()-based approach is much more robust.

Related

Mass Conversion of (macintosh) .csv to (ms-dos) .csv

I am using a program to export hundreds of rows in an Excel sheet into separate documents, but the problem is that a PLC will be reading the files and they only save in (macintosh).csv with no option for windows. Is there a way to bulk convert multiple files with different names into the correct format?
I have used this code for a single file but I do not have the knowledge to use it for multiple in a directory
$path = 'c:\filename.csv';
[System.IO.File]::WriteAllText($path.Remove($path.Length-3)+'txt',[System.IO.File]::ReadAllText($path).Replace("`n","`r`n"));
Thank you
The general PowerShell idiom for processing multiple files one by one:
Use Get-ChildItem (or Get-Item) to enumerate the files of interest, as System.IO.FileInfo instances.
Pipe the result to a ForEach-Object call, whose script-block argument ({ ... }) is invoked once for each input object received via the pipeline, reflected in the automatic $_ variable.
Specifically, since you're calling .NET API methods, be sure to pass full, file-system-native file paths to them, because .NET's working directory usually differs from PowerShell's. $_.FullName does that.
Therefore:
Get-ChildItem -LiteralPath C:\ -Filter *.csv |
ForEach-Object {
[IO.File]::WriteAllText(
[IO.Path]::ChangeExtension($_.FullName, 'txt'),
[IO.File]::ReadAllText($_.FullName).Replace("`n", "`r`n")
)
}
Note:
In PowerShell type literals such as [System.IO.File], the System. part is optional and can be omitted, as shown above.
[System.IO.Path]::ChangeExtension(), as used above, is a more robust way to obtain a copy of a path with the original file-name extension changed to a given one.
While Get-ChildItem -Path C:\*.csv or even Get-ChildItem C:\*.csv would work too (Get-ChildItem's first positional parameter is -Path), -Filter, as shown above, is usually preferable for performance reasons.
Caveat: While -Filter is typically sufficient, it does not use PowerShell's wildcard language, but delegates matching to the host platform's file-system APIs. This means that range or character-set expressions such as [0-9] and [fg] are not supported, and, on Windows, several legacy quirks affect the matching behavior - see this answer for more information.

Convert a string in PowerShell (in Europe) to UTF-8

For a REST call I need the German "Stück" in UTF-8 as read from an access database with
$conn = New-Object System.Data.OleDb.OleDbConnection("Provider=Microsoft.ACE.OLEDB.12.0;Data Source=$filename;Persist Security Info=False;")
and try to convert it.
I have found out that PowerShell ISE seems to encode string constants in ANSI.
So I tried as a minimum test without database and got the same result:
$Text1 = "Stück" # entered via ISE, this is also what I get from the database
# ($StringFromDatabase -eq $Test1) shows $true
$enc = [System.Text.Encoding]::GetEncoding(1252).GetBytes($Text1)
# also tried [System.Text.Encoding]::GetEncoding("ISO-8859-1") # = 28591
$Text1 = [System.Text.Encoding]::UTF8.GetString($enc)
$Text1
$Text1 = "Stück" # = UTF-8, entered here with Notepad++, encoding set to UTF-8
"must see: $Text1"
So I get two outputs - the converted one (showing "St?ck") but I need to see "Stück".
that PowerShell ISE seems to encode string constants in ANSI.
That only applies when communicating with external programs, whereas you're using in-process .NET APIs.
As an aside: this discrepancy with regular console windows, which use the active OEM code page is one of the reasons that make the obsolescent ISE problematic - see the bottom section of this answer for more information.
String literals in memory are always .NET strings, which are UTF-16-encoded (composed of 16-bit Unicode code units), capable of representing all Unicode characters.[1]
Character encoding in web-service calls (Invoke-RestMethod, Invoke-WebRequest):
To send UTF-8 strings, specify charset=utf-8 as part of the -ContentType argument; e.g.:
Invoke-RestMethod -ContentType 'text/plain; charset=utf-8' ...
On receiving strings, PowerShell automatically decodes them either based on an explicitly specified charset field (character encoding) in the response's content header or, in its absence using ISO-8859-1 (which is closely related to, but in effect a subset of Windows-1252).
If a given response doesn't specify a charset but in actually uses a different encoding from ISO-8859-1 - say UTF-8 - PowerShell will misinterpret the strings received, which requires re-encoding after the fact - see this answer.
Character encoding when communicating with external programs:
If you need to send a string with a particular encoding to an external program (via the pipeline, which the target program receives via stdin), set the $OutputEncoding preference variable to that encoding, and PowerShell will automatically convert your .NET strings to the specified encoding.
To send UTF-8-encoded strings to external programs via the pipeline:
$OutputEncoding = [System.Text.UTF8Encoding]::new()
Note, however, that this alone isn't sufficient in order to correctly receive UTF-8 output from external programs; for that, you need to set [Console]::OutputEncoding to the same encoding.
To make your PowerShell session fully UTF-8-aware (irrespective of whether in the ISE or a regular console window):
# Needed in the ISE only:
chcp >$null # Dummy console-program call that ensures that a console is allocated.
# Set all encodings relevant to communicating with external programs to UTF-8.
$OutputEncoding = [Console]::InputEncoding = [Console]::OutputEncoding =
[System.Text.UTF8Encoding]::new()
See this answer for more information.
[1] Note, however, that Unicode characters with a code point greater than 0xFFFF, i.e. those outside the so-called BMP (Basic Multilingual Plane), must be represented with two 16-bit code units ([char]), namely so-called surrogate pairs.

Replace strings in text files with string literals and file names in powershell

My google-fu has failed me, so I'd love to get some help with this issue. I have a directory full of markup files (extension .xft). I need to modify these files by adding string literals and the filename (without the file extension) to each file.
For example, I currently have:
<headerTag>
<otherTag>Some text here </otherTag>
<finalTag> More text </finalTag>
What I need to end up with is:
<modifiedHeaderTag>
<secondTag> filenameGoesHere </secondTag>
<otherTag>Some text here </otherTag>
<finalTag> More text </finalTag>
So in this example,
"<modifiedHeaderTag>
<secondTag>"
would be my first string literal (this is a constant that gets inserted into each file in the same place),
filenameGoesHere
would be the variable string (the name of each file) and,
"</secondTag>"
would be my second constant string literal.
I was able to successfully replace text using:
(Get-Content *.xft).Replace("<headerTag>", "<modifiedHeaderTag>")
However, when I tried
(Get-Content *.xft).Replace("<headerTag>", "<modifiedHeaderTag> `n
<secondTag> $($_.Name) </secondTag>")
I just got an error message. Replacing $($_.Name) with ${$_.Name) also had no effect.
I've tried other things, but this method was the closest that I had gotten to success. I would appreciate any help that I can get. It's probably simple and I'm just not seeing something due to inexperience with Powershell, so a helping hand would be great.
If the above isn't clear enough, I'd be happy to provide more info, just let me know. Thanks everyone!
Here's my approach, assuming you have all of the XFT's in one folder and you want to write the updates back to the same file:
$path = "C:\XFTs_to_Modify"
$xfts = Get-ChildItem $path -Include "*.xft"
foreach ($xft in $xfts) {
$replace = "<modifiedHeaderTag>
<secondTag> $($xft.Name) </secondTag>"
(Get-Content *.xft).Replace("<headerTag>", $replace) | Set-Content $xft.FullName -Force
}

Adding a header to a '|' delimited CSV file in Powershell?

I was wondering if anybody knows a way to achieve this without breaking/mesing with the data itself?
I have a CSV file which is delimited by '|' which was created by retrieving data from Sharepoint using an SPQuery and exported using out-file (because export-csv is not an option since I would have to store the data in a variable and this would eat at the RAM of the server, querying remotely unfortuntely will also not work so i have to do this on the server itself). Nevertheless I have the Data i need but i want to perform some manipulations and move and autocalc certain data within an excel file and export the said excel file.
The problem I have right now is that I sort of need a header to the file. I have tried using the following code:
$header ="Name|GF_GZ|GF_Title|GF_UniqueId|GF_OldId|GFURL|GF_ITEMREDIRECTID"
$file = Import-Csv inputfilename.csv -Header $header | Export-Csv D:\outputfilename.csv
In powershell but the issue here is that when i perform the second Export-Csv it will delimit at anything that has a comma and thus remove it, i sort of need the data to remain intact.
I have tried playing with the -Delimit '|' setting both on the import and the export path but no matter what i do it seems to be cutting off the data. Is there a better way to simply add a row at the Top (a header) without messing with the already existing file structure?
I have found out that using a delimiter such as -delimiter '°' or any other special case character will remove my problem entirely, but i can never be sure if such a character is going to show up in the dataset and thus (as stated already) am looking for a more "elegant" solution.
Thanks
One option you have is to create the original CSV with the headers first. Then when you are exporting the SharePoint data, use the switch -Append in the Out-File command to append the SP data to the CSV.
I wouldn't even bother messing with it in csv format.
$header ="Name|GF_GZ|GF_Title|GF_UniqueId|GF_OldId|GFURL|GF_ITEMREDIRECTID"
$in_file = '.\inputfilename.csv'
$out_file = '.\outputfilename.csv'
$x = Get-Content $in_file
Set-Content $out_file -Value $header,$x
There's probably a more eloquent/refined two-liner for some of this, but this should get you what you need.

How can I search in PDF documents/PDX catalog in powershell

I have a vendor that supplies their documentation library as a series of PDF files (and some CHM files) and include a .PDX catalog also.
I want to write a powershell script to front end it (using either powershell forms, or hosting powershell in asp.net).
I'm in the early stages, I've worked out how to get document information from the PDF stream (the xmpmeta XML metadata block near the end of the PDF file - one of the few streams in the file that's in plaintext) which looks like this:
<x:xmpmeta xmlns:x="adobe:ns:meta/" x:xmptk="Adobe XMP Core 4.2.1-c043 52.372728, 2009/01/18-15:08:04
"><rdf:RDF xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#"><rdf:Description rdf:about="
" xmlns:pdf="http://ns.adobe.com/pdf/1.3/"><pdf:Producer>GPL Ghostscript 8.64</pdf:Producer><pdf:Keywo
rds>86000056-413</pdf:Keywords></rdf:Description><rdf:Description rdf:about="" xmlns:xmp="http://ns.ad
obe.com/xap/1.0/"><xmp:ModifyDate>2011-03-03T17:38:34-05:00</xmp:ModifyDate><xmp:CreateDate>2011-01-28
T23:12:07+05:30</xmp:CreateDate><xmp:CreatorTool>PScript5.dll Version 5.2</xmp:CreatorTool><xmp:Metada
taDate>2011-03-03T17:38:34-05:00</xmp:MetadataDate></rdf:Description><rdf:Description rdf:about="" xml
ns:xmpMM="http://ns.adobe.com/xap/1.0/mm/"><xmpMM:DocumentID>6cb2263d-2d61-11e0-0000-1390d57dcfcb</xmp
MM:DocumentID><xmpMM:InstanceID>uuid:1a0e68ba-14ad-4a03-b7a1-0a0e127b8753</xmpMM:InstanceID></rdf:Desc
ription><rdf:Description rdf:about="" xmlns:dc="http://purl.org/dc/elements/1.1/"><dc:format>applicati
on/pdf</dc:format><dc:title><rdf:Alt><rdf:li xml:lang="x-default">I/O Subsystem Programming Guide</rdf
:li></rdf:Alt></dc:title><dc:creator><rdf:Seq><rdf:li>Unisys Information Development</rdf:li></rdf:Seq
></dc:creator><dc:description><rdf:Alt><rdf:li xml:lang="x-default">ClearPath MCP 13.1,Application Dev
elopment,Administration,ClearPath MCP</rdf:li></rdf:Alt></dc:description></rdf:Description></rdf:RDF><
/x:xmpmeta>
using the following code (powershell v3, in v2 you need to select and expand the properties thus [string]$title = ($rdf.GetElementsByTagName('dc:title')| Select -expand Alt|Select -expand li)."#text"):
$file = ".\Downloads\68698703-007\PDF\86000056-413.pdf"
#determine what line in file the xmpmeta string starts
[int]$startln = (select-string -pattern '^<x:' $file).ToString().Split(":")[2]
#determine what line in file the xmpmeta string ends
[int]$endln = (select-string -pattern '^</x:' $file).ToString().Split(":")[2]
$startln--
#grab the xmpmeta and cast as type xml
[xml]$xmp = (gc $file)["$startln".."$endln"]
[xml]$rdf = $xmp.xmpmeta.InnerXml
#get title/creator/description element text
[string]$title = $rdf.GetElementsByTagName('dc:title').Alt.li."#text"
[string]$creator = $rdf.GetElementsByTagName('dc:creator').Alt.li."#text"
[string]$description = $rdf.GetElementsByTagName('dc:description').Alt.li."#text"
That's crucial because the filenames are in the format 12345678-123.pdf, the actual title is in the metadata itself, as well as document category etc.
So, I can produce a list of documents (displaying their proper titles, not the real filename) and allow them to be launched, but I also want to be able to search in all the documents using PDX file, but it's by no means plaintext!
I guess I could use one of a number of tools out there to convert each PDF into text, search it, repeat for each document and then return results for each document.
But, it strikes me that Adobe Reader already does that, so can I either start AcroRd32.exe with switches that will start the search, with search terms I've passed in to the AcroRd32 program, or can I use Adobe Search.API from within Powershell?
Any ideas specifically on automating load of the .PDX in Adobe Reader and firing off the search, or using adobe's API in powershell?
EDIT:
I can now launch acrobat from command line and search (so could mimic this in powershell) but the search only works when searching a PDF, not a PDX catalog. Both bring up the search pane, but only in a PDF document does the search field get populated and the search executed.
C:\Program Files (x86)\Adobe\Reader 10.0\Reader>AcroRd32.exe /A "search=trim" "P:\Doc Library\PDF\00_home.pdx"
Or
C:\Program Files (x86)\Adobe\Reader 10.0\Reader>AcroRd32.exe /A "search=trim" "P:\Doc Library\PDF\86000056-413.pdf"
Regards,
Graham
This is an old post, but be aware that the searching you do is potentially dangerous and that there is a better way to find the XMP metadata in a PDF file. XMP was designed specifically to be "findable" by text search. To that purpose it has a well defined begin and end code defined that is in there specifically so that you can extract the XMP data without having to parse the PDF format (or any other format the XMP metadata blob might be embedded in.
You can download the XMP specification here: http://www.adobe.com/devnet/xmp.html. Part 1 is the part where the explanation about XMP Packets explains how a text scanner can find the XMP packet with more accuracy.
Finally, PDF has an additional quirk that allows it to be incrementally updated. This might cause multiple XMP packets to appear in the file (where the last packet is normally the correct one). But annoyingly when the PDF is exported from applications like InDesign, images in the PDF (and other objects) might also have their own "object" XMP attached to it.
So consider where your files come from and how many strange things you might encounter and you want to provision for. But reading the XMP specification is not a bad idea for sure.

Resources