Use Powershell to find SSN's in Word and Excell Documents

Use Powershell to find SSN's in Word and Excell Documents - excel

I am very noob to Powershell and have small amounts of Linux bash scripting experience. I have been looking for a way to get a list of files that have Social Security Numbers on a server. I found this in my research and it performed exactly as I had wanted when testing on my home computer except for the fact that it did not return results from my work and excel test documents. Is there a way to use a PowerShell command to get results from the various office documents as well? This server is almost all Word and excel files with a few PowerPoints.
PS C:\Users\Stephen> Get-ChildItem -Path C:\Users -Recurse -Exclude *.exe, *.dll | `
Select-String "\d{3}[-| ]\d{2}[-| ]\d{4}"
Documents\SSN:1:222-33-2345
Documents\SSN:2:111-22-1234
Documents\SSN:3:111 11 1234
PS C:\Users\Stephen> Get-childitem -rec | ?{ findstr.exe /mprc:. $_.FullName } | `
select-string "[0-9]{3}[-| ][0-9]{2}[-| ][0-9]{4}"
Documents\SSN:1:222-33-2345
Documents\SSN:2:111-22-1234
Documents\SSN:3:111 11 1234

Is there a way to use a PowerShell command to get results from the various office documents as well? This server is almost all Word and excel files with a few PowerPoints.
When interacting with MS Office files, the best way is to use COM interfaces to grab the information you need.
If you are new to Powershell, COM will definitely be somewhat of a learning curve for you, as very little "beginner" documentation exists on the internet.
Therefore I strongly advise starting off small :
First focus on opening a single Word doc and reading in the contents into a string for now.
Once you have this ready, focus on extracting relevant info (The Powershell Match operator is very helpful)
Once you are able to work with a single Word doc, try to locate all files named *.docx in a folder and repeat your process on them: foreach ($file in (ls *.docx)) { # work on $file }
Here's some reading (admittedly, all this is for Excel as I build automated Excel charting tools, but the lessons will be very helpful for automating any Office application)
Powershell and Excel - Introduction
A useful document from a deleted link (link points to the Google Cache for that doc) - http://dev.donet.com/automating-excel-spreadsheets-with-powershell
Introduction to working with "Objects" in PS - CodeProject

When you only want to restrict this to docx and xlsx, you might also want to consider plain unzipping and then searching through the contents, ignoring any XML tags (so allow between each digit one or more XML elements).

Related

Check if Excel file saved in compatibility mode without using Excel

I have recently been experimenting with perl and some modules to read Excel files and in particular the format of thier cells.
For example I wrote a piece of perl code that used the module ParseExcel to read a cells background colour. However while testing I noticed that for certain files the colour returned by my perl program did not match the colour reported by Excel. Eventually I found the reason for this was that the file I was reading was a .xls file saved in compatibility mode. Basically the creator of the file had used the functionality of Excel .xlsx type files (2007+) to colour some of the cells and then saved the file with the old .xls file extension that did not support the colours chosen.
So my question: Is there any way to tell whether a given .xls file (or any other old Excel file format) has been saved in compatibility mode without usung Excel to find out? The reason I ask is that I am working under a linux environment and can't use any windows tools to analyse the files.
Furthermore, if one could identify that a given Excel file has, indeed, been saved in compatibiity mode is there any way of knowing how the original colours were mapped to the ones that my program is telling me?
Many thanks for any help on this.

I do not think that you can do this using Spreadsheet::ParseExcel. I have tried saving an xls file with a color from an .xlsx and saving it with 2003 compatibility. Then comparing it with an empty .xls of 2003 and I do not see any difference in my files.
You can try the following code to debug it with your own files trying to find a difference that you could use:
use strict;
use warnings;
use Spreadsheet::ParseExcel;
use Data::Dumper;
use JSON;
use Test::More tests => 1;
my $file_1 = 'test_xls.xls';
my $file_2 = 'compat_xls.xls';
my #files = (
$file_1,
$file_2,
);
my #workbooks;
foreach my $file (#files){
print("\n\nReading $file\n");
my $parser = Spreadsheet::ParseExcel->new();
my $workbook = $parser->parse($file);
# print Dumper($workbook->{PkgStr});
delete $workbook->{PkgStr};
delete $workbook->{File};
delete $workbook->{Worksheet}->[0]->{MinRow};
delete $workbook->{Worksheet}->[0]->{RowHeight};
delete $workbook->{Worksheet}->[0]->{_Pos};
delete $workbook->{Worksheet}->[0]->{MinCol};
delete $workbook->{Worksheet}->[0]->{MaxCol};
delete $workbook->{Worksheet}->[0]->{MaxRow};
delete $workbook->{Worksheet}->[0]->{Cells};
delete $workbook->{Format}->[62];
push #workbooks, $workbook;
}
my ($ok, $stack) = is_deeply($workbooks[0], $workbooks[1]);
my $diag = explain($stack);
print(Dumper($diag));

MS Windows 10 Edge Live tiles system cache

I've changed paths (polling uri's) to XML with data but Windows still requests the old one xml url.
I was updating xml url in the following steps:
Turn live tiles off
Unpin tile
MS Edge browser cache and history clearing
Delete all content within C:/Users/user_name/AppData/Local/Packages/Microsoft.MicrosoftEdge_randomized_hash/LocalState/PinnedTiles
Delete file iconcache.db inside C:/Users/user_name/AppData/Local
Disk Cleanup
So I start MS Edge again and pin tiles to start menu. Then I see that Windows still requests the old xml path via server logs.
How to update it? There must be some system cache I suppose ...
I've spent a lot of time and would appreciate any advice!

Microsoft Support replied to my question. So the reason was MS Edge cache.
These steps helped me, I hope it'll help someone else.
Please try the below steps to reset the edge browser and check. Please
know that resetting the Microsoft edge will remove your bookmarks and
History. Follow the instruction provided below and check.
a. Navigate to the location:
C:\Users\%username%\AppData\Local\Packages\Microsoft.MicrosoftEdge_8wekyb3d8bbwe
b. Delete everything in this folder.
c. Type Windows Powershell in search box.
d. Right click on Windows Powershell and select Run as administrator.
e. Copy and paste the following command.
Get-AppXPackage -AllUsers -Name Microsoft.MicrosoftEdge | Foreach
{Add-AppxPackage -DisableDevelopmentMode -Register
"$($_.InstallLocation)\AppXManifest.xml" –Verbose}

The above worked for me. However, I have since discovered that after unpinning the tile/s you don't want, you can simply just delete the unrequired tile folder/s here: C:\Users\YOUR USERNAME\AppData\Local\Packages\Microsoft.MicrosoftEdge_8wekyb3d8bbwe\LocalState\PinnedTiles

Not to hijack the topic, but if you'd like to see all your bookmarks/favorites, here's a starter powershell script to give you that information.
You'll need the Newtonsoft.Json.dll
cls;
[Reflection.Assembly]::LoadFile("C:\Users\<YOUR USER FOLDER>\Documents\WindowsPowerShell\Newtonsoft.Json.dll") |Out-Null;
$source = "C:\Users\<YOUR USER FOLDER>\AppData\Local\Packages\Microsoft.MicrosoftEdge_8wekyb3d8bbwe\RoamingState";
$filter = "{*}.json";
$files = Get-ChildItem -Recurse -Path $source -filter $filter -File;
foreach ($f in $files)
{
$json = Get-Content -Path $f.FullName
$result = [Newtonsoft.Json.JsonConvert]::DeserializeObject($json);
$result.Title.ToString()
$result.URL.ToString()
}

message text not available when using psloglist

I'm using psloglist to analysis the saved event log for my windows 2003 server, however, the critical information i need is not retrieved properly and "message text not available. insertion strings" is appended instead. I've been searching for long while and still unable to find any solution or the root cause, anybody come across the same and could give some help in this? Thanks.

psloglist \\localhost -d 7 application -o "Source" | find "MessageText"

save MATLAB code file along with results in one folder?

I'm processing a data set and running into a problem - although I xlswrite all the relevant output variables to a big Excel file that is timestamped, I don't save the code that actually generated that result. So if I try to recreate a certain set of results, I can't do it without relying on memory (which is obviously not a good plan). I'd like to know if there's a command(s) that will help me save the m-files used to generate the output Excel file, as well as the Excel file itself, in a folder I can name and timestamp so I don't have to do this manually.
In my perfect world I would run the master code file that calls 4 or 5 other function m-files, then all those m-files would be saved along with the Excel output to a folder names results_YYYYMMDDTIME. Does this functionality exist? I can't seem to find it.

There's no such functionality built in.
You could build a dependency tree of your main function by using depfun with mfilename.
depfun(mfilename()) will return a list of all functions/m-files that are called by the currently executing m-file.
This will include all files that come as MATLAB builtins, you might want to remove those (and only record the MATLAB version in your excel sheet).
As pseudocode:
% get all files:
dependencies = depfun(mfilename());
for all dependencies:
if not a matlab-builtin:
copyfile(dependency, your_folder)
As a "long term" solution you might want to check if using a version control system like subversion, mercurial (or one of many others) would be applicable in your case.
In larger projects this is preferred way to record the version of source code used to produce a certain result.

parsing a table in a word document using node.js

I'm trying to create a node.js web app hosted by a linux server. the app must read and parse a table in a word document.
I've looked around and saw that Powershell can trivially accomplish this. The problem is that Powershell is an MS scripting language, and its Mac port (pash) is very unstable and chokes whenever I want to execute something as simple as this:
$wd = New-Object -ComObject Word.Application
$wd.Visible = $true
$doc = $wd.Documents.Open($filename)
$doc.Tables | ForEach-Object {
$_.Cell($_.Rows.Count, $_.Columns.Count).Range.Text
}
I've looked into other solutions like Docsplit and it's too generic (ie it converts an entire word doc to just plain text, not granular enough for my purposes).
some suggested using the saaspose API, but it costs lotsa money! I think I can do this myself.
ideas?

Here's a python module that can read/write docx files:
https://github.com/mikemaccana/python-docx

If you're deploying on a Linux machine, it's probably best to use Docsplit and then parse the output text, or you could try Apache POI.
Another option would be to try MS COM API running on Wine, but I'm not sure if it's compatible.

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string

Use Powershell to find SSN's in Word and Excell Documents - excel

When you only want to restrict this to docx and xlsx, you might also want to consider plain unzipping and then searching through the contents, ignoring any XML tags (so allow between each digit one or more XML elements).

Related

Check if Excel file saved in compatibility mode without using Excel

MS Windows 10 Edge Live tiles system cache

message text not available when using psloglist

save MATLAB code file along with results in one folder?

parsing a table in a word document using node.js

Categories

Resources