I have a single file that contains 1GB worth of data. This data is actually 10's of thousands of individual mini files.
I need to extract each individual file and place them in their own separate Distinct file.
So essentially, I need to go from a single file to 30K+ separate files.
Here is a sample of what My file looks like.
FILENAM1 VER 1 32 D
10/15/87 09/29/87
PREPARED BY ?????
REVISED BY ?????
DESCRIPTION USER DOMAIN
RECORD FILENAM1 VER 1 D SUFFIX -4541
100 05 ST-CTY-CDE-FMHA-4541 DISPLAY
200 10 ST-CDE-FMHA-4541 9(2) DISPLAY
300 10 CTY-CDE-FMHA-4541 9(3) DISPLAY
400 05 NME-CTY-4541 X(20) DISPLAY
500 05 LST-UPDTE-DTE-4541 9(06) DISPLAY
600 05 FILLER X DISPLAY 1REPORT NO. 08
DATA DICTIONARY REPORTER REL 17.0 09/23/21
PAGE 2 DREPORT 008
RECORD REPORT
-************************************************************************************************************************************
RECORD RECORD ---- D A T E ----
RECORD NAME LENGTH BUILDER TYPE
OCCURRENCES UPDATED CREATED
************************************************************************************************************************************ 0
FILENAM2 VER 1 176 D
03/09/98 02/21/84
PREPARED BY ??????
REVISED BY ??????
DEFINITION
I Need split the files out based upon a match of VER in position 68, 69 and 70. I also need to name each file uniquely. That information is stored on the same line in position 2-9. In the example above that string is "FILENAM1" and FILENAM2".
So just using the example above I would create two output files and they would be named FILENAM1.txt and FILENAM2.txt.
Since I have 30K+ files I need to split, doing this manually is impossible.
I do have a script that will split a file into multiple files but it will not search for strings by position.
Would anyone be able to assist me with this?
Here is script that DOES NOT Work. Hopefully I can butcher it and get some valid results....
$InputFile = "C:\COPIES.txt"
$Reader = New-Object System.IO.StreamReader($InputFile)
$OPName = #()
While (($Line = $Reader.ReadLine()) -ne $null) {
If ($Line -match "VER"(67,3)) {
$OPName = $Line.(2,8)
$FileName = $OPName[1].Trim()
Write-Host "Found ... $FileName" -foregroundcolor green
$OutputFile = "$FileName.txt"
}
Add-Content $OutputFile $Line
}
Thank you in advance,
-Ron
I suggest using a switch statement, which offers both convenient and fast line-by-line reading of files via -File and regex-matching via -Regex:
$streamWriter = $null
switch -CaseSensitive -Regex -File "C:\COPIES.txt" {
'^.(.{8}).{58}VER' { # Start of a new embedded file.
if ($streamWriter) { $streamWriter.Close() } # Close previous output file.
# Create a new output file.
$fileName = $Matches[1].Trim() + '.txt'
$streamWriter = [System.IO.StreamWriter] (Join-Path $PWD.ProviderPath $fileName)
$streamWriter.WriteLine($_)
}
default { # Write subsequent lines to the same file.
if ($streamWriter) { $streamWriter.WriteLine($_) }
}
}
$streamWriter.Close()
Note: A solution using the .Substring() method of the [string] type is possible too, but would be more verbose.
The ^.(.{8}).{58} portion of the regex matches the first 67 characters on each line, while capturing those in (1-based) columns 2 through 9 (the file name) via capture group (.{8}), which makes the captured text available in index [1] of the automatic $Matches variable. The VER portion of the regex then ensures that the line only matches if VER is found at column position 68.
For efficient output-file creation, [System.IO.StreamWriter] instances are used, which is much faster than line-by-line Add-Content calls. Additionally, with Add-Content you'd have to ensure that a target file doesn't already exist, as the existing content would then be appended to.
Related
How can I force conversion to type System.Version in PowerShell, or more likely, better understand why I cannot arbitrarily assign number strings type System.Version?
We ingest some software updates in folders whose titles include version numbers. In trying to get reports on what the latest versions ingested are, I have been doing the following quick and dirty:
ForEach ($Folder in $(Get-ChildItem -Path $SoftwareDirectory -Directory))
{
$CurrentVersion = $Folder -Replace "[^0-9.]"
If ($CurrentVersion -ne $null)
{
If ([System.Version]$CurrentVersion -gt [System.Version]$MaxVersion)
{
$MaxVersion = $CurrentVersion
$MaxFolder = $Folder
}
}
}
This would be fed directory titles like the following,
foo-tools-1.12.file
bar-static-3.4.0.file
Most of the time, this is acceptable. However, when encountering some oddballs with longer numbers, like the following,
applet-4u331r364.file
In which case, System.Version refuses the resulting string as being too long.
Cannot convert value "4331364" to type "System.Version". Error: "Version string portion was too short or too long."
You need to ensure that your version strings have at least two components in order for a cast to [version] to succeed:
(
#(
'oo-tools-1.12.file'
'bar-static-3.4.0.file'
'applet-4u331r364.file'
) -replace '[^0-9.]'
).TrimEnd('.') -replace '^[^.]+$', '$&.0' | ForEach-Object { [version] $_ }
The above transforms 'applet-4u331r364.file' into '4331364.0', which works when cast to [version].
Note that you can avoid the need for .TrimEnd('.') if you exclude the filename extension to begin with: $Folder.BaseName -replace '[^0-9.]'
-replace '^[^.]+$', '$&.0' matches only strings that contain no . chars., in full, i.e. only those that don't already have at least two components; replacement expression $&.0 appends literal .0 to the matched string ($&).
Output (via Format-Table -AutoSize):
Major Minor Build Revision
----- ----- ----- --------
1 12 -1 -1
3 4 0 -1
4331364 0 -1 -1
We have a text file of students and their notes and we have to count how many "1" notes all the students have got.
My code shows how many lines contain the "1" note, but when it finds a "1", it jumps to the next line.
Could you help me please?
for example:
Huckleberry Finn 2 1 4 1 1
Tom Sawyer 3 2 1 4 1
It should be 5, but it gets 2.
$ones = 0
$file= Get-Content notes.txt
foreach ($i in $file) {
if ($i.Split(' ') -eq 1){
$ones ++
}
}
$ones
If all the 1 tokens are whitespace-separated in your input file, as the sample content suggests, try:
# With your sample file content, $ones will receive the number 5.
$ones = (-split (Get-Content -Raw notes.txt) -eq '1').Count
The above uses the unary form of -split, the string splitting operator and the -Raw switch of the Get-Content cmdlet, which loads a file into memory as a single, multi-line string.
That is, the command above splits the entire file into white-space separated tokens, and counts all tokens that equal '1', across all lines.
If, instead, you meant to count the number of '1' tokens per line, use the following approach:
# With your sample file content, $ones will receive the following *array*:
# 3, 2
$ones = Get-Content notes.txt | ForEach-Object { ((-split $_) -eq '1').Count }
As for what you tried:
if ($i.Split(' ') -eq 1)
While $i.Split(' ') does return all (single-)space-separated tokens contained in a single input line stored in $i, using that expression in a conditional expression of an if statement only results in one invocation of the associated { ... } block and therefore increments $ones only by a value of 1, not the number of 1 tokens in the line at hand.
Solved!
Thank you mklement0!
I don't understand why, but it works so:
$ones = 0
$file= Get-Content notes.txt
foreach ($i in $file) {
$ones=(-split ($i) -eq '1').Count
}
}
$ones
I want to do the following:
I need to check the content of a (text) file. If a defined string is not there, it has to be inserted on a specific position.
I.e.:
My textfile is a configuration file with different sections, example:
[default]
name=bob
alias=alice
foo=bar
example=value
[conf]
name=value
etc=pp
I want to check if the string “foo=bar” and “example=value” exists in this file. If not, it has to get inserted, but I can't just append the new lines, since they have to be in the certain (here [default]) section and not to the end of the file. The position within the section doesn’t matter.
I tried with the following PowerShell script, which actually just looks for a definitely existing string and adds the new lines after it. Therefore I can make sure that the new lines get inserted in the right section, but I can't make sure that they won't be doubled, since my script doesn't check if they already exist.
$InputFile = "C:\Program Files (x86)\Path\to\file.ini"
$find = [regex]::Escape("alias=alice")
$addcontent1 = "foo=bar"
$addcontent2 = " example=value `n"
$InputFileData = Get-Content $InputFile
$matchedLineNumber = $InputFileData |
Where-Object{$_ -match $find} |
Select-Object -Expand ReadCount
$InputFileData | ForEach-Object{
$_
if ($_.ReadCount -eq ($matchedLineNumber)) {
$addcontent1
$addcontent2
}
} | Set-Content $InputFile
As mentioned by Bill_Stewart, Ansgar Wiechers and LotPings there are multiple modules to work with .ini files available in the web.
Let's take this one for example. Once you download it and import you can see how it imports your file (I removed foo=bar to demonstrate):
PS C:\SO\51291727> $content = Get-IniContent .\file.ini
PS C:\SO\51291727> $content
Name Value
---- -----
default {name, alias, example}
conf {name, etc}
From here what you want to do is pretty simple - check if key exists and if not - add:
if ($content.default.foo -ne 'bar') {
$content.default.foo='bar'
}
Verify that the value has been inserted:
PS C:\SO\51291727> $content.default
Name Value
---- -----
name bob
alias alice
example value
foo bar
And export:
$content | Out-IniFile .\out.ini
I've been using this (and this) script to delete older sharepoint backups, but it deletes all backups rather than the 14+ day old ones.
I ran it through powershell_ise.exe and put a break point under the line that has $_.SPStartTime in it, and it shows $_.SPStartTime = as if the date isn't populated. I looked inside $sp.SPBackupRestoreHistory.SPHistoryObject and that contains the data I expect.
The part that is there issue is on this line:
# Find the old backups in spbrtoc.xml
$old = $sp.SPBackupRestoreHistory.SPHistoryObject |
? { $_.SPStartTime -lt ((get-date).adddays(-$days)) }
I get all of the dates output (which I would expect). This tells me the problem in in the 'where' or '?' - I understand they are interchangeable. Regardless, $old always appears to be null.
As Requested:
<?xml version="1.0" encoding="utf-8"?>
<SPBackupRestoreHistory>
<SPHistoryObject>
<SPId>a8a03c50-6bc2-4af4-87b3-caf60e750fa0</SPId>
<SPRequestedBy>ASERVER\AUSER</SPRequestedBy>
<SPBackupMethod>Full</SPBackupMethod>
<SPRestoreMethod>None</SPRestoreMethod>
<SPStartTime>01/09/2011 00:00:13</SPStartTime>
<SPFinishTime>01/09/2011 00:05:22</SPFinishTime>
<SPIsBackup>True</SPIsBackup>
<SPConfigurationOnly>False</SPConfigurationOnly>
<SPBackupDirectory>E:\Backups\spbr0003\</SPBackupDirectory>
<SPDirectoryName>spbr0003</SPDirectoryName>
<SPDirectoryNumber>3</SPDirectoryNumber>
<SPTopComponent>Farm</SPTopComponent>
<SPTopComponentId>689d7f0b-4f64-45d4-ac58-7ab225223625</SPTopComponentId>
<SPWarningCount>0</SPWarningCount>
<SPErrorCount>0</SPErrorCount>
</SPHistoryObject>
<SPHistoryObject>
<SPId>22dace04-c300-41d0-a9f1-7cfe638809ef</SPId>
<SPRequestedBy>ASERVER\AUSER</SPRequestedBy>
<SPBackupMethod>Full</SPBackupMethod>
<SPRestoreMethod>None</SPRestoreMethod>
<SPStartTime>01/08/2011 00:00:13</SPStartTime>
<SPFinishTime>01/08/2011 00:05:26</SPFinishTime>
<SPIsBackup>True</SPIsBackup>
<SPConfigurationOnly>False</SPConfigurationOnly>
<SPBackupDirectory>E:\Backups\spbr0002\</SPBackupDirectory>
<SPDirectoryName>spbr0002</SPDirectoryName>
<SPDirectoryNumber>2</SPDirectoryNumber>
<SPTopComponent>Farm</SPTopComponent>
<SPTopComponentId>689d7f0b-4f64-45d4-ac58-7ab225223625</SPTopComponentId>
<SPWarningCount>0</SPWarningCount>
<SPErrorCount>0</SPErrorCount>
</SPHistoryObject>
</SPBackupRestoreHistory>
I believe the issue was to do with the date formatting.
Final working script:
# Location of spbrtoc.xml
$spbrtoc = "E:\Backups\spbrtoc.xml"
# Days of backup that will be remaining after backup cleanup.
$days = 14
# Import the Sharepoint backup report xml file
[xml]$sp = gc $spbrtoc
# Find the old backups in spbrtoc.xml
$old = $sp.SPBackupRestoreHistory.SPHistoryObject |
? { ( (
[datetime]::ParseExact($_.SPStartTime, "MM/dd/yyyy HH:mm:ss", [System.Globalization.CultureInfo]::InvariantCulture)
) -lt (get-date).adddays(-$days)
)
}
if ($old -eq $Null) { write-host "No reports of backups older than $days days found in spbrtoc.xml.`nspbrtoc.xml isn't changed and no files are removed.`n" ; break}
# Delete the old backups from the Sharepoint backup report xml file
$old | % { $sp.SPBackupRestoreHistory.RemoveChild($_) }
# Delete the physical folders in which the old backups were located
$old | % { Remove-Item $_.SPBackupDirectory -Recurse }
# Save the new Sharepoint backup report xml file
$sp.Save($spbrtoc)
Write-host "Backup(s) entries older than $days days are removed from spbrtoc.xml and harddisc."
It looks to me like you're ending up with the comparison being string-based, rather than date-based, so for example:
"10/08/2007 20:20:13" -lt (Get-Date -Year 1900)
The number is always going to be less than the "Sunday" or "Monday" or whatever that you would get at the front of the string when the DateTime object is cast to a string ...
I don't have access to a set of backups I could test this on, but for starters, you should fix that, and at the same time, make sure that you're not deleting the backup just because the value is null:
# Find the old backups in spbrtoc.xml
$old = $sp.SPBackupRestoreHistory.SPHistoryObject |
Where { (Get-Date $_.SPStartTime) -lt ((get-date).adddays(-$days)) }
The date string format in the XML file (according to the docs page) is one that Get-Date can readily parse, so that should work without any problems.
Incidentally, your assumption is right about $_ being the current iteration object from the array ;)
I am to process single PDFs that have each been created by 'merging' multiple PDFs. Each of the merged PDF has the places where the PDF parts start displayed with a bookmark.
Is there any way to automatically split this up by bookmarks with a script?
We only have the bookmarks to indicate the parts, not the page numbers, so we would need to infer the page numbers from the bookmarks. A Linux tool would be best.
pdftk can be used to split the PDF file and extract the page numbers of the bookmarks.
To get the page numbers of the bookmarks do
pdftk in.pdf dump_data
and make your script read the page numbers from the output.
Then use
pdftk in.pdf cat A-B output out_A-B.pdf
to get the pages from A to B into out_A-B.pdf.
The script could be something like this:
#!/bin/bash
infile=$1 # input pdf
outputprefix=$2
[ -e "$infile" -a -n "$outputprefix" ] || exit 1 # Invalid args
pagenumbers=( $(pdftk "$infile" dump_data | \
grep '^BookmarkPageNumber: ' | cut -f2 -d' ' | uniq)
end )
for ((i=0; i < ${#pagenumbers[#]} - 1; ++i)); do
a=${pagenumbers[i]} # start page number
b=${pagenumbers[i+1]} # end page number
[ "$b" = "end" ] || b=$[b-1]
pdftk "$infile" cat $a-$b output "${outputprefix}"_$a-$b.pdf
done
There's a command line tool written in Java called Sejda where you can find the splitbybookmarks command that does exactly what you asked. It's Java so it runs on Linux and being a command line tool you can write script to do that.
Disclaimer
I'm one of the authors
you have programs that are built like pdf-split that can do that for you:
A-PDF Split is a very simple, lightning-quick desktop utility program that lets you split any Acrobat pdf file into smaller pdf files. It provides complete flexibility and user control in terms of how files are split and how the split output files are uniquely named. A-PDF Split provides numerous alternatives for how your large files are split - by pages, by bookmarks and by odd/even page. Even you can extract or remove part of a PDF file. A-PDF Split also offers advanced defined splits that can be saved and later imported for use with repetitive file-splitting tasks. A-PDF Split represents the ultimate in file splitting flexibility to suit every need.
A-PDF Split works with password-protected pdf files, and can apply various pdf security features to the split output files. If needed, you can recombine the generated split files with other pdf files using a utility such as A-PDF Merger to form new composite pdf files.
A-PDF Split does NOT require Adobe Acrobat, and produces documents compatible with Adobe Acrobat Reader Version 5 and above.
edit*
also found a free open sourced program Here if you do not want to pay.
Here's a little Perl program I use for the task. Perl isn't special; it's just a wrapper around pdftk to interpret its dump_data output to turn it into page numbers to extract:
#!perl
use v5.24;
use warnings;
use Data::Dumper;
use File::Path qw(make_path);
use File::Spec::Functions qw(catfile);
my $pdftk = '/usr/local/bin/pdftk';
my $file = $ARGV[0];
my $split_dir = $ENV{PDF_SPLIT_DIR} // 'pdf_splits';
die "Can't find $ARGV[0]\n" unless -e $file;
# Read the data that pdftk spits out.
open my $pdftk_fh, '-|', $pdftk, $file, 'dump_data';
my #chapters;
while( <$pdftk_fh> ) {
state $chapter = 0;
next unless /\ABookmark/;
if( /\ABookmarkBegin/ ) {
my( $title ) = <$pdftk_fh> =~ /\ABookmarkTitle:\s+(.+)/;
my( $level ) = <$pdftk_fh> =~ /\ABookmarkLevel:\s+(.+)/;
my( $page_number ) = <$pdftk_fh> =~ /\BookmarkPageNumber:\s+(.+)/;
# I only want to split on chapters, so I skip higher
# level numbers (higher means more nesting, 1 is lowest).
next unless $level == 1;
# If you have front matter (preface, etc) then this numbering
# will be off. Chapter 1 might be called Chapter 3.
push #chapters, {
title => $title,
start_page => $page_number,
chapter => $chapter++,
};
}
}
# The end page for one chapter is one before the start page for
# the next chapter. There might be some blank pages at the end
# of the split for PDFs where the next chapter needs to start on
# an odd page.
foreach my $i ( 0 .. $#chapters - 1 ) {
my $last_page = $chapters[$i+1]->{start_page} - 1;
$chapters[$i]->{last_page} = $last_page;
}
$chapters[$#chapters]->{last_page} = 'end';
make_path $split_dir;
foreach my $chapter ( #chapters ) {
my( $start, $end ) = $chapter->#{qw(start_page last_page)};
# slugify the title so use it as a filename
my $title = lc( $chapter->{title} =~ s/[^a-z]+/-/gri );
my $path = catfile( $split_dir, "$title.pdf" );
say "Outputting $path";
# Use pdftk to extract that part of the PDF
system $pdftk, $file, 'cat', "$start-$end", 'output', $path;
}