Shell script - convert Excel (xlsx) to CSV - remove blank space / tab space - excel

I receive excel file (xslx) with multiple sheets for my project. The number of records on these sheets ranges from 15k to 70k per sheet. I need to perform following tasks on this data and then convert it to CSV. Or covert to CSV and then process the data either way its fine.
Input Example:
call_no uniq_no Type Strength Description
2456 15 TX SomeSting SomeSting
5263 15 BLL SomeSting SomeSting
4263 162 TX SomeSting
2369 215 LH SomeSting
4269 426 BLL SomeSting SomeSting
7412 162 TX SomeSting SomeSting
As per the requirement i need to
Find duplicate values in column 'uniq_no' and delete all duplicate records except the original record (first record).
Replace blanks with data. (Just simple find blank and replace with value logic)
Remove space/tab space in any cell. (This point is not important, its just like a side-quest)
Output Example:
call_no uniq_no Type Strength Description
2456 15 TX SomeSting SomeSting
4263 162 TX **NewDATA** SomeSting
2369 215 LH SomeSting **NewDATA**
4269 426 BLL SomeSting SomeSting
This is a routine task for me. I have fair knowledge of shell scripting. So if anyone can guide me even with rough outline of a script for this then i can do tweaks at my end. Please help.

Update: the desired platform for the script has been clarified and a vb-script response is no longer applicable. However, I will leave this response here in case a future viewer of this question stumbles upon it and finds it useful. Anyone writing a shell script in a Ubuntu language may be able to port over some aspects of this vbscript as well.
Here is something to get you started. If you record actions with Excel's macro recorder remember that using the same commands in a VBS means you have to get rid of all of the named parameters.
prep_xlsx.vbs
Set objExcel = WScript.CreateObject ("Excel.Application")
objExcel.Visible = true 'False 'True for testing
strFileName = "c:\tmp\vbs_test.xlsx"
set objWb = objExcel.WorkBooks.open(strFileName)
set objWs = objWb.Worksheets(1)
with objWs
with .cells(1, 1).CurrentRegion
.Cells.SpecialCells(4) = "**NewDATA**" ' 4 is xlCellTypeBlanks
.Cells.RemoveDuplicates 2, 1 ' Columns:=2, Header:=xlYes
for c = 1 to .Columns.Count
with .columns(c)
.TextToColumns .Cells(1), 2 ', Array(0, 1) 'Range("C1"), DataType:=xlFixedWidth, FieldInfo:=Array(0, 1)
end with
next 'next c
end with
end with
objWb.Close True 'save on close
objExcel.Quit
Set objExcel = Nothing
It should be noted that removing leading / trailing spaces with the Range.TextToColumns method with xlFixedWidth can attempt to split the column into two if there are too many leading spaces. Currently, this will halt the process as it will ask for confirmation on overwriting the next columns values (which you do not want to do). There has to be a significant number of spaces to have Excel guess that it belongs in two columns so unless there are more spaces than a typical word there is nothing to worry about; just something to be aware about. e.g. if there were twice as many leading spaces in D6, it might want to split across two columns.
            vbs_test.xlsx before prep_xlsx.vbs
            vbs_test.xlsx after prep_xlsx.vbs

Related

Using VBA Regex To Remove a space within a pattern

I am very new to VBA and I'm working with data from a chemistry instrument which outputs values that are not uniformly delimited and contain special characters. I am trying to import these values into excel and have solved pretty much all of the problems except for one. When I am importing these values into excel they are read in line-by-line. Each line that is read-in is contained within its own cell in column A. There can be anywhere from 50- roughly 1000 columns of data, with the associated identifiers and metadata above. Below is a copy/paste of the first 5 lines of data.
1 7.724 1190 1231 1292 PV 4 724391 121434659 49.60% 9.688%
2 9.272 1451 1481 1484 VB 3961552 186833117 76.32% 14.905%
3 10.968 1732 1754 1816 VV 2673526 111034313 45.36% 8.858%
4 15.249 2382 2445 2453 PV 296082 33844178 13.82% 2.700%
5 15.384 2453 2466 2500 VV 219908 14461812 5.91% 1.154%
The problem I am having is that there are times when there are multiple peaks that make up one value and are recorded as 2 letters a space and one to two numbers (0-9), whereas peak types with only one peak are just two letters. For an example please look in line 1 where there is "PV 4". I am trying to use regular expressions to loop through the A column, starting at row 18 and ending around row 1000, to find the letters and associated numbers, and remove the interstitial space so that he cell will look like this:
1 7.724 1190 1231 1292 PV4 724391 121434659 49.60% 9.688%
Once it is in that form, I can use the space delimiter to separate the cells without frame shifting the ones that have the multiple peak types.
Here is the code I've written so far, but I am unsure how to proceed:
Sub PKTYRegexRemoveSpace()
Dim StrPattern As String: StrPattern = "[A,B,H,M,N,P,S,T,U,V,X,\+][A,B,H,M,N,P,S,T,U,V,X,\+]\s[0-9]{1,2}\s"
Dim StrInput As String
Dim MyRange As Range
Dim regEx As New RegExp
Dim Cell As Range
Set MyRange = ActiveSheet.Range("A22:A24")
For Each Cell In MyRange
If StrPattern <> "" Then
StrInput = Cell.Value
With regEx
.Pattern = StrPattern
.Global = False
.IgnoreCase = False
End With
If regEx.Test(StrInput) Then
MsgBox (regEx.Replace(StrInput, *this is where I need help*))
Else
MsgBox ("Not matched")
End If
End If
Next
End Sub
I am using a msgbox during devlopment in order to avoid having to re-import the file for every failed replacement attempt.
Any help would be greatly appreciated!
I suggest change the regex Pattern to use capturing groups and word boundary tokens
\b([A,B,H,M,N,P,S,T,U,V,X,\+][A,B,H,M,N,P,S,T,U,V,X,\+])\s([0-9]{1,2})\b
Then, for the replace string:
$1$2

Fastest way to insert multiple page breaks in Excel with VBA

I need to insert multiple page breaks into a document (a few hundred) and am trying to find the fastest way. I know with most sheet operations in excel it is fastest to batch them as an array or range and then perform them all at once. But with page breaks I can't see a way around performing individually within a loop.
The code below works but is slow
With shtBagCheckList
.Cells.PageBreak = xlPageBreakNone
For i = 13 To lngArrayRows Step 12
.Cells(i, 1).PageBreak = xlPageBreakManual
Next i
End With
A technique I sometimes use to apply formatting is to combine all the ranges that require the same formatting into one range and then apply the formatting to the combined range. Like this:
Set rngToBold = Union(rngToBold, Rng2) 'etc in a loop
rngToBold.Font.Bold = True
But doing that with page breaks, the break is only applied to the first element in the range. For example doing the below only create a page break at row 12, none at 24
set rng = range("A12", "A24")
rng.PageBreak = xlPageBreakManual
What is the fastest way to add many page breaks into a sheet?
Update
Here are the results I got using tips from user11121185 and Tim Williams.
It turns out using HpageBreaks.add is way faster
1 rng.Page Break, calculation on: 20.594 (45%)
2 rng.Page Break, calculation off: 13.219 (29%)
3 rng.Page Break, calculation off, display breaks off: 11.922 (26%)
4 hbreaks.add,calculation on: 0.018 (0%)
5 hbreaks.add, calculation off : 0.203 (0%)
6 hbreaks.add, calculation off, display page breaks off: 0.312 (1%)
Total Time: 46.267
(calculation off means automatic calculation and screen updating off)
Usage is like this:
sheet.HPageBreaks.Add sheet.range("A10")
Or:
With shtBagCheckList
.Cells.PageBreak = xlPageBreakNone
For i = 13 To lngArrayRows Step 12
.HPageBreaks.Add .Cells(i, 1)
Next i
End With

How to insert page breaks to keep certain data ranges together?

I'm trying to create a monthly report for my company and I'm having troubles with the aesthetics of it. I have tables (with lots more data) that looks like this:
enter image description here
textwise looks like this (if picture is unavailable)
Customer 1 - 2018-###-#33
Customer 2 - 2017-###-###
----------------2018-###-###
Customer 3- 2018-###-###
etc
but sometimes there is a page break in the middle of a customer data, so that the title shows up on the first page but the last row(s) are on a new page.
I am trying to come up with a code to sense whether or not the data range will be split up, but I can not. I had thought maybe the pagebreak could be anticipated with the row number (as that's how I'm grouping the data) but one sheet has the break on row 101 (because this data starts at row 46) and the other has the pagebreak on row 106.
I have tried doing something similar to this:
If i >= 102 And start.row <= 101 Then
Worksheets(storenumber).HPageBreaks.Add before:=Rows(start.row)
Else
If i >= 176 And start.row <= 166 Then
Worksheets(storenumber).HPageBreaks.Add before:=Rows(start.row)
End If
End If
(with varying i = numbers and combination) but I haven't had any success.
Does anyone else have any other ideas? there normally is only one page break per storenumber sheet, with the except of one sheet. (and some sheets don't need a page break as they only have say 5 customers)
I'm looking for an easier code (excel beginner/intermediate) but I'm open to any ideas. Thanks for your help!
This code seems to work for me:
Option Explicit
Sub BreakPagesCleanly()
Dim pb As HPageBreak
Dim rng As Range
Dim i As Long
ActiveWindow.View = xlPageBreakPreview
ActiveSheet.ResetAllPageBreaks
For Each pb In ActiveSheet.HPageBreaks
If pb.Location.Value = "" Then
'go back to the last value
Set rng = pb.Location.End(xlUp)
ActiveSheet.HPageBreaks.Add before:=rng
DoEvents 'Seems to be required to give Excel time to catch up
End If
Next pb
End Sub
Here's the before view, vs the after:

Excel macro to move column

Basically I'm looking to move a column for one position to another.
561 DISK_GROUP_003 0 545 1
561 Disk_Group_iS 95 84144 80210
561 DISK_GROUP_iS 99 26335 26304
1415 t1_200ea 93 8804 8203
1415 t2_30010k 35 59846 21121
1415 t3_1tb72k 19 184941 36590
1415 t3_3tb72k 86 258635 224328
5018 t1_200ea 98 9905 9802
5018 t2_30015k 89 39987 35986
5018 t2_60015k 67 59984 40700
5018 t3_1tb72k 89 87567 78807
5018 t3_2tb72k 84 94412 79620
I need to move the 3rd column to the end at the right.
This is what I have tried so far:
Sub moveColumn()
With ActiveSheet
Excel.Columns(3).Cut
Excel.Columns(6).PasteSpecial
End With
End Sub
But this method doesn't work as it gets a runtime error '1004'.
Any help would be much appreciated.
For those wondering, it's possible to do this without replacing the contents of the destination column.
For example, to cut column B and insert it to the left of column F, you can use
Columns("B").Cut
Columns("F").Insert Shift:=xlToRight
You can also replace the named column headers with column indices, to taste (so Columns("B") becomes Columns(2))
Pastespecial doesn't work with Cut. You can do this:
Columns(3).Cut Range("F1")
Columns(3).Delete Shift:=xlToLeft 'if you want to delete the empty column
The problem with the other answers given is that the cut/paste technique uses the clipboard—overwriting whatever is in it, and making it impossible for the program to operate correctly if another program that also uses the clipboard is running (such as another instance of the same VBA project).
Instead, do this:
Application.CutCopyMode = False ' don't want an existing operation to interfere
Columns("F").Insert XlDirection.xlToRight
Columns("F").Value = Columns("B").Value ' this would be one greater if to the right of F
Columns("B").Delete
Just note that if there are references to the existing column, they will break and not be updated.

Using VBA, how can I select every other cell in a row range (to be copied and pasted vertically)?

I have a 2200+ page text file. It is delivered from a customer through a data exchange to us with asterisks to separate values and tildes (~) to denote the end of a row. The file is sent to me as a text file in Word. Most rows are split in two (1 row covers a full line and part of a second line). I transfer segments (10 page chunks) of it at a time into Excel where, unfortunately, any zeroes that occur at the end of a row get discarded in the "text to columns" procedure. So, I eyeball every "long" row to insure that zeroes were not lost and manually re-enter any that were.
Here is a small bit of sample data:
SDQ EA 92 1551 378 1601 151 1603 157 1604 83
The "SDQ, EA, and 92" are irrelevant (artifacts of data transmission). I want to use Excel and/or VBA to select 1551, 1601, 1603, and 1604 (these are store numbers) so that I can copy those values, and transpose paste them vertically. I will then go back and copy 378, 151, 157, and 83 (sales values) so that I can transpose paste them next to the store numbers. The next two rows of data contain the same store numbers but give the corresponding dollar values. I will only need to copy the dollar values so they can be transpose pasted vertically next to unit values (e.g. 378, 151, 157, and 83).
Just being able to put my cursor on the first cell of interest in the row and run a macro to copy every other cell would speed up my work tremendously. I have tried using ActiveCell and Offset references to select a range to copy, but have not been successful. Does any have any suggestions for me? Thanks in advance for the help.
It's hard to give a complete answer without more information about the file.
I think if your input data is 2200+ pages long, it's unlikely that opening it with the default excel opening functions is the way to go. Especially since Excel has maximum number of rows and columns. If the file is a text file (.txt) I would suggest opening it with VBA and reading each line, one at a time, and processing the data.
Here's an example to get you started. Just keep in mind that this is transposing each row of text into columns of data, so you will quickly fill all the columns of excel long before you run thru 2200 pages of text. But it's just an example.
Sub getData()
dFile = FreeFile
sFile = "c:\code\test.txt"
Open sFile For Input As #dFile
c = 1
'keep doing this until end of file
Do While Not EOF(dFile)
'read line into dataLine
Input #dFile, dataLine
' break up line into words based on spaces
j = Split(dataLine, " ")
jLength = UBound(j)
If jLength > 2 Then
r = 1
'ignore first 3 words
'and get every other word
'transpose rows of text into columns
For word = 3 To jLength Step 2
Cells(r, c) = j(word)
r = r + 1
Next word
End If
c = c + 1
Loop
Close #Data
End Sub

Resources