I have a file that looks like this, containing a huge amount of data
>ENSMUSG00000020333|ENSMUST00000000145|Acsl6
AGCTCCAGGAGGGCCCGTCTCAGTCCGATGAACTTTGCAGCAATATTATAGTTATTCGTG
GTTCACAGAATTCCATTAAACATAAAGAAAAAACATAA
>ENSMUSG00000000001|ENSMUST00000000001|Gnai3
GAGGATGGCATAGTAAAAGCTATTACAGGGAGGAGTGTTGAGACCAGATGTCATCTACTG
CTCTGTAATCTAATGTTTAGGGCATATTGAAGTTGAGGTGCTGCCTTCCAGAACTTAAAC
the columns should be transformed so that lines always contain:
ENSMUSG*** ENSMUST*** GeneName Sequence (four separate columns)
the Sequence column should be the lines starting with either A,C,G,or T fused into one text cell, the number of cells to fuse varies from gene to gene.
does anyone have advice how to solve this?
thank you so much for your help!
best wishes
kk
Use the Text to Columns button on the Data tab. Choose Delimited , click Next, then select Other and in the box type the pipe symbol |. Then click Next and Finish.
I believe it is only those with Office 365 subscriptions that have the worksheet function CONCAT, which might be useful in this situation. So I would do this with a VBA macro.
First line -- split using the pipe | delimiter
Then concatenate the next lines until we get to one that does not start with "A", "C", "G", "T"
Store the results in a Collection object
Write the results back to the worksheet.
Since you have a large database, the "work" is done in VBA arrays as this will process much more rapidly.
It is assumed that your data is in Column A, starting in A1; and that your results will be written in columns B:E
If your database is clean, and formatted as you show, it should work OK. If it does not fall into the format you have presented, some error-checking might need to be added.
Option Explicit
Sub Organize()
Dim COL As Collection
Dim vSrc As Variant, vRes As Variant
Dim WS As Worksheet, rRes As Range
Dim V As Variant, W As Variant, S As String
Dim I As Long, J As Long
Set WS = ActiveSheet
With WS
Set rRes = .Cells(1, 2)
vSrc = .Range(.Cells(1, 1), .Cells(.Rows.Count, 1).End(xlUp))
End With
Set COL = New Collection
For J = 1 To UBound(vSrc, 1)
ReDim vRes(0 To 3)
W = Split(vSrc(J, 1), "|") 'First line
For I = 0 To 2
vRes(I) = W(I)
Next I
S = ""
'Concatenate subsequent lines
'Could look for the "<" but OP gave specifice starting letters
' So will use that
Do
Select Case Left(vSrc(J + 1, 1), 1)
Case "A", "C", "G", "T"
S = S & vSrc(J + 1, 1)
Case Else
Exit Do
End Select
J = J + 1
Loop Until J = UBound(vSrc, 1)
vRes(3) = S
COL.Add vRes
Next J
ReDim vRes(1 To COL.Count, 1 To 4)
I = 0
For Each W In COL
I = I + 1
For J = 1 To 4
vRes(I, J) = W(J - 1)
Next J
Next W
Set rRes = rRes.Resize(rowsize:=UBound(vRes, 1), columnsize:=UBound(vRes, 2))
With rRes
.EntireColumn.Clear
.Value = vRes
.EntireColumn.AutoFit
End With
End Sub
Related
I am new to VBA and am trying to copy the column from Row 2 onwards where the column header (in Row 1) contains a certain word- "Unique ID".
Currently what I have is:
Dim lastRow As Long
lastRow = ActiveWorkbook.Worksheets("Sheets1").Range("A" & Rows.Count).End(xlUp).Row
Sheets("Sheets1").Range("D2:D" & lastRow).Copy
But the "Unique ID" is not always in Column D
You can try following code, it loops through first row looking for a specified header:
Sub CopyColumnWithHeader()
Dim i As Long
Dim lastRow As Long
For i = 1 To Columns.Count
If Cells(1, i) = "Unique ID" Then
lastRow = Cells(Rows.Count, i).End(xlUp).Row
Range(Cells(2, i), Cells(lastRow, i)).Copy Range("A2")
Exit For
End If
Next
End Sub
When you want to match info in VBA you should use a dictionary. Additionally, when manipulating data in VBA you should use arrays. Although it will require some learning, below code will do what you want with minor changes. Happy learning and don't hesitate to ask questions if you get stuck:
Option Explicit
'always add this to your code
'it will help you to identify non declared (dim) variables
'if you don't dim a var in vba it will be set as variant wich will sooner than you think give you a lot of headaches
Sub DictMatch()
'Example of match using dictionary late binding
'Sourcesheet = sheet1
'Targetsheet = sheet2
'colA of sh1 is compared with colA of sh2
'if we find a match, we copy colB of sh1 to the end of sh2
'''''''''''''''''
'Set some vars and get data from sheets in arrays
'''''''''''''''''
'as the default is variant I don't need to add "as variant"
Dim arr, arr2, arr3, j As Long, i As Long, dict As Object
'when creating a dictionary we can use early and late binding
'early binding has the advantage to give you "intellisense"
'late binding on the other hand has the advantage you don't need to add a reference (tools>references)
Set dict = CreateObject("Scripting.Dictionary") 'create dictionary lateB
dict.CompareMode = 1 'textcompare
arr = Sheet1.Range("A1").CurrentRegion.Value2 'load source, assuming we have data as of A1
arr2 = Sheet2.Range("A1").CurrentRegion.Value2 'load source2, assuming we have data as of A1
'''''''''''''''''
'Loop trough source, calculate and save to target array
'''''''''''''''''
'here we can access each cell by referencing our array(<rowCounter>, <columnCounter>
'e.g. arr(j,i) => if j = 1 and i = 1 we'll have the values of Cell A1
'we can write these values anywhere in the activesheet, other sheet, other workbook, .. but to limit the number of interactions with our sheet object we can also create new, intermediant arrays
'e.g. we could now copy cel by cel to the new sheet => Sheets(arr(j,1).Range(... but this would create significant overhead
'so we'll use an intermediate array (arr3) to store the results
'We use a "dictionary" to match values in vba because this allows to easily check the existence of a value
'Together with arrays and collections these are probably the most important features to learn in vba!
For j = 1 To UBound(arr) 'traverse source, ubound allows to find the "lastrow" of the array
If Not dict.Exists(arr(j, 1)) Then 'Check if value to lookup already exists in dictionary
dict.Add Key:=arr(j, 1), Item:=arr(j, 1) 'set key if I don't have it yet in dictionary
End If
Next j 'go to next row. in this simple example we don't travers multiple columns so we don't need a second counter (i)
'Before I can add values to a variant array I need to redim it. arr3 is a temp array to store matching col
'1 To UBound(arr2) = the number of rows, as in this example we'll add the match as a col we just keep the existing nr of rows
'1 to 1 => I just want to add 1 column but you can basically retrieve as much cols as you want
ReDim arr3(1 To UBound(arr2), 1 To 1)
For j = 1 To UBound(arr2) 'now that we have all values to match in our dictionary, we traverse the second source
If dict.Exists(arr2(j, 1)) Then 'matching happens here, for each value in col 1 we check if it exists in the dictionary
arr3(j, 1) = arr(j, 2) 'If a match is found, we add the value to find back, in this example col. 2, and add it to our temp array (arr3).
'arr3(j, 2) = arr(j, 3) 'As explained above, we could retrieve as many columns as we want, if you only have a few you would add them manually like in this example but if you have many we could even add an additional counter (i) to do this.
End If
Next j 'go to the next row
'''''''''''''''''
'Write to sheet only at the end, you could add formatting here
'''''''''''''''''
With Sheet2 'sheet on which I want to write the matching result
'UBound(arr2, 2) => ubound (arr2) was the lastrow, the ubound of the second dimension of my array is the lastcolumn
'.Cells(1, UBound(arr2, 2) + 1) = The startcel => row = 1, col = nr of existing cols + 1
'.Cells(UBound(arr2), UBound(arr2, 2) + 1)) = The lastcel => row = number of existing rows, col = nr of existing cols + 1
.Range(.Cells(1, UBound(arr2, 2) + 1), .Cells(UBound(arr2), UBound(arr2, 2) + 1)).Value2 = arr3 'write target array to sheet
End With
End Sub
I have some lines (rows) in excel in the following form :
I need to transform it to the following form :
I'm not a good user of MS Excel and I am using the french version.
Thanks
Give this a try
Option Explicit
Public Sub MergeRows()
Dim rng As Range
Dim dict As Object
Dim tmp As Variant
Dim i As Long, j As Long
Dim c, key
Set dict = CreateObject("Scripting.dictionary")
dict.CompareMode = vbTextCompare
' Change this to where your source data is
With Sheet17
Set rng = .Range(.Cells(2, 10), .Cells(.Cells(.Rows.Count, 10).End(xlUp).Row, 10))
End With
For Each c In rng
If Not dict.exists(c.Value2) Then
ReDim tmp(1 To 3)
dict.Add key:=c.Value2, Item:=tmp
End If
j = 1
tmp = dict(c.Value2)
Do
If Not c.Offset(0, j).Value2 = vbNullString Then tmp(j) = c.Offset(0, j).Value2
j = j + 1
Loop Until j > UBound(tmp)
dict(c.Value2) = tmp
Next c
' Change this to where you want your output
With Sheet17.Range("A2")
i = 0
For Each key In dict.keys
.Offset(i, 0).Value2 = key
.Offset(i, 1).Resize(, UBound(dict(key))) = dict(key)
i = i + 1
Next key
End With
End Sub
Fun problem, here's my take using formulae but there's sure to be solutions out there that are better.
It uses three helper columns to the right of your options' columns:
1-Identify which option is chosen by finding the (first) non-blank cell in the row (as per these instructions). Put the first range as your header row (option1, option2, ... optionn) and the second range as your row from option1 through optionn. It should look something like this: =INDEX($K$1:$M$1,MATCH(FALSE,ISBLANK(K2:M2),0)) and note that it should be an array formula (ctrl+shift+enter)
2-Use a simple index+match to register the choice. Assuming the first helper column is in column N, this is: =INDEX(k2:m2,MATCH(n2,$k$1:$m$1,0))
3-Concatenate the address and option name to make it possible to look up the address-option combination. This is simply: =J2&N2
Once this is done, you create a simple table with only single Addresses in the left most column and Options as a header row (depending on the number of addresses, you might want to use a pivot table to populate them). Then you have an index-match to find your results:
=INDEX($O$2:$O$6,MATCH($J9&K$8,$P$2:$P$6,0)).
And you should be done.
There is an excel issue where we have one column with values like below and we want the respective values to go into corresponding new columns like allocation, primary purpose etc.
data is like
Allocation: Randomized|Endpoint Classification: Safety/Efficacy Study|Intervention Model: Parallel Assignment|Masking: Double Blind (Subject, Caregiver)|Primary Purpose: Treatment
Allocation: Randomized|Primary Purpose: Treatment
Allocation: Randomized|Intervention Model: Parallel Assignment|Masking: Open Label|Primary Purpose: Treatment
There are many such rows like this.
First use text to columns to split data using | delimiter.
Assuming data layout as in screenshot:
Add the following in A6 and drag across/down as required:
=IFERROR(MID(INDEX(1:1,0,(MATCH("*"&A$5&"*",1:1,0))),FIND(":",INDEX(1:1,0,(MATCH("*"&A$5&"*",1:1,0))),1)+2,1000),"")
It uses the MATCH/INDEX function to get the text of cell containing the heading, then uses MID/FIND function to get the text after the :. The whole formula is then enclosed in IFERROR so that if certain rows do not contain a particular header item, it returns a blank instead of #N/A's
You did not ask for a VBA solution, but here is one anyway.
Determine the column headers by examining each line and generate a unique list of the headers, storing it in a dictionary
You can add a routine to sort or order the headers
Create a "results" array and write the headers to the first row, using the dictionary to store the column number for later lookup
examine each line again and pull out the value associated with each column header, populating the correct slot in the results array.
write the results array to a "Results" worksheet.
In the code below, you may need to rename the worksheet where the source data resides. The Results worksheet will be added if it does not already exist -- feel free to rename it.
Test this on a copy of your data first, just in case.
Be sure to set the reference to Microsoft Scripting Runtime (Tools --> References) as indicated in the notes in the code.
Option Explicit
'Set References
' Microsoft Scripting Runtime
Sub MakeColumns()
Dim vSrc As Variant, vRes As Variant
Dim wsSrc As Worksheet, wsRes As Worksheet, rRes As Range
Dim dHdrs As Dictionary
Dim V As Variant, W As Variant
Dim I As Long, J As Long
Set wsSrc = Worksheets("Sheet1")
'Get source data
With wsSrc
vSrc = .Range(.Cells(1, 1), .Cells(.Rows.Count, 1).End(xlUp))
End With
'Set results sheet and range
On Error Resume Next
Set wsRes = Worksheets("Results")
If Err.Number = 9 Then
Worksheets.Add.Name = "Results"
End If
On Error GoTo 0
Set wsRes = Worksheets("Results")
Set rRes = wsRes.Cells(1, 1)
'Get list of headers
Set dHdrs = New Dictionary
dHdrs.CompareMode = TextCompare
'Split each line on "|" and then ":" to get header/value pairs
For I = 1 To UBound(vSrc, 1)
V = Split(vSrc(I, 1), "|")
For J = 0 To UBound(V)
W = Split(V(J), ":") 'W(0) will be header
If Not dHdrs.Exists(W(0)) Then _
dHdrs.Add W(0), W(0)
Next J
Next I
'Create results array
ReDim vRes(0 To UBound(vSrc, 1), 1 To dHdrs.Count)
'Populate Headers and determine column number for lookup when populating
'Could sort or order first if desired
J = 0
For Each V In dHdrs
J = J + 1
vRes(0, J) = V
dHdrs(V) = J 'column number
Next V
'Populate the data
For I = 1 To UBound(vSrc, 1)
V = Split(vSrc(I, 1), "|")
For J = 0 To UBound(V)
'W(0) is the header
'The dictionary will have the column number
'W(1) is the value
W = Split(V(J), ":")
vRes(I, dHdrs(W(0))) = W(1)
Next J
Next I
'Write the results
Set rRes = rRes.Resize(UBound(vRes, 1) + 1, UBound(vRes, 2))
With rRes
.EntireColumn.Clear
.Value = vRes
With .Rows(1)
.Font.Bold = True
.HorizontalAlignment = xlCenter
End With
.EntireColumn.AutoFit
End With
End Sub
If you have not used macros before, to enter this Macro (Sub), alt-F11 opens the Visual Basic Editor.
Ensure your project is highlighted in the Project Explorer window.
Then, from the top menu, select Insert/Module and
paste the code below into the window that opens.
To use this Macro (Sub), opens the macro dialog box. Select the macro by name, and RUN.
I have the following column K in Excel:
K
2 Apps -
3 Appointed CA - Apps - Assist - Appointed NA - EOD Efficiency
4 Appointed CA -
5 Appointed CA -
I want to split at - and count the number occurrences of the specific words in the string.
I tried the following formula which splits my string and returns everything LEFT to the -
=LEFT( K2, FIND( "-", K2 ) - 2 )
But the ideal output should be:
Apps Appointed CA Assist Appointed NA EOD Efficiency
1
1 1 1 1 1
1
1
Based on the above data.
Regards,
Here is a VBA macro that will
Generate a unique list of phrases from all of the data
Create the "header row" containing the phrases for the output
Go through the original data again and generate the counts for each phrase
As written, the macro is case insensitive. To make it case sensitive, one would have change the method of generating the unique list -- using the Dictionary object instead of a collection.
To enter this Macro (Sub), alt-F11 opens the Visual Basic Editor.
Ensure your project is highlighted in the Project Explorer window.
Then, from the top menu, select Insert/Module and paste the code below into the window that opens. It should be obvious where to make changes to handle variations in where your source data is located, and where you want the results.
To use this Macro (Sub), alt-F8 opens the macro dialog box. Select the macro by name, and RUN.
It will generate results as per your ideal output above
Option Explicit
Option Compare Text
Sub CountPhrases()
Dim colP As Collection
Dim wsSrc As Worksheet, wsRes As Worksheet, rRes As Range
Dim vSrc As Variant, vRes() As Variant
Dim I As Long, J As Long, K As Long
Dim V As Variant, S As String
'Set Source and Results worksheets and ranges
Set wsSrc = Worksheets("sheet1")
Set wsRes = Worksheets("sheet2")
Set rRes = wsRes.Cells(1, 1) 'Results will start in A1 on results sheet
'Get source data and read into array
With wsSrc
vSrc = .Range("K2", .Cells(.Rows.Count, "K").End(xlUp))
End With
'Collect unique list of phrases
Set colP = New Collection
On Error Resume Next 'duplicates will return an error
For I = 1 To UBound(vSrc, 1)
V = Split(vSrc(I, 1), "-")
For J = 0 To UBound(V)
S = Trim(V(J))
If S <> "" Then colP.Add S, CStr(S)
Next J
Next I
On Error GoTo 0
'Dimension results array
'Row 0 will be for the column headers
ReDim vRes(0 To UBound(vSrc, 1), 1 To colP.Count)
'Populate first row of results array
For J = 1 To colP.Count
vRes(0, J) = colP(J)
Next J
'Count the phrases
For I = 1 To UBound(vSrc, 1)
V = Split(vSrc(I, 1), "-")
For J = 0 To UBound(V)
S = Trim(V(J))
If S <> "" Then
For K = 1 To UBound(vRes, 2)
If S = vRes(0, K) Then _
vRes(I, K) = vRes(I, K) + 1
Next K
End If
Next J
Next I
'write results
Set rRes = rRes.Resize(UBound(vRes, 1) + 1, UBound(vRes, 2))
With rRes
.EntireColumn.Clear
.Value = vRes
.EntireColumn.AutoFit
End With
End Sub
Assuming result range starts in column L:
L2: =IF(FIND("Apps", K2, 1) <> 0, 1, "")
M2: =IF(FIND("Appointed CA", K2, 1) <> 0, 1, "")
etc.
Autofill downwards.
EDIT:
Assuming all possible string combinations we're looking for are known ahead of time, the following should work. If the possible string combinations are not known, I would recommend building a UDF to sort it all out.
Anyway, assuming the strings are known, following the same principle as above:
L2: =IF(FIND("Apps", K2, 1) <> 0, (LEN(K2) - LEN(SUBSTITUTE(K2, "Apps", "")) / LEN(K2)), "")
M2: =IF(FIND("Appointed CA", K2, 1) <> 0, (LEN(K2) - LEN(SUBSTITUTE(K2, "Appointed CA", "")) / LEN(K2)), "")
Increase for as many strings as you like, autofill downwards.
I need to find a way to split some data on excel: e.g.
If a cell has the following in: LWPO0001653/1654/1742/1876/241
All of the info after the / should be LWPO000... with that number.
Is there anyway of separating them out and adding in the LWPO000in? So they come out as LWPO0001653
LWPO0001654
etc etc
I could do manually yes, but i have thousands to do so would take a long time.
Appreciate your help!
Here is a solution using Excel Formulas.
With your original string in A1, and assuming the first seven characters are the one's that get repeated, then:
B1: =LEFT($A1,FIND("/",$A1)-1)
C1: =IF(LEN($A1)-LEN(SUBSTITUTE($A1,"/",""))< COLUMNS($A:A),"",LEFT($A1,7)&TRIM(MID(SUBSTITUTE(MID($A1,8,99),"/",REPT(" ",99)),(COLUMNS($A:A))*99,99)))
Select C1 and fill right as far as required. Then Fill down from Row 1
EDIT: For a VBA solution, try this code. It assumes the source data is in column A, and puts the results adjacent starting in Column B (easily changed if necessary). It works using arrays within VBA, as doing multiple worksheet read/writes can slow things down. It will handle different numbers of splits in the various cells, although could be shortened if we knew the number of splits was always the same.
Option Explicit
Sub SplitSlash()
Dim vSrc As Variant
Dim rRes As Range, vRes() As Variant
Dim sFirst7 As String
Dim V As Variant
Dim COL As Collection
Dim I As Long, J As Long
Dim lMaxColCount As Long
Set rRes = Range("B1") 'Set to A1 to overwrite
vSrc = Range("a1", Cells(Rows.Count, "A").End(xlUp))
'If only a single cell, vSrc won't be an array, so change it
If Not IsArray(vSrc) Then
ReDim vSrc(1 To 1, 1 To 1)
vSrc(1, 1) = Range("a1")
End If
'use collection since number of columns can vary
Set COL = New Collection
For I = 1 To UBound(vSrc)
sFirst7 = Left(vSrc(I, 1), 7)
V = Split(vSrc(I, 1), "/")
For J = 1 To UBound(V)
V(J) = sFirst7 & V(J)
Next J
lMaxColCount = IIf(lMaxColCount < UBound(V), UBound(V), lMaxColCount)
COL.Add V
Next I
'Results array
ReDim vRes(1 To COL.Count, 1 To lMaxColCount + 1)
For I = 1 To UBound(vRes, 1)
For J = 0 To UBound(COL(I))
vRes(I, J + 1) = COL(I)(J)
Next J
Next I
'Write results to sheet
Set rRes = rRes.Resize(UBound(vRes, 1), UBound(vRes, 2))
With rRes
.EntireColumn.Clear
.Value = vRes
.EntireColumn.AutoFit
End With
End Sub
I'm clearly missing the point :-) but anyway, in B1 and copied down to suit:
=SUBSTITUTE(A1,"/","/"&LEFT(A1,7))
Select ColumnB, Copy and Paste Special, Values over the top.
Apply Text to Columns to ColumnB, Delimited, with / as the delimiter.
There's a couple of ways to solve this. The quickest is probably:
Assuming that the data is in column A:
Highlight the column, go to Data>>Text To Columns
Choose "Delimited" and in the "Other" box, put /
Click ok. You'll have your data split into multiple cells
Insert a column at B and put in the formula =Left(A1, 7)
Insert a column at C and pit in formula =Right(A1, Length(A1)-7)
You'll now have Column B with your first 7 characters, and columns B,C,D,E,F, etc.. with the last little bit. You can concatenate the values back together for each column you have with =Concatenate(B1,C1), =Concatenate(B1,D1), etc..
A quick VBa, which does nearly the same thing that #Kevin's does as well. I wrote it before I saw his answer, and I hate to throw away work ;)
Sub breakUpCell()
Dim rngInput As Range, rngInputCell As Range
Dim intColumn As Integer
Dim arrInput() As String
Dim strStart As String
Dim strEnd As Variant
'Set the range for the list of values (Assuming Sheet1 and A1 is the start)
Set rngInput = Sheet1.Range("A1").Resize(Sheet1.Range("A1").End(xlDown).Row)
'Loop through each cell in the range
For Each rngInputCell In rngInput
'Split up the values after the first 7 characters using "/" as the delimiter
arrInput = Split(Right(rngInputCell.Value, Len(rngInputCell.Value) - 7), "/")
'grab the first 7 characters
strStart = Left(rngInputCell.Value, 7)
'We'll be writing out the values starting in column 2 (B)
intColumn = 2
'Loop through each split up value and assign to strEnd
For Each strEnd In arrInput
'Write the concatenated value out starting at column B in the same row as rngInputCell
Sheet1.Cells(rngInputCell.Row, intColumn).Value = strStart & strEnd
'Head to the next column (C, then D, then E, etc)
intColumn = intColumn + 1
Next strEnd
Next rngInputCell
End Sub
Here is how you can do it with a macro:
This is what is happening:
1) Set range to process
2) Loop through each cell in range and check it isn't blank
3) If the cell contains the slash character then split it and process
4) Skip the first record and concatenate "LWPO000" plus the current string to adjacent cells.
Sub CreateLWPO()
On Error Resume Next
Application.ScreenUpdating = False
Dim theRange
Dim cellValue
Dim offset As Integer
Dim fields
'set the range of cells to be processed here
Set theRange = range("A1:A50")
'loop through each cell and if not blank process
For Each c In theRange
offset = 0 'this will be used to offset each item found 1 cell to the right (change this number to this first column to be populated)
If c.Value <> "" Then
cellValue = c.Value
If InStr(cellValue, "/") > 0 Then
fields = Split(cellValue, "/")
For i = 1 To UBound(fields)
offset = offset + 1
cellValue = "LWPO000" & fields(i)
'if you need to pad the number of zeros based on length do this and comment the line above
'cellValue = "LWPO" & Right$(String(7, "0") & fields(i), 7)
c.offset(0, offset).Value = cellValue
Next i
End If
End If
Next
Application.ScreenUpdating = True
End Sub