How to insert the missing line breakers in R dataframe

How to insert the missing line breakers in R dataframe - string

I need to insert some missing line breakers in an one-column R dataframe. Those line breakers were missing from the data collection phase.
The data looks like:
V1
Apple
OrangeBanana
BananaBananaBanana
Watermelon
GrapeBanana
so all the line breakers before "Banana" are missing
I want to search for "Banana" and add those missing line breakers so it looks like:
V1
Apple
Orange
Banana
Banana
Banana
Banana
Watermelon
Grape
Banana

Here's a slightly more general solution, but one that can be easily purposed to explicitly working with "Banana".
V1 <- c("Apple", "OrangeBanana", "BananaBananaBanana", "Watermelon", "GrapeBanana")
First, let's split them up by finding all upper case letter which aren't word boundaries and replacing them with a space and an upper case letter:
splits <- gsub("(?:\\B)([[:upper:]])"," \\1" , V1, perl=TRUE)
[1] "Apple" "Orange Banana" "Banana Banana Banana" "Watermelon" "Grape Banana"
Then split by the space character and convert from list to vector:
unlist(strsplit(splits, " "))
[1] "Apple" "Orange" "Banana" "Banana" "Banana" "Banana" "Watermelon" "Grape" "Banana"
Or in one line:
unlist(strsplit(gsub("(?:\\B)([[:upper:]])"," \\1" , V1, perl=TRUE), " "))
EDIT: For a regex that works explicitly with "Banana":
gsub("(?:\\B)(Banana)"," \\1" , V1, perl=TRUE)

Related

Excel Formula - Match substrings of List to List

I have two Lists in an excel spreadsheet.
The first list has strings such as
1234 blue 6 abc
xyz blue/white 1234
abc yellow 123
The other list contains substrings of the first list
yellow
blue/white
blue
Result
1234 blue 6 abc blue
xyz blue/white 1234 blue/white
abc yellow 123 yellow
Now I need some kind of match formula to assign the correct value from the second list to the first. The problem, there is no specific pattern to determine where the color substring is positioned. The other problem, the values are not totally unique. As my example above shows, the lookup needs to be in an order (checking for "blue/white" before checking for "blue").
I played around the formulas like match, find also using wildcards * but couldn't come to any result.
A similar question asked here on SO covers the opposite case How to find if substring exists in a list of strings (and return full value in list if so)
Any help is appriciated. A formula would be cool, but using vba is also okay.

=INDEX(D$7:D$9, AGGREGATE(15, 7, ROW($1:$3)/ISNUMBER(SEARCH(D$7:D$9, A2)), 1))

Here is a solution with VBA
List 1 (strings) is in column A
List 2 (substrings) is in column C
The code basically contains to nested while loops checking whether the substring is inside the string.
row_1 = 1
While .Cells(row_1, "A") <> ""
row_2 = 1
While .Cells(row_2, "C") <> ""
color = .Cells(row_2, "C").Value
If InStr(1, .Cells(row_1, "A"), color, vbBinaryCompare) > 0 Then
.Cells(row_1, "B") = color
End If
row_2 = row_2 + 1
Wend
row_1 = row_1 + 1
Wend

how to format sublists in a tabular way automatically?

I have the following list, which contains sublists
tableData = [['apples', 'oranges', 'cherries', 'banana'],['Alice', 'Bob', 'Carol', 'David'], ['dogs', 'cats', 'moose', 'goose']]
My aim is to format them that way
apples Alice dogs
oranges Bob cats
cherries Carol moose
banana David goose
The code below do that
In [55]: for nested_list in zip(*tableData):
print("{:>9} {:>9} {:>9}".format(*nested_list))
Yet what bugs me is I need to specify manually the format of each sublist.
I've been trying to find a way to do it automatically with a for loop but I did not find anything relevant on how to do it.
Any tips are more than welcomed.
Thanks.

How about this:
for line in zip(*tableData):
for word in line:
print("{:>9}".format(word), end=' ')
print()
Explanation
If the print() was absent, all the sublists would be put on a single line like this
apples Alice dogs oranges Bob cats cherries Carol moose banana David goose
The print() allows a newline

If you just want to use {:>9} as the format code with an arbitrary number of columns, try this:
fieldFormat = ' '.join(['{:>9}'] * len(tableData))
for nestedList in zip(*tableData):
print(fieldFormat.format(*nestedList))
This just creates a list of {:>9} format specifiers, one for each column in tableData, then joins them together with spaces.
If you want to automatically calculate the field widths as well, you can do this:
fieldWidths = [max(len(word) for word in col) for col in tableData]
fieldFormat = ' '.join('{{:>{}}}'.format(wid) for wid in fieldWidths)
for nestedList in zip(*tableData):
print(fieldFormat.format(*nestedList))
fieldWidths is generated from a list comprehension that calculates the maximum length of each word in each column. From the inside:
(len(word) for word in col)
This is a generator that will produce the length of each word in col.
max(len(word) for word in col)
Feeding the generator (or any iterable) into max will calculate the maximum value of everything produced by the iterable.
[max(len(word) for word in col) for col in tableData]
This list comprehension produces the maximum length of all words in each column col of data in tableData.
fieldFormat is then produded by transforming fieldWidths into format specifiers. Again from the inside:
'{{:>{}}}'.format(wid)
This formats wid into the {:>#} format. {{ is a way to have a format specifier produce a {; similarly, }} produces }. The {} in the middle is what actually gets formatted with wid.
('{{:>{}}}'.format(wid) for wid in fieldWidths)
This is a generator function that does the above formatting for each width listed in fieldWidths.
fieldFormat = ' '.join('{{:>{}}}'.format(wid) for wid in fieldWidths)
This just joins those formats together with spaces in between to create the fieldFormat format specifier.

How do I replace a text string with a number, based on a key word contained in the cell

I have a string variable with short text strings. I want to replace all the text strings with numbers based on key words contained inside the individual cells.
Example: Some cells states "I like cats", while others "I dont like the smell of wet dog".
I want to assign the value 1 to all cells containing the word cat, and the number 2 to all cells containing the word dog.
How do I do this?

This will put 1 in NewVar when "cat" appears in OldVar, 2 for "dog", 3 for "mouse":
do repeat wrd="cat" "dog" "mouse"/val= 1 2 3.
if index(OldVar, wrd)>0 NewVar=val.
end repeat.
This is only good if there will never be a cat AND a dog in the same sentence. If you do have such cases you should go this way:
do repeat wrd="cat" "dog" "mouse"/NewVar=cat dog mouse.
compute NewVar=char.index(OldVar, wrd)>0.
end repeat.
This will create a new variable for each of the possible words, putting 1 in cases where the word appears in OldVar, 0 when it doesn't.

Apparently you have to open a syntax window and enter this command:
COMPUTE newvar=CHAR.INDEX(UPCASE(VAR1),"ABCD")>0
newvar is the name of the new variable.
VAR1 is the name of the variable to be searched.
ABCD is the text to be searched for. NOTE: This must be in CAPITAL letters.
newvar will recieve a value of 1 if the text is found.

How to find the line offset containing a specific number

I am trying to use the following code
put 7 into lFoodID
lineoffset (lFoodID,gArrFood) into tArrFoodLine
to find the line that contain the number 7 in the array below
17 Banana
20 Beans
2 Beef
1 Bread
8 Cabagge
6 Chicken
5 Eggs
15 Ice Cream
3 Mango
7 Pork
18 Rice
4 Salad
19 fried fish
It's returning 1. I know that this is because 17 contains the number 7. I have tried
set the wholeMatches to true
but that does not work either. I believe that regex (^(7) should work but I can figure out how to use regex in lineoffset.

I'm not sure what you're really after and I wonder if your data really look like what you have provided here. I assume that your data are as displayed, but this may lead do a solution that's slightly different from what you really want.
If you want to get the product associated with an index, you can use the following script
put fld 1 into myList
replace space&space with tab in myList
repeat until (tab&tab is not in myList and space&space is not in myList)
replace space&space with space in myList
replace tab&tab with tab in myList
end repeat
split myList by cr and tab
put myList[7] into myProduct
put myProduct
MyProduct contains the product name. Note that you won't need the repeat loop if your data is formatted properly. If you really want to have the index, use this:
put fld 1 into myList
put 7 into myIndex
if word 1 of myList is not myIndex then
put number of lines of char 1 to offset(cr & "7" & space ,myList) of myList into myLine
else
put 1 into myLine
end if
put myLine
MyLine contains the complete record in your list.

Removing duplicates within a cell

I can't find a way to remove duplicate values inside a same cell in Excel. For example, in A1, I have:
DOG DOG DOG
I want to have only DOG.
Code output: This will result certain values in the Excel cell (37, 4), such as:
2000 3000 0300 0300 2000
I am lost as to how to delete the repeated values in the cell.

You can use a regex with a backreference to match duplicated words or phrases. The nongreedy pattern ^(.+?)\s(\1\s*)+$ will match any duplicating phrase with whitespace in between.
Function RemoveDupes(strText)
With New RegExp
.Pattern = "^(.+?)\s(\1\s*)+$"
If .Test(strText) Then RemoveDupes = .Replace(strText, "$1")
End With
End Function
Tests:
WScript.Echo RemoveDupes("Dog Cat Dog Cat") ' => Dog Cat
WScript.Echo RemoveDupes("Dog Dog") ' => Dog
WScript.Echo RemoveDupes("Dog Dog Dog") ' => Dog
WScript.Echo RemoveDupes("Dog Dog Dog Dog") ' => Dog
WScript.Echo RemoveDupes("Dog Cat Dog Cat Dog Cat") ' => Dog Cat
Edit:
I see you've added some additional examples that don't repeat perfectly as your original examples did. In that case, you'll need to use an alternative method. If your values are always separated by a space, consider splitting your values into an array and storing them into a Dictionary to keep track of unique values. For example:
Dim d, a, i
Set d = CreateObject("Scripting.Dictionary")
a = Split(objExcel.ActiveSheet.Range("A1"), " ")
For i = 0 To UBound(a)
If Not d.Exists(a(i)) Then d.Add a(i), ""
Next
Now your Dictionary should have unique values from your cell and you can recombine them:
objExcel.ActiveSheet.Range("A1") = Join(d.Keys, " ")

Should be something like... We need to know where the space is between DOG DOG..
Excel:
=LEFT(A1,SEARCH(" ",A1)-1)
VBS:
Left(string, length)
Ref. https://msdn.microsoft.com/en-us/library/sk3xcs8k%28v=vs.84%29.aspx
InStr:
SOMESTRING = "DOG DOG DOG"
InStr (1, SOMESTRING, " ") '-- look for the position of the space
Ref. https://msdn.microsoft.com/en-us/library/wybb344c%28v=vs.84%29.aspx
Putting the two together we get:
SOMESTRING = "DOG DOG DOG"
RESULTS = Left(SOMESTRING, InStr (1, SOMESTRING, " ")-1)
Hope this helps!

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string

How to insert the missing line breakers in R dataframe - string

Related

Excel Formula - Match substrings of List to List

how to format sublists in a tabular way automatically?

How do I replace a text string with a number, based on a key word contained in the cell

How to find the line offset containing a specific number

Removing duplicates within a cell

Categories

Resources