Find elements in a character array that contains a string - string

I want to extract the elements of a character array that contains some particular string. For example:
x <- c('aa', 'ab', 'ac', 'bb', 'bc')
I want some function such that, given x and 'a'(in general this can be a string), it returns 'aa', 'ab', 'ac'. I have experimented with a combination of %in%, match, which, etc, but have not been able to make them work. Any idea?

Just use grep:
grep('a', x, value=TRUE)
[1] "aa" "ab" "ac"

In a table or a list, we can use dplyr::pull from dplyr/tidyverse package to convert values in a column to a vector first and then find the particular value in the column. For instance, in the LEGO example, we can do the following to find any theme starting by "s" or "S":
inventory_parts_themes <- inventories %>%
inner_join(inventory_parts, by = c("id" = "inventory_id")) %>%
arrange(desc(quantity)) %>%
select(-id, -version) %>%
inner_join(sets, by = "set_num") %>%
inner_join(themes, by = c("theme_id" = "id"), suffix = c("_set", "_theme"))
all_theme_names <- dplyr::pull(inventory_parts_themes, name_theme)
all_theme_names[grep("^[sS].*", all_theme_names)]

Related

Change dot (.) to comma (,) in R

I have a dataset I am working on in R, however, one of the column values has dot (.) instead of a comma (,) so I think this might be messing up when I am running the regression. Does anyone know what code I should run to change all the dots to commas?
Thanks beforehand.
Assuming you have a dataframe named df
df %>% mutate_all(funs(str_replace(., "\\.", ",")))
If its for one column only
df %>% mutate(col1 = gsub("\\.", ",", col1))
Assuming your data is a character vector inside a dataframe
df <- data.frame(var = c("5.1", "30", "..", "75.234.4423.5"))
With gsub
df$var <- gsub("\\.", ",", df$var)
With stringi and purrr
library(stringi)
library(purrr)
df$var <- modify_if(df$var, stri_detect_fixed(df$var, "."),
~stri_sub_replace_all(., stri_locate_all_fixed(., "."), replacement=","))
Output
df
var
1 5,1
2 30
3 ,,
4 75,234,4423,5
I used purrr::modify_if with the predicate stri_detect_fixed(df$var, ".") so that the values without any dots (in my example, 30) are not converted to NA by stringi::stri_sub_replace_all.
The stringi version is more flexible for other purposes, you can pass functions inside the replacement argument when you want a dynamic replacement value.
I cannot comment on if it will help your regression analysis. I simply answered by giving ways to change dots to commas in a character vector.

Replace an item in list if it starts with a precise character

I am fairly new to python and have a task to solve.
I have a list that is made of strings made of hexadecimal numbers. I want to replace some items with '0', if they do not start with the right characters.
So, for example, I have
List = ['0800096700000000', '090000000000025d', '0b0000000000003c', '0500051b014f0000']
and I want, say, to only have the data that starts with "0b" and "05", and I want to replace the others by "0".
For now, I have this:
multiplex = ('0b', '05')
List = ['0800096700000000', '090000000000025d', '0b0000000000003c', '0500051b014f0000']
List = [x for x in List if x.startswith(multiplex)]
This gives me the following result:
['0b0000000000003c', '0500051b014f0000']
Although I would like the following result:
['0', '0', '0b0000000000003c', '0500051b014f0000']
I cannot index the specific item I wish to change because the actual data is way too large for that...
Can someone help?
You should use an if/else to determine what to return, not if a value should be in the list.
my_list = ['0800096700000000', '090000000000025d', '0b0000000000003c', '0500051b014f0000']
multiplex = ('0b', '05')
my_new_list = [x if x.startswith(multiplex) else '0' for x in my_list]
print(my_new_list)
'''' Sample Output
['0', '0', '0b0000000000003c', '0500051b014f0000']
''''
Your multiplex strings are too long, so a single character string does not start with 2 characters. Try if x.startswith(multiplex) or len(str(x)) < 2 and x.startswith("0") or if x.startswith(multiplex) or str(x) == "0"
List = [x if x.startswith(multiplex) else '0' for x in List]

How to Sort Alphabets

Input : abcdABCD
Output : AaBbCcDd
ms=[]
n = input()
for i in n:
ms.append(i)
ms.sort()
print(ms)
It gives me ABCDabcd.
How to sort this in python?
Without having to import anything, you could probably do something like this:
arr = "abcdeABCDE"
temp = sorted(arr, key = lambda i: (i.lower(), i))
result = "".join(temp)
print(result) # AaBbCcDdEe
The key will take in each element of arr and sort it first by lower-casing it, then if it ties, it will sort it based on its original value. It will group all similar letters together (A with a, B with b) and then put the capital first.
Use a sorting key:
ms = "abcdABCD"
sorted_ms = sorted(ms, key=lambda letter:(letter.upper(), letter.islower()))
# sorted_ms = ['A', 'a', 'B', 'b', 'C', 'c', 'D', 'd']
sorted_str = ''.join(sorted_ms)
# sorted_str = 'AaBbCcDd'
Why this works:
You can specify the criteria by which to sort by using the key argument in the sorted function, or the list.sort() method - this expects a function or lambda that takes the element in question, and outputs a new criteria by which to sort it. If that "new criteria" is a tuple, then the first element takes precedence - if it's equal, then the second argument, and so on.
So, the lambda I provided here returns a 2-tuple:
(letter.upper(), letter.islower())
letter.upper() as the first element here means that the strings are going to be sorted lexigraphically, but case-insensitively (as it will sort them as if they were all uppercase). Then, I use letter.islower() as the second argument, which is True if the letter was lowercase and False otherwise. When sorting, False comes before True - which means that if you give a capital letter and a lowercase letter, the capital letter will come first.
Try this:
>>>s='abcdABCD'
>>>''.join(sorted(s,key=lambda x:x.lower()))
'aAbBcCdD'

Remove middle B from each element in the list

suppose I have a list which calls name:
name=['ACCBCDB','CCABACB','CAABBCB']
I want to use python to remove middle B from each element in the list.
the output should display :
['ACCCDB','CCAACB','CAABCB']
def ter(s):
return s[3:-3]
name=['ACBBDBA','CCABACB','CABBCBB']
xx=[ter(s) for s in name]
z=xx
print(z)
output
['B', 'B', 'B']
I did the reverse I want to delete B in middle and keep the other parts from each element
name = ['ACCBCDB','CCABACB','CAABBCB']
name_without_middle = []
for oldstr in name:
midlen = int((len(oldstr)/2))
newstr = oldstr[:midlen] + oldstr[midlen+1:]
name_without_middle.append(newstr)
print(name_without_middle)
Returns
['ACCCDB', 'CCAACB', 'CAABCB']
Try it here

Insert a character at a specific location in a string

I would like to insert an extra character (or a new string) at a specific location in a string. For example, I want to insert d at the fourth location in abcefg to get abcdefg.
Now I am using:
old <- "abcefg"
n <- 4
paste(substr(old, 1, n-1), "d", substr(old, n, nchar(old)), sep = "")
I could write a one-line simple function for this task, but I am just curious if there is an existing function for that.
You can do this with regular expressions and gsub.
gsub('^([a-z]{3})([a-z]+)$', '\\1d\\2', old)
# [1] "abcdefg"
If you want to do this dynamically, you can create the expressions using paste:
letter <- 'd'
lhs <- paste0('^([a-z]{', n-1, '})([a-z]+)$')
rhs <- paste0('\\1', letter, '\\2')
gsub(lhs, rhs, old)
# [1] "abcdefg"
as per DWin's comment,you may want this to be more general.
gsub('^(.{3})(.*)$', '\\1d\\2', old)
This way any three characters will match rather than only lower case. DWin also suggests using sub instead of gsub. This way you don't have to worry about the ^ as much since sub will only match the first instance. But I like to be explicit in regular expressions and only move to more general ones as I understand them and find a need for more generality.
as Greg Snow noted, you can use another form of regular expression that looks behind matches:
sub( '(?<=.{3})', 'd', old, perl=TRUE )
and could also build my dynamic gsub above using sprintf rather than paste0:
lhs <- sprintf('^([a-z]{%d})([a-z]+)$', n-1)
or for his sub regular expression:
lhs <- sprintf('(?<=.{%d})',n-1)
stringi package for the rescue once again! The most simple and elegant solution among presented ones.
stri_sub function allows you to extract parts of the string and substitute parts of it like this:
x <- "abcde"
stri_sub(x, 1, 3) # from first to third character
# [1] "abc"
stri_sub(x, 1, 3) <- 1 # substitute from first to third character
x
# [1] "1de"
But if you do this:
x <- "abcde"
stri_sub(x, 3, 2) # from 3 to 2 so... zero ?
# [1] ""
stri_sub(x, 3, 2) <- 1 # substitute from 3 to 2 ... hmm
x
# [1] "ab1cde"
then no characters are removed but new one are inserted. Isn't that cool? :)
#Justin's answer is the way I'd actually approach this because of its flexibility, but this could also be a fun approach.
You can treat the string as "fixed width format" and specify where you want to insert your character:
paste(read.fwf(textConnection(old),
c(4, nchar(old)), as.is = TRUE),
collapse = "d")
Particularly nice is the output when using sapply, since you get to see the original string as the "name".
newold <- c("some", "random", "words", "strung", "together")
sapply(newold, function(x) paste(read.fwf(textConnection(x),
c(4, nchar(x)), as.is = TRUE),
collapse = "-WEE-"))
# some random words strung together
# "some-WEE-NA" "rand-WEE-om" "word-WEE-s" "stru-WEE-ng" "toge-WEE-ther"
Your original way of doing this (i.e. splitting the string at an index and pasting in the inserted text) could be made into a generic function like so:
split_str_by_index <- function(target, index) {
index <- sort(index)
substr(rep(target, length(index) + 1),
start = c(1, index),
stop = c(index -1, nchar(target)))
}
#Taken from https://stat.ethz.ch/pipermail/r-help/2006-March/101023.html
interleave <- function(v1,v2)
{
ord1 <- 2*(1:length(v1))-1
ord2 <- 2*(1:length(v2))
c(v1,v2)[order(c(ord1,ord2))]
}
insert_str <- function(target, insert, index) {
insert <- insert[order(index)]
index <- sort(index)
paste(interleave(split_str_by_index(target, index), insert), collapse="")
}
Example usage:
> insert_str("1234567890", c("a", "b", "c"), c(5, 9, 3))
[1] "12c34a5678b90"
This allows you to insert a vector of characters at the locations given by a vector of indexes. The split_str_by_index and interleave functions are also useful on their own.
Edit:
I revised the code to allow for indexes in any order. Before, indexes needed to be in ascending order.
I've made a custom function called substr1 to deal with extracting, replacing and inserting chars in a string. Run these codes at the start of every session. Feel free to try it out and let me know if it needs to be improved.
# extraction
substr1 <- function(x,y) {
z <- sapply(strsplit(as.character(x),''),function(w) paste(na.omit(w[y]),collapse=''))
dim(z) <- dim(x)
return(z) }
# substitution + insertion
`substr1<-` <- function(x,y,value) {
names(y) <- c(value,rep('',length(y)-length(value)))
z <- sapply(strsplit(as.character(x),''),function(w) {
v <- seq(w)
names(v) <- w
paste(names(sort(c(y,v[setdiff(v,y)]))),collapse='') })
dim(z) <- dim(x)
return(z) }
# demonstration
abc <- 'abc'
substr1(abc,1)
# "a"
substr1(abc,c(1,3))
# "ac"
substr1(abc,-1)
# "bc"
substr1(abc,1) <- 'A'
# "Abc"
substr1(abc,1.5) <- 'A'
# "aAbc"
substr1(abc,c(0.5,2,3)) <- c('A','B')
# "AaB"
It took me some time to understand the regular expression, afterwards I found my way with the numbers I had
The end result was
old <- "89580000"
gsub('^([0-9]{5})([0-9]+)$', '\\1-\\2', old)
similar to yours!
First make sure to load tidyverse package, and then use both paste0 and gsub.
Here is the exact code:
paste0(substr(old, 1,3), "d", substr(old,4,6))
In base you can use regmatches to insert a character at a specific location in a string.
old <- "abcefg"
n <- 4
regmatches(old, `attr<-`(n, "match.length", 0)) <- "d"
old
#[1] "abcdefg"
This could also be used with a regex to find the location to insert.
s <- "abcefg"
regmatches(s, regexpr("(?<=c)", s, perl=TRUE)) <- "d"
s
#[1] "abcdefg"
And works also for multiple matches with individual repacements at different matches.
s <- "abcefg abcefg"
regmatches(s, gregexpr("(?<=c)", s, perl=TRUE)) <- list(1:2)
s
#[1] "abc1efg abc2efg"

Resources