Insert a character at a specific location in a string - string

I would like to insert an extra character (or a new string) at a specific location in a string. For example, I want to insert d at the fourth location in abcefg to get abcdefg.
Now I am using:
old <- "abcefg"
n <- 4
paste(substr(old, 1, n-1), "d", substr(old, n, nchar(old)), sep = "")
I could write a one-line simple function for this task, but I am just curious if there is an existing function for that.

You can do this with regular expressions and gsub.
gsub('^([a-z]{3})([a-z]+)$', '\\1d\\2', old)
# [1] "abcdefg"
If you want to do this dynamically, you can create the expressions using paste:
letter <- 'd'
lhs <- paste0('^([a-z]{', n-1, '})([a-z]+)$')
rhs <- paste0('\\1', letter, '\\2')
gsub(lhs, rhs, old)
# [1] "abcdefg"
as per DWin's comment,you may want this to be more general.
gsub('^(.{3})(.*)$', '\\1d\\2', old)
This way any three characters will match rather than only lower case. DWin also suggests using sub instead of gsub. This way you don't have to worry about the ^ as much since sub will only match the first instance. But I like to be explicit in regular expressions and only move to more general ones as I understand them and find a need for more generality.
as Greg Snow noted, you can use another form of regular expression that looks behind matches:
sub( '(?<=.{3})', 'd', old, perl=TRUE )
and could also build my dynamic gsub above using sprintf rather than paste0:
lhs <- sprintf('^([a-z]{%d})([a-z]+)$', n-1)
or for his sub regular expression:
lhs <- sprintf('(?<=.{%d})',n-1)

stringi package for the rescue once again! The most simple and elegant solution among presented ones.
stri_sub function allows you to extract parts of the string and substitute parts of it like this:
x <- "abcde"
stri_sub(x, 1, 3) # from first to third character
# [1] "abc"
stri_sub(x, 1, 3) <- 1 # substitute from first to third character
x
# [1] "1de"
But if you do this:
x <- "abcde"
stri_sub(x, 3, 2) # from 3 to 2 so... zero ?
# [1] ""
stri_sub(x, 3, 2) <- 1 # substitute from 3 to 2 ... hmm
x
# [1] "ab1cde"
then no characters are removed but new one are inserted. Isn't that cool? :)

#Justin's answer is the way I'd actually approach this because of its flexibility, but this could also be a fun approach.
You can treat the string as "fixed width format" and specify where you want to insert your character:
paste(read.fwf(textConnection(old),
c(4, nchar(old)), as.is = TRUE),
collapse = "d")
Particularly nice is the output when using sapply, since you get to see the original string as the "name".
newold <- c("some", "random", "words", "strung", "together")
sapply(newold, function(x) paste(read.fwf(textConnection(x),
c(4, nchar(x)), as.is = TRUE),
collapse = "-WEE-"))
# some random words strung together
# "some-WEE-NA" "rand-WEE-om" "word-WEE-s" "stru-WEE-ng" "toge-WEE-ther"

Your original way of doing this (i.e. splitting the string at an index and pasting in the inserted text) could be made into a generic function like so:
split_str_by_index <- function(target, index) {
index <- sort(index)
substr(rep(target, length(index) + 1),
start = c(1, index),
stop = c(index -1, nchar(target)))
}
#Taken from https://stat.ethz.ch/pipermail/r-help/2006-March/101023.html
interleave <- function(v1,v2)
{
ord1 <- 2*(1:length(v1))-1
ord2 <- 2*(1:length(v2))
c(v1,v2)[order(c(ord1,ord2))]
}
insert_str <- function(target, insert, index) {
insert <- insert[order(index)]
index <- sort(index)
paste(interleave(split_str_by_index(target, index), insert), collapse="")
}
Example usage:
> insert_str("1234567890", c("a", "b", "c"), c(5, 9, 3))
[1] "12c34a5678b90"
This allows you to insert a vector of characters at the locations given by a vector of indexes. The split_str_by_index and interleave functions are also useful on their own.
Edit:
I revised the code to allow for indexes in any order. Before, indexes needed to be in ascending order.

I've made a custom function called substr1 to deal with extracting, replacing and inserting chars in a string. Run these codes at the start of every session. Feel free to try it out and let me know if it needs to be improved.
# extraction
substr1 <- function(x,y) {
z <- sapply(strsplit(as.character(x),''),function(w) paste(na.omit(w[y]),collapse=''))
dim(z) <- dim(x)
return(z) }
# substitution + insertion
`substr1<-` <- function(x,y,value) {
names(y) <- c(value,rep('',length(y)-length(value)))
z <- sapply(strsplit(as.character(x),''),function(w) {
v <- seq(w)
names(v) <- w
paste(names(sort(c(y,v[setdiff(v,y)]))),collapse='') })
dim(z) <- dim(x)
return(z) }
# demonstration
abc <- 'abc'
substr1(abc,1)
# "a"
substr1(abc,c(1,3))
# "ac"
substr1(abc,-1)
# "bc"
substr1(abc,1) <- 'A'
# "Abc"
substr1(abc,1.5) <- 'A'
# "aAbc"
substr1(abc,c(0.5,2,3)) <- c('A','B')
# "AaB"

It took me some time to understand the regular expression, afterwards I found my way with the numbers I had
The end result was
old <- "89580000"
gsub('^([0-9]{5})([0-9]+)$', '\\1-\\2', old)

similar to yours!
First make sure to load tidyverse package, and then use both paste0 and gsub.
Here is the exact code:
paste0(substr(old, 1,3), "d", substr(old,4,6))

In base you can use regmatches to insert a character at a specific location in a string.
old <- "abcefg"
n <- 4
regmatches(old, `attr<-`(n, "match.length", 0)) <- "d"
old
#[1] "abcdefg"
This could also be used with a regex to find the location to insert.
s <- "abcefg"
regmatches(s, regexpr("(?<=c)", s, perl=TRUE)) <- "d"
s
#[1] "abcdefg"
And works also for multiple matches with individual repacements at different matches.
s <- "abcefg abcefg"
regmatches(s, gregexpr("(?<=c)", s, perl=TRUE)) <- list(1:2)
s
#[1] "abc1efg abc2efg"

Related

Checking whether a "string" expression in a specified variable is contained in several other variables

This is a very basic question... and I'm sure its written here somewhere... I'm trying to create a new variable that tells me whether there is a mendialian violation in a trio (genetics data)
For example (for those who don't know genetics):
For one family (one per row), I have Father, Mother and Childs genotype represented as A/G, G/A, G/G (as separate variables). I want to create a new 0/1 or False/True variable that tells me whether Allele 1 of the the Child is seen in either allele of the mothers genotype or in either allele of the fathers genotype. The same for Allele 2.
I tried using regexpr as follows, in R:
vcf_GT$MVLR <- regexpr(c(sapply(strsplit(as.character(vcf_GT[,10]),"/"),function(x) x[1])),
(sapply(strsplit(as.character(vcf_GT[,10]),"/"),function(x) x[2])),
(c(c(sapply(strsplit(as.character(vcf_GT[,9]),"/"),function(x) x[1])),
(sapply(strsplit(as.character(vcf_GT[,9]),"/"),function(x) x[2])),
c(sapply(strsplit(as.character(vcf_GT[,8]),"/"),function(x) x[1])),
(sapply(strsplit(as.character(vcf_GT[,8]),"/"),function(x) x[2]))))) > 0
with the column 10 representing the Child's genotype, and 9 and 8 representing the mothers and father's respectively. This is tedious and I've probably forgotten a parenthesis somewhere in here.
There has to be an easier way to check child's genotype with mother and fathers.
Thanks in advance!
P.S. If I'm not making sense - I'll try to add some more detail.
Edit: Although my code is actually one huge line, as requested I've added returns so its easier to read (though, its kind of hard to regardless :))
First off, if you find yourself doing the same thing over and over, write a function. So instead of
c(sapply(strsplit(as.character(vcf_GT[,10]),"/"),function(x) x[1]))...
Write a little wrapper:
myfun <- function(var1, var2, dat=vcf_GT) {
sapply(strsplit(as.character(dat[,var1], '/'),
function(x) x[var2])
}
Now the stuff you pasted above becomes something like:
regexpr(c(myfun(10, 1),
myfun(10, 2)...
However, I think there is an easier way...
To solve a problem like this (or of any kind) I generally break it into chunks. Start with a single "row" like you've given and write some functions that do what you want (sorry if I've gotten this wrong, but that was confusing code!)...
dad = 'A/G'
mom = 'G/A'
kid = 'G/G'
splt <- function(x) unlist(strsplit(x, '/'))
comp <- function(x, y) c(x[1] %in% y, x[2] %in% y)
comp(splt(kid), splt(dad))
From there you are an apply away from doing this on a data.frame:
## make some data
possible <- expand.grid(c('C', 'T', 'A', 'G'),
c('C', 'T', 'A', 'G'))
gen <- function(n, pos=possible) {
res=possible[sample(1:nrow(possible), n, replace=TRUE),]
return (paste(res[,1], res[,2], sep='/'))
}
n <- 10
dat <- data.frame(mom=gen(n), dad=gen(n), kid=gen(n))
# put both functions together
splt_and_comp <- function(x, y) {
x <- splt(x)
y <- splt(y)
comp(x, y)
}
# you could do this with `apply` as well...
mapply(splt_and_comp, dat$kid, dat$mom)
FWIW, your current code is calling regexpr with three arguments as follow. It very well could function but is impossible to read and has extra parentheses all over:
first_arg <- c(sapply(strsplit(as.character(vcf_GT[,10]), "/"),
function(x) x[1]))
second_arg <- (sapply(strsplit(as.character(vcf_GT[, 10]), "/"),
function(x) x[2]))
third_arg <- (c(c(sapply(strsplit(as.character(vcf_GT[,9]),"/"),
function(x) x[1])),
(sapply(strsplit(as.character(vcf_GT[,9]),"/"),
function(x) x[2])),
c(sapply(strsplit(as.character(vcf_GT[,8]),"/"),
function(x) x[1])),
(sapply(strsplit(as.character(vcf_GT[,8]),"/"),
function(x) x[2]))))
If what you want is simply to look at the alleles that match either the mother or the father, you do not necessarily need regexp to be able to do this. You can do this with the %in% operator (this is also known as the match() function, but I prefer this syntax.
Let's set up our data frame of genotypes. Note that the last "family" is one where the child has a different allele from the mother.
x <- data.frame(list(mom = c("A/G", "C/C", "C/A"),
dad = c("G/A", "T/T", "A/A"),
child = c("G/G", "T/T", "A/T")
), stringsAsFactors = FALSE)
Now, we can set up our function to check the alleles of the child. You will have to change the c(1,2,3) to c(8,9,10) for it to work on your data set, but it should work. This is the function we will use on each row of your data frame. It will split all of the genotypes of the family, compare the child to both mother and father and then determine if the child's genotype matches either parent.
check_child_allele <- function(x) {
fam <- strsplit(as.character(x[c(1, 2, 3)]), "/")
names(fam) <- c("mom", "dad", "child")
mom_query <- fam[["child"]] %in% fam[["mom"]]
dad_query <- fam[["child"]] %in% fam[["dad"]]
fam_matrix <- matrix(c(mom = mom_query, dad = dad_query), nrow = 2)
child_match_parents <- rowSums(fam_matrix)
child_geno <- ifelse(child_match_parents < 1, FALSE, TRUE)
return(child_geno)
}
Checking the example.
apply(x, 1, check_child_allele)
## [,1] [,2] [,3]
## [1,] TRUE TRUE TRUE
## [2,] TRUE TRUE FALSE
Changing the data frame to represent a child that matches neither of the parents.
y <- x
y[2, 3] <- "A/G" # Adding a child that has no alleles in common with parents
apply(y, 1, check_child_allele)
## [,1] [,2] [,3]
## [1,] TRUE FALSE TRUE
## [2,] TRUE FALSE FALSE
A side note that might not be relevant to your work:
One thing that you might be concerned about is the fact that this will check if the allele is present in either parent, but it will not check if both of the parents are indeed possible parents. The second set of genotypes in the first data frame is an example since the child is "T/T", but the mother is "C/C".
Hope that helps!

How can i determine if a string is a concatenation of a string list

Suppose we are given a string S, and a list of some other strings L.
How can we know if S is a one of all the possible concatenations of L?
For example:
S = "abcdabce"
L = ["abcd", "a", "bc", "e"]
S is "abcd" + "a" + "bc" + "e", then S is a concatenation of L, whereas "ababcecd" is not.
In order to solve this question, I tried to use DFS/backtracking. The pseudo code is as follows:
boolean isConcatenation(S, L) {
if (L.length == 1 && S == L[0]) return true;
for (String s: L) {
if (S.startwith(s)) {
markAsVisited(s);
if (isConcatnation(S.exclude(s), L.exclude(s)))
return true;
markAsUnvisited(s);
}
}
return false;
}
However, DFS/backtracking is not a efficient solution. I am curious what is the fastest algorithm to solve this question or if there is any other algorithm to solve it in a faster way. I hope there are algorithms like KMP, which can solve it in O(n) time.
In python:
>>> yes = 'abcdabce'
>>> no = 'ababcecd'
>>> L = ['abcd','a','bc','e']
>>> yes in [''.join(p) for p in itertools.permutations(L)]
True
>>> no in [''.join(p) for p in itertools.permutations(L)]
False
edit: as pointed out, this is n! complex, so is inappropriate for large L. But hey, development time under 10 seconds.
You can instead build your own permutation generator, starting with the basic permutator:
def all_perms(elements):
if len(elements) <=1:
yield elements
else:
for perm in all_perms(elements[1:]):
for i in range(len(elements)):
yield perm[:i] + elements[0:1] + perm[i:]
And then discard branches that you don't care about by tracking what the concatenation of the elements would be and only iterating if it adds up to your target string.
def all_perms(elements, conc=''):
...
for perm in all_perms(elements[1:], conc + elements[0]):
...
if target.startswith(''.join(conc)):
...
A dynamic programming approach would be to work left to right, building up an array A[x] where A[x] is true if the first x characters of the string form one of the possible concatenations of L. You can work out A[n] given earlier A[n] by checking each possible string in the list - if the characters of S up to the nth character match a candidate string of length k and if A[n-k] is true, then you can set A[n] true.
I note that you can use https://en.wikipedia.org/wiki/Aho%E2%80%93Corasick_string_matching_algorithm to find the matches you need as input to the dynamic program - the matching costs will be linear in the size of the input string, the total size of all candidate strings, and the number of matches between the input string and candidate strings.
i would try the following:
find all positions of L_i patterns in S
let n=length(S)+1
create a graph with n nodes
for all L_i positions i: directed edges: node_i --> L_i matches node --> node_{i+length(L_i)}
to enable the permutation constrains you have to add some more node/edges to exclude multiple usage of the same pattern
now i can ask a new question: is there exists a directed path from 0 to n ?
notes:
if there exists a node(0 < i < n) with degree <2 then no match is possible
all nodes which have d-=1, d+=1 are part of the permutation
bread first or diskstra to look for the solution
You can use the Trie data structure. First, construct a trie from strings in L.
Then, for the input string S, search for the S in the trie.
During searching, for every visited node which is an end of one of the words in L, call a new search on the trie (from the root) with remaining (yet unmatched) suffix of S. So, we are using recursion. If you consume all characters of S in that process then you know, that S is a contatenation of some strings from L.
I would suggest this solution:
Take an array of size 256 which will store the occurence count of each character in all strings of L. Now try to match that with count of each character of S. If both are unequal then we can confidently say that they cannot form the given character.
If counts are same, Do the following, using KMP algorithm try to find simultaneously each string in L in S. If at any time there is a match we remove that string from L and continue search for other strings in L. If at any time we dont find a match we just print that it cannot be represented. If at the end L is empty we conclude that S indeed is a concatenation of L.
Assuming that L is a set of unique strings.
Two Haskell propositions:
There may be some counter examples to this...just for fun...sort L by a custom sort:
import Data.List (sortBy,isInfixOf)
h s l = (concat . sortBy wierd $ l) == s where
wierd a b | isInfixOf (a ++ b) s = LT
| isInfixOf (b ++ a) s = GT
| otherwise = EQ
More boring...attempt to build S from L:
import Data.List (delete,isPrefixOf)
f s l = g s l [] where
g str subs result
| concat result == s = [result]
| otherwise =
if null str || null subs'
then []
else do sub <- subs'
g (drop (length sub) str) (delete sub subs) (result ++ [sub])
where subs' = filter (flip isPrefixOf str) subs
Output:
*Main> f "abcdabce" ["abcd", "a", "bc", "e", "abc"]
[["abcd","a","bc","e"],["abcd","abc","e"]]
*Main> h "abcdabce" ["abcd", "a", "bc", "e", "abc"]
False
*Main> h "abcdabce" ["abcd", "a", "bc", "e"]
True
Your algorithm has complexity N^2 (N is the length of list). Let's see in actual C++
#include <string>
#include <vector>
#include <algorithm>
#include <iostream>
using namespace std;
typedef pair<string::const_iterator, string::const_iterator> stringp;
typedef vector<string> strings;
bool isConcatenation(stringp S, const strings L) {
for (strings::const_iterator p = L.begin(); p != L.end(); ++p) {
auto M = mismatch(p->begin(), p->end(), S.first);
if (M.first == p->end()) {
if (L.size() == 1)
return true;
strings T;
T.insert(T.end(), L.begin(), p);
strings::const_iterator v = p;
T.insert(T.end(), ++v, L.end());
if (isConcatenation(make_pair(M.second, S.second), T))
return true;
}
}
return false;
}
Instead of looping on the entire vector, we could sort it, then reduce the search to O(LOG(N)) steps in the optimum case, where all strings start with different chars. The worst case will remain O(N^2).

Non character argument in R string split function (strsplit)

This works
x <- "0.466:1.187:2.216:1.196"
y <- as.numeric(unlist(strsplit(x, ":")))
Values of blat$LRwAvg all look like X above but this doesn't work
for (i in 1:50){
y <- as.numeric(unlist(strsplit(blat$LRwAvg[i], "\\:")))
blat$meanLRwAvg[i]=mean(y)
}
Because of:
Error in strsplit(blat$LRwAvg[i], "\:") : non-character argument
It doesn't matter if I have one, two or null backslashes.
What's my problem? (Not generally, I mean in this special task, technically)
As agstudy implied blat$LRwAvg <- as.character(blat$LRwAvg) before loop fixed it
blat$meanLRwAvg <- blat$gtFrqAvg #or some other variable in data frame with equal length
blat$LRwAvg <- as.character(blat$LRwAvg)
for (i in 1:50){
y <- as.numeric(unlist(strsplit(blat$LRwAvg[i], "\\:")))
blat$meanLRwAvg[i]=mean(y)
}

R: How to replace a character in a string after sampling and print out character instead of index?

I'd like to replace a character in a string with another character, by first sampling by the character. I'm having trouble having it print out the character instead of the index.
Example data, is labelled "try":
L 0.970223325 - 0.019851117 X 0.007444169
K 0.962779156 - 0.027295285 Q 0.004962779
P 0.972704715 - 0.027295285 NA 0
C 0.970223325 - 0.027295285 L 0.00248139
V 0.970223325 - 0.027295285 T 0.00248139
I'm trying to sample a character for a given row using weighted probabilities.
samp <- function(row) {
sample(try[row,seq(1, length(try), 2)], 1, prob = try[row,seq(2, length(try), 2)])
}
Then, I want to use the selected character to replace a position in a given string.
subchar <- function(string, pos, new) {
paste(substr(string, 1, pos-1), new , substr(string, pos+1, nchar(string)), sep='')
}
My question is - if I do, for example
> subchar("KLMN", 3, samp(4))
[1] "KL1N"
But I want it to read "KLCN". As.character(samp(4)) doesn't work either. How do I get it to print out the character instead of the index?
The problem arises because your letters are stored as factors rather than characters, and samp is returning a data.frame.
C is the first level in your factor so that is stored as 1 internally, and as.character (which gets invoked by the paste statement) pulls this out when working on the mini-data.frame:
samp(4)
V1
4 C
as.character(samp(4))
[1] "1"
You can solve this in 2 ways, either dropping the data.frame of the samp output in your call to subchar, or modifying samp to do so:
subchar("KLMN", 3, samp(4)[,1])
[1] "KLCN"
samp2 <- function(row)
{ sample(try[row,seq(1, length(try), 2)], 1, prob = try[row,seq(2, length(try), 2)])[,1]
}
subchar("KLMN",3,samp2(4))
[1] "KLCN
You may also find it easier to sample within your subsetting, and you can drop the data.frame from there:
samp3 <- function(row){
try[row,sample(seq(1,length(try),2),1,prob=try[row,seq(2,length(try),2)]),drop=TRUE]
}

Merging through fuzzy matching of variables in R

I have two dataframes (x & y) where the IDs are student_name, father_name and mother_name. Because of typographical errors ("n" instead of "m", random white spaces, etc.), I have about 60% of values which are not aligning, though I can eyeball the data and see they should. Is there a way to reduce the level of non-match somehow so that manually editing because at least feasible? The dataframes are have about 700K observations.
R would be best. I know a little bit of python, and some basic unix tools. P.S. I read up on agrep(), but don't understand how that can work on actual datasets, especially when the match is over more than one variable.
update (data for posted bounty):
Here are two example data frames, sites_a and sites_b. They could be matched on the numeric columns lat and lon as well as on the sitename column. It would be useful to know how this could be done on a) just lat + lon, b) sitename or c) both.
you can source the file test_sites.R which is posted as a gist.
Ideally the answer would end with
merge(sites_a, sites_b, by = **magic**)
The agrep function (part of base R), which does approximate string matching using the Levenshtein edit distance is probably worth trying. Without knowing what your data looks like, I can't really suggest a working solution. But this is a suggestion... It records matches in a separate list (if there are multiple equally good matches, then these are recorded as well). Let's say that your data.frame is called df:
l <- vector('list',nrow(df))
matches <- list(mother = l,father = l)
for(i in 1:nrow(df)){
father_id <- with(df,which(student_name[i] == father_name))
if(length(father_id) == 1){
matches[['father']][[i]] <- father_id
} else {
old_father_id <- NULL
## try to find the total
for(m in 10:1){ ## m is the maximum distance
father_id <- with(df,agrep(student_name[i],father_name,max.dist = m))
if(length(father_id) == 1 || m == 1){
## if we find a unique match or if we are in our last round, then stop
matches[['father']][[i]] <- father_id
break
} else if(length(father_id) == 0 && length(old_father_id) > 0) {
## if we can't do better than multiple matches, then record them anyway
matches[['father']][[i]] <- old_father_id
break
} else if(length(father_id) == 0 && length(old_father_id) == 0) {
## if the nearest match is more than 10 different from the current pattern, then stop
break
}
}
}
}
The code for the mother_name would be basically the same. You could even put them together in a loop, but this example is just for the purpose of illustration.
This takes a list of common column names, matches based on agrep of all those columns combined, and then if all.x or all.y equals TRUE it appends non-matching records filling in missing columns with NA. Unlike merge, the column names to match on need to be the same in each data frame. The challenge would seem to be setting the agrep options correctly to avoid spurious matches.
agrepMerge <- function(df1, df2, by, all.x = FALSE, all.y = FALSE,
ignore.case = FALSE, value = FALSE, max.distance = 0.1, useBytes = FALSE) {
df1$index <- apply(df1[,by, drop = FALSE], 1, paste, sep = "", collapse = "")
df2$index <- apply(df2[,by, drop = FALSE], 1, paste, sep = "", collapse = "")
matches <- lapply(seq_along(df1$index), function(i, ...) {
agrep(df1$index[i], df2$index, ignore.case = ignore.case, value = value,
max.distance = max.distance, useBytes = useBytes)
})
df1_match <- rep(1:nrow(df1), sapply(matches, length))
df2_match <- unlist(matches)
df1_hits <- df1[df1_match,]
df2_hits <- df2[df2_match,]
df1_miss <- df1[setdiff(seq_along(df1$index), df1_match),]
df2_miss <- df2[setdiff(seq_along(df2$index), df2_match),]
remove_cols <- colnames(df2_hits) %in% colnames(df1_hits)
df_out <- cbind(df1_hits, df2_hits[,!remove_cols])
if(all.x) {
missing_cols <- setdiff(colnames(df_out), colnames(df1_miss))
df1_miss[missing_cols] <- NA
df_out <- rbind(df_out, df1_miss)
}
if(all.x) {
missing_cols <- setdiff(colnames(df_out), colnames(df2_miss))
df2_miss[missing_cols] <- NA
df_out <- rbind(df_out, df2_miss)
}
df_out[,setdiff(colnames(df_out), "index")]
}

Resources