Automatic acronyms of strings in R - string

Long strings in plots aren't always attractive. What's the shortest way of making an acronym in R? E.g., "Hello world" to "HW", and preferably to have unique acronyms.
There's function abbreviate, but it just removes some letters from the phrase, instead of taking first letters of each word.

An easy way would be to use a combination of strsplit, substr, and make.unique.
Here's an example function that can be written:
makeInitials <- function(charVec) {
make.unique(vapply(strsplit(toupper(charVec), " "),
function(x) paste(substr(x, 1, 1), collapse = ""),
vector("character", 1L)))
}
Test it out:
X <- c("Hello World", "Home Work", "holidays with children", "Hello Europe")
makeInitials(X)
# [1] "HW" "HW.1" "HWC" "HE"
That said, I do think that abbreviate should suffice, if you use some of its arguments:
abbreviate(X, minlength=1)
# Hello World Home Work holidays with children Hello Europe
# "HlW" "HmW" "hwc" "HE"

Using regex you can do following. The regex pattern ((?<=\\s).|^.) looks for any letter followed by space or first letter of the string. Then we just paste resulting vectors using collapse argument to get first letter based acronym. And as Ananda suggested, if you want to make unique pass the result through make.unique.
X <- c("Hello World", "Home Work", "holidays with children")
sapply(regmatches(X, gregexpr(pattern = "((?<=\\s).|^.)", text = X, perl = T)), paste, collapse = ".")
## [1] "H.W" "H.W" "h.w.c"
# If you want to make unique
make.unique(sapply(regmatches(X, gregexpr(pattern = "((?<=\\s).|^.)", text = X, perl = T)), paste, collapse = "."))
## [1] "H.W" "H.W.1" "h.w.c"

Related

Matlab: Find string pattern with a list of words and replace in text with one word of the list

In Matlab, Consider the string:
str = 'text text text [[word1,word2,word3]] text text'
I want to isolate randomly one word of the list ('word1','word2','word3'), say 'word2', and then write, in a possibly new file, the string:
strnew = 'text text text word2 text text'
My approach is as follows (certainly pretty bad):
Isolating the string '[[word1,word2,word3]]' can be achieved via
str2=regexp(str,'\[\[(.*?)\]\]','match')
Removing the opening and closing square brackets in the string is achieved via
str3=str2(3:end-2)
Finally we can split str3 into a list of words (stored in a cell)
ListOfWords = split(str3,',')
which outputs {'word1'}{'word2'}{'word3'} and I am stuck there. How can I pick one of the entries and plug it back into the initial string (or a copy of it...)? Note that the delimiters [[ and ]] could both be changed to || if it can help.
You can do it as follows:
Use regexp with the 'split' option;
Split the middle part into words;
Select a random word;
Concatenate back.
str = 'text text text [[word1,word2,word3]] text text'; % input
str_split = regexp(str, '\[\[|\]\]', 'split'); % step 1
list_of_words = split(str_split{2}, ','); % step 2
chosen_word = list_of_words{randi(numel(list_of_words))}; % step 3
strnew = [str_split{1} chosen_word str_split{3}]; % step 4
I have a horrible solution. I was trying to see if I could do it in one function call. You can... but at what cost! Abusing dynamic regular expressions like this barely counts as one function call.
I use a dynamic expression to process the comma separated list. The tricky part is selecting a random element. This is made exceedingly difficult because MATLAB's syntax doesn't support paren indexing off the result of a function call. To get around this, I stick it in a struct so I can dot index. This is terrible.
>> regexprep(str,'\[\[(.*)\]\]',"${struct('tmp',split(string($1),',')).tmp(randi(count($1,',')+1))}")
ans =
'text text text word3 text text'
Luis definitely has the best answer, but I think it could be simplified a smidge by not using regular expressions.
str = 'text text text [[word1,word2,word3]] text text'; % input
tmp = extractBetween(str,"[[","]]"); % step 1
tmp = split(tmp, ','); % step 2
chosen_word = tmp(randi(numel(tmp))) ; % step 3
strnew = replaceBetween(str,"[[","]]",chosen_word,"Boundaries","Inclusive") % step 4

Finding the "difference" between two string texts (Lua example)

I'm trying to find the difference in text between two string values in Lua, and I'm just not quite sure how to do this effectively. I'm not very experienced in working with string patterns, and I'm sure that's my downfall on this one. Here's an example:
-- Original text
local text1 = "hello there"
-- Changed text
local text2 = "hello.there"
-- Finding the alteration of original text with some "pattern"
print(text2:match("pattern"))
In the example above, I'd want to output the text ".", since that's the difference between the two texts. Same goes for cases where the difference could be sensitive to a string pattern, like this:
local text1 = "hello there"
local text2 = "hello()there"
print(text2:match("pattern"))
In this example, I'd want to print "(" since at that point the new string is no longer consistent with the old one.
If anyone has any insight on this, I'd really appreciate it. Sorry I couldn't give more to work with code-wise, I'm just not sure where to begin.
Just iterate over the strings and find when they don't match.
function StringDifference(str1,str2)
for i = 1,#str1 do --Loop over strings
if str1:sub(i,i) ~= str2:sub(i,i) then --If that character is not equal to it's counterpart
return i --Return that index
end
end
return #str1+1 --Return the index after where the shorter one ends as fallback.
end
print(StringDifference("hello there", "hello.there"))
local function get_inserted_text(old, new)
local prv = {}
for o = 0, #old do
prv[o] = ""
end
for n = 1, #new do
local nxt = {[0] = new:sub(1, n)}
local nn = new:sub(n, n)
for o = 1, #old do
local result
if nn == old:sub(o, o) then
result = prv[o-1]
else
result = prv[o]..nn
if #nxt[o-1] <= #result then
result = nxt[o-1]
end
end
nxt[o] = result
end
prv = nxt
end
return prv[#old]
end
Usage:
print(get_inserted_text("hello there", "hello.there")) --> .
print(get_inserted_text("hello there", "hello()there")) --> ()
print(get_inserted_text("hello there", "hello htere")) --> h
print(get_inserted_text("hello there", "heLlloU theAre")) --> LUA

Generating substrings and random strings in R

Please bear with me, I come from a Python background and I am still learning string manipulation in R.
Ok, so lets say I have a string of length 100 with random A, B, C, or D letters:
> df<-c("ABCBDBDBCBABABDBCBCBDBDBCBDBACDBCCADCDBCDACDDCDACBCDACABACDACABBBCCCBDBDDCACDDACADDDDACCADACBCBDCACD")
> df
[1]"ABCBDBDBCBABABDBCBCBDBDBCBDBACDBCCADCDBCDACDDCDACBCDACABACDACABBBCCCBDBDDCACDDACADDDDACCADACBCBDCACD"
I would like to do the following two things:
1) Generate a '.txt' file that is comprised of 20-length subsections of the above string, each starting one letter after the previous with their own unique name on the line above it, like this:
NAME1
ABCBDBDBCBABABDBCBCB
NAME2
BCBDBDBCBABABDBCBCBD
NAME3
CBDBDBCBABABDBCBCBDB
NAME4
BDBDBCBABABDBCBCBDBD
... and so forth
2) Take that generated list and from it comprise another list that has the same exact substrings with the only difference being a change of one or two of the A, B, C, or Ds to another A, B, C, or D (any of those four letters only).
So, this:
NAME1
ABCBDBDBCBABABDBCBCB
Would become this:
NAME1.1
ABBBDBDBCBDBABDBCBCB
As you can see, the "C" in the third position became a "B" and the "A" in position 11 became a "D", with no implied relationship between those changed letters. Purely random.
I know this is a convoluted question, but like I said, I am still learning basic text and string manipulation in R.
Thanks in advance.
Create a text file of substrings
n <- 20 # length of substrings
starts <- seq(nchar(df) - 20 + 1)
v1 <- mapply(substr, starts, starts + n - 1, MoreArgs = list(x = df))
names(v1) <- paste0("NAME", seq_along(v1), "\n")
write.table(v1, file = "filename.txt", quote = FALSE, sep = "",
col.names = FALSE)
Randomly replace one or two letters (A-D):
myfun <- function() {
idx <- sample(seq(n), sample(1:2, 1))
rep <- sample(LETTERS[1:4], length(idx), replace = TRUE)
return(list(idx = idx, rep = rep))
}
new <- replicate(length(v1), myfun(), simplify = FALSE)
v2 <- mapply(function(x, y, z) paste(replace(x, y, z), collapse = ""),
strsplit(v1, ""),
lapply(new, "[[", "idx"),
lapply(new, "[[", "rep"))
names(v2) <- paste0(names(v2), ".1")
write.table(v2, file = "filename2.txt", quote = FALSE, sep = "\n",
col.names = FALSE)
I tried breaking this down into multiple simple steps, hopefully you can get learn a few tricks from this:
# Random data
df<-c("ABCBDBDBCBABABDBCBCBDBDBCBDBACDBCCADCDBCDACDDCDACBCDACABACDACABBBCCCBDBDDCACDDACADDDDACCADACBCBDCACD")
n<-10 # Number of cuts
set.seed(1)
# Pick n random numbers between 1 and the length of string-20
nums<-sample(1:(nchar(df)-20),n,replace=TRUE)
# Make your cuts
cuts<-sapply(nums,function(x) substring(df,x,x+20-1))
# Generate some names
nams<-paste0('NAME',1:n)
# Make it into a matrix, transpose, and then recast into a vector to get alternating names and cuts.
names.and.cuts<-c(t(matrix(c(nams,cuts),ncol=2)))
# Drop a file.
write.table(names.and.cuts,'file.txt',quote=FALSE,row.names=FALSE,col.names = FALSE)
# Pick how many changes are going to be made to each cut.
changes<-sample(1:2,n,replace=2)
# Pick that number of positions to change
pos.changes<-lapply(changes,function(x) sample(1:20,x))
# Find the letter at each position.
letter.at.change.pos<-lapply(pos.changes,function(x) substring(df,x,x))
# Make a function that takes any letter, and outputs any other letter from c(A-D)
letter.map<-function(x){
# Make a list of alternate letters.
alternates<-lapply(x,setdiff,x=c('A','B','C','D'))
# Pick one of each
sapply(alternates,sample,size=1)
}
# Find another letter for each
letter.changes<-lapply(letter.at.change.pos,letter.map)
# Make a function to replace character by position
# Inefficient, but who cares.
rep.by.char<-function(str,pos,chars){
for (i in 1:length(pos)) substr(str,pos[i],pos[i])<-chars[i]
str
}
# Change every letter at pos.changes to letter.changes
mod.cuts<-mapply(rep.by.char,cuts,pos.changes,letter.changes,USE.NAMES=FALSE)
# Generate names
nams<-paste0(nams,'.1')
# Use the matrix trick to alternate names.Drop a file.
names.and.mod.cuts<-c(t(matrix(c(nams,mod.cuts),ncol=2)))
write.table(names.and.mod.cuts,'file2.txt',quote=FALSE,row.names=FALSE,col.names = FALSE)
Also, instead of the rep.by.char function, you could just use strsplit and replace like this:
mod.cuts<-mapply(function(x,y,z) paste(replace(x,y,z),collapse=''),
strsplit(cuts,''),pos.changes,letter.changes,USE.NAMES=FALSE)
One way, albeit slowish:
Rgames> foo<-paste(sample(c('a','b','c','d'),20,rep=T),sep='',collapse='')
Rgames> bar<-matrix(unlist(strsplit(foo,'')),ncol=5)
Rgames> bar
[,1] [,2] [,3] [,4] [,5]
[1,] "c" "c" "a" "c" "a"
[2,] "c" "c" "b" "a" "b"
[3,] "b" "b" "a" "c" "d"
[4,] "c" "b" "a" "c" "c"
Now you can select random indices and replace the selected locations with sample(c('a','b','c','d'),1) . For "true" randomness, I wouldn't even force a change - if your newly drawn letter is the same as the original, so be it.
Like this:
ibar<-sample(1:5,4,rep=T) # one random column number for each row
for ( j in 1: 4) bar[j,ibar[j]]<-sample(c('a','b','c','d'),1)
Then, if necessary, recombine each row using paste
For the first part of your question:
df <- c("ABCBDBDBCBABABDBCBCBDBDBCBDBACDBCCADCDBCDACDDCDACBCDACABACDACABBBCCCBDBDDCACDDACADDDDACCADACBCBDCACD")
nstrchars <- 20
count<- nchar(df)-nstrchars
length20substrings <- data.frame(length20substrings=sapply(1:count,function(x)substr(df,x,x+20)))
# to save to a text file. I chose not to include row names or a column name in the .txt file file
write.table(length20substrings,"length20substrings.txt",row.names=F,col.names=F)
For the second part:
# create a function that will randomly pick one or two spots in a string and replace
# those spots with one of the other characters present in the string:
changefxn<- function(x){
x<-as.character(x)
nc<-nchar(as.character(x))
id<-seq(1,nc)
numchanges<-sample(1:2,1)
ids<-sample(id,numchanges)
chars2repl<-strsplit(x,"")[[1]][ids]
charspresent<-unique(unlist(strsplit(x,"")))
splitstr<-unlist(strsplit(x,""))
if (numchanges>1) {
splitstr[id[1]] <- sample(setdiff(charspresent,chars2repl[1]),1)
splitstr[id[2]] <- sample(setdiff(charspresent,chars2repl[2]),1)
}
else {splitstr[id[1]] <- sample(setdiff(charspresent,chars2repl[1]),1)
}
newstr<-paste(splitstr,collapse="")
return(newstr)
}
# try it out
changefxn("asbbad")
changefxn("12lkjaf38gs")
# apply changefxn to all the substrings from part 1
length20substrings<-length20substrings[seq_along(length20substrings[,1]),]
newstrings <- lapply(length20substrings, function(ii)changefxn(ii))

How to paste two vectors together and pad at the end?

I would like to paste two character strings together and pad at the end with another character to make the combination a certain length. I was wondering if there was an option to paste that one can pass or another trick that I am missing? I can do this in multiple lines by figuring out the length of each and then calling paste with rep(my_pad_character,N) but I would like to do this in one line.
Ex: pad together "hi", and "hello" and pad with an "a" to make the sequence length 10. the result would be "hihelloaaa"
Here is one option:
s1 <- "hi"
s2 <- "hello"
f <- function(x, y, pad = "a", length = 10) {
out <- paste0(x, y)
nc <- nchar(out)
paste0(out, paste(rep(pad, length - nc), collapse = ""))
}
> f(s1, s2)
[1] "hihelloaaa"
You can use the stringr function str_pad
library(stringr)
str_pad(paste0('hi','hello'), side = 'right', width = 10 , pad = 'a')

R: How can I replace let's say the 5th element within a string?

I would like to convert the a string like be33szfuhm100060 into BESZFUHM0060.
In order to replace the small letters with capital letters I've so far used the gsub function.
test1=gsub("be","BE",test)
Is there a way to tell this function to replace the 3rd and 4th string element? If not, I would really appreciate if you could tell me another way to solve this problem. Maybe there is also a more general solution to change a string element at a certain position into a capital letter whatever the element is?
A couple of observations:
Cnverting a string to uppercase can be done with toupper, e.g.:
> toupper('be33szfuhm100060')
> [1] "BE33SZFUHM100060"
You could use substr to extract a substring by character positions and paste to concatenate strings:
> x <- 'be33szfuhm100060'
> paste(substr(x, 1, 2), substr(x, 5, nchar(x)), sep='')
[1] "beszfuhm100060"
As an alternative, if you are going to be doing this alot:
String <- function(x="") {
x <- as.character(paste(x, collapse=""))
class(x) <- c("String","character")
return(x)
}
"[.String" <- function(x,i,j,...,drop=TRUE) {
unlist(strsplit(x,""))[i]
}
"[<-.String" <- function(x,i,j,...,value) {
tmp <- x[]
tmp[i] <- String(value)
x <- String(tmp)
x
}
print.String <- function(x, ...) cat(x, "\n")
## try it out
> x <- String("be33szfuhm100060")
> x[3:4] <- character(0)
> x
beszfuhm100060
You can use substring to remove the third and fourth elements.
x <- "be33szfuhm100060"
paste(substring(x, 1, 2), substring(x, 5), sep = "")
If you know what portions of the string you want based on their position(s), use substr or substring. As I mentioned in my comment, you can use toupper to coerce characters to uppercase.
paste( toupper(substr(test,1, 2)),
toupper(substr(test,5,10)),
substr(test,12,nchar(test)),sep="")
# [1] "BESZFUHM00060"

Resources