Split string into 100 words parts in R - string

How do I split a single huge "character" into smaller ones, each containing exactly 100 words.
For example, that's how I used to split it by single words.
myCharSplitByWords <- strsplit(myCharUnSplit, " ")[[1]]
I think that this can probably be done with regex (maybe selecting 100th space or smth) but couldn't write a proper expression
I'm new to R and I'm totally stuck. Thanks

Maybe there is a way using regular expressions but after strsplit it would be easier to group the words by "hand":
## example data
set.seed(1)
string <- paste0(sample(c(LETTERS[1:10], " "), 1e5, replace=TRUE), collapse="")
## split if there is at least one space
words <- strsplit(string, "\\s+")[[1]]
## build group index
group <- rep(seq(ceiling(length(words)/100)), each=100)[1:length(words)]
## split by group index
words100 <- split(words, group)

You can get every 100th instances of a run of spaces preceded by a run of non-spaces (if that's your definition of a word) by:
ind<- gregexpr("([^ ]+? +){100}", string)[[1]]
and then substring your original by
hundredWords <- substr(string, ind, c(ind[-1]-1, nchar(string))
This will leave trailing spaces at the end of each entry, and the final entry will not necessarily have 100 entries, but will have the remaining words that are left after removing batches of 100. If you have another definition of word delimiter (tabs, punctuation, ...) then post that and we can change the regular expression accordingly.

Related

In DrRacket how do I check if a string has a certain amount of characters, as well how do I determine what the first character in a string is

Basically I have a problem, here is the information needed to solve the problem.
PigLatin. Pig Latin is a way of rearranging letters in English words for fun. For example, the sentence “pig latin is stupid” becomes “igpay atinlay isway upidstay”.
Vowels(‘a’,‘e’,‘i’,‘o’,and‘u’)are treated separately from the consonants(any letter that isn’t a vowel).
For simplicity, we will consider ‘y’ to always be a consonant. Although various forms of Pig Latin exist, we will use the following rules:
(1) Words of two letters or less simply have “way” added on the end. So “a” becomes “away”.
(2) In any word that starts with consonants, the consonants are moved to the end, and “ay” is added. If a word begins with more than two consonants, move only the first two letters. So “hello” becomes “ellohay”, and “string” becomes “ringstay”.
(3) Any word which begins with a vowel simply has “way” added on the end. So “explain” becomes “explainway”.
Write a function (pig-latin L) that consumes a non-empty (listof Str) and returns a Str containing the words in L converted to Pig Latin.
Each value in L should contain only lower case letters and have a length of at least 1.
I understand that i need to set three main conditions here, i'm struggling with Racket and learning the proper syntax to write out my solutions. first I need to make a conditions that looks at a string and see if it's length is 2 or less to meet the (1) condition. For (2) I need to look at the first two characters in a string, i'm assuming I have to convert the string into a list of char(string->list). For (3) I understand I just have to look at the first character in the string, i basically have to repeat what I did with (2) but just look at the first character.
I don't know how to manipulate a list of char though. I also don't know how to make sure string-length meets a criteria. Any assistance would be appreciated. I basically have barely any code for my problem since I am baffled on what to do here.
An example of the problem is
(pig-latin (list "this" "is" "a" "crazy" "exercise")) =>
"isthay isway away azycray exerciseway"
The best strategy to solve this problem is:
Check in the documentation all the available string procedures. We don't need to transform the input string to a list of chars to operate upon it, and you'll find that there are existing procedures that meet all of our needs.
Write helper procedures. In fact, we only need a procedure that tells us if a string contains a vowel at a given position; the problem states that only a-z characters are used so we can negate this procedure to also find consonants.
It's also important to identify the best order to write the conditions, for example: conditions 1 and 3 can be combined in a single case. This is my proposal:
(define (vowel-at-index? text index)
(member (string-ref text index)
'(#\a #\e #\i #\o #\u)))
(define (pigify text)
; cases 1 and 3
(cond ((or (<= (string-length text) 2)
(vowel-at-index? text 0))
(string-append text "way"))
; case 2.1
((and (not (vowel-at-index? text 0))
(vowel-at-index? text 1))
(string-append (substring text 1)
(substring text 0 1)
"ay"))
; case 2.2
(else
(string-append (substring text 2)
(substring text 0 2)
"ay"))))
(define (pig-latin lst)
(string-join (map pigify lst)))
For the final step, we only need to apply the pigify procedure to each element in the input, and that's what map does. It works as expected:
(pig-latin '("this" "is" "a" "crazy" "exercise"))
=> "isthay isway away azycray exerciseway"

How can I replace each letter in the sentence to sentence without breaking it?

Here's my problem.
sentence = "This car is awsome."
and what I want do do is
sentence.replace("a","<emoji:a>")
sentence.replace("b","<emoji:b>")
sentence.replace("c","<emoji:c>")
and so on...
But of course if I do it in that way the letters in "<emoji:>" will also be replaced as I go along. So how can I do it in other way?
As Carlos Gonzalez suggested:
create a mapping dict and apply it to each character in sequence:
sentence = "This car is awsome."
# mapping
up = {"a":"<emoji:a>",
"b":"<emoji:b>",
"c":"<emoji:c>",}
# apply mapping to create a new text (use up[k] if present else default to k)
text = ''.join( (up.get(k,k) for k in sentence) )
print(text)
Output:
This <emoji:c><emoji:a>r is <emoji:a>wsome.
The advantage of the generator expression inside the ''.join( ... generator ...) is that it takes each single character of sentence and either keeps it or replaces it. It only ever touches each char once, so there is no danger of multiple substitutions and it takes only one pass of sentence to convert the whole thing.
Doku: dict.get(key,default) and Why dict.get(key) instead of dict[key]?
If you used
sentence = sentence.replace("a","o")
sentence = sentence.replace("o","k")
you would first make o from a and then make k from any o (or a before) - and you would have to touch each character twice to make it happen.
Using
up = { "a":"o", "o":"k" }
text = ''.join( (up.get(k,k) for k in sentence) )
avoids this.
If you want to replace more then 1 character at a time, it would be easier to do this with regex. Inspired by Passing a function to re.sub in Python
import re
sentence = "This car is awsome."
up = {"is":"Yippi",
"ws":"WhatNot",}
# modified it to create the groups using the dicts key
text2 = re.sub( "("+'|'.join(up)+")", lambda x: up[x.group()], sentence)
print(text2)
Output:
ThYippi car Yippi aWhatNotome.
Doku: re.sub(pattern, repl, string, count=0, flags=0)
You would have to take extra care with your keys, if you wanted to use "regex" specific characters that have another meaning if used as regex-pattern - f.e. .+*?()[]^$

How can I remove repeated characters in a string with R?

I would like to implement a function with R that removes repeated characters in a string. For instance, say my function is named removeRS, so it is supposed to work this way:
removeRS('Buenaaaaaaaaa Suerrrrte')
Buena Suerte
removeRS('Hoy estoy tristeeeeeee')
Hoy estoy triste
My function is going to be used with strings written in spanish, so it is not that common (or at least correct) to find words that have more than three successive vowels. No bother about the possible sentiment behind them. Nonetheless, there are words that can have two successive consonants (especially ll and rr), but we could skip this from our function.
So, to sum up, this function should replace the letters that appear at least three times in a row with just that letter. In one of the examples above, aaaaaaaaa is replaced with a.
Could you give me any hints to carry out this task with R?
I did not think very carefully on this, but this is my quick solution using references in regular expressions:
gsub('([[:alpha:]])\\1+', '\\1', 'Buenaaaaaaaaa Suerrrrte')
# [1] "Buena Suerte"
() captures a letter first, \\1 refers to that letter, + means to match it once or more; put all these pieces together, we can match a letter two or more times.
To include other characters besides alphanumerics, replace [[:alpha:]] with a regex matching whatever you wish to include.
I think you should pay attention to the ambiguities in your problem description. This is a first stab, but it clearly does not work with "Good Luck" in the manner you desire:
removeRS <- function(str) paste(rle(strsplit(str, "")[[1]])$values, collapse="")
removeRS('Buenaaaaaaaaa Suerrrrte')
#[1] "Buena Suerte"
Since you want to replace letters that appear AT LEAST 3 times, here is my solution:
gsub("([[:alpha:]])\\1{2,}", "\\1", "Buennaaaa Suerrrtee")
#[1] "Buenna Suertee"
As you can see the 4 "a" have been reduced to only 1 a, the 3 r have been reduced to 1 r but the 2 n and the 2 e have not been changed.
As suggested above you can replace the [[:alpha:]] by any combination of [a-zA-KM-Z] or similar, and even use the "or" operator | inside the squre brackets [y|Q] if you want your code to affect only repetitions of y and Q.
gsub("([a|e])\\1{2,}", "\\1", "Buennaaaa Suerrrtee")
# [1] "Buenna Suerrrtee"
# triple r are not affected and there are no triple e.

Count word occurrences in R

Is there a function for counting the number of times a particular keyword is contained in a dataset?
For example, if dataset <- c("corn", "cornmeal", "corn on the cob", "meal") the count would be 3.
Let's for the moment assume you wanted the number of element containing "corn":
length(grep("corn", dataset))
[1] 3
After you get the basics of R down better you may want to look at the "tm" package.
EDIT: I realize that this time around you wanted any-"corn" but in the future you might want to get word-"corn". Over on r-help Bill Dunlap pointed out a more compact grep pattern for gathering whole words:
grep("\\<corn\\>", dataset)
Another quite convenient and intuitive way to do it is to use the str_count function of the stringr package:
library(stringr)
dataset <- c("corn", "cornmeal", "corn on the cob", "meal")
# for mere occurences of the pattern:
str_count(dataset, "corn")
# [1] 1 1 1 0
# for occurences of the word alone:
str_count(dataset, "\\bcorn\\b")
# [1] 1 0 1 0
# summing it up
sum(str_count(dataset, "corn"))
# [1] 3
You can also do something like the following:
length(dataset[which(dataset=="corn")])
I'd just do it with string division like:
library(roperators)
dataset <- c("corn", "cornmeal", "corn on the cob", "meal")
# for each vector element:
dataset %s/% 'corn'
# for everything:
sum(dataset %s/% 'corn')
You can use the str_count function from the stringr package to get the number of keywords that match a given character vector.
The pattern argument of the str_count function accepts a regular expression that can be used to specify the keyword.
The regular expression syntax is very flexible and allows matching whole words as well as character patterns.
For example the following code will count all occurrences of the string "corn" and will return 3:
sum(str_count(dataset, regex("corn")))
To match complete words use:
sum(str_count(dataset, regex("\\bcorn\\b")))
The "\b" is used to specify a word boundary. When using str_count function, the default definition of word boundary includes apostrophe. So if your dataset contains the string "corn's", it would be matched and included in the result.
This is because apostrophe is considered as a word boundary by default. To prevent words containing apostrophe from being counted, use the regex function with parameter uword = T. This will cause the regular expression engine to use the unicode TR 29 definition of word boundaries. See http://unicode.org/reports/tr29/tr29-4.html. This definition does not consider apostrophe as a word boundary.
The following code will give the number of time the word "corn" occurs. Words such as "corn's" will not be included.
sum(str_count(dataset, regex("\\bcorn\\b", uword = T)))

Wrapping strings, but not substrings in quotes, using R

This question is related to my question about Roxygen.
I want to write a new function that does word wrapping of strings, similar to strwrap or stringr::str_wrap, but with the following twist: Any elements (substrings) in the string that are enclosed in quotes must not be allowed to wrap.
So, for example, using the following sample data
test <- "function(x=123456789, y=\"This is a long string argument\")"
cat(test)
function(x=123456789, y="This is a long string argument")
strwrap(test, width=40)
[1] "function(x=123456789, y=\"This is a long"
[2] "string argument\")"
I want the desired output of a newWrapFunction(x, width=40, ...) to be:
desired <- c("function(x=123456789, ", "y=\"This is a long string argument\")")
desired
[1] "function(x=123456789, "
[2] "y=\"This is a long string argument\")"
identical(desired, newWrapFunction(tsring, width=40))
[1] TRUE
Can you think of a way to do this?
PS. If you can help me solve this, I will propose this code as a patch to roxygen2. I have identified where this patch should be applied and will acknowledge your contribution.
Here's what I did to get strwrap so it would not break single quoted sections on spaces:
A) Pre-process the "even" sections after splitting by the single-quotes by substituting "~|~" for the spaces:
Define new function strwrapqt
....
zz <- strsplit(x, "\'") # will be only working on even numbered sections
for (i in seq_along(zz) ){
for (evens in seq(2, length(zz[[i]]), by=2)) {
zz[[i]][evens] <- gsub("[ ]", "~|~", zz[[i]][evens])}
}
zz <- unlist(zz)
.... insert just before
z <- lapply(strsplit) ...........
Then at the end replace all the "~|~" with spaces. It might be necessary to doa lot more thinking about the other sorts of whitespace "events" to get a fully regular treatment.
....
y <- gsub("~\\|~", " ", y)
....
Edit: Tested #joran's suggestion. Matching single and double quotes would be a difficult task with the methods I am using but if one were willing to consider any quote as equally valid as a separator target, one could just use zz <- strsplit(x, "\'|\"") as the splitting criterion in the code above.

Resources