I have a dataset I am working on in R, however, one of the column values has dot (.) instead of a comma (,) so I think this might be messing up when I am running the regression. Does anyone know what code I should run to change all the dots to commas?
Thanks beforehand.
Assuming you have a dataframe named df
df %>% mutate_all(funs(str_replace(., "\\.", ",")))
If its for one column only
df %>% mutate(col1 = gsub("\\.", ",", col1))
Assuming your data is a character vector inside a dataframe
df <- data.frame(var = c("5.1", "30", "..", "75.234.4423.5"))
With gsub
df$var <- gsub("\\.", ",", df$var)
With stringi and purrr
library(stringi)
library(purrr)
df$var <- modify_if(df$var, stri_detect_fixed(df$var, "."),
~stri_sub_replace_all(., stri_locate_all_fixed(., "."), replacement=","))
Output
df
var
1 5,1
2 30
3 ,,
4 75,234,4423,5
I used purrr::modify_if with the predicate stri_detect_fixed(df$var, ".") so that the values without any dots (in my example, 30) are not converted to NA by stringi::stri_sub_replace_all.
The stringi version is more flexible for other purposes, you can pass functions inside the replacement argument when you want a dynamic replacement value.
I cannot comment on if it will help your regression analysis. I simply answered by giving ways to change dots to commas in a character vector.
Related
I am trying to sort a list of strings. Each string contains numbers and letters and they are separated by space. I want to sort the list based on the numbers.
Example code:
list=["x 10","y 20"]
You may sort with the help of a lambda:
list = ["x 10", "y 20", "z 15"]
list_sorted = sorted(list, key=lambda x: int(x.split()[1]))
print(list_sorted) # ['x 10', 'z 15', 'y 20']
In the snippet above, the lambda expression splits each list element by space, and then casts the second element to an integer. It is this value which is then used to sort the list.
l=["x 10","y 20","z 15"]
n=len(l)
for i in range(n-1):
for j in range(n-i-1):
if int(l[j].split()[1])>int(l[j+1].split()[1]):
l[j],l[j+1]=l[j+1],l[j]
print(f"sorted{l}")
I have some data stored in a structured numpy array that I would like to write to a file. Because I'm generating some code to match an existing output format that exported a more visually pleasant table I need to output this data with a varying whitespace delimiter.
Using a basic example:
import numpy as np
x = np.zeros((2,),dtype=('i4,f4,a10'))
x[:] = [(1,2.,'Hello'),(2,3.,'World')]
With a desired output of:
1 2.00 Hello
2 3.00 World
Now, I know I could do this:
for row in x:
print('%i %.2f %s' % (row[0], row[1], row[2]))
Which works fine but it seems overly verbose if you have a lot of columns. Right now there are 8 in my output.
I'm new to Python (coming from MATLAB), is there a more generic statement that I could use to 'roll up' all of the columns? I was thinking something along the lines of print('%i %.2f %s' % val for val in row) but this just prints 2 generators. When I ran into a generator while using re.split I could use list() to get my desired output, but I don't think using something similar with, say, str() for this case would do what I'm looking for.
If you wrap row in a tuple, you don't have to list all the elements. % formats take a tuple. row may display as a tuple but it is actually a numpy.void.
for row in x:
print('%i %.2f %s'%tuple(row))
Printing or writing rows in a loop like this is perfectly normal Python.
Class __str__ methods are often written like:
astr=['a header line']
for row in x:
astr.append('%i %.2f %s'%tuple(row))
astr='\n'.join(astr)
producing a string of joined lines.
A comprehension equivalent could be:
'\n'.join('%i %.2f %s'%tuple(row) for row in x)
Since you are using Python3, you could also use the .format approach, but I suspect the % style is closer to what you are using in MATLAB.
First:
Regarding your implicit question with the generators;
you can "stringify" a generator for example with str.join:
>>> gen = (i.upper() for i in "teststring")
>>> print(''.join(gen))
TESTSTRING
Second:
To the actual question: It is a little bit more generic, than your approach, but it is still not really satisfying.
formats = [
("{}", 1), # format and number of spaces for row[0]
("{.2f}", 4), # the same for row[1]
("{}", 0) # and row[2]
]
for row in x:
lst = [f.format(d) + (" "*s) for (f, s, d) in zip(formats, row)]
print(*lst, sep='')
It just provides a better overview over each format.
I want to extract the elements of a character array that contains some particular string. For example:
x <- c('aa', 'ab', 'ac', 'bb', 'bc')
I want some function such that, given x and 'a'(in general this can be a string), it returns 'aa', 'ab', 'ac'. I have experimented with a combination of %in%, match, which, etc, but have not been able to make them work. Any idea?
Just use grep:
grep('a', x, value=TRUE)
[1] "aa" "ab" "ac"
In a table or a list, we can use dplyr::pull from dplyr/tidyverse package to convert values in a column to a vector first and then find the particular value in the column. For instance, in the LEGO example, we can do the following to find any theme starting by "s" or "S":
inventory_parts_themes <- inventories %>%
inner_join(inventory_parts, by = c("id" = "inventory_id")) %>%
arrange(desc(quantity)) %>%
select(-id, -version) %>%
inner_join(sets, by = "set_num") %>%
inner_join(themes, by = c("theme_id" = "id"), suffix = c("_set", "_theme"))
all_theme_names <- dplyr::pull(inventory_parts_themes, name_theme)
all_theme_names[grep("^[sS].*", all_theme_names)]
I would like to insert an extra character (or a new string) at a specific location in a string. For example, I want to insert d at the fourth location in abcefg to get abcdefg.
Now I am using:
old <- "abcefg"
n <- 4
paste(substr(old, 1, n-1), "d", substr(old, n, nchar(old)), sep = "")
I could write a one-line simple function for this task, but I am just curious if there is an existing function for that.
You can do this with regular expressions and gsub.
gsub('^([a-z]{3})([a-z]+)$', '\\1d\\2', old)
# [1] "abcdefg"
If you want to do this dynamically, you can create the expressions using paste:
letter <- 'd'
lhs <- paste0('^([a-z]{', n-1, '})([a-z]+)$')
rhs <- paste0('\\1', letter, '\\2')
gsub(lhs, rhs, old)
# [1] "abcdefg"
as per DWin's comment,you may want this to be more general.
gsub('^(.{3})(.*)$', '\\1d\\2', old)
This way any three characters will match rather than only lower case. DWin also suggests using sub instead of gsub. This way you don't have to worry about the ^ as much since sub will only match the first instance. But I like to be explicit in regular expressions and only move to more general ones as I understand them and find a need for more generality.
as Greg Snow noted, you can use another form of regular expression that looks behind matches:
sub( '(?<=.{3})', 'd', old, perl=TRUE )
and could also build my dynamic gsub above using sprintf rather than paste0:
lhs <- sprintf('^([a-z]{%d})([a-z]+)$', n-1)
or for his sub regular expression:
lhs <- sprintf('(?<=.{%d})',n-1)
stringi package for the rescue once again! The most simple and elegant solution among presented ones.
stri_sub function allows you to extract parts of the string and substitute parts of it like this:
x <- "abcde"
stri_sub(x, 1, 3) # from first to third character
# [1] "abc"
stri_sub(x, 1, 3) <- 1 # substitute from first to third character
x
# [1] "1de"
But if you do this:
x <- "abcde"
stri_sub(x, 3, 2) # from 3 to 2 so... zero ?
# [1] ""
stri_sub(x, 3, 2) <- 1 # substitute from 3 to 2 ... hmm
x
# [1] "ab1cde"
then no characters are removed but new one are inserted. Isn't that cool? :)
#Justin's answer is the way I'd actually approach this because of its flexibility, but this could also be a fun approach.
You can treat the string as "fixed width format" and specify where you want to insert your character:
paste(read.fwf(textConnection(old),
c(4, nchar(old)), as.is = TRUE),
collapse = "d")
Particularly nice is the output when using sapply, since you get to see the original string as the "name".
newold <- c("some", "random", "words", "strung", "together")
sapply(newold, function(x) paste(read.fwf(textConnection(x),
c(4, nchar(x)), as.is = TRUE),
collapse = "-WEE-"))
# some random words strung together
# "some-WEE-NA" "rand-WEE-om" "word-WEE-s" "stru-WEE-ng" "toge-WEE-ther"
Your original way of doing this (i.e. splitting the string at an index and pasting in the inserted text) could be made into a generic function like so:
split_str_by_index <- function(target, index) {
index <- sort(index)
substr(rep(target, length(index) + 1),
start = c(1, index),
stop = c(index -1, nchar(target)))
}
#Taken from https://stat.ethz.ch/pipermail/r-help/2006-March/101023.html
interleave <- function(v1,v2)
{
ord1 <- 2*(1:length(v1))-1
ord2 <- 2*(1:length(v2))
c(v1,v2)[order(c(ord1,ord2))]
}
insert_str <- function(target, insert, index) {
insert <- insert[order(index)]
index <- sort(index)
paste(interleave(split_str_by_index(target, index), insert), collapse="")
}
Example usage:
> insert_str("1234567890", c("a", "b", "c"), c(5, 9, 3))
[1] "12c34a5678b90"
This allows you to insert a vector of characters at the locations given by a vector of indexes. The split_str_by_index and interleave functions are also useful on their own.
Edit:
I revised the code to allow for indexes in any order. Before, indexes needed to be in ascending order.
I've made a custom function called substr1 to deal with extracting, replacing and inserting chars in a string. Run these codes at the start of every session. Feel free to try it out and let me know if it needs to be improved.
# extraction
substr1 <- function(x,y) {
z <- sapply(strsplit(as.character(x),''),function(w) paste(na.omit(w[y]),collapse=''))
dim(z) <- dim(x)
return(z) }
# substitution + insertion
`substr1<-` <- function(x,y,value) {
names(y) <- c(value,rep('',length(y)-length(value)))
z <- sapply(strsplit(as.character(x),''),function(w) {
v <- seq(w)
names(v) <- w
paste(names(sort(c(y,v[setdiff(v,y)]))),collapse='') })
dim(z) <- dim(x)
return(z) }
# demonstration
abc <- 'abc'
substr1(abc,1)
# "a"
substr1(abc,c(1,3))
# "ac"
substr1(abc,-1)
# "bc"
substr1(abc,1) <- 'A'
# "Abc"
substr1(abc,1.5) <- 'A'
# "aAbc"
substr1(abc,c(0.5,2,3)) <- c('A','B')
# "AaB"
It took me some time to understand the regular expression, afterwards I found my way with the numbers I had
The end result was
old <- "89580000"
gsub('^([0-9]{5})([0-9]+)$', '\\1-\\2', old)
similar to yours!
First make sure to load tidyverse package, and then use both paste0 and gsub.
Here is the exact code:
paste0(substr(old, 1,3), "d", substr(old,4,6))
In base you can use regmatches to insert a character at a specific location in a string.
old <- "abcefg"
n <- 4
regmatches(old, `attr<-`(n, "match.length", 0)) <- "d"
old
#[1] "abcdefg"
This could also be used with a regex to find the location to insert.
s <- "abcefg"
regmatches(s, regexpr("(?<=c)", s, perl=TRUE)) <- "d"
s
#[1] "abcdefg"
And works also for multiple matches with individual repacements at different matches.
s <- "abcefg abcefg"
regmatches(s, gregexpr("(?<=c)", s, perl=TRUE)) <- list(1:2)
s
#[1] "abc1efg abc2efg"
I'm trying to use a loop to create histograms where the column name reference changes as a result of the loop by changing the string of the column name. I want to get four histograms for column1, column2, column3, column4. (In my actually example the columns are not named column 1, but I want it to be clear).
For (i in 1:4){
hist( paste("dataset$column" +i ) , main ="title")
}
When I try using paste I get the error that x must be numeric, but if I try it using just one as a check like
hist( dataset$column1), main = "title")
it works fine so its not the data itself.
You can use apply
set.seed(001)
DF <- data.frame(column1=rnorm(100),
column2=rnorm(100),
column3=rnorm(100),
column4=rnorm(100))
apply(DF, 2, hist) # It produces one hist for each column
Using a for loop
for(i in 1:ncol(DF)){
hist(DF[, paste('column', i, sep='')],
main=paste('Histogram', i))
}
I usually use lapply() in these cases. Here's an example where I've also used gsub() to pretty up the names a little bit.
set.seed(001)
DF <- data.frame(Funky.Name.1 = rnorm(100),
Funky.Name.2 = rnorm(100),
Whoo.Whoo = rnorm(100),
Yee.Haw = rnorm(100))
lapply(names(DF),
function(x) hist(DF[, x], main = gsub("\\.", " ", x), xlab="Value"))
Demo
par(mfrow = c(2, 2)) makes it so we can plot four plots together in a 2x2 grid filled in by row.
# par(mfrow = c(2, 2)) # 2x2 layout of all four Histograms
# lapply(names(DF),
# function(x) hist(DF[, x], main = gsub("\\.", " ", x), xlab="Value"))
# When you're done: dev.off()
Result:
The for function is not capitalized. And the "+" operator does not work on character values. And you cannot append number to column name "stems" in the manner you are attempting, but you can calculate arguments to the "[[" operator which is what the "$" operator really is. This might work depending on what the column names of 'dataset' really are:
for (i in 1:4) {
hist( dataset[[ paste0("column", i ) ]] , main ="title")
}
I very much doubt that this:
hist( dataset$column1), main = "title")
works - you have an extra bracket.
If you do:
paste("dataset$column" +i )
you will get a string "dataset$column1", etc. Instead, you need to select the column you want:
for (i in 1:4) {
hist(dataset[,i] , main ="title")
}
to select columns 1, 2, 3, & 4.
Or you could have:
for (i in 1:4) {
hist(dataset[[paste0("column",i)]] , main ="title")
}