I've got a string that I need to only upcase the first letter. I also need to preserve the case of any subsequent letters. At first I thought:
String.capitalize("hyperText")
would do the trick. But in addition to fixing the first letter, it downcases the rest of the letters. What I need to end up with is "HyperText". My initial pass at this is:
<<letter :: utf8, rest :: binary>> = word
upcased_first_letter = List.to_string([letter])
|> String.upcase()
upcased_first_letter <> rest
This works perfectly but it really seems like a lot of verbosity and a lot of work as well. I keep feeling like there's a better way. I'm just not seeing it.
You can use with/1 to keep it to a single expression, and you can avoid List.to_string by using the <<>> operator again on the resulting codepoint:
with <<first::utf8, rest::binary>> <- "hyperText", do: String.upcase(<<first::utf8>>) <> rest
Or put it in a function:
def upcaseFirst(<<first::utf8, rest::binary>>), do: String.upcase(<<first::utf8>>) <> rest
One method:
iex(10)> Macro.camelize("hyperText")
"HyperText"
This might be more UTF-8 compatible? Not sure how many letters are multiple codepoints, but this seems a little safer than assuming how many bytes a letter is going to be.
iex(6)> with [first | rest] <- String.codepoints("βool") do
...(6)> [String.capitalize(first) | rest] |> Enum.join()
...(6)> end
"Βool"
iex(7)> with [first | rest] <- String.codepoints("😂ool") do
...(7)> [String.capitalize(first) | rest] |> Enum.join()
...(7)> end
"😂ool"
iex(8)>
If you're just upcasing the English alphabet, you could do an easy guard clause on your match. An anonymous function example, though named or a with or something would work too:
iex> cap_first = fn
...> <<first, rest::binary>> when first in ?a..?z -> <<first - 32, rest::binary>>
...> string -> string
...> end
iex> cap_first.("hyperText")
"HyperText"
Related
I can use occursin function, but its haystack argument cannot be a regular expression, which means I have to pass the entire alphanumeric string to it. Is there a neat way of doing this in Julia?
I'm not sure your assumption about occursin is correct:
julia> occursin(r"[a-zA-z]", "ABC123")
true
julia> occursin(r"[a-zA-z]", "123")
false
but its haystack argument cannot be a regular expression, which means I have to pass the entire alphanumeric string to it.
If you mean its needle argument, it can be a Regex, for eg.:
julia> occursin(r"^[[:alnum:]]*$", "adf24asg24y")
true
julia> occursin(r"^[[:alnum:]]*$", "adf24asg2_4y")
false
This checks that the given haystack string is alphanumeric using Unicode-aware character class
[[:alnum:]] which you can think of as equivalent to [a-zA-Z\d], extended to non-English characters too. (As always with Unicode, a "perfect" solution involves more work and complication, but this takes you most of the way.)
If you do mean you want the haystack argument to be a Regex, it's not clear why you'd want that here, and also why "I have to pass the entire alphanumeric string to it" is a bad thing.
As has been noted, you can indeed use regexes with occursin, and it works well. But you can also roll your own version, quite simply:
isalphanumeric(c::AbstractChar) = isletter(c) || ('0' <= c <= '9')
isalphanumeric(str::AbstractString) = all(isalphanumeric, str)
I am collecting tweets from twitter using erlang and I am trying to save only the hashtags to a database. However when I'm converting the bitstrings to list-strings all the non-latin-letter tweets converts to strange symbols.
Is there any way to check if a string is only containing alphanumeric characters in erlang?
for latin chars you can use this function:
is_alpha([Char | Rest]) when Char >= $a, Char =< $z ->
is_alpha(Rest);
is_alpha([Char | Rest]) when Char >= $A, Char =< $Z ->
is_alpha(Rest);
is_alpha([Char | Rest]) when Char >= $0, Char =< $9 ->
is_alpha(Rest);
is_alpha([]) ->
true;
is_alpha(_) ->
false.
for other coding, you can add their rang of code and add them.
There are three io_lib functions specifically for this:
io_lib:printable_list/1
io_lib:printable_latin1_list/1
io_lib:printable_unicode_list/1
Here is an example of one in use:
-spec show_message(ExParent, Message) -> ok
when WxParent :: wx:wx_object(),
Message :: unicode:chardata() | term().
show_message(WxParent, Message) ->
Format =
case io_lib:printable_unicode_list(Message) of
true -> "~ts";
false -> "~tp"
end,
Modal = wxMessageDialog:new(WxParent, io_lib:format(Format, [Message])),
_ = wxMessageDialog:showModal(Modal),
ok = wxMessageDialog:destroy(Modal).
Check out the io_lib docs: http://www.erlang.org/doc/man/io_lib.html#printable_list-1
Addendum
Because this subject isn't always easy to research in Erlang a related, but slightly broader Q/A might be of interest:
How to check whether input is a string in Erlang?
The easiest way is to use regular expressions.
StringAlphanum = "1234abcZXYM".
StringNotAlphanum = "1ZXYMÄ#kMp&?".
re:run(StringAlphanum, "^[0-9A-Za-z]+$").
>> {match,[{0,11}]}
re:run(StringNotAlphanum, "^[0-9A-Za-z]+$").
>> nomatch
You can easily make a function out of it...
isAlphaNum(String) ->
case re:run(String, "^[0-9A-Za-z]+$") of
{match, _} -> true;
nomatch -> false
end.
But, in my opinion, the better way would be to solve the underlying Problem, the correct interpretation of unicode binary strings.
If you want to represent unicode-characters correctly, do not use binary_to_list. Use the unicode-module instead. Unicode-binary strings can not be interpreted naiveley as binary, the UTF-8 character encoding for example has some special constraints that prevent this. For example: the most significant bit in the first character determines, if it is a multi-byte character.
I took the following example from this site, lets define a UTF8-String:
Utf8String = <<195, 164, 105, 116, 105>>.
Interpreted naiveley as binary it yields:
binary_to_list(Utf8String).
"äiti"
Interpreted with unicode-support:
unicode:characters_to_list(Utf8String, utf8).
"äiti"
I want to find if two strings are anagrams or not..
I thought to sort them,and then check one by one but is there any algorithms for sorting stings? or another idea to make it? (simple ideas or code because i am a beginner )thanks
Strings are lists of characters in Haskell, so the standard sort simply works.
> import Data.List
> sort "hello"
"ehllo"
Your idea of sorting and then comparing sounds fine for checking anagrams.
I can give you and idea-(as I am not that much acquainted with haskell).
Take an array having 26 spaces.
Now for each character in the first string you increase certaing position in array.
If array A[26]={0,0,...0}
Now if you find 'a' then put A[1]=A[1]+1;
if 'b' then A[2]=A[2]+1;
Now in case of 2nd string for each character you decrease the values for each character found in the same array.(if you find 'a' decrease A[1] like A[1]=A[1]-1)
At last check if all the array elements are 0 or not. If 0 then definitely they are anagram else not an anagram.
Note: You may extend this for Capital letters similarly.
It is not necessary to count the crowd each letter.
Simply, you can sort your string and then check each element of two lists.
For example, you have this
"cinema" and "maneci"
It would be helpful to make your string into a list of characters.
['c','i','n','e','m','a'] and ['m','a','n','e','c','i']
Then , you can sort these list and you will check each character.
Note that you will have these cases :
example [] [] = True
example [] a = False
example a [] = False
example (h1:t1)(h2:t2) = if h1==h2 then _retroactively_ else False
In the Joy of Haskell "Finding Success and Failure", pp.11-14, the authors offer the following code which works:
import Data.List
isAnagram :: String -> String -> Bool
isAnagram word1 word2 = (sort word1) == (sort word2)
After importing your module (I imported practice.hs into Clash), you can enter two strings which, if they are anagrams, will return true:
*Practice> isAnagram "julie" "eiluj"
True
Is there a way in ML to take in a string and output a list of those string where a separation is a space, newline or eof, but also keeping strings inside strings intact?
EX) hello world "my id" is 5555
-> [hello, world, my id, is, 5555]
I am working on a tokenizing these then into:
->[word, word, string, word, int]
Sure you can! Here's the idea:
If we take a string like "Hello World, \"my id\" is 5555", we can split it at the quote marks, ignoring the spaces for now. This gives us ["Hello World, ", "my id", " is 5555"]. The important thing to notice here is that the list contains three elements - an odd number. As long as the string only contains pairs of quotes (as it will if it's properly formatted), we'll always get an odd number of elements when we split at the quote marks.
A second important thing is that all the even-numbered elements of the list will be strings that were unquoted (if we start counting from 0), and the odd-numbered ones were quoted. That means that all we need to do is tokenize the ones that were unquoted, and then we're done!
I put some code together - you can continue from there:
fun foo s =
let
val quoteSep = String.tokens (fn c => c = #"\"") s
val spaceSep = String.tokens (fn c => c = #" ") (* change this to include newlines and stuff *)
fun sepEven [] = []
| sepEven [x] = (* there were no quotes in the string *)
| sepEven (x::y::xs) = (* x was unquoted, y was quoted *)
in
if length quoteSep mod 2 = 0
then (* there was an uneven number of quote marks - something is wrong! *)
else (* call sepEven *)
end
String.tokens brings you halfway there. But if you really want to handle quotes like you are sketching then there is no way around writing an actual lexer. MLlex, which comes with SML/NJ and MLton (but is usable with any SML) could help. Or you just write it by hand, which should be easy enough in this case as well.
I have a lot of strings, and each of which tends to have the following format: Ab_Cd-001234.txt
I want to replace it with 001234. How can I achieve it in R?
The stringr package has lots of handy shortcuts for this kind of work:
# input data following #agstudy
data <- c('Ab_Cd-001234.txt','Ab_Cd-001234.txt')
# load library
library(stringr)
# prepare regular expression
regexp <- "[[:digit:]]+"
# process string
str_extract(data, regexp)
Which gives the desired result:
[1] "001234" "001234"
To explain the regexp a little:
[[:digit:]] is any number 0 to 9
+ means the preceding item (in this case, a digit) will be matched one or more times
This page is also very useful for this kind of string processing: http://en.wikibooks.org/wiki/R_Programming/Text_Processing
Using gsub or sub you can do this :
gsub('.*-([0-9]+).*','\\1','Ab_Cd-001234.txt')
"001234"
you can use regexpr with regmatches
m <- gregexpr('[0-9]+','Ab_Cd-001234.txt')
regmatches('Ab_Cd-001234.txt',m)
"001234"
EDIT the 2 methods are vectorized and works for a vector of strings.
x <- c('Ab_Cd-001234.txt','Ab_Cd-001234.txt')
sub('.*-([0-9]+).*','\\1',x)
"001234" "001234"
m <- gregexpr('[0-9]+',x)
> regmatches(x,m)
[[1]]
[1] "001234"
[[2]]
[1] "001234"
You could use genXtract from the qdap package. This takes a left character string and a right character string and extracts the elements between.
library(qdap)
genXtract("Ab_Cd-001234.txt", "-", ".txt")
Though I much prefer agstudy's answer.
EDIT Extending answer to match agstudy's:
x <- c('Ab_Cd-001234.txt','Ab_Cd-001234.txt')
genXtract(x, "-", ".txt")
# $`- : .txt1`
# [1] "001234"
#
# $`- : .txt2`
# [1] "001234"
gsub Remove prefix and suffix:
gsub(".*-|\\.txt$", "", x)
tools package Use file_path_sans_ext from tools to remove extension and then use sub to remove prefix:
library(tools)
sub(".*-", "", file_path_sans_ext(x))
strapplyc Extract the digits after - and before dot. See gsubfn home page for more info:
library(gsubfn)
strapplyc(x, "-(\\d+)\\.", simplify = TRUE)
Note that if it were desired to return a numeric we could use strapply rather than strapplyc like this:
strapply(x, "-(\\d+)\\.", as.numeric, simplify = TRUE)
I'm adding this answer because it works regardless of what non-numeric characters you have in the strings you want to clean up, and because OP said that the string tends to follow the format "Ab_Cd-001234.txt", which I take to mean allows for variation.
Note that this answer takes all numeric characters from the string and keeps them together, so if the string were "4_Ab_Cd_001234.txt", your result would be "4001234".
If you're wanting to point your solution at a column in a dataframe you've got,
df$clean_column<-gsub("[^0-9]", "", df$dirty_column)
This is very similar to the answer here:
https://stackoverflow.com/a/52729957/9731173.
Essentially what you are doing with my solution is replacing any non-numeric character with "", while the answer I've linked to replaces any character that is not numeric, - or .