How to convert different levels of strings into numeric responses in R? - string

First, I have read some similar questions. My question is very similar to those which have been already solved. But the slight difference causes some problems for me.
In my question, I have a column of data frame with five different levels of strings: "10-20%" "100+%" "21-40%" "41-70%" "71-100%". I have tried both function, as.numeric and as.integer. These two functions did change the strings into numeric responses. The problem is that I want to convert these strings by following the numerically sequence. For example, "10-20%" "100+%" "21-40%" "41-70%" "71-100%", each of the string is corresponding to the strings is 1,2,3,4,5.
But the thing I want is to "10-20%" is 1, "21-40%" is 2, "41-70%" is 3, "71-100%" is 4 and "100+%" is 5.
Do I have to change the sequence of levels of these strings Manually if I want to achieve my goal?
Appendix:
levels(dataset$PercentGrowth)
[1] "" "10-20%" "100+%" "21-40%" "41-70%" "71-100%"
head(as.integer(dataset$PercentGrowth))
[1] 1 4 3 1 3 4
head(as.numeric(dataset$PercentGrowth))
[1] 1 4 3 1 3 4
head((dataset$PercentGrowth))
[1] 21-40% 100+% 100+% 21-40%
Levels: 10-20% 100+% 21-40% 41-70% 71-100%

You should create a factor from your strings assigns the levels in the good order:
x = c("10-20%", "100+%" ,"21-40%" ,"41-70%", "71-100%")
as.integer(factor(x,levels=x))
[1] 1 2 3 4 5

as.numeric(factor(df$string.var,
levels = c("10-20%", "21-40%", "41-70%", "71-100%", "100+%"))
?factor
Sample data would help.
Edited to add levels.

You may try:
x <- c("10-20%", "100+%" ,"21-40%" ,"41-70%", "21-40%", "71-100%", "10-20%")
library(gtools)
match(x,unique(mixedsort(x)))
#[1] 1 5 2 3 2 4 1
##
as.numeric(factor(x, levels=unique(mixedsort(x))))
#[1] 1 5 2 3 2 4 1
Suppose your vector is: (Not a general solution)
x1 <- c("less than one year", "one year", "more than one year","one year", "less than one year")
?gsub2() From R: replace characters using gsub, how to create a function?
gsub2 <- function(pattern, replacement, x, ...) {
for(i in 1:length(pattern))
x <- gsub(pattern[i], replacement[i], x, ...)
x
}
x1[mixedorder(gsub2(c("less","^one","more"), c(0,1,2), x1))]
[1] "less than one year" "less than one year" "one year"
[4] "one year" "more than one year"

Related

Hacker rank problem - code optimisation and debugging logical errors required to pass all the test cases for the below python program

This problem is regarding sets, here is an array arr of integers. There are also disjoint sets, A and B, each containing integers. You like all the integers in the set A and dislike all the integers in set B. Your initial happiness is 0. For each integer in the array, if i belongs to A, you add 1 to your happiness. If i belongs to B, you add -1 to your happiness. Otherwise, your happiness does not change. Output your final happiness at the end.
Note: A and B are set, they have no repeated elements. However, the array might contain duplicate elements.
In the below code, I have tried to take input n,m
k = list(map(str,input().split(' ')))
n,m =k
arr=[]
arr = [int(i) for i in input().split()]
arr1 = list( dict.fromkeys(arr) )
A=set(int(i) for i in input().split())
B=set(int(i) for i in input().split())
a=len(set(arr1).intersection(A))
b=len(set(arr1).intersection(B))
print(a-b)
Input Format
The first line contains integers n and m and separated by a space.
The second line contains n integers, the elements of the array.
The third and fourth lines contain m integers, A, and B, respectively.
Input
**1** **2**
3 2 13 4
1 5 3 1 7 8 5 3 7 9 4 9 8 2 1 4
3 1 1 5 3 9
5 7 7 4 2 8
Output
1 0
The above piece of code works for small input test cases but it results as the Wrong answer for the rest.
Follow the link for the actual problem statement
This is the code I used but it was unable to clear most test cases. Need help.

Use a split function in every row of one column of a data frame

I have a rather big pandas data frame (more than 1 million rows) with columns containing either strings or numbers. Now I would like to split the strings in one column before the expression "is applied".
An example to explain what I mean:
What I have:
a b description
2 4 method A is applied
10 5 titration is applied
3 1 computation is applied
What I am looking for:
a b description
2 4 method A
10 5 titration
3 1 computation
I tried the following,
df.description = df.description.str.split('is applied')[0]
But this didn't bring the desired result.
Any ideas how to do it? :-)
You are close, need str[0]:
df.description = df.description.str.split(' is applied').str[0]
Alternative solution:
df.description = df.description.str.extract('(.*)\s+is applied')
print (df)
a b description
0 2 4 method A
1 10 5 titration
2 3 1 computation
But for better performance use list comprehension:
df.description = [x.split(' is applied')[0] for x in df.description]
you can use replace
df.description = df.description.str.replace(' is applied','')
df
a b description
0 2 4 method A
1 10 5 titration
2 3 1 computation

Understanding solution to online test

The question is in the following link:
http://www.spoj.com/problems/AEROLITE/
Input:
1 1 1 1
0 0 6 3
1 1 1 2
[and 7 test cases more]
Output:
6
57
8
[and 7 test cases more]
How does the output come from the input?
Consider the outputs corresponding to the following letters:
a. 1 1 1 1 = 6
b. 0 0 6 3 = 57
c. 1 1 1 2 = 8
Restating the definitions from the problem in a more tactical way, the 4 inputs correspond to the following:
The number of "{}" pairs
The number of "[]" pairs
The number of "()" pairs
The max depth when generating the output
The output is a single number representing the number of regular expressions that match the input parameters (how much depth can be used with the pairs) and how many combinations of the 3 pairs can be generating matching the prioritization rules that "()" cannot contain "{}" or "[]" and "[]" cannot contain "{}".
The walkthrough below shows how to arrive at the outputs, but it doesn't try to break the sub-problems or anything down. Hopefully, it will at least help you connect the numbers and start to find the problems to break down.
Taking those examples explicitly, start with "a" for 1 1 1 1 = 6:
The inputs mean that only do a depth of 1 and use 1 pair each of "{}", "[]", "()". This is a permutation how many arrangements of 3 can be made as permutations, so 3! = 6.
Actual: {}, {}()[], []{}(), {}, (){}[], ()[]{}
Then go to "b" for 1 1 1 2 = 8
This is just like "a" with exception that we must now allow for another level of depth (d = 2 instead of 1)
Therefore, this is 6 from "a" + any additional combinations of depth = 2
** Additional = {[()]}, {} (only 2 additional cases meet the rules)
"a" + (additional for d = 2) = 8
Finally, consider "b" where we are exploring only the d = 3 of 6 "()".
We must break down and add the depth (d) of 1, 2, and 3
Because only parenthesis exist here, this is just a Catalan number Cn where n = 6, but limited to a depth of no more than 3 levels of parenthesis (For more on this: https://en.wikipedia.org/wiki/Catalan_number) C(6) = 132, but once you exclude all the Catalan numbers for depths more than 3, you are left with 57 matches.
Alternatively and much more tediously, you can iterate over all the combinations of parenthesis that are depth of 3 or less to get to 57 records:
** Start with d = 1, so just ()()()()()()
** Then d = 2, so examples like (())()()()(), ()(())()()(), ()()(())()(), ()()()(())(), ()()()()(()), and so on
** Then d = 3, so examples like ((()))()()(), ()((()))()(), ()()((()))(), ()()()((())), and so on

Modifying one row of an array

I've only just started learning J and there's something I have no idea how to do properly
Suppose i want to print a checkerboard of 2 symbols, for example
baba
abab
baba
To do this, I assumed you could just generate an array
baba
baba
baba
and reverse the second line.
Generating the array is easy: 3 4 $ 'ba'. But reversing the second row is where I struggle.
I can get the reverse of the second row doing |. 1 { 3 4 $ 'ba' but that only gives me the second row, not the entire array. I don't see how using J syntax I can actually keep the top and bottom row and only apply |. to the middle row.
More generally, how would you apply |. only every other row?
What you asked
To apply |. to one row, try something like:
x =: 3 4 $ 'ba'
(|. 1{x) 1} x
baba
abab
baba
To reverse every other row, I don't know if there's something simpler than this:
,/ 1 (]`(|."1))\ i. 5 4
0 1 2 3
7 6 5 4
8 9 10 11
15 14 13 12
16 17 18 19
This uses a relatively obscure feature of the dyad \ (Infix):
x m\ y applies successive verbs from the gerund m to the infixes of y, extending m cyclically as required.
Here, x is 1, so our "infixes" are just 1×4 matrices; we cycle through a gerund (] ` (|."1)) to alternate between doing nothing (]) and reversing the single row of the submatrix (|."1). Then, we flatten the resulting 5×1×4 array back to a 5×4 matrix with ,/.
What you maybe want instead
A much simpler way to achieve a "checkerboard" is as follows: first, use +/ on two ranges to create an "addition table", like so:
(i.3) +/ (i.4)
0 1 2 3
1 2 3 4
2 3 4 5
Then take all of these values mod 2, to get a checkerboard pattern of 0s and 1s:
2 | (i.3) +/ (i.4)
0 1 0 1
1 0 1 0
0 1 0 1
Then index from a string of choice with {:
(2 | (i.3) +/ (i.4)) { 'ba'
baba
abab
baba
Way 1: Amending }
Replace the second line with the changed line:
( 4 $ 'ab') (1 }) m =: 3 4 $ 'ba'
or generally, replace with pattern a =: 4 $ 'ab', at indices i =: +:i.5:
a i } 10 4 $ 'ba'
Way 2: Cycling with gerund and cut ;.
You can cyclically apply verbs by tying them with `. For every other row (rank "1) you want to either do nothing ] or reverse |.:
(]"1)`(|."1) ;.1 m
Way 3: Using a different pattern
You can see your pattern as 4 $ 'ba' followed by its inverse:
3 $ (,:|.) 4 $ 'ba'
Incidentally,
having an odd dimension (3) with an even pattern ('ba') allows you the simpler |: 4 3 $ 'ba'.
Well, you already have a lot of answers, but none of them was the first thing that popped into my head, so I'll add this one:
0 1 0 |."(0 1) 3 4$'ba'
This takes advantage of the fact that rotating the middle row by 1 looks the same as flipping it. You can generalize this by computing as long a list of 0 and 1 as you need based on the number of rows in your checkerboard.
The approach that I would try would not require reversing lines of the array, but works by reframing the situation in a J friendly way.
I would add a column to the array so that I have an odd number columns (5) with an even number of elements ('ba'), then strip off the last item in each row.
4 5$'ba'
babab
ababa
babab
ababa
}:"1 (4 5$'ba')
baba
abab
baba
abab
Here's yet another way of re-framing this specific problem in a "J friendly way", as bob put it. It doesn't "modify one row" but achieves the desired result in a different way.
|: 4 3 $ 'ba' NB. transpose a 4x3 matrix
baba
abab
baba

How to filter a list in J?

I'm currently learning the fascinating J programming language, but one thing I have not been able to figure out is how to filter a list.
Suppose I have the arbitrary list 3 2 2 7 7 2 9 and I want to remove the 2s but leave everything else unchanged, i.e., my result would be 3 7 7 9. How on earth do I do this?
The short answer
2 (~: # ]) 3 2 2 7 7 2 9
3 7 7 9
The long answer
I have the answer for you, but before you should get familiar with some details. Here we go.
Monads, dyads
There are two types of verbs in J: monads and dyads. The former accept only one parameter, the latter accept two parameters.
For example passing a sole argument to a monadic verb #, called tally, counts the number of elements in the list:
# 3 2 2 7 7 2 9
7
A verb #, which accepts two arguments (left and right), is called copy, it is dyadic and is used to copy elements from the right list as many times as specified by the respective elements in the left list (there may be a sole element in the list also):
0 0 0 3 0 0 0 # 3 2 2 7 7 2 9
7 7 7
Fork
There's a notion of fork in J, which is a series of 3 verbs applied to their arguments, dyadically or monadically.
Here's the diagram of a kind of fork I used in the first snippet:
x (F G H) y
G
/ \
F H
/ \ / \
x y x y
It describes the order in which verbs are applied to their arguments. Thus these applications occur:
2 ~: 3 2 2 7 7 2 9
1 0 0 1 1 0 1
The ~: (not equal) is dyadic in this example and results in a list of boolean values which are true when an argument doesn't equal 2. This was the F application according to diagram.
The next application is H:
2 ] 3 2 2 7 7 2 9
3 2 2 7 7 2 9
] (identity) can be a monad or a dyad, but it always returns the right argument passed to a verb (there's an opposite verb, [ which returns.. Yes, the left argument! :)
So far, so good. F and H after application returned these values accordingly:
1 0 0 1 1 0 1
3 2 2 7 7 2 9
The only step to perform is the G verb application.
As I noted earlier, the verb #, which is dyadic (accepts two arguments), allows us to duplicate the items from the right argument as many times as specified in the respective positions in the left argument. Hence:
1 0 0 1 1 0 1 # 3 2 2 7 7 2 9
3 7 7 9
We've just got the list filtered out of 2s.
Reference
Slightly different kind of fork, hook and other primitves (including abovementioned ones) are described in these two documents:
A Brief J Reference (175 KiB)
Easy-J. An Introduction to the World's most Remarkable Programming Language (302 KiB)
Other useful sources of information are the Jsoftware site with their wiki and a few mail list archives in internets.
Just to be sure it's clear, the direct way - to answer the original question - is this:
3 2 2 7 7 2 9 -. 2
This returns
3 7 7 9
The more elaborate method - generating the boolean and using it to compress the vector - is more APLish.
To answer the other question in the very long post, to return the first element and the number of times it occurs, is simply this:
({. , {. +/ .= ]) 1 4 1 4 2 1 3 5
1 3
This is a fork using "{." to get the first item, "{. +/ . = ]" to add up the number of times the first item equals each element, and "," as the middle verb to concatenate these two parts.
Also:
2 ( -. ~ ]) 3 2 2 7 7 2 9
3 7 7 9
There are a million ways to do this - it bothers me, vaguely, that these these things don't evaluate strictly right to left, I'm an old APL programmer and I think of things as right to left even when they ain't.
If it were a thing that I was going to put into a program where I wanted to pull out some number and the number was a constant, I would do the following:
(#~ 2&~:) 1 3 2 4 2 5
1 3 4 5
This is a hook sort of thing, I think. The right half of the expression generates the truth vector regarding which are not 2, and then the octothorpe on the left has its arguments swapped so that the truth vector is the left argument to copy and the vector is the right argument. I am not sure that a hook is faster or slower than a fork with an argument copy.
+/3<+/"1(=2&{"1)/:~S:_1{;/5 6$1+i.6
156
This above program answers the question, "For all possible combinations of Yatzee dice, how many have 4 or 5 matching numbers in one roll?" It generates all the permutations, in boxes, sorts each box individually, unboxing them as a side effect, and extracts column 2, comparing the box against their own column 2, in the only successful fork or hook I've ever managed to write. The theory is that if there is a number that appears in a list of 5, three or more times, if you sort the list the middle number will be the number that appears with the greatest frequency. I have attempted a number of other hooks and/or forks and every one has failed because there is something I just do not get. Anyway that truth table is reduced to a vector, and now we know exactly how many times each group of 5 dice matched the median number. Finally, that number is compared to 3, and the number of successful compares (greater than 3, that is, 4 or 5) are counted.
This program answers the question, "For all possible 8 digit numbers made from the symbols 1 through 5, with repetition, how many are divisible by 4?"
I know that you need only determine how many within the first 25 are divisible by 4 and multiply, but the program runs more or less instantly. At one point I had a much more complex version of this program that generated the numbers in base 5 so that individual digits were between 0 and 4, added 1 to the numbers thus generated, and then put them into base 10. That was something like 1+(8$5)#:i.5^8
+/0=4|,(8$10)#. >{ ;/ 8 5$1+i.5
78125
As long as I have solely verb trains and selection, I don't have a problem. When I start having to repeat my argument within the verb so that I'm forced to use forks and hooks I start to get lost.
For example, here is something I can't get to work.
((1&{~+/)*./\(=1&{))1 1 1 3 2 4 1
I always get Index Error.
The point is to output two numbers, one that is the same as the first number in the list, the second which is the same as the number of times that number is repeated.
So this much works:
*./\(=1&{)1 1 1 3 2 4 1
1 1 1 0 0 0 0
I compare the first number against the rest of the list. Then I do an insertion of an and compression - and this gives me a 1 so long as I have an unbroken string of 1's, once it breaks the and fails and the zeros come forth.
I thought that I could then add another set of parens, get the lead element from the list again, and somehow record those numbers, the eventual idea would be to have another stage where I apply the inverse of the vector to the original list, and then use $: to get back for a recursive application of the same verb. Sort of like the quicksort example, which I thought I sort of understood, but I guess I don't.
But I can't even get close. I will ask this as a separate question so that people get proper credit for answering.

Resources