I am looking to match multiple string criteria and then subset the row in R, using grepl to find the match. I have found a nice solution from another post where some specific code is used (but you get the idea): subset(GEMA_EO5, grepl(paste(l, collapse="|"),GEMA_EO5$RefSeq_ID))
I am wondering if it is possible to grepl in two columns, instead of just RefSeq_ID in the example above. That is, in grepl via any other method. In other words, I would like to look for the options in l not just in one column, but in two (or however many). Is this possible?
eg.: 3 columns, a b and c. I would like to criteria such that T (rows 3 and 4) is selected, despite the format "T I" in (3,b). it should identify both (4,a) and (3,b), hence the link to the previous question. I want it to look in column a AND column b, not one or the other.
a b c
A A C P L
V V B W E E
W T I P J G
T W P J
Here's some demo data to show how this works:
set.seed(1234)
dat <- data.frame(A = sample(letters[1:3],10,TRUE),
B = sample(letters[1:3],10,TRUE))
Using [ to subset makes this a lot more clear in my opinion - we can use grepl to give a logical vector based on a match, and use | to combine two tests (on multiple columns). If you wanted a subset of all the rows that contained an 'a' in either column:
dat.a <- dat[with(dat, grepl("a", A)|grepl("a", B)),]
A B
1 b a
2 b a
3 a c
5 a a
9 a a
Related
I am new to pandas.I have a situation I want to split length column into two columns a and b.Values in length column are in pair.I want to compare first pair smaller value should be in a nad larger in b.then compare next pair on same row and smaller in a,larger in b.
I have hundred rows.I think I can not use str.split because there are multiple values and same delimiter.I have no idea how to do it
The output should be same like this.
Any help will be appreciated
length a b
{22.562,"35.012","25.456",37.342,24.541,38.241} 22.562,25.45624.541 35.012,37.342,38.241
{21.562,"37.012",25.256,36.342} 31.562,25.256 37.012,36.342
{22.256,36.456,26.245,35.342,25.56,"36.25"} 22.256,26.245,25.56 36.456,35.342,36.25
I have tried
df['a'] = df['length'].str.split(',').str[0::2]
df['b'] = df['length'].str.split(',').str[1::3]
through this ode column b output is perfect but col a is printing first full pair then second.. It is not giving only 0,2,4th values
The problem comes from the fact that your length column is made of set not lists.
Here is a way to do what you want by casting your length column as list:
df['length'] = [list(x) for x in df.length] # We cast the sets as lists
df['a'] = [x[0::2] for x in df.length]
df['b'] = [x[1::2] for x in df.length]
Output:
length a \
0 [35.012, 37.342, 38.241, 22.562, 24.541, 25.456] [35.012, 38.241, 24.541]
1 [25.256, 36.342, 21.562, 37.012] [25.256, 21.562]
2 [35.342, 36.456, 36.25, 22.256, 25.56, 26.245] [35.342, 36.25, 25.56]
b
0 [37.342, 22.562, 25.456]
1 [36.342, 37.012]
2 [36.456, 22.256, 26.245]
I have data that look like this (going on for many more rows):
What I want to do is:
Match the relationship of C and G to the relationship of I and J.
For example, I:Q1652 matches up with J:Q1662; therefore, C:Q1652 should also match up with G:Q1662.
At the same time, A & B and E & F should maintain their relationships with C and G, respectively
For example, when C:Q1652 and G:Q1662 are being matched, they should carry with them their respective rows/values from columns A & B and E & F.
Please let me know if there's anything more I can clarify! Thanks!
Please see K1:N1 cells in the below graph.
K1: =INDEX(A:A,MATCH($I1,$C:$C,0))
L1: =INDEX(B:B,MATCH($I1,$C:$C,0))
M1: =INDEX(E:E,MATCH($J1,$G:$G,0))
N1: =INDEX(F:F,MATCH($J1,$G:$G,0))
I've got two data sets: Data-A and Data-B.
Data-A
A B C D Start_Date End_Date
N C P 1 23-05-2015 27-05-2015
N C K 1 30-05-2015 07-06-2015
N C Ke 1 09-06-2015 28-06-2015
N C Ch 1 14-07-2015 25-07-2015
N C Th 1 29-06-2015 13-07-2015
N C Po 2 23-05-2015 27-05-2015
N C Kan 2 30-05-2015 08-06-2015
Data-B
X D Date A B C
444 1 09-07-2015
455 1 20-07-2015
1542 1 28-06-2015
2321 1 21-07-2015
2744 1 01-07-2015
7455 2 25-05-2015
12454 2 02-06-2015
18568 2 24-05-2015
28329 2 03-06-2015
28661 2 31-05-2015
Values is data-Bare missing and I need to fill them using conditional index matching/vlookup such that column D(Data-B) is matched along with Date(Data-B) such that Start Date<= Date <=End Date.
Desired Output:
X D Date A B C
444 1 09-07-2015 N C Th
455 1 20-07-2015 N C Ch
1542 1 28-06-2015 N C Ke
2321 1 21-07-2015 N C Ch
2744 1 01-07-2015 N C Th
7455 2 25-05-2015 N C Po
12454 2 02-06-2015 N C Kan
18568 2 24-05-2015 N C Po
28329 2 03-06-2015 N C Kan
28661 2 31-05-2015 N C Kan
Proof of Concept
In order to achieve the above I used the AGGREGATE function. It is a normal formula that performs array like calculations. The following formula will return the results from the first row that matches your criteria.
=INDEX(A$2:A$8,AGGREGATE(15,6,ROW($D$2:$D$8)/(($J2=$D$2:$D$8)*($E$2:$E$8<=$K2)*($K2<=$F$2:$F$8)),1)-1)
This assumed your table Data-A Started in A1 and included 1 row as a header row. The formula can be place in the first cell under A in Data-B and copied down and to the right as needed.
UPDATE Formula explained
The aggregate function performs array calculations within its brackets for certain sub function. There are about 19 different subfunctions. Subfunction 14 and 15 are both array calculations. This is a nice feature since it does array like calculations while being a regular formula.
Since I wanted the first row that met your criteria, I opted to use the small function or subfunction 15 for the first argument. Basically I am telling the aggregate function to generate a list and sort it in ascending order.
The second argument has a value of 6 which tell the aggregate to ignore any results from the array that generate errors. This will come in very handy if we can make results we do not want turn in to errors.
Now we are getting into the array portion of the formula. You can take this next part of the equation and highlight the appropriate rows in a neighbouring column and enter it as a CONTROL+SHIFT+ENTER (CSE) formula. As long as you do this in the top cell the array formula will propagate to the remainder of the selected cells and show you the results of the array. Also check the formula bar to see if { } appeared around your formula. You cannot add the { } manually.
{=ROW($D$2:$D$8)/(($J2=$D$2:$D$8)*($E$2:$E$8<=$K2)*($K2<=$F$2:$F$8))}
What this will do is determine the current row and then will divide it by the results of our conditions. You can also try each of the following conditions in a separate column as CSE formulas in the same manner described above to see their results.
($J2=$D$2:$D$8)
($E$2:$E$8<=$K2)
($K2<=$F$2:$F$8)
These on their own will provide you with either TRUE or FALSE as it checks each row. Now the interesting thing is, and this applies to excel formulas, when you perform a math operation on a Boolean, it will treat 0 as false and anything other number as TRUE. It will actually convert TRUE to 1. You will also note that each of the logic checks was separated by *. In this case * is acting like an AND operator as only when all results are true will you get an answer of 1. (+ will act like an OR operator)
Now if you remember from earlier 6 said to ignore all errors. So any row that does not meet our logic check will result in a division by 0 since not all logic checks results in TRUE or 1. All the checks that wound up false wind up getting ignored. So now after doing that, a list of only row numbers that met our criteria is left inside the aggregates array.
After the logic check there is a ,1 for the next argument. In this case we are telling the aggregate to return the 1st number in the list which is the first row number that met our criteria. If we wanted the third number, this would be ,3 instead.
So aggregate is returning the first row number of the results we want. When this is paired with an INDEX function, when can use the result to tell us what row of the INDEX function to look in. In this case we said we wanted to look in the index A$2:A$8. The aggregate function is telling us how many rows to go down in the index. If the index had start in row 1 we would not have to do anything. But since there is a header row, we need to adjust the results from the aggregate function by subtracting 1 for the head row (in reality you need to subtract the row number above the start of your data). This is why you see the -1 after the aggregate function.
Now if you pay attention to the lock on the range you will notice I did not lock the A in A$2:A$8. I did this so that I could copy the formula to the right and the column A address would update as I did. This only works because you were keeping the columns in the same order. If the order has changed I would have changed the index from a 1D array to a 2D array and used a MATCH function to line up the column headers.
I am attempting to translate my existing Matlab code into Numbers (basically Excel). In Matlab, I have the following code:
clear all; clc;
n = 30
x = 1:(n-1)
T = 295;
D = T./(n-x)
E = T/n
for i=1:(n-2)
C(i) = D(i+1) - D(i)
end
hold on
plot(x(1:end-1), C, 'rx')
plot(x, D, 'bx')
I believe everything has been solved by your formulas, there are parts of them that I don't understand otherwise I would try to figure the rest out myself. Attached is the result (Also you might like to know that the formulas you gave work and are recognized in Numbers). Im trying to figure out why (x) begins at 2 as I feel as though it should start at 1?
Also it is clear that the realistic results for the formulas only exist under certain conditions i.e. column E > 0. That being the case, what would be the easiest way to chart data with a filter so that only certain data is charted?
(Using Excel...)
Suppose you put your input values T & n in A1 & B1 respectively.
You could generate x, D & C In columns C,D & E with:
C1: =IF(ROW()<$A$1,ROW(),"")
D1: =IF(LEN(C1)>0,$A$2/($A$1-C1),"")
E1: =IF(LEN(D2)>0,D2-D1,"")
You could then pull all 3 columns down as far as you need to generate the full length of your vectors. If you then want to chart these, simply use these columns as inputs.
I have a simple-seeming problem, but in practice it seems to be more involved. In python, for example, it seems like it would be much more straightforward. But I would really like to learn how to do this in Stata.
Say that I have a big dataset. I have several string variables, S1, S2, and S3. I get a subset of S1 based on some criteria. Let's say that this gets me (after sorting and only the data of interest are displayed):
S1
1 A
2 B
3 C
4 D
5 E
Based on different criteria, I get, for S2:
S2
1 B
2 B
3 C
4 F
For S3:
S3
1 B
2 Long string
What I am interested in doing is to get a list of all of the distinct values across S1, S2, and S3. One way I have thought about doing this is:
Save all desired values of S1 into a macro, M1. I didn't figure out how one is able to do this.
Save all desired values of S2 into a macro, M2.
Check if the values of M2 are in M1. Do not add the values of M2 to M1 that are already in M1, but do add the values of M2 to M1 that are not already there. It seems like this post is similar to how to do this step. (Why is there a : in front of list?)
Repeat step 3, except for S3/M3 instead of S2/M2.
This would produce the macro M1 with values:
A B C D E F Long String
Note that I do not need this to be in a macro. If it could be in a matrix or some other way, that would work as well. The important part is to get the information.
Several ways to do this.
Many assumptions made in this example (many things are not clear in your post):
clear
set more off
input ///
str15(s1 s2 s3)
a "b" "b"
b "b" "long string"
c "c" ""
d "f" ""
e "" ""
end
list
stack s*, into(news) clear
bysort news : keep if _n == 1
drop _stack
list
If you want to work your way through, using macros, then help macrolists and help levelsof can aid:
clear
set more off
input ///
str15(s1 s2 s3)
a "b" "b"
b "b" "long string"
c "c" ""
d "f" ""
e "" ""
end
list
local uvalues
foreach var of varlist _all {
levelsof `var', local(loc`var')
local uvalues : list uvalues | loc`var'
}
display `"`uvalues'"'
Saying more about how your variables are organized (e.g. one or several files), whether you care or not to destroy the original data set, the treatment of missings, etc. can probably get you an ad hoc answer.