Extracting individuals with Plink. Error: Line 1 of --keep file has fewer tokens than expected - genome

I have files with the 2504 individuals of the 1000 genomes project, and I want to filter by population. I did the following for the first population (ACB):
plink --file all1000gen --keep indACB.txt --make-bed --out all1000genACB
but it gives back the following error:
Error: Line 1 of --keep file has fewer tokens than expected.
my indACB.txt file looks like this:
head indACB.txt
HG01879
HG01880
HG01882
HG01883
HG01885
HG01886
HG01889
HG01890
HG01894
HG01896
which I made (por each population, using grep) from the population information file that's available in the 1000 genomes page, which has a two times the individual ID (first two columns) and one with the population name, as shown:
head indpop2.txt
HG00096 HG00096 GBR
HG00097 HG00097 GBR
HG00099 HG00099 GBR
HG00100 HG00100 GBR
HG00101 HG00101 GBR
HG00102 HG00102 GBR
HG00103 HG00103 GBR
HG00105 HG00105 GBR
HG00106 HG00106 GBR
HG00107 HG00107 GBR
I think there's a problem with my --keep file, but I'm not sure what's the wanted structure of the txt file.
I also tried greping ACB individuals from indpop2.txt , so the new indACB.txt file looks like this:
head indACB2.txt
HG01879 HG01879 ACB
HG01880 HG01880 ACB
HG01882 HG01882 ACB
HG01883 HG01883 ACB
HG01885 HG01885 ACB
HG01886 HG01886 ACB
HG01889 HG01889 ACB
HG01890 HG01890 ACB
HG01894 HG01894 ACB
HG01896 HG01896 ACB
But it yields the following error:
plink --file allconcat39 --keep indACB2.txt --make-bed --out allconcat43ACB
Error: No people remaining after --keep.

the first two columns are family and individual IDs; the third column is expected to be a numeric value (although the file can have more than 3 columns), and only individuals who have a value of 1 for this would be included in any subsequent analysis or file generation procedure.

Related

How to pull specific data from table, .txt file, Python 3

I would like to pull data from 2 columns (Input & Surname) from a table that is saved as a .txt file and then generate an output file (by writing a script) with the two columns (Input and Surname). I know how to do this with normal lines but have no idea where to start with a table format.
Example table -
Input
Name
Middle-name
Surname
Gender
123
Sam
Mitchell
Grant
Male
123
Sameuel
n/a
Fineus
Male
123
Sharron
Elizabeth
Graceson
Female
Actual data -
Input Input Type MGI Gene/Marker ID Symbol Name Feature Type
GO:0003723 Gene Ontology (GO) MGI:87879 Aco1 aconitase 1 protein coding gene
GO:0003723 Gene Ontology (GO) MGI:88022 Ang angiogenin, ribonuclease, RNase A family, 5 protein coding gene
GO:0003723 Gene Ontology (GO) MGI:88042 Apex1 apurinic/apyrimidinic endonuclease 1 protein coding gene
The second row of the table starts from GO:0003723 and each new row starts with GO:0003723 as well.
You can use the csv module to parse tab seperated value files as shown here.

How to swap characters around inside a column in EXCEL?

Specifically, I know ahead of time I only need to swap position 1 and 2 with 4 and 5.
2 Examples:
HEART
New output:
RTAHE
12734
New output:
34712
There is probably more than a handful of ways to do this. If you're interested in a formula, here is one way to go about it:
=RIGHT(A3,2)&MID(A3,3,LEN(A3)-4)&LEFT(A3,2)
Seems to be working on some test data I threw together.
A bit more robust, as suggested by #Rafalon:
=MID(A3,4,2)&MID(A3,3,1)&LEFT(A3,2)&MID(A3,6,LEN(A3))
Produces following results:
Input
1
12
123
1234
12345
123456
1234567
Output
1
12
312
4312
45312
453126
4531267

Python/Pandas - Preparing Source Data with Weekly Columns to Time Series

I tried to google a question like this: How to transform weekly data for time series analysis in Pandas?
This question is hard to search without results that talk straight about re-sampling data from daily to weekly or something along those lines.
My question is really more to do with source data already in the form of weekly numerical data, but no time or date data like a datetime stamp.
Here is the form: (Please use the vertical bars for logical alignment of each row.)
Unique_Entity(string) | WK1(float64) | WK2(float64) | WK3(float64)| ...
UE1 | 123 | 234 | 345 | ...
UE2 | 456 | 567 | 678 | ...
UE3 | 789 | 890 | 901 | ...
... | ... | ... | ... | ...
Also WK1 is a "dynamic" description to indicate the numerical data is last week, WK2 is two weeks ago, WK3 is three weeks ago, and so on. So next week WK1's data will shift to WK2 and new data will be added to WK1. Hope that makes sense from my description.
With this being the source data format, I'd like to analyze this live data using time series tools provided by pandas and other python modules. A lot of them use an explicit date column to get their claws in for the rest of the analysis.
Wrap-Up Question: How do I transform or prepare my source data so that these tools can be easily used? (Apart from my naive solution below.)
Naive Solution: I could tag the date of the Monday (Or Friday) every week going backwards. (A function that uses today's date to then generate the dates of every Monday (Or Friday) going back.) Then I could point those time series tools to use those dates and re-sample as weeks.
This is assuming I've un-pivoted the horizontal headers so that WK1 will join with last Monday's (Or Friday's) date and so forth.
Create a DatetimeIndex ending today, with 1 week period in reverse, and assign it to the columns:
df.columns = pd.date_range(end=datetime.date.today(), periods=len(df.columns),
freq='1W-MON')[::-1]
It gives:
2019-06-10 2019-06-03 2019-05-27
UE1 123 234 345
UE2 456 567 678
UE3 789 890 901
Transpose the result if needed.

How to resolve duplicate column names in excel file with Alteryx?

I have a wide excel file with price data, looking like this
Product | 2015-08-01 | 2015-09-01 | 2015-09-01 | 2015-10-01
ABC | 13 | 12 | 15 | 14
CDE | 69 | 70 | 71 | 67
FGH | 25 | 25 | 26 | 27
The date 2015-09-01 can be found twice, which in the context is valid but obviously messes up my workflow.
It can be understood that the first value is the minimum price, the second one the maximum price. If there is only one column, min and max are the same.
Is there a way to resolve this issue?
An idea I had was the following:
I also have cells that contain a value like "38 - 42", again indicating min and max. I resolved this by spliting it based on a Regex expression. What could be a solution is to join two columns that have the same header, to afterwards split the values according to my rules. That however would require me to detect dynamically if the headers are duplicates.
Is that something that is possible in Alteryx or is there an easier solution for this problem?
And of course asking the supplier of the file to change it is not really an option, unfortunatelly.
Thanks
EDIT:
Just got another idea:
I transpose the table to have the format
Product | Date | Price Low | Price High
So if I could check for duplicates in that table and somehow merge these records into one, that would do the trick as well.
EDIT2:
Since I seem to haven't made that clear, my final result should look like the transposed table in EDIT1. If there is only one value it should go in "Price Low" (and then I will probably copy it to "Price High" anyway. If there are two values they should go in the according columns. #Poornima's suggestion resolves the duplicate issue in a more sophisticated form than putting a "_2" behind the column name, but doesn't put the value in the required column.
If this format works for you:
Product | Date | Price Low | Price High
Then:
- Transpose with Product as a key field
- Use a select tool to truncate your Name field to 10 characters. This will remove any _2 values that Alteryx has automatically renamed.
- Summarize:
Group by Product
Group by Name
Then apply Min and Max operations to value.
Result is:
Product | Name | Min_Value | Max_Value
ABC | 2015-08-01 | 13 | 13
ABC | 2015-09-01 | 12 | 15
ABC | 2015-10-01 | 14 | 14
For this problem, you can leverage the native Excel (.xlsx) driver available in Alteryx 9.1. If multiple columns in Excel use the same string, then they are renamed by the native driver with an underscore at the end e.g., 2015-09-01, 2015-09-01_1. By leveraging this, we can reformat the data in three steps:
As you suggested, we start by transposing the data so that we can leverage the column headers.
We can then write a formula with the Formula Tool that evaluates whether the column header for the date is the first or the last one based on the header length.
The final step would be to bring the data back into the same format as before, which can be via the Crosstab Tool.
You can review the configurations for each of these tools here. The end result would be as follows.
Hope this helps.
Regards,
Poornima

Selecting Text from an R string to create a new object

I'm relatively new to R, and I'm currently stuck.
I have observations that are made up of legal articles, fe:
BIV:III,XXVIII.1(b);CIV:2.
So I splitted them resulting in a string listing each observation and the legal articles used. This looks like:
ArtAGr list of 400230
chr[1:2] "BIV:III,XXVIII.1(b)" "CIV:2"
chr[1:1] "ILA:2.3(b)"
chr[1:3] "BIV:IB.3(d)" "CIV:7,9" "ILA:VII.1"
The BIV and CIV would need to become my new variables. However, the observations vary, so some observations include both BIV and CIV, while others include other legal articles like ILA:II.3(b)
Now, I would like to create a dataframe from these guys, so I can group all the observations in a column for each major article.
Eventually, the perfect dataframe should look like:
Dispute BIV CIV ILA
1 III, XXVIII.1(b) 2 NA
2 NA NA II.3(b)
3 IV.3(d) 7,9 VII.1
4 II NA NA
So, I will need to create a new object grouping all observations who contain a text like BIV, and a O or N/A for those observations that do not use this legal article. Any thoughts would be greatly appreciated!
Thanks a lot!
Sven
Here's an approach:
# a vector of character strings (not the splitted ones)
vec <- c("BIV:III,XXVIII.1(b);CIV:2",
"ILA:II.3(b)",
"BIV:IB.3(d);CIV:7,9;ILA:VII.1")
# split strings
s <- strsplit(vec, "[;:]")
# target words
tar <- c("BIV", "CIV", "ILA")
# create data frame
setNames(as.data.frame(do.call(rbind, lapply(s, function(x)
replace(rep(NA_character_, length(tar)),
match(x[c(TRUE, FALSE)], tar), x[c(FALSE, TRUE)])))), tar)
The result:
BIV CIV ILA
1 III,XXVIII.1(b) 2 <NA>
2 <NA> <NA> II.3(b)
3 IB.3(d) 7,9 VII.1

Resources