Subset PLINK binary files (BED, BIM & FAM) - subset

Hi everyone I am working on genotype data, I have bed, bim and fam files with the summary statistics of GWAS. Because the number of individuals are a lot, so I want to sample from my binary files in numbers of 3000 randomly . In other words, I want to subset the binary files. Do you know how can I do that by plink, R or python?

you can achieve that using PLINK. First, create a list of individuals that you want to subset and name it say individuals.txt. Next, run the following command to create a separate binary file for individuals in the individuals.txt
plink --bfile toy --keep individuals.txt --make-bed --out toy_subset
Hope this helps.

Related

Move non-sequential files to new directory

I have no previous programming experience. I know this question has been asked before or the answer is out there but I, for the life of me, cannot find it. I have searched google for hours trying to figure this out. I am working on a Red Hat Linux computer and it is in bash.
I have a directory of files 0-500 in /directory/.
They are named as such,
/directory/filename_001, /directory/filename_002, and so forth.
After running my analysis for my research, I have a listofnumbers.txt (txt file, with each row being a new number) of the numbers that I am interested in. For example,
015
124
187
345
412
A) Run a command from the list of files the files from the list of numbers? Our code looks like this:
g09slurm filename_001.com filename_001.log
Is there a way to write something like:
find value (row1 of listofnumbers.txt) then g09slurm filename_row1value.com filename_row1value.log
find value (row2 of listofnumbers.txt) then g09slurm filename_row2value.com filename_row2value.log
find value (row3 of listofnumbers.txt) then g09slurm filename_row3value.com filename_row2value.log
etc etc
B) Move the selected files from the list to a new directory, so I can rename them sequentially, then run a sequential number command?
Thanks.
First, read the list of files into an array:
readarray myarray < /path/to/filename.txt
Next, we'll get all the filenames based on those numbers, and move them
cd /path/to/directory
mv -t /path/to/new_directory "${myarray[#]/#/filename_}"
After this... honestly, I got bored. Stack Overflow is about helping people who make a good start at a problem, and you've done zero work toward figuring this out (other than writing "I promise I tried google").
I don't even understand what
Run a command from the list of files the files from the list of numbers
means.
To rename them sequentially (once you've moved them), you'll want to do something based on this code:
for i in $(ls); do
*your stuff here*
done
You should be able to research and figure stuff out. You might have to do some bash tutorials, here's a reasonable starting place

How to call a large list of paired files to be executed by a program in BASH?

I have a large directory of files (100+) that I'd like to pass through a program via the terminal.
The files are paired and all follow a naming scheme like such:
TS-8_S53_L001_R1_001.fastq
TS-8_S53_L001_R2_001.fastq
RS-9_S54_L001_R1_001.fastq
RS-9_S54_L001_R2_001.fastq
And the program execution looks like:
Seqprogram -i1 Blah_R1_001.fastq -i2 Blah_R2_001.fastq -o Blah_paired.fastq
All of these files are in one directory.
I'd like to able to run the program on all of the files, using the files paired together in the proper sequence (R1 files are passed through i1, the R1 and R2 files have the same base name) and the output file (-o) is saved under the base name with some identifier attached ("_paired", etc).
I've envisioned on how I'd do this over Python; however, I am trying to get better with BASH.
I'm familiar with how one might call multiple files into a single command; i.e., uncompressing all .gz files in a particular directory
gunzip "*.gz"
But this command has two inputs, and the inputs must be ordered, so the wildcard scheme isn't sufficient.
Thanks
Use a wildcard to get one file of the pair, and then use parameter substitution to get the other corresponding filenames.
for i1 in *_R1_001.fastq; do
i2=${i1/R1_001/R2_001}
paired=${i1/R1_001/paired}
Seqprogram -i1 "$i1" -i2 "$i2" -o "$paired"
done
The easiest way to do this is to match a single one of the three filenames patterned, and to modify it to get the other two.
That is to say:
for r1file in *_R1_*.fastq; do
r2file=${r1file/_R1_/_R2_}
pairfile=${r1file%_R1_*}_paired.fastq
Seqprogram -i1 "$r1file" -i2 "$r2file" -o "$pairfile"
done

PDF Data and Table Scraping to Excel

I'm trying to figure out a good way to increase the productivity of my data entry job.
What I am looking to do is come up with a way to scrape data from a PDF and input it into Excel.
More specifically the data I am working with is from grocery store flyers. As it stands now we have to manually enter every deal in the flyer into a database. A sample of a flyer is http://weeklyspecials.safeway.com/customer_Frame.jsp?drpStoreID=1551
What I am hoping to do is have columns for products, price, and predefined options (Loyalty Cards, Coupons, Select Variety... that sort of thing).
Any help would be appreciated, and if I need to be more specific let me know.
After looking at the specific PDF linked to by the OP, I have to say that this is not quite displaying a typical table format.
It contains many images inside the "cells", but the cells are not all strictly vertically or horizontally aligned:
So this isn't even a 'nice' table, but an extremely ugly and awkward one to work with...
Having said that, I'll have to add:
Extracting even 'nice' tables from PDFs in general is extremely difficult...
Standard PDFs do not provide any hints about the semantics of what they draw on a page:
the only distinction that the syntax provides is the distinctions between vector elements (lines, fills,...), images and text.
Whether any character is part of a table or part of a line or just a lonely, single character within an otherwise empty area is not easy to recognize programmatically by parsing the PDF source code.
For a background about why the PDF file format should never, ever be thought of as suitable for hosting extractable, structured data, see this article:
Why Updating Dollars for Docs Was So Difficult (ProPublica-Website)
...but doing so with TabulaPDF works very well!
Having said the above now let me add this:
For an amazing open source family of tools that gets better and better from week to week for extracting tabular data from PDFs (unless they are scanned pages) -- contradicting what I said in my introductionary paragraphs! -- check out TabulaPDF. See these links:
Introducing Tabula: Upload a PDF, get back tabular CSV data. Poof!
Tabula-Extractor: A Command Line Interface to Tabula
Tabula source code repository
Tabula API (upcoming, not ready yet)
Tabula-Extractor is written in Ruby.
In the background it makes use of PDFBox (which is written in Java) and a few other third-party libs.
To run, Tabula-Extractor requires JRuby-1.7 installed.
Installing Tabula-Extractor
I'm using the 'bleeding-edge' version of Tabula-Extractor directly from its GitHub source code repository.
Getting it to work was extremely easy, since on my system JRuby-1.7.4_0 is already present:
mkdir ~/svn-stuff
cd ~/svn-stuff
git clone https://github.com/tabulapdf/tabula-extractor.git git.tabula-extractor
Included in this Git clone will already be the required libraries, so no need to install PDFBox.
The command line tool is in the /bin/ subdirectory.
Exploring the command line options:
~/svn-stuff/git.tabula-extractor/bin/tabula -h
Tabula helps you extract tables from PDFs
Usage:
tabula [options] <pdf_file>
where [options] are:
--pages, -p <s>: Comma separated list of ranges, or all. Examples:
--pages 1-3,5-7, --pages 3 or --pages all. Default
is --pages 1 (default: 1)
--area, -a <s>: Portion of the page to analyze
(top,left,bottom,right). Example: --area
269.875,12.75,790.5,561. Default is entire page
--columns, -c <s>: X coordinates of column boundaries. Example
--columns 10.1,20.2,30.3
--password, -s <s>: Password to decrypt document. Default is empty
(default: )
--guess, -g: Guess the portion of the page to analyze per page.
--debug, -d: Print detected table areas instead of processing.
--format, -f <s>: Output format (CSV,TSV,HTML,JSON) (default: CSV)
--outfile, -o <s>: Write output to <file> instead of STDOUT (default:
-)
--spreadsheet, -r: Force PDF to be extracted using spreadsheet-style
extraction (if there are ruling lines separating
each cell, as in a PDF of an Excel spreadsheet)
--no-spreadsheet, -n: Force PDF not to be extracted using
spreadsheet-style extraction (if there are ruling
lines separating each cell, as in a PDF of an Excel
spreadsheet)
--silent, -i: Suppress all stderr output.
--use-line-returns, -u: Use embedded line returns in cells. (Only in
spreadsheet mode.)
--version, -v: Print version and exit
--help, -h: Show this message
Extracting the table which the OP wants
I'm not even trying to extract this ugly table from the OP's monster PDF. I'll leave it as an excercise to these readers who are feeling adventurous enough...
Instead, I'll demo how to extract a 'nice' table. I'll take pages 651-653 from the official PDF-1.7 specification, here represented with screenshots:
I used this command:
~/svn-stuff/git.tabula-extractor/bin/tabula \
-p 651,652,653 -g -n -u -f CSV \
~/Downloads/pdfs/PDF32000_2008.pdf
After importing the generated CSV into LibreOffice Calc, the spreadsheet looks like this:
To me this looks like the perfect extraction of a table which did spread over 3 different PDF pages. (Even the newlines used within table cells made it into the spreadsheet.)
Update
Here is an ASCiinema screencast (which you also can download and re-play locally in your Linux/MacOSX/Unix terminal with the help of the asciinema command line tool), starring tabula-extractor:

Matching text files from a list of system numbers

I have ~ 60K bibliographic records, which can be identified by system number. These records also hold full-text (individudal text files named by system number).
I have lists of system numbers in bunches of 5K and I need to find a way to copy only the text files from each 5K list.
All text files are stored in a directory (/fulltext) and are named something along these lines:
014776324.txt.
The 5k lists are plain text stored in separated directories (e.g. /5k_list_1, 5k_list_2, ...), where each system number matches to a .txt file.
For example: bibliographic record 014776324 matches to 014776324.txt.
I am struggling to find a way to copy into the 5k_list_* folders only the corresponding text files.
Any idea?
Thanks indeed,
Let's assume we invoke the following script this way:
./the-script.sh fulltext 5k_list_1 5k_list_2 [...]
Or more succinctly:
./the-script.sh fulltext 5k_list_*
Then try using this (totally untested) script:
#!/usr/bin/env bash
set -eu # enable error checking
src_dir=$1 # first argument is where to copy files from
shift 1
for list_dir; do # implicitly consumes remaining args
while read bibliographic record sys_num rest; do
cp "$src_dir/$sys_num.txt" "$list_dir/"
done < "$list_dir/list.txt"
done

Creating a tag cloud from text:nsp output

I am using Text::NSP which creates n-gram from text files. Is it possible to create tag clouds from an output file of Text-NSP? I have used and liked IBM Word Cloud Generator which only gives a tag cloud output from the frequency of each word within a file. However, I am working with 2-grams and 3-grams. In short, I need a tag cloud generator which will accept an input file with words and their occurrence number. I am running on Debian.
Thanks all.
I started to use R snippets package which is what I was searching for.
The output of text::nsp should be changed with some bash scripts in order to obtain a dataframe acceptable by the R.

Resources