Genomics-annotation and curation

Genomics-annotation and curation - genome

How do we spot INDEL errors in an annotated .fna file of Bacillus? Is there a template / example ?
I've annotated a fasta sequence nucleotide in RAStk and am following apaper on how to further curate it.. It mentions that insertion/deletion (indels) errors were corrected in CLC Genomics Workbench 11 and adjusted, as there was enough depth coverage. I would like to apply CLC genomics work bench to check for the errors in the sequences, but how does one recognise what are errors and what is an error free/ or a corrected file to help identify whether the indel sequences have been modified. can we open the fasta file and identify errors?

Related

Opening .hoc files in neuron simulator + "not a mechanism" problem

I am trying to run a NEURON simulation via python. I got all the libraries in order and am able to run some simple simulations, but am experiencing some troubles with a more complicated code. If you have any idea how to help I will appreciate it very much
Problem number 1:
Neuron doesn't open part of a .hoc file even though it is compiled. I get the error:
NEURON: Can't open import3d/import3d_sec.hoc
in import3d.hoc near line 1
{xopen("import3d/import3d_sec.hoc")}
^
xopen("import3d/i...")
xopen("import3d.hoc")
execute1("{xopen("im...")
load_file("C:/Users/U...")
Problem number 2:
The simulator doesn't recognize a mechanism I am trying to use. here I am a bit lost and don't know to describe further, but this is the error message:
NEURON: Im is not a MECHANISM
in L5PCbiophys5b.hoc near line 26
insert Im
^
xopen("L5PCbiophy...")
execute1("{xopen("L5...")
load_file("C:/Users/U...")
Problem number 3:
Not recognizing as a template:
NEURON: Import3d_Neurolucida3 is not a template
in L5PCtemplate_2.hoc near line 26
nl = new Import3d_Neurolucida3()
^
xopen("L5PCtempla...")
execute1("{xopen("L5...")
load_file("C:/Users/U...")

You can try to use an absolute path
The name for insert should match with the SUFFIX statement in the file; also make sure that file was compiled in and that the dll is loaded (should be a message when you start nrniv)
Perhaps a result of the file xopen problem? If it is a template-containing file you should use load_file() instead of xopen()

How Do I resolve "Illuminate\Queue\InvalidPayloadException: Unable to JSON encode payload. Error code: 5"

Trying out the queue system for a better user upload experience with Laravel-Excel.
.env was been changed from 'sync' to 'database' and migrations run. All the necessary use statements are in place yet the error above persists.
The exact error happens here:
Illuminate\Queue\Queue.php:97
$payload = json_encode($this->createPayloadArray($job, $queue, $data));
if (JSON_ERROR_NONE !== json_last_error()) {
throw new InvalidPayloadException(
If I drop ShouldQueue, the file imports perfectly in-session (large file so long wait period for user.)
I've read many stackoverflow, github etc comments on this but I don't have the technical skills to deep-dive to fix my particular situation (most of them speak of UTF-8 but I don't if that's an issue here; I changed the excel save format to UTF-8 but it didn't fix it.)
Ps. Whilst running the migration, I got the error:
SQLSTATE[42000]: Syntax error or access violation: 1071 Specified key was too long; max key length is 767 bytes (SQL: alter table `jobs` add index `jobs_queue_index`(`queue`))
I bypassed by dropping the 'add index'; so my jobs table is not indexed on queue but I don't feel this is the cause.

One thing you can do when looking into json_encode() errors is use the json_last_error_msg() function, which will give you a bit more of a readable error message.
In your case you're getting a '5' back, which is the JSON_ERROR_UTF8 error code. The error message back for this is a slightly more informative one:
'Malformed UTF-8 characters, possibly incorrectly encoded'
So we know it's encountering non-UTF-8 characters, even though you're saving the file specifically with UTF-8 encoding. At first glance you might think you need to convert the encoding yourself in code (like this answer), but in this case, I don't think that'll help. For Laravel-Excel, this seems to be a limitation of trying to queue-read .xls files - from the Laravel-Excel docs:
You currently cannot queue xls imports. PhpSpreadsheet's Xls reader contains some non-utf8 characters, which makes it impossible to queue.
In this case you might be stuck with a slow, non-queueable option, or need to convert your spreadsheet into a queueable format e.g. .csv.
The key length error on running the migration is unrelated. It has been around for a while and is a side-effect of using an older version of MySQL/MariaDB. Check out this answer and the Laravel documentation around index lengths - you need to add this to your AppServiceProvider::boot() method:
Schema::defaultStringLength(191);

Basic PRINT LIN(2)

I was looking through the 1975 Oregon Trail Basic Code and found this line in it:
PRINT LIN(2)
I have searched quite a few places but can't find any reference to it.
Can anyone tell me what this means?
Edit:
Sorry, I was wondering what the PRINT LIN(2) meant. Does anyone know what that means?

"Oregon Trail" source of 1975 at www.filfre.net/misc/oregon1975.bas was written in BASIC for a HP-2100 system.
This HP-2100 system was a series of minicomputers produced by Hewlett-Packard.
This system run an interpreted BASIC named "HP Time-Shared BASIC".
This is the reference manual of "TimeShared BASIC/2000 Level F".
About:
PRINT LIN(2)
Generates a carriage return and 2 (two) line feeds.
"Oregon Trail" for year 1978 at www.filfre.net/misc/oregon1978.bas was written using BASIC for "CDC Cyber range of mainframe-class supercomputers of Control Data Corporation (CDC)".
Documentation
http://bitsavers.org/pdf/cdc/cyber/lang/basic/
19983900K_BASIC_Version_3_Reference_Manual_Aug84.pdf
I compare both sources (strip line number without reference by THEN, GOTO or GOSUB) at
Oregon Trail Compare

Labelling text using Notepad++ or any other tool

I have several .dat, containing information about hotel reviews as below
/*
<Author> simmotours
<Content> review......goes here
<Date>Nov 18, 2008
<No. Reader>-1
<No. Helpful>-1
<Overall>4`enter code here`
<Value>4
<Rooms>3
<Location>4
<Cleanliness>4
<Check in / front desk>4
<Service>4
<Business service>-1
*/
I want to classify the review into two pos and neg , i.e. have two folder pos and neg containing several files with reviews above 3 classified as positive and below 3 classified as negative.
How can I quickly and efficiently automate this process?

You could write up a python script to read the overall score. Do this by looping over the the lines using readline() See here. Find the "Overall" Score using some string parsing. Then move the file into the right directory. All very simple things to do in Python, just break it down into steps and search for answers to those steps.

Notepad++ can do replacements with regular expressions. And allows the definition of macros. Use them to convert the file to an XML file. Check out the help file.
Then you can read it with any scripting language and do what you want.
Alternatively you could change the file to a form where you can load it into Excel and do the analysis there.

Can't Input Tab Delimited file to Stanford Classifier

I'm having a problem inputting tab delimited files into the stanford classifier.
Although I was able to successfully walk through all the included stanford tutorials, including the newsgroup tutorial, when I try to input my own training and test data it doesn't load properly.
At first I thought the problem was that I was saving the data into a tab delimited file using an Excel spreadsheet and it was some kind of encoding issue.
But then I got exactly the same results when I did the following. First I literally typed the demo data below into gedit, making sure to use a tab between the politics/sports class and the ensuing text:
politics Obama today announced a new immigration policy.
sports The NBA all-star game was last weekend.
politics Both parties are eyeing the next midterm elections.
politics Congress votes tomorrow on electoral reforms.
sports The Lakers lost again last night, 102-100.
politics The Supreme Court will rule on gay marriage this spring.
sports The Red Sox report to spring training in two weeks.
sports Messi set a world record for goals in a calendar year in 2012.
politics The Senate will vote on a new budget proposal next week.
politics The President declared on Friday that he will veto any budget that doesn't include revenue increases.
I saved that as myproject/demo-train.txt and a similar file as myproject/demo-test.txt.
I then ran the following:
java -mx1800m -cp stanford-classifier.jar edu.stanford.nlp.classify.ColumnDataClassifier
-trainFile myproject/demo-train.txt -testFile myproject/demo-test.txt
The good news: this actually ran without throwing any errors.
The bad news: since it doesn't extract any features, it can't actually estimate a real model and the probability defaults to 1/n for each item, where n is the number of classes.
So then I ran the same command but with two basic options specified:
java -mx1800m -cp stanford-classifier.jar edu.stanford.nlp.classify.ColumnDataClassifier
-trainFile myproject/demo-train.txt -testFile myproject/demo-test.txt -2.useSplitWords =2.splitWordsRegexp "\s+"
That yielded:
Exception in thread "main" java.lang.RuntimeException: Training dataset could not be processed
at edu.stanford.nlp.classify.ColumnDataClassifier.readDataset(ColumnDataClassifier.java:402)
at edu.stanford.nlp.classify.ColumnDataClassifier.readTrainingExamples (ColumnDataClassifier.java:317)
at edu.stanford.nlp.classify.ColumnDataClassifier.trainClassifier(ColumnDataClassifier.java:1652)
at edu.stanford.nlp.classify.ColumnDataClassifier.main(ColumnDataClassifier.java:1628)
Caused by: java.lang.ArrayIndexOutOfBoundsException: 2
at edu.stanford.nlp.classify.ColumnDataClassifier.makeDatum(ColumnDataClassifier.java:670)
at edu.stanford.nlp.classify.ColumnDataClassifier.makeDatumFromLine(ColumnDataClassifier.java:267)
at edu.stanford.nlp.classify.ColumnDataClassifier.makeDatum(ColumnDataClassifier.java:396)
... 3 more
These are exactly the same results I get when I used the real data I saved from Excel.
Even more though, I don't know how to make sense of the ArrayIndexOutOfBoundsException. When I used readline in python to print out the raw strings for both the demo files I created and the tutorial files that worked, nothing about the formatting seemed different. So I don't know why this exception would be raised with one set of files but not the other.
Finally, one other quirk. At one point I thought maybe line breaks were the problem. So I deleted all line breaks from the demo files while preserving tab breaks and ran the same command:
java -mx1800m -cp stanford-classifier.jar edu.stanford.nlp.classify.ColumnDataClassifier
-trainFile myproject/demo-train.txt -testFile myproject/demo-test.txt -2.useSplitWords =2.splitWordsRegexp "\s+"
Surprisingly, this time no java exceptions are thrown. But again, it's worthless: it treats the entire file as one observation, and can't properly fit a model as a result.
I've spent 8 hours on this now and have exhausted everything I can think of. I'm new to Java but I don't think that should be an issue here -- according to Stanford's API documentation for ColumnDataClassifier, all that's required is a tab delimited file.
Any help would be MUCH appreciated.
One last note: I've run these same commands with the same files on both Windows and Ubuntu, and the results are the same in each.

Use a properties file. In the example Stanford classifier example
trainFile=20news-bydate-devtrain-stanford-classifier.txt
testFile=20news-bydate-devtest-stanford-classifier.txt
2.useSplitWords=true
2.splitWordsTokenizerRegexp=[\\p{L}][\\p{L}0-9]*|(?:\\$ ?)?[0-9]+(?:\\.[0-9]{2})?%?|\\s+|[\\x80-\\uFFFD]|.
2.splitWordsIgnoreRegexp=\\s+
The number 2 at the start of lines 3, 4 and 5 signifies the column in your tsv file. So in your case you would use
trainFile=20news-bydate-devtrain-stanford-classifier.txt
testFile=20news-bydate-devtest-stanford-classifier.txt
1.useSplitWords=true
1.splitWordsTokenizerRegexp=[\\p{L}][\\p{L}0-9]*|(?:\\$ ?)?[0-9]+(?:\\.[0-9]{2})?%?|\\s+|[\\x80-\\uFFFD]|.
1.splitWordsIgnoreRegexp=\\s+
or if you want to run with command line arguments
java -mx1800m -cp stanford-classifier.jar edu.stanford.nlp.classify.ColumnDataClassifier -trainFile myproject/demo-train.txt -testFile myproject/demo-test.txt -1.useSplitWords =1.splitWordsRegexp "\s+"

I've faced the same error as you.
Pay attention on tabs in the text you are classifying.
Caused by: java.lang.ArrayIndexOutOfBoundsException: 2
This means, that at some point classifier expects array of 3 elements, after it splits the string with tabs.
I've run a method, that counts amount of tabs in each line, and if at some line you have not two of them - here is an error.

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string