I do a simple MIB file parser.
Antlr 4 tool generates following files:
MibParser.java
MibLexer.java
MibListener.java
MibBaseListener.java
Can I force antlr 4 to combine pairs of those files to reduce clutter?
Any command line switch ?
You can use the -no-listener and -no-visitor options to suppress the generation of two of those files if you don't need them. In addition to not having an option to combine files currently, I completely disagree that the creation of 4 files represents "clutter" in this context, so I would argue strongly against the implementation of such an option.
Ćukasz Bownik wrote:
Can I force antlr 4 to combine pairs of those files to reduce clutter?
No.
Related
I have multiple versions of scripts files within the same directory. I want ansible via the find: module, through some sort of sorting comparison to always choose the highest version file it can. The issue is that the integer sorting I desire won't work because the strings in question aren't a pure integer comparison, and the lexicographic sorting won't give me the proper expected version. These filenames have a naming "convention" but there is no exact hard-coded filename versioning to work with, the versions are random within each project, but if integer ordering could be used, I/we could determine which file to use for each task.
Example, lets say I have the following 3 strings:
script_v3.sh
script_v9.sh
script_v10.sh
The normal sorting methods/filters within ansible (such as the sort filter) will try to do a lexicographic comparison of these strings in order. This means the one with the "highest" value will be script_v9.sh which is bad as I/we expect it should be script_v10.sh this is no good, as it will try to use script_v9.sh and then cause the rest of that process/task to fail. I would love to turn them into integers to do an integer comparison/sort, but as there are other non-numerical characters in the string every attempt so far to do so has been a failure. Note that I/we also need to occasionally use the lowest version in some tasks as well, which also screws up utilizing our example if lexicographic sorting is used.
I would like to know if this is possible to accomplish through some convenient comparison method, or filter which I have overlooked, or if anyone has any better ideas? The only thing I could possibly come up with is to use a regex to strip out the integers, compare them by themselves as integers, and then finally match up the result to the filename which contains the highest 10 value and then have the task use that. However, I'm horrible at regular expressions, and I'm not even certain that's the most elegant way to approach this. Anyone who could help me out would be highly appreciated.
Suppose I would like to use simple language that is only a subset of Perl6 as a extension/embeddable language to "script" my own Perl 6 programs.
For example let this language has only:
variable declaration
expressions
literals
with Perl6 syntax and may be very limited subset of built in functions.
Anything outside of this should cause a compilation error and should not
executed.
Is it possible to (re)use Rakudo compiler for this or it can be done only by hand-written interpreter/compiler?
Let me clarify my motivation for this.
Using (subset of) host language (Perl 6 in this case) as
DSL-language for configuration files for scripts/apps written in
host language. This can be done with EVAL
(perl6 'do(file)' equivalent),
but it's not safe at all since there is no control what EVAL can do.
Using (subset of) host language as extension/scripting language for
apps written in host language. Much like scripting Blender with
Python or WoW with Lua. I guess app core with some API is needed in this case? But how exactly it should/can be done?
But again, why host language for configuration/scripting?
In case of conf files I don't like using "foreign languages" like YAML or JSON because:
extra code/library needed to convert data from those formats into native Perl6 memory data structures, but we already have all (language and compiler) to express conf's contents;
conf file may use host language code natively (i.e. callbacks) with compile-time checks;
portability of conf files is not issue in my case;
In case of extension/scripting: again, I don't see any reason to use Lua or Python for Perl 6 apps, but again I don't like idea about inventing my own extension/scripting language and writing interpreter/compiler for it in Perl 6
if I already have Perl 6/Rakudo.
I know that this isn't the answer you were looking for, but I really think most configuration can be handled well by JSON. JSON is well accepted outside of the JavaScript community. Many languages use it. In fact, JSON::Fast comes with Rakudo-Star (as evidenced by it's json_fast submodule). You can convert JSON files to Perl 6 data structures with this one-liner (okay, two-liner including use JSON::Fast):
use JSON::Fast;
my %json = from-json(slurp($filename));
Also, JSON is a pretty decent data structure. It can be simple if you need simple, but you can use it for very complex configurations using nested hashes and arrays in just about any combination.
Perl 6 code can be abstracted into Blocks. You can declare Block-type variables, and you can also declare subset of blocks using where. If you are able to express the restrictions to your Perl 6 blocks into Perl 6 expressions, you can easily create subsets of Perl 6 combining those constraints. Your DSL will then be
valid objects of the (sub)type you have declared.
I'm using macros in MASM to generate about 2000 functions, for each of which I define a string, but I only use around ~30 of them in any given program.
(There is no way to predict which ones I will use ahead of time; I use them as needed.)
Is there any way to tell the linker to "strip out" the strings that I don't end up using? They blow up the binary size by quite a lot.
Why don't you just put those 2000 functions and strings into a static library? Make the procs public, and use externdef for the strings, then when you link your exe to the lib, the linker will only pull in the strings and procs that are used.
My company maintains a domain-specific language that syntactically resembles the Excel formula language. We're considering adding new builtins to the language. One way to do this is to identify verbose commands that are repeatedly used in our codebase. For example, if we see people always write the same 100-character command to trim whitespace from the beginning and end of a string, that suggests we should add a trim function.
Seeing a list of frequent substrings in the codebase would be a good start (though sometimes the frequently used commands differ by a few characters because of different variable names used).
I know there are well-established algorithms for doing this, but first I want to see if I can avoid reinventing the wheel. For example, I know this concept is the basis of many compression algorithms, so is there a compression module that lets me retrieve the dictionary of frequent substrings? Any other ideas would be appreciated.
The string matching is just the low hanging fruit, the obvious cases. The harder cases are where you're doing similar things but in different order. For example suppose you have:
X+Y
Y+X
Your string matching approach won't realize that those are effectively the same. If you want to go a bit deeper I think you need to parse the formulas into an AST and actually compare the AST's. If you did that you could see that the tree's are actually the same since the binary operator '+' is commutative.
You could also apply reduction rules so you could evaluate complex functions into simpler ones, for example:
(X * A) + ( X * B)
X * ( A + B )
Those are also the same! String matching won't help you there.
Parse into AST
Reduce and Optimize the functions
Compare the resulting AST to other ASTs
If you find a match then replace them with a call to a shared function.
I would think you could use an existing full-text indexer like Lucene, and implement your own Analyzer and Tokenizer that is specific to your formula language.
You then would be able to run queries, and be able to see the most used formulas, which ones appear next to each other, etc.
Here's a quick article to get you started:
Lucene Analyzer, Tokenizer and TokenFilter
You might want to look into tag-cloud generators. I couldn't find any source in the minute that I spent looking, but here's an online one:
http://tagcloud.oclc.org/tagcloud/TagCloudDemo which probably won't work since it uses spaces as delimiters.
Task:
to cluster a large pool of short DNA fragments in classes that share common sub-sequence-patterns and find the consensus sequence of each class.
Pool: ca. 300 sequence fragments
8 - 20 letters per fragment
4 possible letters: a,g,t,c
each fragment is structured in three regions:
5 generic letters
8 or more positions of g's and c's
5 generic letters
(As regex that would be [gcta]{5}[gc]{8,}[gcta]{5})
Plan:
to perform a multiple alignment (i.e. withClustalW2) to find classes that share common sequences in region 2 and their consensus sequences.
Questions:
Are my fragments too short, and would it help to increase their size?
Is region 2 too homogeneous, with only two allowed letter types, for showing patterns in its sequence?
Which alternative methods or tools can you suggest for this task?
Best regards,
Simon
Yes, 300 is FAR TOO FEW considering that this is the human genome and you're essentially just looking for a particular 8-mer. There are 65,536 possible 8-mers and 3,000,000,000 unique bases in the genome (assuming you're looking at the entire genome and not just genic or coding regions). You'll find G/C containing sequences 3,000,000,000 / 65,536 * 2^8 =~ 12,000,000 times (and probably much more since the genome is full of CpG islands compared to other things). Why only choose 300?
You don't want to use regex's for this task. Just start at chromosome 1, look for the first CG or GC and extend until you get your first non-G-or-C. Then take that sequence, its context and save it (in a DB). Rinse and repeat.
For this project, Clustal may be overkill -- but I don't know your objectives so I can't be sure. If you're only interested in the GC region, then you can do some simple clustering like so:
Make a database entry for each G/C 8-mer (2^8 = 256 in all).
Take each GC-region and walk it to see which 8-mers it contains.
Tag each GC-region with the sequences it contains.
Now, for each 8-mer, you have thousands of sequences which contain it. I'll leave the analysis of the data up to your own objectives.
Your region two, with the 2 letters, may end up a bit too similar, increasing length or variability (e.g. more letters) could help.