How to convert mongoDB data into arff file - node.js

I have different data set from users through different forms. I am using MEAN stack, mongoose and node-weka to analyze stored data, but before that, weka uses arff files, that is why I have to convert data stored in mongoDB into ARFF file. does anyone know how to do it? I am a beginner and I haven't found the right documents.
Here is the beginning of the code in NODE JS
var data = ... //ARFF json format
var options = {
//'classifier': 'weka.classifiers.bayes.NaiveBayes',
'classifier': 'weka.classifiers.functions.SMO',
'params' : ''
};
var testData = {
outlook : 'sunny',
windy : 'TRUE'
};
weka.classify(data, testData, options, function (err, result) {
console.log(result); //{ predicted: 'yes', prediction: '1' }

I don't know about a pure-javascript solution, only about a command-line solution (for linux, unix and mac).
In any case, many of weka's classifiers indeed expect an arff file as input.
You can export your json data to csv, convert csv to .arff on-the-fly on the command line, and then pipe them to weka 3.6 (not 3.7).
You can use a bash script to convert csv to arff via a tempfile. This special-purpose script weka-cluster demonstrates. Adapt it to your needs.
#!/usr/bin/env bash
ALGO="$#"
IN=$(mktemp --tmpdir weka-cluster-XXXXXXXX).arff
finish () {
rm -f $IN
}
trap finish EXIT
csv2arff > $IN
weka filters.unsupervised.attribute.AddCluster -W "weka.${ALGO}" -i $IN -o /dev/stdout | arff2csv
call this script as
cat my.csv | weka-cluster clusterers.SimpleKMeans
you can extend this to mongodb, like this:
mymongoquery.sh | json2csv | (more optional filters.e.g csvcut) | weka-cluster clusterers.SimpleKMeans
These command-line tools (but not mongo) are described in more detail in the book "Data science the command line" by Jeroen Janssen. Check out the github repo for csv2arff, weka-cluster, and on how to install the other tools (csv2arff, arff2csv, csvcut, json2csv).

Related

How to pass an array as input to awk in node child_process and output a text file

I'm thinking whether it is possible to pass an array as input of awk command via Node.JS child_process.spawn().
I'll explain myself with an example.
Let's define an array (in my real case the array has about 3 million and something rows) :
const myvar = ['val1', 'val2', 'val1', 'val3', 'another', 'etc', 'etc'];
Now, using child_process, I want to pass this variable as input of awk command, remove duplicates and save the output into a text file.
Wrongly, I'm currently using a template string, thinking that maybe the command would read from the variable:
const { spawn } = require('child_process');
const awk = spawn(
`awk '!a[$0]++' ${myvar} > outputfile.txt`
);
but it's obviously a wrong usage. Since, I get:
awk exits with code 2 (wrong usage)
or also:
I get Error: spawn E2BIG
I'm finding myself stuck in this issue. I know I can pass a simple string variable but it's different.
.
Is something like this doable?
N.B.: the array is taken from a text file loaded into the code at an earlier stage. That's why I don't want to read from the file again. Using the text file as input works okay with awk but I'm trying to avoid reading that multiple times.
If I didn't explain clearly, please let me know. Thanks for help!
['val1', 'val2', 'val1', 'val3', 'another', 'etc', 'etc']
————————————————————————————————————
{m/g}awk
'ORS = __[$-_]++ < /[^\47]/ ? "\n" :_' \
FS = '^$' \
RS = '[[][\47]|[\47]([,][ \t]*[\47]|[]][\n]?)'
or these really fringe syntax style :
'ORS = __[$-_]-->-/[^\47]/ ? "\n" : _'
'ORS = -/[^\47]/~--__[$-_] ? "\n" : _'
————————————————————————————————————
val1
val2
val3
another
etc

Python Error: Cannot extract data from dictionary

I have a json output (y) like this below.
{
"WebACL":{
"Name":"aBlockKnownBadInputs-WebAcl",
"Id":"4312a5d0-9878-4feb-a083-09d7a9cfcfbb",
"ARN":"arn:aws:wafv2:us-east-1:100467320728:regional/webacl/aBlockKnownBadInputs-WebAcl/4312a5d0-9878-4feb-a083-09d7a9cfcfbb",
"DefaultAction":{
"Allow":{
}
},
"Description":"",
"Rules":[
{
"Name":"AWS-AWSManagedRulesKnownBadInputsRuleSet",
"Priority":500,
"Statement":{
"ManagedRuleGroupStatement":{
"VendorName":"AWS",
"Name":"AWSManagedRulesKnownBadInputsRuleSet"
}
},
"OverrideAction":{
"None":{
}
},
"VisibilityConfig":{
"SampledRequestsEnabled":true,
"CloudWatchMetricsEnabled":true,
"MetricName":"AWS-AWSManagedRulesKnownBadInputsRuleSet"
}
}
]
}
}
I want to extract "AWS-AWSManagedRulesKnownBadInputsRuleSet" from the section:-
"Name":"AWS-AWSManagedRulesKnownBadInputsRuleSet",
"Priority":500,
"Statement":{
"ManagedRuleGroupStatement":{
"VendorName":"AWS",
"Name":"AWSManagedRulesKnownBadInputsRuleSet"*
At the minute my code is returning a key error:
KeyError: 'Rules[].Statement[].ManagedRuleGroupStatement[].Name'
The format of this line is clearly wrong, but I don't know why.
ruleset = y['Rules[].Statement[].ManagedRuleGroupStatement[].Name']
My code block:
respons = client.get_web_acl(Name=(acl),Scope='REGIONAL',Id=(ids))
for y in response['WebACLs']:
ruleset = y['Rules[].Statement[].ManagedRuleGroupStatement[].Name']
Does anyone know what I'm doing wrong here?
UPDATE :- In case anyone else comes up against this problem, I fixed this by doing it a slightly different way.
#Requesting the info from AWS via get_web_acl request
respons = client.get_web_acl(Name=(acl),Scope='REGIONAL',Id=(ids))
#Converting the dict output to a string to make it searchable
result = json.dumps(respons)
#Defining what I want to search for
fullstring = "AWS-AWSManagedRulesKnownBadInputsRuleSet"
#Searching the output & printing the result: if = true / else = false
if fullstring in result:
print("Found WAF ruleset: AWS-AWSManagedRulesKnownBadInputsRuleSet!")
else:
print("WAF ruleset not found!")
Also, as part of my research I found a cool thing called jello.
(https://github.com/kellyjonbrazil/jello).
jello is similar to jq in that it processes JSON and JSON Lines data except it uses standard python dict and list syntax.
So, I copied my json into a file called file.json
Ran cat file.json | jello -s to print a grep-able schema by using the -s option
Found the bit I was interested in - in my case the name of the ManagedRuleGroupStatement and ran the following:
cat file.json | jello -s _.WebACL.Rules[0].Statement.ManagedRuleGroupStatement.Name
This gave me exactly what I wanted!
It doesn't work inside a python script but was something new and interesting to learn!

How to call a forward the value of a variable created in the script in Nextflow to a value output channel?

i have process that generates a value. I want to forward this value into an value output channel. but i can not seem to get it working in one "go" - i'll always have to generate a file to the output and then define a new channel from the first:
process calculate{
input:
file div from json_ch.collect()
path "metadata.csv" from meta_ch
output:
file "dir/file.txt" into inter_ch
script:
"""
echo ${div} > alljsons.txt
mkdir dir
python3 $baseDir/scripts/calculate.py alljsons.txt metadata.csv dir/
"""
}
ch = inter_ch.map{file(it).text}
ch.view()
how do I fix this?
thanks!
best, t.
If your script performs a non-trivial calculation, writing the result to a file like you've done is absolutely fine - there's nothing really wrong with this approach. However, since the 'inter_ch' channel already emits files (or paths), you could simple use:
ch = inter_ch.map { it.text }
It's not entirely clear what the objective is here. If the desire is to reduce the number of channels created, consider instead switching to the new DSL 2. This won't let you avoid writing your calculated result to a file, but it might mean you can avoid an intermediary channel, potentially.
On the other hand, if your Python script actually does something rather trivial and can be refactored away, it might be possible to assign a (global) variable (below the script: keyword) such that it can be referenced in your output declaration, like the line x = ... in the example below:
Valid output
values
are value literals, input value identifiers, variables accessible in
the process scope and value expressions. For example:
process foo {
input:
file fasta from 'dummy'
output:
val x into var_channel
val 'BB11' into str_channel
val "${fasta.baseName}.out" into exp_channel
script:
x = fasta.name
"""
cat $x > file
"""
}
Other than that, your options are limited. You might have considered using the env output qualifier, but this just adds some syntactic-sugar to your shell script at runtime, such that an output file is still created:
Contents of test.nf:
process test {
output:
env myval into out_ch
script:
'''
myval=$(calc.py)
'''
}
out_ch.view()
Contents of bin/calc.py (chmod +x):
#!/usr/bin/env python
print('foobarbaz')
Run with:
$ nextflow run test.nf
N E X T F L O W ~ version 21.04.3
Launching `test.nf` [magical_bassi] - revision: ba61633d9d
executor > local (1)
[bf/48815a] process > test [100%] 1 of 1 ✔
foobarbaz
$ cat work/bf/48815aeefecdac110ef464928f0471/.command.sh
#!/bin/bash -ue
myval=$(calc.py)
# capture process environment
set +u
echo myval=$myval > .command.env

MafftCommandline and io.StringIO

I've been trying to use the Mafft alignment tool from Bio.Align.Applications. Currently, I've had success writing my sequence information out to temporary text files that are then read by MafftCommandline(). However, I'd like to avoid redundant steps as much as possible, so I've been trying to write to a memory file instead using io.StringIO(). This is where I've been having problems. I can't get MafftCommandline() to read internal files made by io.StringIO(). I've confirmed that the internal files are compatible with functions such as AlignIO.read(). The following is my test code:
from Bio.Align.Applications import MafftCommandline
from Bio import SeqIO
from Bio.Seq import Seq
from Bio.SeqRecord import SeqRecord
import io
from Bio import AlignIO
sequences1 = ["AGGGGC",
"AGGGC",
"AGGGGGC",
"AGGAGC",
"AGGGGG"]
longest_length = max(len(s) for s in sequences1)
padded_sequences = [s.ljust(longest_length, '-') for s in sequences1] #padded sequences used to test compatibilty with AlignIO
ioSeq = ''
for items in padded_sequences:
ioSeq += '>unknown\n'
ioSeq += items + '\n'
newC = io.StringIO(ioSeq)
cLoc = str(newC).strip()
cLocEdit = cLoc[:len(cLoc)] #create string to remove < and >
test1Handle = AlignIO.read(newC, "fasta")
#test1HandleString = AlignIO.read(cLocEdit, "fasta") #fails to interpret cLocEdit string
records = (SeqRecord(Seq(s)) for s in padded_sequences)
SeqIO.write(records, "msa_example.fasta", "fasta")
test1Handle1 = AlignIO.read("msa_example.fasta", "fasta") #alignIO same for both #demonstrates working AlignIO
in_file = '.../msa_example.fasta'
mafft_exe = '/usr/local/bin/mafft'
mafft_cline = MafftCommandline(mafft_exe, input=in_file) #have to change file path
mafft_cline1 = MafftCommandline(mafft_exe, input=cLocEdit) #fails to read string (same as AlignIO)
mafft_cline2 = MafftCommandline(mafft_exe, input=newC)
stdout, stderr = mafft_cline()
print(stdout) #corresponds to MafftCommandline with input file
stdout1, stderr1 = mafft_cline1()
print(stdout1) #corresponds to MafftCommandline with internal file
I get the following error messages:
ApplicationError: Non-zero return code 2 from '/usr/local/bin/mafft <_io.StringIO object at 0x10f439798>', message "/bin/sh: -c: line 0: syntax error near unexpected token `newline'"
I believe this results due to the arrows ('<' and '>') present in the file path.
ApplicationError: Non-zero return code 1 from '/usr/local/bin/mafft "_io.StringIO object at 0x10f439af8"', message '/usr/local/bin/mafft: Cannot open _io.StringIO object at 0x10f439af8.'
Attempting to remove the arrows by converting the file path to a string and indexing resulted in the above error.
Ultimately my goal is to reduce computation time. I hope to accomplish this by calling internal memory instead of writing out to a separate text file. Any advice or feedback regarding my goal is much appreciated. Thanks in advance.
I can't get MafftCommandline() to read internal files made by
io.StringIO().
This is not surprising for a couple of reasons:
As you're aware, Biopython doesn't implement Mafft, it simply
provides a convenient interface to setup a call to mafft in
/usr/local/bin. The mafft executable runs as a separate process
that does not have access to your Python program's internal memory,
including your StringIO file.
The mafft program only works with an input file, it doesn't even
allow stdin as a data source. (Though it does allow stdout as a
data sink.) So ultimately, there must be a file in the file system
for mafft to open. Thus the need for your temporary file.
Perhaps tempfile.NamedTemporaryFile() or tempfile.mkstemp() might be a reasonable compromise.

How to modify a perl script to read excel instead of Html files

My first question is:
Is this possible to do this, since now I have a perl script which reads Html file and extract data to display on another html file.
If the answer for the question above is Yes, my second question would be:
How to do this?
Sorry to ask frankly as this, but since I'm so new for perl, and I have to take this task, so I'm here for some useful advice or suggestion to guide me through this task. Appreciate your help in advance.
Here's a part of the code, since the whole chunk is quite long:
$date=localtime();
($TWDAY, $TMTH, $TD1D, $TSE, $TYY) = split(/\s+/, $date);
$TSE =~ s/\://g;
$STAMP=_."$TD1D$TMTH$TYY";
#ServerInfo=();
#--------------------------------------------------------------------------- -------------------------------
# Read Directory
#----------------------------------------------------------------------------------------------------------
$myDir=getcwd;
#----------------------------------------------------------------------------------------------------------
# INITIALIZE HTML FORMAT
#----------------------------------------------------------------------------------------------------------
&HTML_FORMAT;
#----------------------------------------------------------------------------------------------------------
# REPORT
#----------------------------------------------------------------------------------------------------------
if (! -d "$myDir/report") { mkdir("$myDir/report");};
$REPORTFILE="$myDir/report/checkpack".".htm";
open OUT,">$REPORTFILE" or die "\nCannot open out file $REPORTFILE\n\n";
print OUT "$Tag_Header";
#----------------------------------------------------------------------------------------------------------
sub numSort {
if ($b < $a) { return -1; }
elsif ($a == $b) { return 0;}
elsif ($b > $a) { return 1; }
}
#ArrayDir = sort numSort #DirArray;
#while (<#ArrayDir>) {
#OutputDir=grep { -f and -T } glob "$myDir/*.htm $myDir/*.html";
#}
#----------------------------------------------------------------------------------------------------------
#ReadLine3=();
$xyxycnt=0;
foreach $InputFile (#OutputDir) { #---- MAIN
$filename=(split /\//, $InputFile) [-1]; print "-"x80 ; print "\nFilename\t:$filename\n";
open IN, "<$InputFile" or die "Cannot open Input file $InputFile\n";
#MyData=();
$DataCnt=0;
#MyLine=();
$MyLineCnt=0;
while (<IN>) {
$LINE=$_;
chomp($LINE);
$LINE=~s/\<br\>/XYXY/ig;
$LINE=~s/\<\/td\>/ \nXYZXYZ\n/ig;
$LINE=~s/\<dirname\>/xxxdirnameyyy/ig;
$LINE=linetrim3($LINE);
$LINE=linetrim($LINE);
$LINE=~s/XYXY/\<br\>/ig;
$LINE=~s/xxxdirnameyyy/&lt dirname &gt/ig;
$LINE=~s/^\s+//ig;
print OUT2 "$LINE\n";
if (defined($LINE)) { $MyData[$DataCnt]="$LINE"; $DataCnt++ ; }
}
close IN;
foreach $ReadFile (#MyData) { #--- Mydata
$MyLineCnt++;
$MyLine[$MyLineCnt]="";
#### FILENAME
$ServerInfo[0]="$filename";
#### IP ADDRESS
if ($ReadFile =~ /Host\/Device Name\:/) {
#print "$ReadFile\n"
($Hostname)=(split /\:|\s+/, $ReadFile)[3]; print "$Hostname\n";
&myServerInfo("$Hostname","1");
}
if ($ReadFile =~ /IP Address\(es\)/) {#ListIP=(); $SwIP=1; $CntIP=0 ; };
#### OPERATING SYSTEM & VERSION
if ($ReadFile =~ /Operating System\:/) {
$SwIP=0;
$OS= (split /\:|\s+/, $ReadFile)[3]; &myServerInfo("$OS","3") ; print "$OS\n";
$OSVer= (split /\:|\s+/, $ReadFile)[-2]; &myServerInfo("$OSVer","4") ; print "$OSVer\n";
};
#### GET IP VALUE
if ($SwIP==1) {
$ReadFile=(split /\:/,$ReadFile) [2];
$ReadFile=~s/[a-z|A-Z]|\(|\)|\// /ig; print "$ReadFile\n";
if ($CntIP==0) {
#$ListIP[$CntIP]=(split /\s+/,$ReadFile) [1];
#ListIP="$ReadFile";
} elsif ($CntIP==1) { print "\n\t\t $ReadFile\n" ; $ListIP[$CntIP]="\n$ReadFile";
} else { print "\t\t $ReadFile\n" ; $ListIP[$CntIP]="\n$ReadFile"; };
$CntIP++;
}
I'm afraid if you don't understand what is going on in this program and you also don't understand how to approach a task like this at all, Stack Overflow might not be the right place to get help.
Let me try to show you the approach I would take with this. I'm assuming there is more code.
First, write down a list of everything you know:
What is the input format of the existing file
Where does the existing file come from now
What is the output format of the existing file
Where does the generated output file go afterwards
What does the new file look like
Where does the new file come from
Use perltidy to indent the inherited code so you can read it better. The default options should be enough.
Read the code, take notes about what pieces do what, add comments
Write a unit test for the desired output format. You can use Test::More. Another useful testing module here is Test::File.
Refactor the part that generated the output format to work with a certain data structure. Use your tests to make sure you don't break it.
Write code to parse the new file into the data structure from the point above. Now you can plug that in and get the expected output.
Refactor the part that takes the old input file from the existing file location to be a function, so you can later switch it for the new one.
Write code to get the new file from the new file location.
Document what you did so the next guy is not in the same situation. Remember that could be you in half a year.
Also add use strict and use warnings while you refactor to catch errors more easily. If stuff breaks because of that, make it work before you continue. Those pragmas tell you what's wrong. The most common one you will encounter is Global symbol "$foo" requires explicit package name. That means you need to put my in front of the first assignment, or declare the variable before.
If you have specific questions, ask them as a new question with a short example. Read how to ask to make sure you will get help on those.
Good luck!
After seing your comment I am thinking you want a different input and a different output. In that case, disregard this, throw away the old code and start from scratch. If you don't know enough Perl, get a book like Curtis Poe's Beginning Perl if you already know programming. If not, check out Learning Perl by Randal L. Schwartz.

Resources