Problem when using for loop with scikit-learn decision tree - python-3.x

I'm trying to use scikit-learn tree library to draw decision trees by generating .dot files with tree.export_graphviz() function. I want to transform these .dot files into .pdf files using dot bash command line.
My python code:
from sklearn.datasets import load_iris
iris=load_iris()
from sklearn import tree
for i in range(3,10):
clf=tree.DecisionTreeClassifier(max_leaf_nodes=i)
clf=clf.fit(iris.data,iris.target)
file_name = 'tpsk1-' + str(i) + '.dot'
tree.export_graphviz(clf,out_file=file_name)
In here, I'm writing a for loop with i from range 3 to 10 to export 7 dot files. But when I executed my bash script to transform them into pdf files, something weird happened.
My bash script:
for i in 3 4 5 6 7 8 9
do
dot_file="tpsk1-$i.dot"
pdf_file="tpsk1-$i.dot"
dot -Tpdf $dot_file -o $pdf_file
done
The result:
Error: tpsk1-3.dot: syntax error in line 12 near '�S'
Warning: syntax ambiguity - badly delimited number '.0S' in line 12 of tpsk1-3.dot splits into two tokens
Warning: syntax ambiguity - badly delimited number '3r' in line 49 of tpsk1-3.dot splits into two tokens
Error: tpsk1-4.dot: syntax error in line 16 near 'X'
Warning: syntax ambiguity - badly delimited number '3r' in line 56 of tpsk1-4.dot splits into two tokens
Error: tpsk1-5.dot: syntax error in line 20 near 'ػ0'
Error: tpsk1-6.dot: syntax error in line 24 near '`'
Error: tpsk1-7.dot: syntax error in line 28 near '��'
Warning: syntax ambiguity - badly delimited number '1�' in line 31 of tpsk1-7.dot splits into two tokens
Warning: syntax ambiguity - badly delimited number '3r' in line 68 of tpsk1-7.dot splits into two tokens
Error: tpsk1-8.dot: syntax error in line 32 near '��'
Warning: syntax ambiguity - badly delimited number '0�' in line 32 of tpsk1-8.dot splits into two tokens
Warning: syntax ambiguity - badly delimited number '8z' in line 32 of tpsk1-8.dot splits into two tokens
Error: tpsk1-9.dot: syntax error in line 36 near '�Cb'
I retried to remove the for loop to write one single dot file and it worked just fine.
My new python script:
from sklearn.datasets import load_iris
iris=load_iris()
from sklearn import tree
clf=tree.DecisionTreeClassifier(max_leaf_nodes=3)
clf=clf.fit(iris.data,iris.target)
file_name = 'tpsk1-3.dot'
tree.export_graphviz(clf,out_file=file_name)
My dot bash command:
dot -Tpdf tpsk1-3.dot -o tpsk1-3.pdf
Can somebody please explain to me what happened, I think that I miss some wisdom behind for loop in python in here? Thank you very much.

You have wrong extension in your example:
for i in 3 4 5 6 7 8 9
do
dot_file="tpsk1-$i.dot"
pdf_file="tpsk1-$i.dot"
dot -Tpdf $dot_file -o $pdf_file
done
It should be pdf_file="tpsk1-$i.pdf"

Related

Compile python file that references to xlsx files without compiling the xslx files

I am trying to compile my py file but end up with an error.
The scripts reads from 2 excel files and write back to 1
When compiling the py file i get error FileNotFoundError: [Errno 2] No such file or directory: 'file.xlsx'. While the file is there and can be found when i execute the py file I cant seems to fix this.
When i chande the path from relative to full, this error pops up
workbook = load_workbook(filename="C:\Users\userxdx\Desktop\Excellsupport\file.xlsx")
^
SyntaxError: (unicode error) 'unicodeescape' codec can't decode bytes in position 2-3: truncated \UXXXXXXXX escape
To compile I make use of py2exe (for windows)
what am i missing here?
This is not working because a \ is an escape character. For example, "\n" will create a new line in a string. To ignore escape characters, place an r at the beginning of the string like so:
filename=r"C:\Users\userxdx\Desktop\Excellsupport\file.xlsx"

importing script in python terminal

Trying to import a python script file into terminal (invalid syntax)
Hi, I am trying to import a python script file in my python terminal however a is giving me an error when the script name starts with a number or includes certain characters like _ (I am still a beginner)
This works fine:
>>> import a00
Bright Green
However this give me invalid syntax:
>>> import 00a
File "<stdin>", line 1
import 00a
^
SyntaxError: invalid syntax
or this
>>> import 00_a
File "<stdin>", line 1
import 00_a
^
SyntaxError: invalid token
Python module (meaning a .py file) names follow the same naming rules as variables, so they can only start with underscore or a letter. After the first character, you can use numbers also.
(Dashes are also technically allowed, but should be avoided since using them requires special syntax.)
The preferable convention for module names is to have them be all lowercase characters, and underscores if needed.
You can read more from the Python PEP 8 Style Guide.
Yes, names cannot begin with digits, so any name like 0abcd or 1nfi, but not a00, will be invalid.

yaml syntax error for ansible base64 multi line variable

Ansible Version: 2.1.2.0
So I have a yaml file with a multi line variable that's from a binary file converted to base 64.
My variable file: self-cert.yml
selfcert: |
MIIKCAIBAzCCCcIGCSqGSIb3DQEHAaCCCbMEggmvMIIJqzCCBWgGCSqGSIb3DQEHAaCCBVkEggVV
MIIFUTCCBU0GCyqGSIb3DQEMCgECoIIE+jCCBPYwKAYKKoZIhvcNAQwBAzAaBBQFa98IY7UgblDK
qGwMjTIQCK+3DwICBAAEggTIvA/VFm3j3oSN6cknp5qFyUxXAI5TxURnyx8UVRm8UfMcA0LHlh+z
06ztcwApIrxMSV26ezu0p1FrHInpbABNuO0rlk4XlQwTkLynUyg58iBwK7IyV5SqT2UC8djaOiMN
b9ViC3yn7SrRdS3MmCQznu6dScRIHbhG46yZNJrzrJh038X2KAPpS/LfC9DJBjaEzkZY8BwyARYe
When I try to run my playbook that includes this variable, I get:
ERROR! Syntax Error while loading YAML.
The error appears to have been in '/home/ansible/projects/install-cert/self-cert.yml': line 3, column 1, but may
be elsewhere in the file depending on the exact syntax problem.
The offending line appears to be:
MIIKCAIBAzCCCcIGCSqGSIb3DQEHAaCCCbMEggmvMIIJqzCCBWgGCSqGSIb3DQEHAaCCBVkEggVV
MIIFUTCCBU0GCyqGSIb3DQEMCgECoIIE+jCCBPYwKAYKKoZIhvcNAQwBAzAaBBQFa98IY7UgblDK
^ here
Any idea whats wrong? I've tried changing | to > , didn't work, and i've also tried indenting the whole base64 output too.
So it turns out you do need to make an indentation to the the multiline variable. My original indentation was an actual tab instead of spaces (Stupid Sublime) and so the indentation failed the syntax check, but using actual spaces made everything work.

bash: Execute a string as a command

See my previous question on assembling a specific string here.
I was given an answer to that question, but unfortunately the information didn't actually help me accomplish what I was trying to achieve.
Using the info from that post, I have been able to assemble the following set of strings: gnuplot -e "filename='output_N.csv'" 'plot.p' where N is replaced by the string representation of an integer.
The following loop will explain: (Actually, there is probably a better way of doing this loop, which you may want to point out - hopefully the following code won't upset too many people...)
1 #!/bin/bash
2 n=0
3 for f in output_*.csv
4 do
5 FILE="\"filename='output_"$n".csv'\""
6 SCRIPT="'plot.p'"
7 COMMAND="gnuplot -e $FILE $SCRIPT"
8 $COMMAND
9 n=$(($n+1))
10 done
Unfortunately this didn't work... gnuplot does run, but gives the following error message:
"filename='output_0.csv'"
^
line 0: invalid command
"filename='output_1.csv'"
^
line 0: invalid command
"filename='output_2.csv'"
^
line 0: invalid command
"filename='output_3.csv'"
^
line 0: invalid command
...
So, as I said before, I'm no expert in bash. My guess is that something isn't being interpreted correctly - either something is being interpreted as a string where it shouldn't or it is not being interpreted as a string where it should? (Just a guess?)
How can I fix this problem?
The first few (relevant) line of my gnuplot script are the following:
(Note the use of the variable filename which was entered as a command line argument. See this link.)
30 fit f(x) filename using 1:4:9 via b,c,e
31
32 plot filename every N_STEPS using 1:4:9 with yerrorbars title "RK45 Data", f(x) title "Landau Model"
Easy fix - I made a mistake with the quotation marks. ("")
Essentially, the only reason why the quotation marks " and " are required around the text filename='output_"$n".csv' is so that this string is interpreted correctly by bash, before executing the command! So indeed it is correct that the program runs when the command gnuplot -e "filename='output_0.csv'" 'plot.p' is entered into the terminal directly, but the quotation marks are NOT required when assembling the string beforehand. (This is a bit difficult to explain, but hopefully it is clear in your mind the difference between the 2.)
So the corrected version of the above code is:
1 #!/bin/bash
2 n=0
3 for f in output_*.csv
4 do
5 FILE="filename='output_"$n".csv'"
6 SCRIPT='plot.p'
7 COMMAND="gnuplot -e $FILE $SCRIPT"
8 $COMMAND
9 n=$(($n+1))
10 done
That is now corrected and working. Note the removal of the escaped double quotes.

Extracting text information using rapidminer

I have a list of text data from which I want to extract certain portions. I am currently using a regular expression to extract the data I want, but it's starting to get very complicated because each record is slightly different. Is there a way to use Rapidminer to "learn" a regular expression based on some typical examples?
For example, for each of the following records I want to extract the text 24 and 18 into two new attributes:
word 24 on line 18
Wrd 24 of Ln 18
Line 18, Word 24
Word 24 comes after word 22 on line 18 (not line 19)
I have watched all the text processing videos, but none of them show how to do this sort of thing, and I don't really know where to start. Can anyone suggest a way of doing this other than manually creating regular expressions?
The TXR language has a straightforward way to express pattern matching variants without cryptic regular expressions:
Here is your data file:
$ cat 13249396.dat
word 24 on line 18
Wrd 24 of Ln 18
Line 18, Word 24
Word 24 comes after word 22 on line 18 (not line 19)
Here is the txr script:
#(collect)
# (some)
word #wd on line #ln
# (or)
Wrd #wd of Ln #ln
# (or)
Line #ln, Word #wd
# (or)
Word #wd comes after word #nil on line #ln (#(skip)
# (end)
#(end)
#(output)
# (repeat)
#wd:#ln
# (end)
#(end)
Test run:
$ txr 13249396.txr 13249396.dat
24:18
24:18
24:18
24:18
The script was developed by taking the cases from the sample file and replacing a few things by bits of special syntax.

Resources