Plotting a column of precision floating point values - j

I have a sequence of data that I have modified to the following:
load 'tables/csv'
load 'graphics/plot'
x =: readcsv 'table_ctl.csv'
dat =: 4 {::|:x
dat
The data in question is pulling the fourth column, that has been transposed of the following sequence of the array. Below is a sample of the first five values for the column.
13.5598 13.6815 14.027 14.132 14.0104
However upon running:
plot dat
I get the following error:
|option not found: 13.5598: signal
| signal'option not found: ',j
Is this error due to the precision of the floating point values?
Thank you.

You're getting this error as you're passing a list of boxes to plot, and plot is expecting some of these boxes to contain the data to plot, and some other boxes to contain control data. 13.5598 is not a valid option for a plot.
fread 'table_ctl.csv'
a,b,0,1,13.5598
a,b,0,1,13.6815
a,b,0,1,14.027
a,b,0,1,14.132
a,b,0,1,14.0104
4 {::|: readcsv 'table_ctl.csv'
┌───────┬───────┬──────┬──────┬───────┐
│13.5598│13.6815│14.027│14.132│14.0104│
└───────┴───────┴──────┴──────┴───────┘
Probably you were thinking that {:: automatically unboxes, but it only does this if the path you give it designates a single box. See the top text at Fetch. The other problem to have is that the contents of these boxes are strings, not floats:
$ > 4 {::|: readcsv 'table_ctl.csv'
5 7
|."1 > 4 {::|: readcsv 'table_ctl.csv'
8955.31
5186.31
720.41
231.41
4010.41
So, to plot your numbers: plot > makenum 4 {::|: readcsv 'table_ctl.csv' which starts with the list of boxes, then turns each box into a box of a float, then unboxes the list and plots it. makenum comes with readcsv and is like a smart ". each in this case, as it would leave non-numeric boxes alone.
There's a bit more to set up, but jd might also work for this:
fread 'table_ctl.cdefs'
1 label byte 1
2 name varbyte
3 enabled boolean
4 weight int
5 score float
options , LF NO \ 0 iso8601-char
load 'data/jd'
!!! Jd key: non-commercial use only!
jdwelcome_jd_ NB. run this sentence for important information
jdadminnew'temp'
CSVFOLDER=:'/path/to/csv/directory'
jd'csvrd table_ctl.csv data'
jd'info schema'
┌─────┬───────┬───────┬─────┐
│table│column │type │shape│
├─────┼───────┼───────┼─────┤
│data │label │byte │1 │
│data │name │varbyte│_ │
│data │enabled│boolean│_ │
│data │weight │int │_ │
│data │score │float │_ │
└─────┴───────┴───────┴─────┘
jd'get data score'
13.5598 13.6815 14.027 14.132 14.0104

Related

Remove repeating values from X axis label in Altair

I am having trouble with Altair repeating X axis label values.
Data:
rule_abbreaviation flagged_claim bill_month
0 CONCIDPROC 1 Apr2022
1 CONTUSMAT1 1 Apr2022
2 COVID05 1 Jun2021
3 FILTROTUB2 1 Sep2021
4 MEPIARTRO1 1 Mar2022
#Code to generate Altair Bar Chart
bar = alt.Chart(Data).mark_bar().encode(
x=alt.X('flagged_claim:Q', axis=alt.Axis(title='Flagged Claims', format= ',.0f'), stack='zero'),
y=alt.Y('rule_abbreaviation:N', axis=alt.Axis(title='Component Abbreviation'), sort=alt.SortField(field=measure, order='descending')),
tooltip=[alt.Tooltip('max(ClaimRuleName):N', title='Claim Component'), alt.Tooltip('flagged_claim:Q', title='Flagged Claims', format= ',.0f')],
color=alt.Color('bill_month', legend=None)
).properties(width=485,
title = alt.TitleParams(text = 'Bottom Components',
font = 'Arial',
fontSize = 16,
color = '#000080',
)
).interactive()
X axis label generated by this chart contains repeated 0 and 1
Image of Visualization: https://i.stack.imgur.com/0XdWB.png
The reason this is happening is because you have format= ',.0f' which tells Altair to include 0 decimals in the axis labels. Remove it or change to 1f to see decimals in the labels. In general, a good way to troubleshoot problems like this is to remove part of your code at a time to identify which part is causing the unexpected behavior.
To reduce the number of ticks you can use alt.Axis(title='Flagged Claims', format='d', tickCount=1) or alt.Axis(title='Flagged Claims', format='d', values=[0, 1]). See also Changing Number of y-axis Ticks in Altair

How to draw a horizontal line at the large y-axis integer?

For the following data.dat file:
08:01:59 451206975237005878
08:04:07 451207335040839108
08:05:56 451207643872368805
08:07:49 451207961547842270
08:09:56 451208317883903787
08:10:12 451208364811411904
08:14:09 451209030026853864
08:16:19 451209395116787156
08:17:14 451209552481002386
08:20:22 451210080432357203
08:25:36 451210963309583903
08:30:23 451211772783766177
08:34:04 451212394854723707
08:35:53 451212702239472024
08:48:46 451214876715294857
08:49:56 451215072475511660
08:51:24 451215321890488523
08:52:39 451215533925588479
08:52:42 451215542324801784
08:54:30 451215845971562410
08:55:08 451215951262906948
08:58:30 451216519498960432
I'd like to draw a horizontal line at the specific level (e.g. 451211772783766177). However, the number is too large to process.
Here is my attempt (based on this post):
$ gnuplot -p -e 'set arrow from 451211772783766177 to 451211772783766177; plot "data.dat" using 2 with lines'
which gives the following error:
line 0: warning: integer overflow; changing to floating point
How I should proceed?
I would treat your large number as a constant function, and plot it right after your data. Also, writing it on a exponential notation X.XE+YY = X.X times 10 to the +YY power (more info on format specifiers here) also takes care of the error:
plot "data.dat" using 2 with lines, 4.51211772783766177E17 with lines
Let me know if this works!

How do read a SEC txt-file into a pandas dataframe?

I am trying to use SEC (U.S. Security and Exchange Commision data). The SEC provides useful data in a txtformat. I am using
Financial Statement Data Sets for the second quarter of 2017. You can find the data I use here.
I try to read the txtfiles into a pandas dataframe. I tried it the following ways:
sub = pd.read_fwf('sub.txt')
sub_1 = pd.read_csv('sub.txt')
I get no error with using Pandas' read_fwf function - but the output is utter rubbish. Here is the head of the dataframe:
adsh cik name sic countryba stprba cityba zipba bas1 bas2 baph countryma stprma cityma zipma mas1 mas2 countryinc stprinc ein former changed afs wksi fye form period fy fp filed accepted prevrpt detail instance nciks aciks Unnamed: 1
0 0000002178-17-000038\t2178\tADAMS RESOURCES & ... NaN
1 0000002488-17-000107\t2488\tADVANCED MICRO DEV... NaN
I do get an error when using read_csv: Error tokenizing data. C error: Expected 2 fields in line 7, saw 3
Any ideas on how tor read the data into a pandas dataframe?
It looks like the files are tab separated - that's why you're seeing \t in the results. pandas read_csv defaults to comma separated values, so you have to change the separator. This is controlled by the sep parameter. In addition, you will need to provide the proper encoding (errors are thrown when trying to read the num, pre, and tag files). Generally ISO-8859-1 is a good choice.
#import pandas
import pandas as pd
#read in the .txt file and choose a separator and encoding standard
df = pd.read_csv('sub.txt', sep='\t', encoding='ISO-8859-1')
#output the results
print(df)
adsh cik name \
0 0000002178-17-000038 2178 ADAMS RESOURCES & ENERGY, INC.
1 0000002488-17-000107 2488 ADVANCED MICRO DEVICES INC
2 0000002969-17-000019 2969 AIR PRODUCTS & CHEMICALS INC /DE/
3 0000002969-17-000024 2969 AIR PRODUCTS & CHEMICALS INC /DE/
4 0000003499-17-000010 3499 ALEXANDERS INC
5 0000003545-17-000043 3545 ALICO INC
6 0000003570-17-000073 3570 CHENIERE ENERGY INC

svm train output file has less lines than that of the input file

I am currently building a binary classification model and have created an input file for svm-train (svm_input.txt). This input file has 453 lines, 4 No. features and 2 No. classes [0,1].
i.e
0 1:15.0 2:40.0 3:30.0 4:15.0
1 1:22.73 2:40.91 3:36.36 4:0.0
1 1:31.82 2:27.27 3:22.73 4:18.18
0 1:22.73 2:13.64 3:36.36 4:27.27
1 1:30.43 2:39.13 3:13.04 4:17.39 ......................
My problem is that when I count the number of lines in the output model generated by svm-train (svm_train_model.txt), this has 12 fewer lines than that of the input file. The line count here shows 450, although there are obviously also 9 lines at the beginning showing the various parameters generated
i.e.
svm_type c_svc
kernel_type rbf
gamma 1
nr_class 2
total_sv 441
rho -0.156449
label 0 1
nr_sv 228 213
SV
Therefore 12 lines in total from the original input of 453 have gone. I am new to svm and was hoping that someone could shed some light on why this might have happened?
Thanks in advance
Updated.........
I now believe that in generating the model, it has removed lines whereby the labels and all the parameters are exactly the same.
To explain............... My input is a set of miRNAs which have been classified as 1 and 0 depending on their involvement in a particular process or not (i.e 1=Yes & 0=No). The input file looks something like.......
0 1:22 2:30 3:14 4:16
1 1:26 2:15 3:17 4:25
0 1:22 2:30 3:14 4:16
Whereby, lines one and three are exactly the same and as a result will be removed from the output model. My question is then both why the output model would do this and how I can get around this (whilst using the same features)?
Whilst both SOME OF the labels and their corresponding feature values are identical within the input file, these are still different miRNAs.
NOTE: The Input file does not have a feature for miRNA name (and this would clearly show the differences in each line) however, in terms of the features used (i.e Nucleotide Percentage Content), some of the miRNAs do have exactly the same percentage content of A,U,G & C and as a result are viewed as duplicates and then removed from the output model as it obviously views them as duplicates even though they are not (hence there are less lines in the output model).
the format of the input file is:
Where:
Column 0 - label (i.e 1 or 0): 1=Yes & 0=No
Column 1 - Feature 1 = Percentage Content "A"
Column 2 - Feature 2 = Percentage Content "U"
Column 3 - Feature 3 = Percentage Content "G"
Column 4 - Feature 4 = Percentage Content "C"
The input file actually looks something like (See the very first two lines below), as they appear identical, however each line represents a different miRNA):
1 1:23 2:36 3:23 4:18
1 1:23 2:36 3:23 4:18
0 1:36 2:32 3:5 4:27
1 1:14 2:41 3:36 4:9
1 1:18 2:50 3:18 4:14
0 1:36 2:23 3:23 4:18
0 1:15 2:40 3:30 4:15
In terms of software, I am using libsvm-3.22 and python 2.7.5
Align your input file properly, is my first observation. The code for libsvm doesnt look for exactly 4 features. I identifies by the string literals you have provided separating the features from the labels. I suggest manually converting your input file to create the desired input argument.
Try the following code in python to run
Requirements - h5py, if your input is from matlab. (.mat file)
pip install h5py
import h5py
f = h5py.File('traininglabel.mat', 'r')# give label.mat file for training
variables = f.items()
labels = []
c = []
import numpy as np
for var in variables:
data = var[1]
lables = (data.value[0])
trainlabels= []
for i in lables:
trainlabels.append(str(i))
finaltrain = []
trainlabels = np.array(trainlabels)
for i in range(0,len(trainlabels)):
if trainlabels[i] == '0.0':
trainlabels[i] = '0'
if trainlabels[i] == '1.0':
trainlabels[i] = '1'
print trainlabels[i]
f = h5py.File('training_features.mat', 'r') #give features here
variables = f.items()
lables = []
file = open('traindata.txt', 'w+')
for var in variables:
data = var[1]
lables = data.value
for i in range(0,1000): #no of training samples in file features.mat
file.write(str(trainlabels[i]))
file.write(' ')
for j in range(0,49):
file.write(str(lables[j][i]))
file.write(' ')
file.write('\n')

Gnuplot 5.0 patchlevel 4 - passing column numbers in a macro

I have a data file (see below) with a dozen columns and I am only interested in plotting two columns (say, 5 and 10) when the values in column 1 are over a given interval. To do so, I have defined:
inter(min,max,var,colx)=(min<=column(var)&&column(var)<=max?column(colx):NaN)
Everything works as expected using plot 'data.dat' u (inter(0.25,0.5,1,5)):10 which plots columns 5 and 10 over the [0.25:0.5] interval of values in column 1.
As I need to plot various couples of columns over various intervals, I have created a file, PlotInterval.p, containing
inter(min,max,var,colx)=(min<=column(var)&&column(var)<=max?column(colx):NaN)
plot ARG1 u (inter(ARG2,ARG3,ARG4,ARG5)):ARG6
and when I call it with call 'PlotInterval.p' 0.25 0.5 1 5 10 then I get the error message:
gnuplot> call 'PlotInterval.p' 'data.dat' 0.25 0.5 1 5 10
"PlotInterval.p", line 3: warning: no column with header "1"
"PlotInterval.p", line 3: warning: partial match against column 6 header "1.451433e-005"
gnuplot> plot ARG1 u (inter(ARG2,ARG3,ARG4,ARG5)):ARG6
^
"PlotInterval.p", line 3: x range is invalid
It appears the column numbers are not passed properly (the min and max values of the interval are passed properly).
Here are the first lines of data.dat:
0.000000e+000 -1.577475e+000 -7.175042e+000 2.764545e-005 -5.966045e+000 1.451433e-005 -4.665347e+000 -1.412159e-005 6.154827e+000 0.000000e+000 0.000000e+000 3.100275e+002 0.000000e+000
2.500000e-003 4.346526e+000 -1.305610e+001 3.170804e-005 -5.790276e+000 1.632860e-005 -4.574010e+000 -1.459951e-005 6.069773e+000 -1.521847e+000 -1.521847e+000 3.009973e+002 0.000000e+000
5.000000e-003 1.055312e+001 -1.861278e+001 3.085889e-005 -5.604992e+000 1.797386e-005 -4.472427e+000 -1.651171e-005 5.977640e+000 -7.909049e+000 -7.909049e+000 3.029022e+002 0.000000e+000
7.500000e-003 1.676089e+001 -2.476250e+001 3.417608e-005 -5.412398e+000 2.195262e-005 -4.354189e+000 -1.823193e-005 5.874751e+000 -4.333744e+000 -4.333744e+000 2.982168e+002 0.000000e+000
1.000000e-002 2.276874e+001 -3.064776e+001 3.607515e-005 -5.204357e+000 2.585798e-005 -4.212604e+000 -1.948774e-005 5.763049e+000 -9.444781e+000 -9.444781e+000 2.864735e+002 0.000000e+000
1.250000e-002 2.901897e+001 -3.670245e+001 3.681956e-005 -4.988488e+000 2.942617e-005 -4.048886e+000 -2.254946e-005 5.638561e+000 -1.512790e+001 -1.512790e+001 2.852074e+002 0.000000e+000
1.500000e-002 3.479634e+001 -4.301166e+001 4.146322e-005 -4.756663e+000 3.338716e-005 -3.862872e+000 -2.427187e-005 5.499905e+000 -1.618025e+001 -1.618025e+001 2.797585e+002 0.000000e+000
1.750000e-002 4.052957e+001 -4.899462e+001 4.416380e-005 -4.503088e+000 3.794105e-005 -3.651641e+000 -2.608256e-005 5.350786e+000 -2.219509e+001 -2.219509e+001 2.736614e+002 0.000000e+000
2.000000e-002 4.657926e+001 -5.503798e+001 4.764674e-005 -4.231202e+000 4.255615e-005 -3.413258e+000 -2.911828e-005 5.187315e+000 -2.519971e+001 -2.519971e+001 2.689015e+002 0.000000e+000
Am I missing something? How can I get the column numbers to be passed? Is there a workaround? Thanks a lot.
The variables ARG1 etc are string variables, and column work differently for string or integer variable. So you must explicitly cast the values given to column to integers:
inter(min,max,var,colx)=(min<=column(int(var))&&column(int(var))<=max?column(int(colx)):NaN)
plot ARG1 u (inter(ARG2,ARG3,ARG4,ARG5)):ARG6

Resources