bar chart with 2 hierarchies in x-axis - python-3.x

I have a dataframe that looks like this
SP1-SP2 P1-P2 gmm Fowlkes Mallows kmeans Fowlkes Mallows
0 AFR-AMR ACB-CLM 0.883981 0.973784
1 AFR-AMR CLM-LWK 0.630063 0.649272
2 AFR-AMR ESN-MXL 0.944129 0.974126
3 AFR-EAS KHV-MSL 0.916021 0.960642
4 AFR-EUR CEU-ACB 0.892367 0.911122
5 AFR-EUR FIN-LWK 0.875518 0.886502
6 AFR-EUR GWD-GBR 0.934915 0.963250
7 AFR-EUR LWK-IBS 0.943654 0.974227
8 AFR-SAS YRI-PJL 0.646557 0.517052
9 AMR-AMR PEL-CLM 0.993127 0.996963
10 AMR-EAS CHS-MXL 0.886552 0.924213
And I wanted to create a chart with two x-axes, and I didn't find a way to do so.
Here is an example from google sheets
SP1-SP2 column is the main x-axes, P1-P2 is on each bar.
I don't care if it's with matplotlib or seaborn etc...
Thanks!

Related

Plotting a column of precision floating point values

I have a sequence of data that I have modified to the following:
load 'tables/csv'
load 'graphics/plot'
x =: readcsv 'table_ctl.csv'
dat =: 4 {::|:x
dat
The data in question is pulling the fourth column, that has been transposed of the following sequence of the array. Below is a sample of the first five values for the column.
13.5598 13.6815 14.027 14.132 14.0104
However upon running:
plot dat
I get the following error:
|option not found: 13.5598: signal
| signal'option not found: ',j
Is this error due to the precision of the floating point values?
Thank you.
You're getting this error as you're passing a list of boxes to plot, and plot is expecting some of these boxes to contain the data to plot, and some other boxes to contain control data. 13.5598 is not a valid option for a plot.
fread 'table_ctl.csv'
a,b,0,1,13.5598
a,b,0,1,13.6815
a,b,0,1,14.027
a,b,0,1,14.132
a,b,0,1,14.0104
4 {::|: readcsv 'table_ctl.csv'
┌───────┬───────┬──────┬──────┬───────┐
│13.5598│13.6815│14.027│14.132│14.0104│
└───────┴───────┴──────┴──────┴───────┘
Probably you were thinking that {:: automatically unboxes, but it only does this if the path you give it designates a single box. See the top text at Fetch. The other problem to have is that the contents of these boxes are strings, not floats:
$ > 4 {::|: readcsv 'table_ctl.csv'
5 7
|."1 > 4 {::|: readcsv 'table_ctl.csv'
8955.31
5186.31
720.41
231.41
4010.41
So, to plot your numbers: plot > makenum 4 {::|: readcsv 'table_ctl.csv' which starts with the list of boxes, then turns each box into a box of a float, then unboxes the list and plots it. makenum comes with readcsv and is like a smart ". each in this case, as it would leave non-numeric boxes alone.
There's a bit more to set up, but jd might also work for this:
fread 'table_ctl.cdefs'
1 label byte 1
2 name varbyte
3 enabled boolean
4 weight int
5 score float
options , LF NO \ 0 iso8601-char
load 'data/jd'
!!! Jd key: non-commercial use only!
jdwelcome_jd_ NB. run this sentence for important information
jdadminnew'temp'
CSVFOLDER=:'/path/to/csv/directory'
jd'csvrd table_ctl.csv data'
jd'info schema'
┌─────┬───────┬───────┬─────┐
│table│column │type │shape│
├─────┼───────┼───────┼─────┤
│data │label │byte │1 │
│data │name │varbyte│_ │
│data │enabled│boolean│_ │
│data │weight │int │_ │
│data │score │float │_ │
└─────┴───────┴───────┴─────┘
jd'get data score'
13.5598 13.6815 14.027 14.132 14.0104

How to create a dataframe with different random numbers on each column?

I'm trying to make different random numbers but it keeps being the same on every column, how to fix it, using 1 line?
CODE:
yuju= pd.DataFrame()
column_price_x = [random.uniform(65.5,140.5) for i in range(20)]
for i in range(1990,2020):
yuju[i] = column_price_x
yuju
RESULT
EXPECTED:
Different numbers value for each column
How can I deal with it?
Its much easier than you think
In [12]: import numpy as np
In [13]: df = pd.DataFrame(np.random.rand(5,5))
In [14]: df
Out[14]:
0 1 2 3 4
0 0.463645 0.818606 0.520964 0.016413 0.286529
1 0.701693 0.556813 0.352911 0.738017 0.148805
2 0.899378 0.626350 0.821576 0.917648 0.404706
3 0.985617 0.336138 0.443910 0.690457 0.627859
4 0.121281 0.784853 0.799065 0.102332 0.156317
np.random.rand samples from standard uniform distribution (over [0,1])
Edit
if you want uniform distribution over given numbers, use np.random.uniform
In [16]: pd.DataFrame(np.random.uniform(low=65.5,high=140.5,size=(5,5))
...: )
Out[16]:
0 1 2 3 4
0 124.356069 96.718934 100.587485 136.670313 124.134073
1 68.109675 105.677037 86.084935 109.284336 108.393333
2 120.445978 125.036895 92.557137 105.864824 95.297450
3 91.027931 140.040051 94.362951 80.870850 70.106912
4 107.404708 92.472469 84.748544 82.116756 129.313166
here the solution
each iteration you should random again to assign new value for each column
yuju= pd.DataFrame()
for i in range(1990,2020):
yuju[i]= [random.uniform(65.5,140.5) for i in range(20)]
yuju
output
1990 1991 1992 1993 1994 1995 1996 1997 ...
0 73.117785 104.158470 76.704672 136.295814 106.008801 88.129275 96.843800 118.172649 ... 106.08
1 77.146977 131.584449 112.781430 113.071448 118.806880 140.301281 132.196554 136.222878 ... 74.85
2 67.976294 90.571586 137.313729 126.388545 134.941530 119.544528 119.692859 124.883332 ... 82.48
3 76.577618 102.765745 137.014399 84.696234 70.087628 86.180974 121.070030 87.991356 ... 71.67
4 104.675987 134.869611 120.221701 69.652423 105.650834 107.308007 122.372708 80.037225 ... 90.58
5 107.093326 124.649323 138.961846 84.312784 98.964176 87.691698 120.426266 79.888018 ... 97.46
6 97.375159 97.607740 119.027947 77.545403 81.365235 119.204719 75.426836 132.545121 ... 120.15
7 81.099338 94.315767 123.389789 85.734648 134.746295 99.196135 65.963834 72.895016 ... 135.63
8 129.577824 118.482358 137.838454 83.338883 68.603851 138.657750 85.155046 73.311065 ... 91.12
9 129.321333 134.598491 138.810883 119.487502 75.794849 125.314185 118.499014 126.969947 ... 74.86
10 122.704160 118.282868 114.196318 69.668442 112.237553 68.953530 115.395672 114.560736 ... 88.21
11 112.653109 109.635751 78.470715 81.973892 111.413094 76.918852 76.318205 129.423737 ... 103.06
12 80.984595 136.170595 83.258407 112.248942 96.730922 84.922575 104.984614 127.646325 ... 103.24
13 82.658896 97.066191 95.096705 107.757428 93.767250 93.958438 115.113325 98.931509 ... 105.32
14 85.173060 77.257117 72.668875 87.061919 130.088992 80.001858 104.526423 85.237558 ... 87.86
15 68.428850 79.948204 107.060400 92.962859 133.393354 93.806838 99.258857 138.314982 ... 86.80
16 115.105281 110.567551 119.868457 139.482290 103.235046 128.805920 140.131489 107.568099 ... 98.16
17 71.318147 119.965667 97.135972 90.174975 125.738171 115.655945 86.333461 114.574965 ... 134.80
18 134.000260 121.417473 104.832999 129.277671 139.932955 122.623911 92.369881 109.523118 ... 137.47
19 104.444951 111.712214 130.602922 119.446700 88.256841 110.316280 74.611164 88.364896 ... 115.32

Keras Prediction result (getting score,use of argmax)

I am trying to use the elmo model for text classification for my own dataset. The training is completed and the number of classes is 4(used keras model and elmo embedding).In the prediction, I got a numpy array. I am attaching the sample code and the result below...
import tensorflow as tf
import keras.backend as K
new_text_pr = np.array(data, dtype=object)[:, np.newaxis]
with tf.Session() as session:
K.set_session(session)
session.run(tf.global_variables_initializer())
session.run(tf.tables_initializer())
model_elmo = build_model(classes)
model_elmo.load_weights(model+"/"+elmo_model)
import time
t = time.time()
predicted = model_elmo.predict(new_text_pr)
print("time: ", time.time() - t)
print(predicted)
# print(predicted[0][0])
print("result:",np.argmax(predicted[0]))
return np.argmax(predicted[0])
when I print the predicts variable I got this.
time: 1.561854362487793
[[0.17483692 0.21439584 0.24001297 0.3707543 ]
[0.15607062 0.24448264 0.4398888 0.15955798]
[0.06494818 0.3439018 0.42254424 0.16860574]
[0.08343349 0.37218323 0.32528472 0.2190985 ]
[0.14868192 0.25948635 0.32722548 0.2646063 ]
[0.0365712 0.4194748 0.3321385 0.21181548]
[0.05350104 0.18225929 0.56712115 0.19711846]
[0.08343349 0.37218323 0.32528472 0.2190985 ]
[0.09541835 0.19085276 0.41069734 0.30303153]
[0.03930932 0.40526104 0.45785302 0.09757669]
[0.06377257 0.33980298 0.32396355 0.27246094]
[0.09784496 0.2292052 0.44426462 0.22868524]
[0.06089798 0.31685832 0.47317514 0.14906852]
[0.03956613 0.46605557 0.3502095 0.14416872]
[0.10513227 0.26166025 0.36598155 0.26722598]
[0.15165758 0.22900137 0.50939053 0.10995051]
[0.06377257 0.33980298 0.32396355 0.27246094]
[0.11404029 0.21311268 0.46880838 0.2040386 ]
[0.07556026 0.20502563 0.52019936 0.19921473]
[0.11096822 0.23295449 0.36192006 0.29415724]
[0.05018891 0.16656907 0.60114646 0.18209551]
[0.08880813 0.2893545 0.44374797 0.1780894 ]
[0.14868192 0.25948635 0.32722548 0.2646063 ]
[0.09596984 0.18282187 0.5053091 0.2158991 ]
[0.09428936 0.13995855 0.62395805 0.14179407]
[0.10513227 0.26166025 0.36598155 0.26722598]
[0.08244281 0.15743142 0.5462735 0.21385226]
[0.07199708 0.2446867 0.44568574 0.23763043]
[0.1339082 0.27288827 0.43478844 0.15841508]
[0.07354636 0.24499843 0.44873005 0.23272514]
[0.08880813 0.2893545 0.44374797 0.1780894 ]
[0.14868192 0.25948635 0.32722548 0.2646063 ]
[0.08924995 0.36547357 0.40014726 0.14512917]
[0.05132649 0.28190497 0.5224545 0.14431408]
[0.06377257 0.33980292 0.32396355 0.27246094]
[0.04849219 0.36724472 0.39698333 0.1872797 ]
[0.07206573 0.31368822 0.4667826 0.14746341]
[0.05948553 0.28048623 0.41831577 0.2417125 ]
[0.07582933 0.18771031 0.54879296 0.18766735]
[0.03858965 0.20433436 0.5596278 0.19744818]
[0.07443814 0.20681688 0.3933627 0.32538226]
[0.0639974 0.23687115 0.5357675 0.16336392]
[0.11005415 0.22901568 0.4279426 0.23298755]
[0.12625505 0.22987585 0.31619486 0.32767424]
[0.08893713 0.14554602 0.45740074 0.30811617]
[0.07906891 0.18683094 0.5214609 0.21263924]
[0.06316617 0.30398315 0.4475617 0.185289 ]
[0.07060979 0.17987429 0.4829593 0.26655656]
[0.0720717 0.27058697 0.41439256 0.24294883]
[0.06377257 0.33980292 0.32396355 0.27246094]
[0.04745338 0.25831962 0.46751252 0.22671448]
[0.06624557 0.20708969 0.54820716 0.17845756]]
result:3
Anyone have any idea about what is the use of taking the 0th index value only. Considering this as a list of lists 0th index means first list and the argmax returns index the maximum value from the list. Then what is the use of other values in the lists?. Why isn't it considered?. Also is it possible to get the score from this? I hope the question is clear. Is it the correct way or is it wrong?
I have found the issue. just posting it others who met the same problem.
Answer: When predicting with Elmo model, it expects a list of strings. In code, the prediction data were split and the model predicted for each word. That's why I got this huge array. I have used a temporary fix. The data is appended to a list then an empty string is also appended with the list. The model will predict the both list values but I took only the first predicted data. This is not the correct way but I have done this as a quick fix and hoping to find a fix in the future
To find the predicted class for each test example, you need to use axis=1. So, in your case the predicted classes will be:
>>> predicted_classes = predicted.argmax(axis=1)
>>> predicted_classes
[3 2 2 1 2 1 2 1 2 2 1 2 2 1 2 2 1 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 1 2 2
2 2 2 2 2 2 3 2 2 2 2 2 1 2 2]
Which means that the first test example belongs to the third class, and the second test example belongs to the second class and so on.
The previous part answers your question (I think), now let's see what the np.argmax(predicted) does. Using np.argmax() alone without specifying the axis will flatten your predicted matrix and get the argument of the maximum number.
Let's see this simple example to know what I mean:
>>> x = np.matrix(np.arange(12).reshape((3,4)))
>>> x
matrix([[ 0, 1, 2, 3],
[ 4, 5, 6, 7],
[ 8, 9, 10, 11]])
>>> x.argmax()
11
11 is the index of the 11 which is the biggest number in the whole matrix.

Generate a random number from normal distribution in J

In J programming: I know how to get a linear random number.
? 5#10
1 3 3 4 7
But how to get a random number from normal distribution, e.g. N(0,1)? Thanks!
I think you're looking for normalrand in the stats package:
load 'stats'
normalrand 5
_0.514477 1.23645 _0.353373 _0.522193 1.23505
See also the stats/distribs addon
load 'stats/distrib'
rnorm 4
_0.486091 _0.339021 1.50653 0.19308
10 2 rnorm 4 NB. from distribution N(10,2)
10.0588 11.0472 13.6208 8.78888

svm train output file has less lines than that of the input file

I am currently building a binary classification model and have created an input file for svm-train (svm_input.txt). This input file has 453 lines, 4 No. features and 2 No. classes [0,1].
i.e
0 1:15.0 2:40.0 3:30.0 4:15.0
1 1:22.73 2:40.91 3:36.36 4:0.0
1 1:31.82 2:27.27 3:22.73 4:18.18
0 1:22.73 2:13.64 3:36.36 4:27.27
1 1:30.43 2:39.13 3:13.04 4:17.39 ......................
My problem is that when I count the number of lines in the output model generated by svm-train (svm_train_model.txt), this has 12 fewer lines than that of the input file. The line count here shows 450, although there are obviously also 9 lines at the beginning showing the various parameters generated
i.e.
svm_type c_svc
kernel_type rbf
gamma 1
nr_class 2
total_sv 441
rho -0.156449
label 0 1
nr_sv 228 213
SV
Therefore 12 lines in total from the original input of 453 have gone. I am new to svm and was hoping that someone could shed some light on why this might have happened?
Thanks in advance
Updated.........
I now believe that in generating the model, it has removed lines whereby the labels and all the parameters are exactly the same.
To explain............... My input is a set of miRNAs which have been classified as 1 and 0 depending on their involvement in a particular process or not (i.e 1=Yes & 0=No). The input file looks something like.......
0 1:22 2:30 3:14 4:16
1 1:26 2:15 3:17 4:25
0 1:22 2:30 3:14 4:16
Whereby, lines one and three are exactly the same and as a result will be removed from the output model. My question is then both why the output model would do this and how I can get around this (whilst using the same features)?
Whilst both SOME OF the labels and their corresponding feature values are identical within the input file, these are still different miRNAs.
NOTE: The Input file does not have a feature for miRNA name (and this would clearly show the differences in each line) however, in terms of the features used (i.e Nucleotide Percentage Content), some of the miRNAs do have exactly the same percentage content of A,U,G & C and as a result are viewed as duplicates and then removed from the output model as it obviously views them as duplicates even though they are not (hence there are less lines in the output model).
the format of the input file is:
Where:
Column 0 - label (i.e 1 or 0): 1=Yes & 0=No
Column 1 - Feature 1 = Percentage Content "A"
Column 2 - Feature 2 = Percentage Content "U"
Column 3 - Feature 3 = Percentage Content "G"
Column 4 - Feature 4 = Percentage Content "C"
The input file actually looks something like (See the very first two lines below), as they appear identical, however each line represents a different miRNA):
1 1:23 2:36 3:23 4:18
1 1:23 2:36 3:23 4:18
0 1:36 2:32 3:5 4:27
1 1:14 2:41 3:36 4:9
1 1:18 2:50 3:18 4:14
0 1:36 2:23 3:23 4:18
0 1:15 2:40 3:30 4:15
In terms of software, I am using libsvm-3.22 and python 2.7.5
Align your input file properly, is my first observation. The code for libsvm doesnt look for exactly 4 features. I identifies by the string literals you have provided separating the features from the labels. I suggest manually converting your input file to create the desired input argument.
Try the following code in python to run
Requirements - h5py, if your input is from matlab. (.mat file)
pip install h5py
import h5py
f = h5py.File('traininglabel.mat', 'r')# give label.mat file for training
variables = f.items()
labels = []
c = []
import numpy as np
for var in variables:
data = var[1]
lables = (data.value[0])
trainlabels= []
for i in lables:
trainlabels.append(str(i))
finaltrain = []
trainlabels = np.array(trainlabels)
for i in range(0,len(trainlabels)):
if trainlabels[i] == '0.0':
trainlabels[i] = '0'
if trainlabels[i] == '1.0':
trainlabels[i] = '1'
print trainlabels[i]
f = h5py.File('training_features.mat', 'r') #give features here
variables = f.items()
lables = []
file = open('traindata.txt', 'w+')
for var in variables:
data = var[1]
lables = data.value
for i in range(0,1000): #no of training samples in file features.mat
file.write(str(trainlabels[i]))
file.write(' ')
for j in range(0,49):
file.write(str(lables[j][i]))
file.write(' ')
file.write('\n')

Resources