Parsing error when reading a specific Pajek (NET) file with Networkx into Jupyter - python-3.x

I am trying to reading this pajek file in Google Colab's version of Jupyter and I get an error when executing the following very simple code:
J = nx.MultiDiGraph()
J=nx.read_pajek("/content/data/graphdatasets/jazz.net")
print(nx.info(J))
The error is the following:
/usr/local/lib/python3.6/dist-packages/networkx/readwrite/pajek.py in parse_pajek(lines)
211 except AttributeError:
212 splitline = shlex.split(str(l))
--> 213 id, label = splitline[0:2]
214 labels.append(label)
215 G.add_node(label)
ValueError: not enough values to unpack (expected 2, got 1)
With pip show networkx, I see that I'm running Networkx version: 2.3. Am I doing something wrong in the code?
Update: Pasting below the file's first few lines:
*Vertices 198
*Arcs
*Edges
1 8 1
1 24 1
1 35 1
1 42 1
1 46 1
1 60 1
1 74 1
1 78 1

According to the Pajek definition the first two lines of your file are not according to the standard. After *vertices n, n lines with details about the vertices are expected. In addition, *edges and *arcs is a duplicate. NetworkX assumes use for an edge list, which started with *arcs a MultiDiGraph and for *edges a MultiGraph (see current code). To resolve your problem, you only need to delete the first two lines of your .net-file.

Related

type error in functions to run point in polygon query on RAPIDS

I want to create a point in polygon query for 14million NYC taxi trips and find out which of the 263 taxi zones the trips were located.
I want to the code on RAPIDS cuspatial. I read a few forums and posts, and came across cuspatial polygon limitations that users can only perform queries on 32 polygons in each run. So I did the following to split my polygons in batches.
This is my taxi zone polygon file
cusptaxizone
(0 0
1 1
2 34
3 35
4 36
...
258 348
259 349
260 350
261 351
262 353
Name: f_pos, Length: 263, dtype: int32,
0 0
1 232
2 1113
3 1121
4 1137
...
349 97690
350 97962
351 98032
352 98114
353 98144
Name: r_pos, Length: 354, dtype: int32,
x y
0 933100.918353 192536.085697
1 932771.395560 191317.004138
2 932693.871591 191245.031174
3 932566.381345 191150.211914
4 932326.317026 190934.311748
... ... ...
98187 996215.756543 221620.885314
98188 996078.332519 221372.066989
98189 996698.728091 221027.461362
98190 997355.264443 220664.404123
98191 997493.322715 220912.386162
[98192 rows x 2 columns])
There are 263 polygons/ taxi zones in total - I want to do queries in 24 batches and 11 polygons in each iteration.
def create_iterations(start, end, batches):
iterations = list(np.arange(start, end, batches))
iterations.append(end)
return iterations
pip_iterations = create_iterations(0, 264, 24)
#loop to do point in polygon query in a table
def perform_pip(cuda_df, cuspatial_data, polygon_name, iter_batch):
cuda_df['borough'] = " "
for i in range(len(iter_batch)-1):
start = pip_iterations[i]
end = pip_iterations[i+1]
pip = cuspatial.point_in_polygon(cuda_df['pickup_longitude'], cuda_df['pickup_latitude'],
cuspatial_data[0][start:end], #poly_offsets
cuspatial_data[1], #poly_ring_offsets
cuspatial_data[2]['x'], #poly_points_x
cuspatial_data[2]['y'] #poly_points_y
)
for i in pip.columns:
cuda_df['borough'].loc[pip[i]] = polygon_name[i]
return cuda_df
When I ran the function I received a type error. I wonder what might cause the issue?
pip_pickup = perform_pip(cutaxi, cusptaxizone, pip_iterations)
TypeError: perform_pip() missing 1 required positional argument: 'iter_batch'
It seems like you are passing in cutaxi for cuda_df, cusptaxizone for cuspatial_data and pip_iterations for polygon_name variable in perform_pip function. There is no variable/value passed for iter_batch defined in perform_pip function:
def perform_pip(cuda_df, cuspatial_data, polygon_name, iter_batch):
Hence, you get the above error which states that iter_batch is missing. As stated in the above comment as well you are not passing the right number of parameters for perform_pip function.
If you edit your code to pass in the right number of variables to perform_pip function the above mentioned error :
TypeError: perform_pip() missing 1 required positional argument: 'iter_batch'
would be resolved.

Console Screen Buffer Info shows incorrect X position

I recently found a great short code Why the irrelevant code made a difference? for obtaining console screen buffer info (which I include below) that replaces the huge code accompanying the standard 'CONSOLE_SCREEN_BUFFER_INFO()' method (which I won't include here!)
import ctypes
import struct
print("xxx",end="") # I added this to show what the problem is
hstd = ctypes.windll.kernel32.GetStdHandle(-11) # STD_OUTPUT_HANDLE = -11
csbi = ctypes.create_string_buffer(22)
res = ctypes.windll.kernel32.GetConsoleScreenBufferInfo(hstd, csbi)
width, height, curx, cury, wattr, left, top, right, bottom, maxx, maxy = struct.unpack("hhhhHhhhhhh", csbi.raw)
# The following two lines are also added
print() # To bring the cursor to next line for displaying infp
print(width, height, curx, cury, wattr, left, top, right, bottom, maxx, maxy) # Display what we got
Output:
80 250 0 7 7 0 0 79 24 80 43
This output is for Windows 10 MSDOS, with clearing the screen before running the code. However. 'curx' = 0 although it should be 3 (after printing "xxx"). The same phenomenon happens also with the 'CONSOLE_SCREEN_BUFFER_INFO()' method. Any idea what is the problem?
Also, any suggestion for a method of obtaining current cursor position -- besides 'curses' library -- will be welcome!
You need to flush the print buffer if you don't output a linefeed:
print("xxx",end="",flush=True)
Then I get the correct curx=3 with your code:
xxx
130 9999 3 0 14 0 0 129 75 130 76
BTW the original answer in the posted question is the "great" code. The "bitness" of HANDLE can break your code, and not defining .argtypes as a "shortcut" is usually the cause of most ctypes problems.

Python3: How to increment a string value within a "for" loop

I have a tabular.text file (Named "xfile"). An example of its contents is attached below.
Scaffold2_1 WP_017805071.1 26.71 161 97
Scaffold2_1 WP_006995572.1 26.36 129 83
Scaffold2_1 WP_005723576.1 26.92 130 81
Scaffold3_1 WP_009894856.1 25.77 245 43
Scaffold8_1 WP_017805071.1 38.31 248 145
Scaffold8_1 WP_006995572.1 38.55 249 140
Scaffold8_1 WP_005723576.1 34.88 258 139
Scaffold9_1 WP_005645255.1 42.54 446 144
Note how each line begins with Scaffold(y)_1, with y being a number. I have written the following code to print each line beginning with the following terms, Scaffold2 and Scaffold8.
with open("xfile", 'r') as data:
for line in data.readlines():
if "Scaffold2" in line:
a = line
print(a)
elif "Scaffold8" in line:
b = line
print(b)
I was wondering, is there a way you would recommend incrementing the (y) portion of Scaffold() in the if and elif statements?
The idea would be to allow the script to search for each line containing "Scaffold(y)" and storing each line with a specific number (y) in its own variable to be then printed. This would obviously be much faster than entering in each number manually.
You can try this, it is quite easier than using Regex. If this isn't what you expect, let me know, I'll change the code.
for line in data.readlines():
if line[0:8] == "Scaffold" and line[8].isdigit():
print(line)
I'm just checking the 9th Position in your line, i.e. (8th index). If it's a digit, I'm printing the line. Like you said, I'm printing if your "y" is a digit. I'm not incrementing it. The work of incrementation is already been done by your for loop.
Ok it seems like you want to get something in format like:
entries = {y1: ['Scaffold(y1)_...', 'Scaffold(y1)_...'], y2: ['Scaffold(y2)_...', 'Scaffold(y2)_...'], ...}
Then you can do something like that (I assume all of your lines start in the same manner as you have shown, so the y value is always the 8th position in the string):
entries = dict()
for line in data.readlines():
if not line[8] in entries.keys():
entries.update({line[8]: [line]})
else:
entries[line[8]].append(line)
print(entries)
This way you will have a dictionary in the format I have shown you above - output:
{'2': ['Scaffold2_1 WP_017805071.1 26.71 161 97', 'Scaffold2_1 WP_006995572.1 26.36 129 83', 'Scaffold2_1 WP_005723576.1 26.92 130 81'], '3': ['Scaffold3_1 WP_009894856.1 25.77 245 43'], '8': ['Scaffold8_1 WP_017805071.1 38.31 248 145', 'Scaffold8_1 WP_006995572.1 38.55 249 140', 'Scaffold8_1 WP_005723576.1 34.88 258 139'], '9': ['Scaffold9_1 WP_005645255.1 42.54 446 144']}
EDIT: tbh I still don't fully understand why would you need that tho.

svm train output file has less lines than that of the input file

I am currently building a binary classification model and have created an input file for svm-train (svm_input.txt). This input file has 453 lines, 4 No. features and 2 No. classes [0,1].
i.e
0 1:15.0 2:40.0 3:30.0 4:15.0
1 1:22.73 2:40.91 3:36.36 4:0.0
1 1:31.82 2:27.27 3:22.73 4:18.18
0 1:22.73 2:13.64 3:36.36 4:27.27
1 1:30.43 2:39.13 3:13.04 4:17.39 ......................
My problem is that when I count the number of lines in the output model generated by svm-train (svm_train_model.txt), this has 12 fewer lines than that of the input file. The line count here shows 450, although there are obviously also 9 lines at the beginning showing the various parameters generated
i.e.
svm_type c_svc
kernel_type rbf
gamma 1
nr_class 2
total_sv 441
rho -0.156449
label 0 1
nr_sv 228 213
SV
Therefore 12 lines in total from the original input of 453 have gone. I am new to svm and was hoping that someone could shed some light on why this might have happened?
Thanks in advance
Updated.........
I now believe that in generating the model, it has removed lines whereby the labels and all the parameters are exactly the same.
To explain............... My input is a set of miRNAs which have been classified as 1 and 0 depending on their involvement in a particular process or not (i.e 1=Yes & 0=No). The input file looks something like.......
0 1:22 2:30 3:14 4:16
1 1:26 2:15 3:17 4:25
0 1:22 2:30 3:14 4:16
Whereby, lines one and three are exactly the same and as a result will be removed from the output model. My question is then both why the output model would do this and how I can get around this (whilst using the same features)?
Whilst both SOME OF the labels and their corresponding feature values are identical within the input file, these are still different miRNAs.
NOTE: The Input file does not have a feature for miRNA name (and this would clearly show the differences in each line) however, in terms of the features used (i.e Nucleotide Percentage Content), some of the miRNAs do have exactly the same percentage content of A,U,G & C and as a result are viewed as duplicates and then removed from the output model as it obviously views them as duplicates even though they are not (hence there are less lines in the output model).
the format of the input file is:
Where:
Column 0 - label (i.e 1 or 0): 1=Yes & 0=No
Column 1 - Feature 1 = Percentage Content "A"
Column 2 - Feature 2 = Percentage Content "U"
Column 3 - Feature 3 = Percentage Content "G"
Column 4 - Feature 4 = Percentage Content "C"
The input file actually looks something like (See the very first two lines below), as they appear identical, however each line represents a different miRNA):
1 1:23 2:36 3:23 4:18
1 1:23 2:36 3:23 4:18
0 1:36 2:32 3:5 4:27
1 1:14 2:41 3:36 4:9
1 1:18 2:50 3:18 4:14
0 1:36 2:23 3:23 4:18
0 1:15 2:40 3:30 4:15
In terms of software, I am using libsvm-3.22 and python 2.7.5
Align your input file properly, is my first observation. The code for libsvm doesnt look for exactly 4 features. I identifies by the string literals you have provided separating the features from the labels. I suggest manually converting your input file to create the desired input argument.
Try the following code in python to run
Requirements - h5py, if your input is from matlab. (.mat file)
pip install h5py
import h5py
f = h5py.File('traininglabel.mat', 'r')# give label.mat file for training
variables = f.items()
labels = []
c = []
import numpy as np
for var in variables:
data = var[1]
lables = (data.value[0])
trainlabels= []
for i in lables:
trainlabels.append(str(i))
finaltrain = []
trainlabels = np.array(trainlabels)
for i in range(0,len(trainlabels)):
if trainlabels[i] == '0.0':
trainlabels[i] = '0'
if trainlabels[i] == '1.0':
trainlabels[i] = '1'
print trainlabels[i]
f = h5py.File('training_features.mat', 'r') #give features here
variables = f.items()
lables = []
file = open('traindata.txt', 'w+')
for var in variables:
data = var[1]
lables = data.value
for i in range(0,1000): #no of training samples in file features.mat
file.write(str(trainlabels[i]))
file.write(' ')
for j in range(0,49):
file.write(str(lables[j][i]))
file.write(' ')
file.write('\n')

Export a matrix to Excel

I made a matrix and I want to export it to Excel. The matrix looks like this:
1 2 3 4 5 6 7
2 0.4069264
3 0.5142857 0.2948718
4 0.3939394 0.4098639 0.3772894
5 0.3476190 0.3717949 0.3194444 0.5824176
6 0.2809524 0.3974359 0.2222222 0.3388278 0.3974359
7 0.2809524 0.5987654 0.3933333 0.4188713 0.4711538 0.3429487
8 0.4675325 0.4855072 0.4523810 0.4917184 0.3409091 0.4318182 0.4128788
9 0.3896104 0.5189594 0.4404762 0.2667549 0.5471429 0.3604762 0.3081502
10 0.4242424 0.4068878 0.3484432 0.2708333 0.4766484 0.3740842 0.4528219
11 0.3476190 0.3942308 0.2881944 0.3228022 0.4711538 0.2147436 0.3653846
12 0.6060606 0.3949830 0.2971612 0.3541667 0.5022894 0.3484432 0.4466490
13 0.4675325 0.5972222 0.6060606 0.3670635 0.4393939 0.3939394 0.3695652
14 0.4978355 0.4951499 0.4480952 0.4713404 0.3814286 0.3147619 0.4629121
15 0.4632035 0.4033883 0.4508929 0.3081502 0.4728571 0.3528571 0.4828571
16 0.3766234 0.5173993 0.4771825 0.4734432 0.5114286 0.3514286 0.4214286
17 0.3939394 0.5289116 0.3260073 0.3333333 0.5663919 0.2330586 0.3015873
18 0.3939394 0.3708791 0.2837302 0.4102564 0.3392857 0.2559524 0.4123810
19 0.3160173 0.5727041 0.4885531 0.3056973 0.4725275 0.3827839 0.3346561
20 0.3333333 0.5793651 0.4257143 0.4876543 0.4390476 0.2390476 0.3131868
21 0.5281385 0.3762755 0.4052198 0.2997449 0.4180403 0.2898352 0.4951499
22 0.3593074 0.3784014 0.4075092 0.2423469 0.4908425 0.3113553 0.3430335
23 0.5281385 0.5875850 0.4404762 0.4634354 0.6071429 0.3763736 0.3747795
24 0.3549784 0.6252381 0.5957341 0.4328571 0.4429563 0.4429563 0.3422619
25 0.4242424 0.4931973 0.5054945 0.2142857 0.4670330 0.4285714 0.4312169
26 0.3852814 0.5671769 0.4954212 0.4073129 0.3736264 0.4890110 0.4523810
27 0.5238095 0.3269558 0.5187729 0.4051871 0.5412088 0.5155678 0.5859788
28 0.3160173 0.1904762 0.3205128 0.3384354 0.3429487 0.3173077 0.5123457
29 0.2380952 0.4468537 0.5196886 0.4536565 0.4491758 0.4491758 0.4634039
30 0.4545455 0.4295635 0.4080087 0.4791667 0.3474026 0.3019481 0.4627329
31 0.2857143 0.3988095 0.3397436 0.3443878 0.4294872 0.2756410 0.3456790
32 0.3636364 0.3027211 0.3772894 0.3452381 0.4413919 0.3388278 0.3818342
33 0.3333333 0.4482402 0.4080087 0.4275362 0.2888199 0.4047619 0.4301242
34 0.5411255 0.4825680 0.4043040 0.4417517 0.4748168 0.3850733 0.3708113
35 0.3160173 0.5476190 0.4230769 0.3979592 0.3653846 0.3397436 0.2283951
36 0.4603175 0.4653209 0.4778912 0.5170807 0.3928571 0.4508282 0.4254658
37 0.3939394 0.1955782 0.2490842 0.4047619 0.2490842 0.3516484 0.4559083
38 0.3463203 0.4660494 0.4300000 0.4157848 0.3833333 0.2233333 0.2788462
39 0.5844156 0.4668367 0.3809524 0.3843537 0.4803114 0.3008242 0.5026455
40 0.5454545 0.4902211 0.3740842 0.2946429 0.5279304 0.2971612 0.3293651
41 0.5800866 0.3758503 0.5073260 0.5136054 0.3598901 0.5393773 0.4823633
42 0.4458874 0.3937390 0.3785714 0.4686949 0.3768315 0.3127289 0.4954212
43 0.6536797 0.5740741 0.5533333 0.4453263 0.4866667 0.5400000 0.4358974
44 0.5887446 0.5548469 0.4308608 0.3949830 0.5462454 0.3411172 0.5136684
45 0.4069264 0.4357993 0.4308608 0.3830782 0.4308608 0.3795788 0.4025573
46 0.5974026 0.3826531 0.3672161 0.3954082 0.4441392 0.3159341 0.5141093
47 0.2554113 0.4196429 0.4262821 0.4961735 0.2788462 0.3301282 0.3055556
I tried the command:
WriteXLS("my matrix after i converted it to data.frame", "test.xls")
but I got this error:
The Perl script 'WriteXLS.pl' failed to run successfully.
I googled it but I couldn't find a solution.
Thanks in advance.
Any reason why you can't just use write.csv?
write.csv(mymatrix, "test.csv")
Import it in Excel and you're set!
PS: I assume you're not putting quotes around your variable name in the WriteXLS call, right?
One other option on Windows (which seems a reasonable assumption given that you are using Excel):
You can write a matrix (or data frame) to the clipboard using a command like:
write.table(mymat, 'clipboard', sep='\t')
Then just go into Excel, click in the cell that you want to be the top left cell, then do a paste and your matrix is there (the sep='\t' is important for Excel to interpret it correctly).
This is similar to other answers, but you don't need an intermediate file on disk.
You could also check xlsx if you do not mind the Excel 2007 format, as xlsx does not depend on Perl (though depends on rJava).
After loading the packge via library(xlsx) just try the following:
write.xlsx(USArrests, "/usarrests.xlsx")
It's hard to see what is going on here exactly. Might be several things.
I think the easiest way to write a matrix to excell is by using write.table() and importing the data in excell. It takes an extra step but it also keeps your data in a nice format.
If foo is your matrix:
write.table(foo,"foo.txt")
If you get an error maybe trie coercing the object to a matrix:
write.table(as.matrix(foo),"foo.txt")
Does the matrix contain values in the upper triangle as well? Perhaps making a full matrix works:
foo<-foo+t(foo)
write.table(as.matrix(foo),"foo.txt")
But these are all just random shots in the dark since I don't have a matrix to work with.
EDIT: In response to the other answer, you can remove the column and rownames with col.names=FALSE and row.names=FALSE in both write.table() and write.csv() (which are the same function with different default values).
I have met the same problem, after reinstalling strawberry perl : after debugging the WriteXLS function in R, I found out the the perl module Text::CSV_XS was missing from my fresh new install. I installed this module from the DOS command line :
perl -MCPAN -e shell
install Text::CSV_XS
After this, WriteXLS was working fine.
upper # matrix name
write.xlsx2(upper,file = "File.xlsx", sheetName="Sheetname",col.names=TRUE, row.names=TRUE, append=TRUE, showNA=TRUE)

Resources