Parsing a http response that is in bytes format - python-3.x

requests.get() receives a response which is of the type bytes. It looks like:
b'{"Close":8506.25,"DownTicks":164,"DownVolume":207,"High":8508.25,"Low":8495.00,"Open":8496.75,"Status":13,"TimeStamp":"\\/Date(1583530800000)\\/","TotalTicks":371,"TotalVolume":469,"UnchangedTicks":0,"UnchangedVolume":0,"UpTicks":207,"UpVolume":262,"OpenInterest":0}\r\n{"Close":8503.00,"DownTicks":152,"DownVolume":203,"High":8509.50,"Low":8502.00,"Open":8506.00,"Status":13,"TimeStamp":"\\/Date(1583531100000)\\/","TotalTicks":282,"TotalVolume":345,"UnchangedTicks":0,"UnchangedVolume":0,"UpTicks":130,"UpVolume":142,"OpenInterest":0}\r\n{"Close":8494.00,"DownTicks":160,"DownVolume":206,"High":8505.75,"Low":8492.75,"Open":8503.25,"Status":13,"TimeStamp":"\\/Date(1583531400000)\\/","TotalTicks":275,"TotalVolume":346,"UnchangedTicks":0,"UnchangedVolume":0,"UpTicks":115,"UpVolume":140,"OpenInterest":0}\r\n{"Close":8499.00,"DownTicks":136,"DownVolume":192,"High":8500.25,"Low":8492.25,"Open":8493.75,"Status":13,"TimeStamp":"\\/Date(1583531700000)\\/","TotalTicks":299,"TotalVolume":402,"UnchangedTicks":0,"UnchangedVolume":0,"UpTicks":163,"UpVolume":210,"OpenInterest":0}\r\n{"Close":8501.75,"DownTicks":176,"DownVolume":314,"High":8508.25,"Low":8495.75,"Open":8498.50,"Status":536870941,"TimeStamp":"\\/Date(1583532000000)\\/","TotalTicks":340,"TotalVolume":510,"UnchangedTicks":0,"UnchangedVolume":0,"UpTicks":164,"UpVolume":196,"OpenInterest":0}\r\nEND'
Please note that while the actual string is much longer, it is always a long string of shorter strings separated by '\r\n', ignoring the final word "END". You can see how similarly-structured these short strings are:
for i in response.text.split('\r\n')[:-1]: print(i, '\n\n')
{"Close":8506.25,"DownTicks":164,"DownVolume":207,"High":8508.25,"Low":8495.00,"Open":8496.75,"Status":13,"TimeStamp":"\/Date(1583530800000)\/","TotalTicks":371,"TotalVolume":469,"UnchangedTicks":0,"UnchangedVolume":0,"UpTicks":207,"UpVolume":262,"OpenInterest":0}
{"Close":8503.00,"DownTicks":152,"DownVolume":203,"High":8509.50,"Low":8502.00,"Open":8506.00,"Status":13,"TimeStamp":"\/Date(1583531100000)\/","TotalTicks":282,"TotalVolume":345,"UnchangedTicks":0,"UnchangedVolume":0,"UpTicks":130,"UpVolume":142,"OpenInterest":0}
{"Close":8494.00,"DownTicks":160,"DownVolume":206,"High":8505.75,"Low":8492.75,"Open":8503.25,"Status":13,"TimeStamp":"\/Date(1583531400000)\/","TotalTicks":275,"TotalVolume":346,"UnchangedTicks":0,"UnchangedVolume":0,"UpTicks":115,"UpVolume":140,"OpenInterest":0}
{"Close":8499.00,"DownTicks":136,"DownVolume":192,"High":8500.25,"Low":8492.25,"Open":8493.75,"Status":13,"TimeStamp":"\/Date(1583531700000)\/","TotalTicks":299,"TotalVolume":402,"UnchangedTicks":0,"UnchangedVolume":0,"UpTicks":163,"UpVolume":210,"OpenInterest":0}
{"Close":8501.75,"DownTicks":176,"DownVolume":314,"High":8508.25,"Low":8495.75,"Open":8498.50,"Status":536870941,"TimeStamp":"\/Date(1583532000000)\/","TotalTicks":340,"TotalVolume":510,"UnchangedTicks":0,"UnchangedVolume":0,"UpTicks":164,"UpVolume":196,"OpenInterest":0}
Goal parsing a few of the fields and saving them in a dataframe with the field "Timestamp" as the dataframe's index.
What I have done:
response_text = response.text
import ast
df = pd.DataFrame(columns = [ 'o', 'h', 'l', 'c', 'v'])
for i in response_text.split('\r\n')[:-1]:
i_dict = ast.literal_eval(i)
epoch_in_milliseconds = int(i_dict['TimeStamp'].split('(')[1].split(')')[0])
time_stamp = datetime.datetime.fromtimestamp(float(epoch_in_milliseconds)/1000.)
o = i_dict['Open']
h = i_dict['High']
l = i_dict['Low']
c = i_dict['Close']
v = i_dict['TotalVolume']
temp_df = pd.DataFrame({'o':o, 'h':h, 'l':l, 'c':c, 'v':v}, index = [time_stamp])
df = df.append(temp_df)
which gets me:
In [546]df
Out[546]:
o h l c v
2020-03-06 16:40:00 8496.75000 8508.25000 8495.00000 8506.25000 469
2020-03-06 16:45:00 8506.00000 8509.50000 8502.00000 8503.00000 345
2020-03-06 16:50:00 8503.25000 8505.75000 8492.75000 8494.00000 346
2020-03-06 16:55:00 8493.75000 8500.25000 8492.25000 8499.00000 402
2020-03-06 17:00:00 8498.50000 8508.25000 8495.75000 8501.75000 510
which is exactly what I need.
Issue this method feels clumsy to me, like a patch-work, and prone to breaking due to possible slight differences in the response text.
Is there any more robust and faster way of extracting this information from the original bytes? (When the server response is in JSON format, I have none of this headache)

This is somewhat cleaner, I believe:
ts = """
b'{"Close":8506.25,"DownTicks":164,"DownVolume":207,"High":8508.25,"Low":8495.00,"Open":8496.75,"Status":13,"TimeStamp":"\\/Date(1583530800000)\\/","TotalTicks":371,"TotalVolume":469,"UnchangedTicks":0,"UnchangedVolume":0,"UpTicks":207,"UpVolume":262,"OpenInterest":0}\r\n{"Close":8503.00,"DownTicks":152,"DownVolume":203,"High":8509.50,"Low":8502.00,"Open":8506.00,"Status":13,"TimeStamp":"\\/Date(1583531100000)\\/","TotalTicks":282,"TotalVolume":345,"UnchangedTicks":0,"UnchangedVolume":0,"UpTicks":130,"UpVolume":142,"OpenInterest":0}\r\n{"Close":8494.00,"DownTicks":160,"DownVolume":206,"High":8505.75,"Low":8492.75,"Open":8503.25,"Status":13,"TimeStamp":"\\/Date(1583531400000)\\/","TotalTicks":275,"TotalVolume":346,"UnchangedTicks":0,"UnchangedVolume":0,"UpTicks":115,"UpVolume":140,"OpenInterest":0}\r\n{"Close":8499.00,"DownTicks":136,"DownVolume":192,"High":8500.25,"Low":8492.25,"Open":8493.75,"Status":13,"TimeStamp":"\\/Date(1583531700000)\\/","TotalTicks":299,"TotalVolume":402,"UnchangedTicks":0,"UnchangedVolume":0,"UpTicks":163,"UpVolume":210,"OpenInterest":0}\r\n{"Close":8501.75,"DownTicks":176,"DownVolume":314,"High":8508.25,"Low":8495.75,"Open":8498.50,"Status":536870941,"TimeStamp":"\\/Date(1583532000000)\\/","TotalTicks":340,"TotalVolume":510,"UnchangedTicks":0,"UnchangedVolume":0,"UpTicks":164,"UpVolume":196,"OpenInterest":0}\r\nEND'
"""
import pandas as pd
from datetime import datetime
import json
data = []
tss = ts.replace("b'","").replace("\r\nEND'","")
tss2 = tss.strip().split("\r\n")
for t in tss2:
item = json.loads(t)
epo = int(item['TimeStamp'].split('(')[1].split(')')[0])
eims = datetime.fromtimestamp(epo/1000)
item.update(TimeStamp=eims)
data.append(item)
pd.DataFrame(data)
Output:
Close DownTicks DownVolume High Low Open Status TimeStamp TotalTicks TotalVolume UnchangedTicks UnchangedVolume UpTicks UpVolume OpenInterest
0 8506.25 164 207 8508.25 8495.00 8496.75 13 2020-03-06 16:40:00 371 469 0 0 207 262 0
1 8503.00 152 203 8509.50 8502.00 8506.00 13 2020-03-06 16:45:00 282 345 0 0 130 142 0
etc. You can drop unwanted columns, change column names and so on.

That's almost JSON format. Or more precisely, it's a series of lines, each of which contains a JSON-formatted object. (Except for the last one.) So that suggests that the optimal solution in some way uses the json module.
json.load doesn't handle files consisting of a series of lines, nor does it directly handle individual strings (much less bytes). However, you're not limited to json.load. You can construct a JSONDecoder object, which does include methods to parse from strings (but not from bytes), and you can use the decode method of the bytes object to construct a string from the input. (You need to know the encoding to do that, but I strongly suspect that in this case all the characters are ascii, so either 'ascii' or the default UTF-8 encoding will work.)
Unless your input is gigabytes, you can just use the strategy in your question: split the input into lines, discard the END line, and pass the rest into a JSONDecoder:
import json
decoder = JSONDecoder()
# Using splitlines seemed more robust than counting on a specific line-end
for line in response_text.decode().splitlines()
# Alternative: use a try/catch around the call to decoder.decode
if line == 'END': break
line_dict = decoder.decode(line)
# Handle the Timestamp member and create the dataframe item

Related

running for loop until arbitrary index (python 3.x)

So I have these strings that I split by spaces (' ') and I just rolled them into a single list I called 'keyLabelRun'
so it looks like this:
keyLabelRun[0-12]:
0 OS=Dengue
1 virus
2 3
3 PE=4
4 SV=1
5 Split=0
6
7 OS=Bacillus
8 subtilis
9 XF-1
10 GN=opuBA
11 PE=4
12 SV=1
I only want the elements that include and are after "OS=", anything else, whether it be "SV=" or "PE=" etc. I want to skip over those elements until I get to the next "OS="
The number of elements to the next "OS=" is arbitrary so that's where I'm having the problem.
This is what I'm currently trying:
OSarr = []
for i in range(len(keyLabelrun)):
if keyLabelrun[i].count('OS='):
OSarr.append(keyLabelrun[i])
if keyLabelrun[i+1].count('=') != 1:
continue
But the elements where "OS=" is not included is what is tripping me up I think.
Also at the end I'm going to join them all back together in their own elements but I feel like I will be able to handle that after this.
In my attempt, I am trying to append all elements I'm looking for in order to an new list 'OSarr'
If anyone can lend a hand, it would be much appreciated.
Thank you.
These list of strings came from a dataset that is a text file in the form:
>tr|W0FSK4|W0FSK4_9FLAV Genome polyprotein (Fragment) OS=Dengue virus 3 PE=4 SV=1 Split=0
MNNQRKKTGKPSINMLKRVRNRVSTGSQLAKRFSKGLLNGQGPMKLVMAFIAFLRFLAIPPTAGVLARWGTFKKSGAIKVLKGFKKEISNMLSIINKRKKTSLCLMMILPAALAFHLTSRDGEPRMIVGKNERGKSLLFKTASGINMCTLIAMDLGEMCDDTVTYKCPHITEVEPEDIDCWCNLTSTWVTYGTCNQAGEHRRDKRSVALAPHVGMGLDTRTQTWMSAEGAWRQVEKVETWALRHPGFTILALFLAHYIGTSLTQKVVIFILLMLVTPSMTMRCVGVGNRDFVEGLSGATWVDVVLEHGGCVTTMAKNKPTLDIELQKTEATQLATLRKLCIEGKITNITTDSRCPTQGEATLPEEQDQNYVCKHTYVDRGWGNGCGLFGKGSLVTCAKFQCLEPIEGKVVQYENLKYTVIITVHTGDQHQVGNETQGVTAEITPQASTTEAILPEYGTLGLECSPRTGLDFNEMILLTMKNKAWMVHRQWFFDLPLPWTSGATTETPTWNRKELLVTFKNAHAKKQEVVVLGSQEGAMHTALTGATEIQNSGGTSIFAGHLKCRLKMDKLELKGMSYAMCTNTFVLKKEVSETQHGTILIKVEYKGEDVPCKIPFSTEDGQGKAHNGRLITANPVVTKKEEPVNIEAEPPFGESNIVIGIGDNALKINWYKKGSSIGKMFEATARGARRMAILGDTAWDFGSVGGVLNSLGKMVHQIFGSAYTALFSGVSWVMKIGIGVLLTWIGLNSKNTSMSFSCIAIGIITLYLGAVVQADMGCVINWKGKELKCGSGIFVTNEVHTWTEQYKFQADSPKRLATAIAGAWENGVCGIRSTTRMENLLWKQIANELNYILWENNIKLTVVVGDIIGVLEQGKRTLTPQPMELKYSWKTWGKAKIVTAETQNSSFIIDGPNTPECPSVSRAWNVWEVEDYGFGVFTTNIWLKLREVYTQLCDHRLMSAAVKDERAVHADMGYWIESQKNGSWKLEKASLIEVKTCTWPKSHTLWSNGVLESDMIIPKSLAGPISQHNHRPGYHTQTAGPWHLGKLELDFNYCEGTTVVITENCGTRGPSLRTTTVSGKLIHEWCCRSCTLPPLRYMGEDGCWYGMEIRPISEKEENMVKSLVSAGSGKVDNFTMGVLCLAILFEEVMRGKFGKKHMIAGVFFTFVLLLSGQITWRDMAHTLIMIGSNASDRMGMGVTYLALIATFKIQPFLALGFFLRKLTSRENLLLGVGLAMATTLQLPEDIEQMANGIALGLMALKLITQFETYQLWTALISLTCSNTIFTLTVAWRTATLILAGVSLLPVCQSSSMRKTDWLPMAVAAMGVPPLPLFIFGLKDTLKRRSWPLNEGVMAVGLVSILASSLLRNDVPMAGPLVAGGLLIACYVITGTSADLTVEKAADITWEEEAEQTGVSHNLMITVDDDGTMRIKDDETENILTVLLKTALLIVSGIFPYSIPATLLVWHTWQKQTQRSGVLWDVPSPPETQKAELEEGVYRIKQQGIFGKTQVGVGVQKEGVFHTMWHVTRGAVLTYNGKRLEPNWASVKKDLISYGGGWRLSAQWQKGEEVQVIAVEPGKNPKNFQTMPGTFQTTTGEIGAIALDFKPGTSGSPIINREGKVVGLYGNGVVTKNGGYVSGIAQTNAEPDGPTPELEEEMFKKRNLTIMDLHPGSGKTRKYLPAIVREAIKRRLRTLILAPTRVVAAEMEEALKGLPIRYQTTATKSEHTGREIVDLMCHATFTMRLLSPVRVPNYNLIIMDEAHFTDPASIAARGYISTRVGMGEAAAIFMTATPPGTADAFPQSNAPIQDEERDIPERSWNSGNEWITDFAGKTVWFVPSIKAGNDIANCLRKNGKKVIQLSRKTFDTEYQKTKLNDWDFVV
>tr|M4KW32|M4KW32_BACIU Choline ABC transporter (ATP-binding protein) OS=Bacillus subtilis XF-1 GN=opuBA PE=4 SV=1 Split=0
MLTLENVSKTYKGGKKAVNNVNLKIAKGEFICFIGPSGCGKTTTMKMINRLIEPSAGKIFIDGENIMDQDPVELRRKIGYVIQQIGLFPHMTIQQNISLVPKLLKWPEQQRKERARELLKLVDMGPEYVDRYPHELSGGQQQRIGVLRALAAEPPLILMDEPFGALDPITRDSLQEEFKKLQKTLHKTIVFVTHDMDEAIKLADRIVILKAGEIVQVGTPDDILRNPADEFVEEFIGKERLIQSSSPDVERVDQIMNTQPVTITADKTLSEAIQLMRQERVDSLLVVDDEHVLQGYVDVEIIDQCRKKANLIGEVLHEDIYTVLGGTLLRDTVRKILKRGVKYVPVVDEDRRLIGIVTRASLVDIVYDSLWGEEKQLAALS
>sp|Q8AWH3|SX17A_XENTR Transcription factor Sox-17-alpha OS=Xenopus tropicalis GN=sox17a PE=2 SV=1 Split=0
MSSPDGGYASDDQNQGKCSVPIMMTGLGQCQWAEPMNSLGEGKLKSDAGSANSRGKAEARIRRPMNAFMVWAKDERKRLAQQNPDLHNAELSKMLGKSWKALTLAEKRPFVEEAERLRVQHMQDHPNYKYRPRRRKQVKRMKRADTGFMHMAEPPESAVLGTDGRMCLESFSLGYHEQTYPHSQLPQGSHYREPQAMAPHYDGYSLPTPESSPLDLAEADPVFFTSPPQDECQMMPYSYNASYTHQQNSGASMLVRQMPQAEQMGQGSPVQGMMGCQSSPQMYYGQMYLPGSARHHQLPQAGQNSPPPEAQQMGRADHIQQVDMLAEVDRTEFEQYLSYVAKSDLGMHYHGQESVVPTADNGPISSVLSDASTAVYYCNYPSA
I got it! :D
OSarr = []
G = 0
for i in range(len(keyLabelrun)):
OSarr.append(keyLabelrun[G])
G += 1
if keyLabelrun[G].count('='):
while keyLabelrun[G].count('OS=') != 1:
G+=1
Maybe next time everyone, thank you!
Due to the syntax, you have to keep track of which part (OS, PE, etc) you're currently parsing. Here's a function to extract the species name from the FASTA header:
def extract_species(description):
species_parts = []
is_os = False
for word in description.split():
if word[:3] == 'OS=':
is_os = True
species_parts.append(word[3:])
elif '=' in word:
is_os = False
elif is_os:
species_parts.append(word)
return ' '.join(species_parts)
You can call it when processing your input file, e.g.:
from Bio import SeqIO
for record in SeqIO.parse('input.fa', 'fasta'):
species = extract_species(record.description)

How to read binary data in pyspark

I'm reading binary file http://snap.stanford.edu/data/amazon/productGraph/image_features/image_features.b
using pyspark.
import array
from io import StringIO
img_embedding_file = sc.binaryRecords("s3://bucket/image_features.b", 4106)
def mapper(features):
a = array.array('f')
a.frombytes(features)
return a.tolist()
def byte_mapper(bytes):
return str(bytes)
decoded_embeddings = img_embedding_file.map(lambda x: [byte_mapper(x[:10]), mapper(x[10:])])
When just product_id is selected from the rdd using
decoded_embeddings = img_embedding_file.map(lambda x: [byte_mapper(x[:10]), mapper(x[10:])])
The output for product_id is
["b'1582480311'", "b'\\x00\\x00\\x00\\x00\\x88c-?\\xeb\\xe2'", "b'7#\\x00\\x00\\x00\\x00\\x00\\x00\\x00\\x00'", "b'\\x00\\x00\\x00\\x00\\x00\\x00\\x00\\x00\\x00\\x00'", "b'\\xec/\\x0b?\\x00\\x00\\x00\\x00K\\xea'", "b'\\x00\\x00c\\x7f\\xd9?\\x00\\x00\\x00\\x00'", "b'L\\xa6\\n>\\x00\\x00\\x00\\x00\\xfe\\xd4'", "b'\\x00\\x00\\x00\\x00\\x00\\x00\\xe5\\xd0\\xa2='", "b'\\x00\\x00\\x00\\x00\\x00\\x00\\x00\\x00\\x00\\x00'", "b'\\x00\\x00\\x00\\x00\\x00\\x00\\x00\\x00\\x00\\x00'"]
The file is hosted on s3.
The file in each row has first 10 bytes for product_id next 4096 bytes as image_features
I'm able to extract all the 4096 image features but facing issue when reading the first 10 bytes and converting it into proper readable format.
EDIT:
Finally, the problem comes from the recordLength. It's not 4096 + 10 but 4096*4 + 10. Chaging to :
img_embedding_file = sc.binaryRecords("s3://bucket/image_features.b", 16394)
Should work.
Actually you can find this in the provided code from the web site you downloaded the binary file:
for i in range(4096):
feature.append(struct.unpack('f', f.read(4))) # <-- so 4096 * 4
Old answer:
I think the issue comes from your byte_mapper function.
That's not the correct way to convert bytes to string. You should be using decode:
bytes = b'1582480311'
print(str(bytes))
# output: "b'1582480311'"
print(bytes.decode("utf-8"))
# output: '1582480311'
If you're getting the error:
UnicodeDecodeError: 'utf-8' codec can't decode byte 0x88 in position 4: invalid start byte
That means product_id string contains non-utf8 characters. If you don't know the input encoding, it's difficult to convert into strings.
However, you may want to ignore those characters by adding option ignore to decode function:
bytes.decode("utf-8", "ignore")

Python3: How to increment a string value within a "for" loop

I have a tabular.text file (Named "xfile"). An example of its contents is attached below.
Scaffold2_1 WP_017805071.1 26.71 161 97
Scaffold2_1 WP_006995572.1 26.36 129 83
Scaffold2_1 WP_005723576.1 26.92 130 81
Scaffold3_1 WP_009894856.1 25.77 245 43
Scaffold8_1 WP_017805071.1 38.31 248 145
Scaffold8_1 WP_006995572.1 38.55 249 140
Scaffold8_1 WP_005723576.1 34.88 258 139
Scaffold9_1 WP_005645255.1 42.54 446 144
Note how each line begins with Scaffold(y)_1, with y being a number. I have written the following code to print each line beginning with the following terms, Scaffold2 and Scaffold8.
with open("xfile", 'r') as data:
for line in data.readlines():
if "Scaffold2" in line:
a = line
print(a)
elif "Scaffold8" in line:
b = line
print(b)
I was wondering, is there a way you would recommend incrementing the (y) portion of Scaffold() in the if and elif statements?
The idea would be to allow the script to search for each line containing "Scaffold(y)" and storing each line with a specific number (y) in its own variable to be then printed. This would obviously be much faster than entering in each number manually.
You can try this, it is quite easier than using Regex. If this isn't what you expect, let me know, I'll change the code.
for line in data.readlines():
if line[0:8] == "Scaffold" and line[8].isdigit():
print(line)
I'm just checking the 9th Position in your line, i.e. (8th index). If it's a digit, I'm printing the line. Like you said, I'm printing if your "y" is a digit. I'm not incrementing it. The work of incrementation is already been done by your for loop.
Ok it seems like you want to get something in format like:
entries = {y1: ['Scaffold(y1)_...', 'Scaffold(y1)_...'], y2: ['Scaffold(y2)_...', 'Scaffold(y2)_...'], ...}
Then you can do something like that (I assume all of your lines start in the same manner as you have shown, so the y value is always the 8th position in the string):
entries = dict()
for line in data.readlines():
if not line[8] in entries.keys():
entries.update({line[8]: [line]})
else:
entries[line[8]].append(line)
print(entries)
This way you will have a dictionary in the format I have shown you above - output:
{'2': ['Scaffold2_1 WP_017805071.1 26.71 161 97', 'Scaffold2_1 WP_006995572.1 26.36 129 83', 'Scaffold2_1 WP_005723576.1 26.92 130 81'], '3': ['Scaffold3_1 WP_009894856.1 25.77 245 43'], '8': ['Scaffold8_1 WP_017805071.1 38.31 248 145', 'Scaffold8_1 WP_006995572.1 38.55 249 140', 'Scaffold8_1 WP_005723576.1 34.88 258 139'], '9': ['Scaffold9_1 WP_005645255.1 42.54 446 144']}
EDIT: tbh I still don't fully understand why would you need that tho.

svm train output file has less lines than that of the input file

I am currently building a binary classification model and have created an input file for svm-train (svm_input.txt). This input file has 453 lines, 4 No. features and 2 No. classes [0,1].
i.e
0 1:15.0 2:40.0 3:30.0 4:15.0
1 1:22.73 2:40.91 3:36.36 4:0.0
1 1:31.82 2:27.27 3:22.73 4:18.18
0 1:22.73 2:13.64 3:36.36 4:27.27
1 1:30.43 2:39.13 3:13.04 4:17.39 ......................
My problem is that when I count the number of lines in the output model generated by svm-train (svm_train_model.txt), this has 12 fewer lines than that of the input file. The line count here shows 450, although there are obviously also 9 lines at the beginning showing the various parameters generated
i.e.
svm_type c_svc
kernel_type rbf
gamma 1
nr_class 2
total_sv 441
rho -0.156449
label 0 1
nr_sv 228 213
SV
Therefore 12 lines in total from the original input of 453 have gone. I am new to svm and was hoping that someone could shed some light on why this might have happened?
Thanks in advance
Updated.........
I now believe that in generating the model, it has removed lines whereby the labels and all the parameters are exactly the same.
To explain............... My input is a set of miRNAs which have been classified as 1 and 0 depending on their involvement in a particular process or not (i.e 1=Yes & 0=No). The input file looks something like.......
0 1:22 2:30 3:14 4:16
1 1:26 2:15 3:17 4:25
0 1:22 2:30 3:14 4:16
Whereby, lines one and three are exactly the same and as a result will be removed from the output model. My question is then both why the output model would do this and how I can get around this (whilst using the same features)?
Whilst both SOME OF the labels and their corresponding feature values are identical within the input file, these are still different miRNAs.
NOTE: The Input file does not have a feature for miRNA name (and this would clearly show the differences in each line) however, in terms of the features used (i.e Nucleotide Percentage Content), some of the miRNAs do have exactly the same percentage content of A,U,G & C and as a result are viewed as duplicates and then removed from the output model as it obviously views them as duplicates even though they are not (hence there are less lines in the output model).
the format of the input file is:
Where:
Column 0 - label (i.e 1 or 0): 1=Yes & 0=No
Column 1 - Feature 1 = Percentage Content "A"
Column 2 - Feature 2 = Percentage Content "U"
Column 3 - Feature 3 = Percentage Content "G"
Column 4 - Feature 4 = Percentage Content "C"
The input file actually looks something like (See the very first two lines below), as they appear identical, however each line represents a different miRNA):
1 1:23 2:36 3:23 4:18
1 1:23 2:36 3:23 4:18
0 1:36 2:32 3:5 4:27
1 1:14 2:41 3:36 4:9
1 1:18 2:50 3:18 4:14
0 1:36 2:23 3:23 4:18
0 1:15 2:40 3:30 4:15
In terms of software, I am using libsvm-3.22 and python 2.7.5
Align your input file properly, is my first observation. The code for libsvm doesnt look for exactly 4 features. I identifies by the string literals you have provided separating the features from the labels. I suggest manually converting your input file to create the desired input argument.
Try the following code in python to run
Requirements - h5py, if your input is from matlab. (.mat file)
pip install h5py
import h5py
f = h5py.File('traininglabel.mat', 'r')# give label.mat file for training
variables = f.items()
labels = []
c = []
import numpy as np
for var in variables:
data = var[1]
lables = (data.value[0])
trainlabels= []
for i in lables:
trainlabels.append(str(i))
finaltrain = []
trainlabels = np.array(trainlabels)
for i in range(0,len(trainlabels)):
if trainlabels[i] == '0.0':
trainlabels[i] = '0'
if trainlabels[i] == '1.0':
trainlabels[i] = '1'
print trainlabels[i]
f = h5py.File('training_features.mat', 'r') #give features here
variables = f.items()
lables = []
file = open('traindata.txt', 'w+')
for var in variables:
data = var[1]
lables = data.value
for i in range(0,1000): #no of training samples in file features.mat
file.write(str(trainlabels[i]))
file.write(' ')
for j in range(0,49):
file.write(str(lables[j][i]))
file.write(' ')
file.write('\n')

python3: Counting repeated occurrence in a list

Each line contains a special timestamp, the caller number, the receiver number, the duration of the call in seconds and the rate per minute in cents at which this call was charged, all separated by ";”. The file contains thousands of calls looks like this. I created a list instead of a dictionary to access the elements but I'm not sure how to count the number of calls originating from the phone in question
timestamp;caller;receiver;duration;rate per minute
1419121426;7808907654;7807890123;184;0.34
1419122593;7803214567;7801236789;46;0.37
1419122890;7808907654;7809876543;225;0.31
1419122967;7801234567;7808907654;419;0.34
1419123462;7804922860;7809876543;782;0.29
1419123914;7804321098;7801234567;919;0.34
1419125766;7807890123;7808907654;176;0.41
1419127316;7809876543;7804321098;471;0.31
Phone number || # |Duration | Due |
+--------------+-----------------------
|(780) 123 4567||384|55h07m53s|$ 876.97|
|(780) 123 6789||132|17h53m19s|$ 288.81|
|(780) 321 4567||363|49h52m12s|$ 827.48|
|(780) 432 1098||112|16h05m09s|$ 259.66|
|(780) 492 2860||502|69h27m48s|$1160.52|
|(780) 789 0123||259|35h56m10s|$ 596.94|
|(780) 876 5432||129|17h22m32s|$ 288.56|
|(780) 890 7654||245|33h48m46s|$ 539.41|
|(780) 987 6543||374|52h50m11s|$ 883.72|
list =[i.strip().split(";") for i in open("calls.txt", "r")]
print(list)
I have very simple solution for your issue:
First of all use with when opening file - it's a handy shortcut and it provides sames functionality as wrap this funtion into try...except. Consider this:
lines = []
with open("test.txt", "r") as f:
for line in f.readlines():
lines.append(line.strip().split(";"))
print(lines)
counters = {}
# you browse through lists and later through numbers inside lists
for line in lines:
for number in line:
# very basic way to count occurences
if number not in counters:
counters[number] = 1
else:
counters[number] += 1
# in this condition you can tell what number of digits you accept
counters = {elem: counters[elem] for elem in counters.keys() if len(elem) > 5}
print(counters)
This should get you started
import csv
import collections
Call = collections.namedtuple("Call", "duration rate time")
calls = {}
with open('path/to/file') as infile:
for time, nofrom, noto, dur, rate in csv.reader(infile):
calls.get(nofrom, {}).get(noto,[]).append(Call(dur, rate, time))
for nofrom, logs in calls.items():
for noto, callist in logs.items():
print(nofrom, "called", noto, len(callist), "times")

Resources