requests.get() receives a response which is of the type bytes. It looks like:
b'{"Close":8506.25,"DownTicks":164,"DownVolume":207,"High":8508.25,"Low":8495.00,"Open":8496.75,"Status":13,"TimeStamp":"\\/Date(1583530800000)\\/","TotalTicks":371,"TotalVolume":469,"UnchangedTicks":0,"UnchangedVolume":0,"UpTicks":207,"UpVolume":262,"OpenInterest":0}\r\n{"Close":8503.00,"DownTicks":152,"DownVolume":203,"High":8509.50,"Low":8502.00,"Open":8506.00,"Status":13,"TimeStamp":"\\/Date(1583531100000)\\/","TotalTicks":282,"TotalVolume":345,"UnchangedTicks":0,"UnchangedVolume":0,"UpTicks":130,"UpVolume":142,"OpenInterest":0}\r\n{"Close":8494.00,"DownTicks":160,"DownVolume":206,"High":8505.75,"Low":8492.75,"Open":8503.25,"Status":13,"TimeStamp":"\\/Date(1583531400000)\\/","TotalTicks":275,"TotalVolume":346,"UnchangedTicks":0,"UnchangedVolume":0,"UpTicks":115,"UpVolume":140,"OpenInterest":0}\r\n{"Close":8499.00,"DownTicks":136,"DownVolume":192,"High":8500.25,"Low":8492.25,"Open":8493.75,"Status":13,"TimeStamp":"\\/Date(1583531700000)\\/","TotalTicks":299,"TotalVolume":402,"UnchangedTicks":0,"UnchangedVolume":0,"UpTicks":163,"UpVolume":210,"OpenInterest":0}\r\n{"Close":8501.75,"DownTicks":176,"DownVolume":314,"High":8508.25,"Low":8495.75,"Open":8498.50,"Status":536870941,"TimeStamp":"\\/Date(1583532000000)\\/","TotalTicks":340,"TotalVolume":510,"UnchangedTicks":0,"UnchangedVolume":0,"UpTicks":164,"UpVolume":196,"OpenInterest":0}\r\nEND'
Please note that while the actual string is much longer, it is always a long string of shorter strings separated by '\r\n', ignoring the final word "END". You can see how similarly-structured these short strings are:
for i in response.text.split('\r\n')[:-1]: print(i, '\n\n')
{"Close":8506.25,"DownTicks":164,"DownVolume":207,"High":8508.25,"Low":8495.00,"Open":8496.75,"Status":13,"TimeStamp":"\/Date(1583530800000)\/","TotalTicks":371,"TotalVolume":469,"UnchangedTicks":0,"UnchangedVolume":0,"UpTicks":207,"UpVolume":262,"OpenInterest":0}
{"Close":8503.00,"DownTicks":152,"DownVolume":203,"High":8509.50,"Low":8502.00,"Open":8506.00,"Status":13,"TimeStamp":"\/Date(1583531100000)\/","TotalTicks":282,"TotalVolume":345,"UnchangedTicks":0,"UnchangedVolume":0,"UpTicks":130,"UpVolume":142,"OpenInterest":0}
{"Close":8494.00,"DownTicks":160,"DownVolume":206,"High":8505.75,"Low":8492.75,"Open":8503.25,"Status":13,"TimeStamp":"\/Date(1583531400000)\/","TotalTicks":275,"TotalVolume":346,"UnchangedTicks":0,"UnchangedVolume":0,"UpTicks":115,"UpVolume":140,"OpenInterest":0}
{"Close":8499.00,"DownTicks":136,"DownVolume":192,"High":8500.25,"Low":8492.25,"Open":8493.75,"Status":13,"TimeStamp":"\/Date(1583531700000)\/","TotalTicks":299,"TotalVolume":402,"UnchangedTicks":0,"UnchangedVolume":0,"UpTicks":163,"UpVolume":210,"OpenInterest":0}
{"Close":8501.75,"DownTicks":176,"DownVolume":314,"High":8508.25,"Low":8495.75,"Open":8498.50,"Status":536870941,"TimeStamp":"\/Date(1583532000000)\/","TotalTicks":340,"TotalVolume":510,"UnchangedTicks":0,"UnchangedVolume":0,"UpTicks":164,"UpVolume":196,"OpenInterest":0}
Goal parsing a few of the fields and saving them in a dataframe with the field "Timestamp" as the dataframe's index.
What I have done:
response_text = response.text
import ast
df = pd.DataFrame(columns = [ 'o', 'h', 'l', 'c', 'v'])
for i in response_text.split('\r\n')[:-1]:
i_dict = ast.literal_eval(i)
epoch_in_milliseconds = int(i_dict['TimeStamp'].split('(')[1].split(')')[0])
time_stamp = datetime.datetime.fromtimestamp(float(epoch_in_milliseconds)/1000.)
o = i_dict['Open']
h = i_dict['High']
l = i_dict['Low']
c = i_dict['Close']
v = i_dict['TotalVolume']
temp_df = pd.DataFrame({'o':o, 'h':h, 'l':l, 'c':c, 'v':v}, index = [time_stamp])
df = df.append(temp_df)
which gets me:
In [546]df
Out[546]:
o h l c v
2020-03-06 16:40:00 8496.75000 8508.25000 8495.00000 8506.25000 469
2020-03-06 16:45:00 8506.00000 8509.50000 8502.00000 8503.00000 345
2020-03-06 16:50:00 8503.25000 8505.75000 8492.75000 8494.00000 346
2020-03-06 16:55:00 8493.75000 8500.25000 8492.25000 8499.00000 402
2020-03-06 17:00:00 8498.50000 8508.25000 8495.75000 8501.75000 510
which is exactly what I need.
Issue this method feels clumsy to me, like a patch-work, and prone to breaking due to possible slight differences in the response text.
Is there any more robust and faster way of extracting this information from the original bytes? (When the server response is in JSON format, I have none of this headache)
This is somewhat cleaner, I believe:
ts = """
b'{"Close":8506.25,"DownTicks":164,"DownVolume":207,"High":8508.25,"Low":8495.00,"Open":8496.75,"Status":13,"TimeStamp":"\\/Date(1583530800000)\\/","TotalTicks":371,"TotalVolume":469,"UnchangedTicks":0,"UnchangedVolume":0,"UpTicks":207,"UpVolume":262,"OpenInterest":0}\r\n{"Close":8503.00,"DownTicks":152,"DownVolume":203,"High":8509.50,"Low":8502.00,"Open":8506.00,"Status":13,"TimeStamp":"\\/Date(1583531100000)\\/","TotalTicks":282,"TotalVolume":345,"UnchangedTicks":0,"UnchangedVolume":0,"UpTicks":130,"UpVolume":142,"OpenInterest":0}\r\n{"Close":8494.00,"DownTicks":160,"DownVolume":206,"High":8505.75,"Low":8492.75,"Open":8503.25,"Status":13,"TimeStamp":"\\/Date(1583531400000)\\/","TotalTicks":275,"TotalVolume":346,"UnchangedTicks":0,"UnchangedVolume":0,"UpTicks":115,"UpVolume":140,"OpenInterest":0}\r\n{"Close":8499.00,"DownTicks":136,"DownVolume":192,"High":8500.25,"Low":8492.25,"Open":8493.75,"Status":13,"TimeStamp":"\\/Date(1583531700000)\\/","TotalTicks":299,"TotalVolume":402,"UnchangedTicks":0,"UnchangedVolume":0,"UpTicks":163,"UpVolume":210,"OpenInterest":0}\r\n{"Close":8501.75,"DownTicks":176,"DownVolume":314,"High":8508.25,"Low":8495.75,"Open":8498.50,"Status":536870941,"TimeStamp":"\\/Date(1583532000000)\\/","TotalTicks":340,"TotalVolume":510,"UnchangedTicks":0,"UnchangedVolume":0,"UpTicks":164,"UpVolume":196,"OpenInterest":0}\r\nEND'
"""
import pandas as pd
from datetime import datetime
import json
data = []
tss = ts.replace("b'","").replace("\r\nEND'","")
tss2 = tss.strip().split("\r\n")
for t in tss2:
item = json.loads(t)
epo = int(item['TimeStamp'].split('(')[1].split(')')[0])
eims = datetime.fromtimestamp(epo/1000)
item.update(TimeStamp=eims)
data.append(item)
pd.DataFrame(data)
Output:
Close DownTicks DownVolume High Low Open Status TimeStamp TotalTicks TotalVolume UnchangedTicks UnchangedVolume UpTicks UpVolume OpenInterest
0 8506.25 164 207 8508.25 8495.00 8496.75 13 2020-03-06 16:40:00 371 469 0 0 207 262 0
1 8503.00 152 203 8509.50 8502.00 8506.00 13 2020-03-06 16:45:00 282 345 0 0 130 142 0
etc. You can drop unwanted columns, change column names and so on.
That's almost JSON format. Or more precisely, it's a series of lines, each of which contains a JSON-formatted object. (Except for the last one.) So that suggests that the optimal solution in some way uses the json module.
json.load doesn't handle files consisting of a series of lines, nor does it directly handle individual strings (much less bytes). However, you're not limited to json.load. You can construct a JSONDecoder object, which does include methods to parse from strings (but not from bytes), and you can use the decode method of the bytes object to construct a string from the input. (You need to know the encoding to do that, but I strongly suspect that in this case all the characters are ascii, so either 'ascii' or the default UTF-8 encoding will work.)
Unless your input is gigabytes, you can just use the strategy in your question: split the input into lines, discard the END line, and pass the rest into a JSONDecoder:
import json
decoder = JSONDecoder()
# Using splitlines seemed more robust than counting on a specific line-end
for line in response_text.decode().splitlines()
# Alternative: use a try/catch around the call to decoder.decode
if line == 'END': break
line_dict = decoder.decode(line)
# Handle the Timestamp member and create the dataframe item
I am quite new to Python and I am now struggling with printing my list in columns. It prints my lists in one columns only but I want it printed under 4 different titles. I know am missing something but can't seem to figure it out. Any advice would be really appreciated!
def createMyList():
myAgegroup = ['20 - 39','40 - 59','60 - 79']
mygroupTitle = ['Age','Underweight','Healthy','Overweight',]
myStatistics = [['Less than 21%','21 - 33','Greater than 33%',],['Less than 23%','23 - 35','Greater than 35%',],['Less than 25%','25 - 38','Greater than 38%',]]
printmyLists(myAgegroup,mygroupTitle,myStatistics)
return
def printmyLists(myAgegroup,mygroupTitle,myStatistics):
print(': Age : Underweight : Healthy : Overweight :')
for count in range(0, len(myAgegroup)):
print(myAgegroup[count])
for count in range(0, len(mygroupTitle)):
print(mygroupTitle[count])
for count in range(0, len(myStatistics)):
print(myStatistics[0][count])
return
createMyList()
To print data in nice columns is nice to know Format Specification Mini-Languag (doc). Also, to group data together, look at zip() builtin function (doc).
Example:
def createMyList():
myAgegroup = ['20 - 39','40 - 59','60 - 79']
mygroupTitle = ['Age', 'Underweight','Healthy','Overweight',]
myStatistics = [['Less than 21%','21 - 33','Greater than 33%',],['Less than 23%','23 - 35','Greater than 35%',],['Less than 25%','25 - 38','Greater than 38%',]]
printmyLists(myAgegroup,mygroupTitle,myStatistics)
def printmyLists(myAgegroup,mygroupTitle,myStatistics):
# print the header:
for title in mygroupTitle:
print('{:^20}'.format(title), end='')
print()
# print the columns:
for age, stats in zip(myAgegroup, myStatistics):
print('{:^20}'.format(age), end='')
for stat in stats:
print('{:^20}'.format(stat), end='')
print()
createMyList()
Prints:
Age Underweight Healthy Overweight
20 - 39 Less than 21% 21 - 33 Greater than 33%
40 - 59 Less than 23% 23 - 35 Greater than 35%
60 - 79 Less than 25% 25 - 38 Greater than 38%
I have a tabular.text file (Named "xfile"). An example of its contents is attached below.
Scaffold2_1 WP_017805071.1 26.71 161 97
Scaffold2_1 WP_006995572.1 26.36 129 83
Scaffold2_1 WP_005723576.1 26.92 130 81
Scaffold3_1 WP_009894856.1 25.77 245 43
Scaffold8_1 WP_017805071.1 38.31 248 145
Scaffold8_1 WP_006995572.1 38.55 249 140
Scaffold8_1 WP_005723576.1 34.88 258 139
Scaffold9_1 WP_005645255.1 42.54 446 144
Note how each line begins with Scaffold(y)_1, with y being a number. I have written the following code to print each line beginning with the following terms, Scaffold2 and Scaffold8.
with open("xfile", 'r') as data:
for line in data.readlines():
if "Scaffold2" in line:
a = line
print(a)
elif "Scaffold8" in line:
b = line
print(b)
I was wondering, is there a way you would recommend incrementing the (y) portion of Scaffold() in the if and elif statements?
The idea would be to allow the script to search for each line containing "Scaffold(y)" and storing each line with a specific number (y) in its own variable to be then printed. This would obviously be much faster than entering in each number manually.
You can try this, it is quite easier than using Regex. If this isn't what you expect, let me know, I'll change the code.
for line in data.readlines():
if line[0:8] == "Scaffold" and line[8].isdigit():
print(line)
I'm just checking the 9th Position in your line, i.e. (8th index). If it's a digit, I'm printing the line. Like you said, I'm printing if your "y" is a digit. I'm not incrementing it. The work of incrementation is already been done by your for loop.
Ok it seems like you want to get something in format like:
entries = {y1: ['Scaffold(y1)_...', 'Scaffold(y1)_...'], y2: ['Scaffold(y2)_...', 'Scaffold(y2)_...'], ...}
Then you can do something like that (I assume all of your lines start in the same manner as you have shown, so the y value is always the 8th position in the string):
entries = dict()
for line in data.readlines():
if not line[8] in entries.keys():
entries.update({line[8]: [line]})
else:
entries[line[8]].append(line)
print(entries)
This way you will have a dictionary in the format I have shown you above - output:
{'2': ['Scaffold2_1 WP_017805071.1 26.71 161 97', 'Scaffold2_1 WP_006995572.1 26.36 129 83', 'Scaffold2_1 WP_005723576.1 26.92 130 81'], '3': ['Scaffold3_1 WP_009894856.1 25.77 245 43'], '8': ['Scaffold8_1 WP_017805071.1 38.31 248 145', 'Scaffold8_1 WP_006995572.1 38.55 249 140', 'Scaffold8_1 WP_005723576.1 34.88 258 139'], '9': ['Scaffold9_1 WP_005645255.1 42.54 446 144']}
EDIT: tbh I still don't fully understand why would you need that tho.
I am currently building a binary classification model and have created an input file for svm-train (svm_input.txt). This input file has 453 lines, 4 No. features and 2 No. classes [0,1].
i.e
0 1:15.0 2:40.0 3:30.0 4:15.0
1 1:22.73 2:40.91 3:36.36 4:0.0
1 1:31.82 2:27.27 3:22.73 4:18.18
0 1:22.73 2:13.64 3:36.36 4:27.27
1 1:30.43 2:39.13 3:13.04 4:17.39 ......................
My problem is that when I count the number of lines in the output model generated by svm-train (svm_train_model.txt), this has 12 fewer lines than that of the input file. The line count here shows 450, although there are obviously also 9 lines at the beginning showing the various parameters generated
i.e.
svm_type c_svc
kernel_type rbf
gamma 1
nr_class 2
total_sv 441
rho -0.156449
label 0 1
nr_sv 228 213
SV
Therefore 12 lines in total from the original input of 453 have gone. I am new to svm and was hoping that someone could shed some light on why this might have happened?
Thanks in advance
Updated.........
I now believe that in generating the model, it has removed lines whereby the labels and all the parameters are exactly the same.
To explain............... My input is a set of miRNAs which have been classified as 1 and 0 depending on their involvement in a particular process or not (i.e 1=Yes & 0=No). The input file looks something like.......
0 1:22 2:30 3:14 4:16
1 1:26 2:15 3:17 4:25
0 1:22 2:30 3:14 4:16
Whereby, lines one and three are exactly the same and as a result will be removed from the output model. My question is then both why the output model would do this and how I can get around this (whilst using the same features)?
Whilst both SOME OF the labels and their corresponding feature values are identical within the input file, these are still different miRNAs.
NOTE: The Input file does not have a feature for miRNA name (and this would clearly show the differences in each line) however, in terms of the features used (i.e Nucleotide Percentage Content), some of the miRNAs do have exactly the same percentage content of A,U,G & C and as a result are viewed as duplicates and then removed from the output model as it obviously views them as duplicates even though they are not (hence there are less lines in the output model).
the format of the input file is:
Where:
Column 0 - label (i.e 1 or 0): 1=Yes & 0=No
Column 1 - Feature 1 = Percentage Content "A"
Column 2 - Feature 2 = Percentage Content "U"
Column 3 - Feature 3 = Percentage Content "G"
Column 4 - Feature 4 = Percentage Content "C"
The input file actually looks something like (See the very first two lines below), as they appear identical, however each line represents a different miRNA):
1 1:23 2:36 3:23 4:18
1 1:23 2:36 3:23 4:18
0 1:36 2:32 3:5 4:27
1 1:14 2:41 3:36 4:9
1 1:18 2:50 3:18 4:14
0 1:36 2:23 3:23 4:18
0 1:15 2:40 3:30 4:15
In terms of software, I am using libsvm-3.22 and python 2.7.5
Align your input file properly, is my first observation. The code for libsvm doesnt look for exactly 4 features. I identifies by the string literals you have provided separating the features from the labels. I suggest manually converting your input file to create the desired input argument.
Try the following code in python to run
Requirements - h5py, if your input is from matlab. (.mat file)
pip install h5py
import h5py
f = h5py.File('traininglabel.mat', 'r')# give label.mat file for training
variables = f.items()
labels = []
c = []
import numpy as np
for var in variables:
data = var[1]
lables = (data.value[0])
trainlabels= []
for i in lables:
trainlabels.append(str(i))
finaltrain = []
trainlabels = np.array(trainlabels)
for i in range(0,len(trainlabels)):
if trainlabels[i] == '0.0':
trainlabels[i] = '0'
if trainlabels[i] == '1.0':
trainlabels[i] = '1'
print trainlabels[i]
f = h5py.File('training_features.mat', 'r') #give features here
variables = f.items()
lables = []
file = open('traindata.txt', 'w+')
for var in variables:
data = var[1]
lables = data.value
for i in range(0,1000): #no of training samples in file features.mat
file.write(str(trainlabels[i]))
file.write(' ')
for j in range(0,49):
file.write(str(lables[j][i]))
file.write(' ')
file.write('\n')