Trouble with parsing JSON msg - python-3.x

I am calling a QUANDL API for data and getting a JSON msg back, which i am having trouble parsing, before sending to a database. My parsing code is clearly not reading the JSON correctly.
Via the below code, i am getting the following (truncated for simplicity) JSON:
{"datatable":{"data":[["AAPL","MRY","2018-09-29",265595000000],["AAPL","MRY","2017-09-30",229234000000],["AAPL","MRY","2016-09-24",215639000000],["AAPL","MRY","2015-09-26",233715000000],["AAPL","MRY","2014-09-27",182795000000],["AAPL","MRY","2013-09-28",170910000000],["AAPL","MRT","2018-09-29",265595000000],["AAPL","MRT","2018-06-30",255274000000],["AAPL","MRT","2018-03-31",247417000000],["AAPL","MRT","2017-12-30",239176000000],["AAPL","MRT","2017-09-30",229234000000],["AAPL","MRT","2017-07-01",223507000000],["AAPL","MRT","2017-04-01",220457000000],["AAPL","MRT","2016-12-31",218118000000],["AAPL","MRT","2016-09-24",215639000000],["AAPL","MRT","2016-06-25",220288000000],["AAPL","MRT","2016-03-26",227535000000],["AAPL","MRT","2015-12-26",234988000000],["AAPL","MRT","2015-09-26",233715000000],["AAPL","MRT","2015-06-27",224337000000],["AAPL","MRT","2015-03-28",212164000000],["AAPL","MRT","2014-12-27",199800000000],["AAPL","MRT","2014-09-27",182795000000],["AAPL","MRT","2014-06-28",178144000000],["AAPL","MRT","2014-03-29",176035000000],"columns":[{"name":"ticker","type":"String"},{"name":"dimension","type":"String"},{"name":"datekey","type":"Date"},{"name":"revenue","type":"Integer"}]},"meta":{"next_cursor_id":null}}
import quandl, requests
from flask import request
from cs50 import SQL
db = SQL("sqlite:///formula.db")
data =
requests.get(f"https://www.quandl.com/api/v3/datatables/SHARADAR/SF1.json?ticker=AAPL&qopts.columns=ticker,dimension,datekey,revenue&api_key=YOURAPIKEY")
responses = data.json()
print(responses)
for response in responses:
ticker=str(response["ticker"])
dimension=str(response["dimension"])
datekey=str(response["datekey"])
revenue=int(response["revenue"])
db.execute("INSERT INTO new(ticker, dimension, datekey, revenue) VALUES(:ticker, :dimension, :keydate, :revenue)", ticker=ticker, dimension=dimension, datekey=datekey, revenue=revenue)
I'm getting the following error msg (which i have in the past, and successfully addressed it) so strongly believe i am not reading the json correctly:
File "new2.py", line 12, in
ticker=str(response["ticker"])
TypeError: string indices must be integers
I want to be able to loop through the json and be able to isolate specific data to then populate a database.

for your response structure, you have a nested dict object:
datatable
data
list of lists of data
so, this will happen:
responses = data.json()
datatable = responses['datatable'] # will get you the information mapped to the 'datatables' key
datatable_data = datatable['data'] # will get you the list mapped to the 'data' key
Now, datatable_data is a list of lists, right? and lists can only be accessed by index point, not by strings
so, lets say you want the first response.
first_response = datatable_data[0]
that will result in
first_response = ["AAPL","MRY","2018-09-29",265595000000]
which you can now access by index point:
for idx, val in enumerate(first_response):
print(f'{idx}\t{val}')
which will print out
0 AAPL
1 MRY
2 2018-09-29
3 265595000000
so, with all this information, you need to alter your program to ensure you're accessing the data key in the response, and then iterate over the list of lists.
So, something like this:
data = responses['datatable']['data']
for record in data:
ticker, dimension, datekey, revenue = record # unpack list into named variables
db.execute(...)

Related

How to output data to Azure ML Batch Endpoint correctly using python?

When invoking Azure ML Batch Endpoints (creating jobs for inferencing), the run() method should return a pandas DataFrame or an array as explained here
However this example shown, doesn't represent an output with headers for a csv, as it is often needed.
The first thing I've tried was to return the data as a pandas DataFrame and the result is just a simple csv with a single column and without the headers.
When trying to pass the values with several columns and it's corresponding headers, to be later saved as csv, as a result, I'm getting awkward square brackets (representing the lists in python) and the apostrophes (representing strings)
I haven't been able to find documentation elsewhere, to fix this:
This is the way I found to create a clean output in csv format using python, from a batch endpoint invoke in AzureML:
def run(mini_batch):
batch = []
for file_path in mini_batch:
df = pd.read_csv(file_path)
# Do any data quality verification here:
if 'id' not in df.columns:
logger.error("ERROR: CSV file uploaded without id column")
return None
else:
df['id'] = df['id'].astype(str)
# Now we need to create the predictions, with previously loaded model in init():
df['prediction'] = model.predict(df)
# or alternative, df[MULTILABEL_LIST] = model.predict(df)
batch.append(df)
batch_df = pd.concat(batch)
# After joining all data, we create the columns headers as a string,
# here we remove the square brackets and apostrophes:
azureml_columns = str(batch_df.columns.tolist())[1:-1].replace('\'','')
result = []
result.append(azureml_columns)
# Now we have to parse all values as strings, row by row,
# adding a comma between each value
for row in batch_df.iterrows():
azureml_row = str(row[1].values).replace(' ', ',')[1:-1].replace('\'','').replace('\n','')
result.append(azureml_row)
logger.info("Finished Run")
return result

Can I parse through a single entry dictionary to create a data frame?

I ran an API using requests.request in python, and am getting an output of a single item dictionary. Is there an efficient way to parse the single item into a dataframe? id like to eventually export to csv.
r = requests.request("GET", url, headers=headers, params=querystring)
x = r.json()
print (type(x))
shows class is type 'dict' for x.
when I print x I get:
{"chart":{"result":[{"meta":{"currency":"USD","symbol":"AAPL","exchangeName":"NMS","instrumentType":"EQUITY","firstTradeDate":345479400,"regularMarketTime":1612451820,"gmtoffset":-18000,"timezone":"EST","exchangeTimezoneName":"America/New_York","regularMarketPrice":135.54,"chartPreviousClose":133.94,"previousClose":133.94,"scale":3,"priceHint":2,"currentTradingPeriod":{"pre":{"timezone":"EST","start":1612429200,"end":1612449000,"gmtoffset":-18000},"regular":{"timezone":"EST","start":1612449000,"end":1612472400,"gmtoffset":-18000},"post":{"timezone":"EST","start":1612472400,"end":1612486800,"gmtoffset":-18000}},"tradingPeriods":[[{"timezone":"EST","start":1612449000,"end":1612472400,"gmtoffset":-18000}]],"dataGranularity":"1m","range":"1d","validRanges":["3mo","5y","6mo","2y","ytd","1y","1d","max","5d","10y","1mo"]},"timestamp":[1612449000,1612449060,1612449120,1612449180,1612449240,1612449300,1612449360,1612449420,1612449480,1612449540,1612449600,1612449660,1612449720,1612449780,1612449840,1612449900,1612449960,1612450020,1612450080,1612450140,1612450200,1612450260,1612450320,1612450380,1612450440,1612450500,1612450560,1612450620,1612450680,1612450740,1612450800,1612450860,1612450920,1612450980,1612451040,1612451100,1612451160,1612451220,1612451280,1612451340,1612451400,1612451460,1612451520,1612451580,1612451640,1612451700,1612451760],"comparisons":[{"symbol":"MSFT","previousClose":243.0,"gmtoffset":-18000,"high":[243.0,243.2,243.06,241.7141,241.5323,241.49,241.89,242.34,242.5507,243.2399,242.72,242.659
The JSON you included is large and invalid. Manipulating it a little then allows standard techniques to turn JSON into a dataframe to be used.
df = pd.json_normalize(js["chart"]["result"]).explode("comparisons")
df.join(df.comparisons.apply(pd.Series)).explode("high")

Update Sqlite w/ Python: InterfaceError: Error binding parameter 0 and None Type is not subscriptable

I've scraped some websites and stored the html info in a sqlite database. Now, I want to extract and store the email addresses. I'm able to successfully extract and print the id and emails. But, I keep getting TypeError: "'NoneType' object is not subscriptable" and "sqlite3.InterfaceError: Error binding parameter 0 - probably unsupported type" when I try to update the database with these new email addresses.
I've verified that the data types I'm using in the update statement are the same as my database (id is class int and email is str). I've googled a bunch of different examples and mucked around with the syntax alot.
I also tried removing the Where Clause in the update statement but got the same errors.
import sqlite3
import re
conn = sqlite3.connect('spider.sqlite')
cur = conn.cursor()
x = cur.execute('SELECT id, html FROM Pages WHERE html is NOT NULL and email is NULL ORDER BY RANDOM()').fetchone()
#print(x)#for testing purposes
for row in x:
row = cur.fetchone()
id = row[0]
html = row[1]
email = re.findall(b'[a-z0-9\.\-+_]+#[a-z0-9\.\-+_]+\.[a-z]+', html)
#print(email)#testing purposes
if not email:
email = 'no email found'
print(id, email)
cur.execute('''UPDATE pages SET email = ? WHERE id = ? ''', (email, id))
conn.commit
I want the update statement to update the database with the extracted email addresses for the appropriate row.
There are a few things going on here.
First off, you don't want to do this:
for row in x:
row = cur.fetchone()
If you want to iterate over the results returned by the query, you should consider something like this:
for row in cur.fetchall():
id = row[0]
html = row[1]
# ...
To understand the rest of the errors you are seeing, let's take a look at them step by step.
TypeError: "'NoneType' object is not subscriptable":
This is likely generated here:
row = cur.fetchone()
id = row[0]
Cursor.fetchone returns None if the executed query doesn't match any rows or if there are no rows left in the result set. The next line, then, is trying to do None[0] which would raise the error in question.
sqlite3.InterfaceError: Error binding parameter 0 - probably unsupported type:
re.findall returns a list of non-overlapping matches, not an individual match. There's no support for binding a Python list to a sqlite3 text column type. To fix this, you'll need to get the first element from the matched list (if it exists) and then pass that as your email parameter in the UPDATE.
.findall() returns a list.
You want to iterate over that list:
for email in re.findall(..., str(html)):
print(id, email)
cur.execute(...)
Not sure what's going on with that b'[a-z...' expression.
Recommend you use a raw string instead: r'[a-z...'.
It handles regex \ backwhacks nicely.

Concatenating FOR loop output

I am very new to Python (first week of active use). I have some bash scripting experience but have decided to learn Python.
I have a variable of multiple strings which I am using to build a URL in FOR loop. The output of URL is JSON and I would like to concatenate complete output into one file.
I will put random URL for privacy reasons.
The code looks like this:
==================
numbers = ['24246', '83367', '37643', '24245', '24241', '77968', '63157', '76004', '71665']
for id in numbers:
restAPI = s.get(urljoin(baseurl, '/test/' + id + '&test2'))
result = restAPI.json
==================
the problem is that if I do print(result) I will get only output of last iteration, i.e. www.google.com/test/71665&test2
Creating a list by adding text = [] worked (content was concatenated) but I would like to keep the original format.
text = []
for id in numbers:
restAPI = s.get(urljoin(baseurl, '/test/' + id + '&test2'))
Does anyone have idea how to do this
When the for loop ends, the variable assigned inside the for loop only keeps the last value. I.e. Every time your code for loops through, the restAPI variable gets reset each time.
If you wanted to keep each URL, you could append to a list outside the scope of the for loop every time, i.e.
restAPI = s.get(urljoin(baseurl, ...
url_list.append(restApi.json)
Or if you just wanted to print...
for id in numbers:
restAPI = s.get(urljoin(baseurl, ...
print(restAPI.json)
If you added them to a list, you could perform seperate functions with the new list of URLs.
If you think there might be duplicates, feel free to use a set() instead (which automatically removes the dupes inside the iterable as new values are added). You can use set_name.add(restAPI.json)
To be better, you could implement a dict and assign the id as the key and the json object as the value. So you could:
dict_obj = dict()
for id in numbers:
restAPI = s.get(urljoin(baseurl, ...
dict_obj[id] = restAPI.json
That way you can query the dictionary later in the script.
Note that if you're querying many URLs, storing the JSON's in memory might be intensive depending on your hardware.

Proper Syntax for List Comprehension Involving an Integer and a Float?

I have a List of Lists that looks like this (Python3):
myLOL = ["['1466279297', '703.0']", "['1466279287', '702.0']", "['1466279278', '702.0']", "['1466279268', '706.0']", "['1466279258', '713.0']"]
I'm trying to use a list comprehension to convert the first item of each inner list to an int and the second item to a float so that I end up with this:
newLOL = [[1466279297, 703.0], [1466279287, 702.0], [1466279278, 702.0], [1466279268, 706.0], [1466279258, 713.0]]
I'm learning list comprehensions, can somebody please help me with this syntax?
Thank you!
[edit - to explain why I asked this question]
This question is a means to an end - the syntax requested is needed for testing. I'm collecting sensor data on a ZigBee network, and I'm using an Arduino to format the sensor messages in JSON. These messages are published to an MQTT broker (Mosquitto) running on a Raspberry Pi. A Redis server (also running on the Pi) serves as an in-memory message store. I'm writing a service (python-MQTT client) to parse the JSON and send a LoL (a sample of the data you see in my question) to Redis. Finally, I have a dashboard running on Apache on the Pi. The dashboard utilizes Highcharts to plot the sensor data dynamically (via a web socket connection between the MQTT broker and the browser). Upon loading the page, I pull historical chart data from my Redis LoL to "very quickly" populate the charts on my dashboard (before any realtime data is added dynamically). I realize I can probably format the sensor data the way I want in the Redis store, but that is a problem I haven't worked out yet. Right now, I'm trying to get my historical data to plot correctly in Highcharts. With the data properly formatted, I can get this piece working.
Well, you could use ast.literal_eval:
from ast import literal_eval
myLOL = ["['1466279297', '703.0']", "['1466279287', '702.0']", "['1466279278', '702.0']", "['1466279268', '706.0']", "['1466279258', '713.0']"]
items = [[int(literal_eval(i)[0]), float(literal_eval(i)[1])] for i in myLOL]
Try:
import json
newLOL = [[int(a[0]), float(a[1])] for a in (json.loads(s.replace("'", '"')) for s in myLOL)]
Here I'm considering each element of the list as a JSON, but since it's using ' instead of " for the strings, I have to replace it first (it only works because you said there will be only numbers).
This may work? I wish I was more clever.
newLOL = []
for listObj in myLOL:
listObj = listObj.replace('[', '').replace(']', '').replace("'", '').split(',')
newListObj = [int(listObj[0]), float(listObj[1])]
newLOL.append(newListObj)
Iterates through your current list, peels the string apart into a list by replace un-wanted string chracters and utilizing a split on the comma. Then we take the modified list object and create another new list object with the values being the respective ints and floats. We then append the prepared newListObj to the newLOL list. Considering you want an actual set of lists within your list. Your previously documented input list actually contains strings, which look like lists.
This is a very strange format and the best solution is likely to change the code which generates that.
That being said, you can use ast.literal_eval to safely evaluate the elements of the list as Python tokens:
>>> lit = ast.literal_eval
>>> [[lit(str_val) for str_val in lit(str_list)] for str_list in myLOL]
[[1466279297, 703.0], [1466279287, 702.0], [1466279278, 702.0], [1466279268, 706.0], [1466279258, 713.0]]
We need to do it twice - once to turn the string into a list containing two strings, and then once per resulting string to convert it into a number.
Note that this will succeed even if the strings contain other valid tokens. If you want to validate the format too, you'd want to do something like:
>>> def process_str_list(str_list):
... l = ast.literal_eval(str_list)
... if not isinstance(l, list):
... raise TypeError("Expected list")
... str_int, str_float = l
... return [int(str_int), float(str_float)]
...
>>> [process_str_list(str_list) for str_list in myLOL]
[[1466279297, 703.0], [1466279287, 702.0], [1466279278, 702.0], [1466279268, 706.0], [1466279258, 713.0]]
Your input consists of a list of strings, where each string is the string representation of a list. The first task is to convert the strings back into lists:
import ast
lol2 = map(ast.literal_eval, mylol) # [['1466279297', '703.0'], ...]
Now, you can simply get int and float values from lol2:
newlol = [[int(a[0]), float(a[1])] for a in lol2]

Resources