My goal is to make multiple Get calls from the results of the first call, then concatenate the clients informations into dataframe. Preferable a faster way because I have a million clients ids
--------------------------------------
| id | name | country | city | phone |
--------------------------------------
| 1 | Leo | France | Paris | 212...|
| . | .. | .. | .. | .. |
| 100| Bale | UK | London| 514...|
The basic request / results (all clients):
import requests
from requests.auth import HTTPBasicAuth
# the initial request which returns all clients
res0 = requests.get('https://x.y.z/api/v1/data/clients', auth=HTTPBasicAuth('me', 'blabla'))
# results
<?xml version="1.0" ?>
<queryResponse>
<entityId type="clients" url="https://x.y.z/api/v1/data/clients/1">1</entityId>
...
...
<entityId type="clients" url="https://x.y.z/api/v1/data/clients/100">100</entityId>
</queryResponse>
The detailed request / results (client infos)
# this request allows to get client informations
res1 = requests.get('https://x.y.z/api/v1/data/clients/1', auth=HTTPBasicAuth('me', 'blabla'))
# results
<queryResponse>
<entity type="client_infos" url="https://x.y.z/api/v1/data/clients/1">
<client_infos displayName="1" id="1">
<name>Leo Massina</name>
<country>France</country>
<city>1607695021057</city>
<phone>+212-61-88-65-123</phone>
</client_infos >
</entity>
You can use lxml to parse the response and make the new calls, retrieve the tags and text in a dictionary and create the dataframe:
(I used fors for clarity, you can optimize the code if needed)
Also, I did not retrieve ids, if needed they can be retrieved as attribute of client_infos tags.
from lxml import etree
root = etree.fromstring(res0)
reqentity = []
data = {"name":[], "country":[], "city":[], "phone":[]}
for entity in root.findall('./entityId'):
reqentity.append(requests.get(entity.attrib['url'], auth=HTTPBasicAuth('me', 'blabla')))
for entity in reqentity:
entity = etree.fromstring(entity)
for item in entity.findall(".//client_infos//"):
data[item.tag].append(item.text)
df = pd.DataFrame(data)
Related
python3 Newby here. I am trying to create a variable that I can use to make a GPS timestamp from an adafruit GPS sensor. I eventually want to store this in a db. The mysql database has a timestamp feature when data is inserted into a table so I want to have that and the UTC time and date that comes from the GPS device be stored as well.
It seems I have something wrong and can not figure it out. The code is hanging on this:
def gpstsRead():
gps_ts = '{}/{}/{} {:02}:{:02}:{:02}'.format(
gps.timestamp_utc.tm_mon,
gps.timestamp_utc.tm_mday,
gps.timestamp_utc.tm_year,
gps.timestamp_utc.tm_hour,
gps.timestamp_utc.tm_min,
gps.timestamp_utc.tm_sec,
)
return gps_ts
I am trying to put all of these into a timestamp like format. The error is this:
Traceback (most recent call last):
File "/home/pi/ek9/Sensors/GPS/gps-db-insert.py", line 57, in <module>
gps_ts = gpstsRead()
File "/home/pi/ek9/Sensors/GPS/gps-db-insert.py", line 20, in gpstsRead
gps.timestamp_utc.tm_mon,
AttributeError: 'NoneType' object has no attribute 'tm_mon'
I have made sure I use spaces instead of tabs as that has caused me grief in the past. Beyond that I really don't know. I have been putzing with this for hours to no avail. Any ideas? thanks for any suggestions.
Thanks all for the input. After reading these I decided to try it a little different. Instead of defining a variable with "def" i just decided to eliminate the "def" and just create the variable itself.
Like this:
# define values
gps_ts = ('{}-{}-{} {:02}:{:02}:{:02}'.format(
gps.timestamp_utc.tm_year,
gps.timestamp_utc.tm_mon,
gps.timestamp_utc.tm_mday,
gps.timestamp_utc.tm_hour,
gps.timestamp_utc.tm_min,
gps.timestamp_utc.tm_sec,)
)
And that seemed to work. For anyone that is doing something similar I will include the complete code. I also understand that like most languages, there is always more than one way to get the job done, some better than others, some not. I am still learning. If anyone cares to point out how I could accomplish the same task by doing something different or more efficient, please feel free to provide me the opportunity to learn. thanks again!
#!/usr/bin/python3
import pymysql
import time
import board
from busio import I2C
import adafruit_gps
i2c = I2C(board.SCL, board.SDA)
gps = adafruit_gps.GPS_GtopI2C(i2c) # Use I2C interface
gps.send_command(b"PMTK314,0,1,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0")
gps.send_command(b"PMTK220,1000")
last_print = time.monotonic()
# Open database connection
db = pymysql.connect("localhost", "database", "password", "table")
# prepare a cursor object using cursor() method
cursor = db.cursor()
while True:
gps.update()
current = time.monotonic()
if current - last_print >= 1.0: # update rate
last_print = current
if not gps.has_fix:
# Try again if we don't have a fix yet.
print("Waiting for a satellite fix...")
continue
# define values
gps_ts = ('{}-{}-{} {:02}:{:02}:{:02}'.format(
gps.timestamp_utc.tm_year,
gps.timestamp_utc.tm_mon,
gps.timestamp_utc.tm_mday,
gps.timestamp_utc.tm_hour,
gps.timestamp_utc.tm_min,
gps.timestamp_utc.tm_sec,)
)
gps_lat = '{}'.format(gps.latitude)
gps_long = '{}'.format(gps.longitude)
gps_fix = '{}'.format(gps.fix_quality)
gps_sat = '{}'.format(gps.satellites)
gps_alt = '{}'.format(gps.altitude_m)
gps_speed = '{}'.format(gps.speed_knots)
gps_track = '{}'.format(gps.track_angle_deg)
sql = "INSERT into ek9_gps(gps_timestamp_utc, latitude, \
longitude, fix_quality, number_satellites, gps_altitude, \
gps_speed, gps_track_angle) \
values (%s,%s,%s,%s,%s,%s,%s,%s)"
arg = (gps_ts, gps_lat, gps_long, gps_fix, gps_sat,
gps_alt, gps_speed, gps_track)
try:
# Execute the SQL command
cursor.execute(sql, arg)
# Commit your changes in the database
db.commit()
except:
print('There was an error on input into the database')
# Rollback in case there is any error
db.rollback()
# disconnect from server
cursor.close()
db.close()
And this is what the mariadb shows:
+----+---------------------+---------------------+----------+-----------+-------------+-------------------+--------------+-----------+-----------------+
| id | datetime | gps_timestamp_utc | latitude | longitude | fix_quality | number_satellites | gps_altitude | gps_speed | gps_track_angle |
+----+---------------------+---------------------+----------+-----------+-------------+-------------------+--------------+-----------+-----------------+
| 11 | 2020-12-30 14:14:42 | 2020-12-30 20:14:42 | xx.xxxx | -xx.xxxx | 1 | 10 | 232 | 0 | 350 |
| 10 | 2020-12-30 14:14:41 | 2020-12-30 20:14:41 | xx.xxxx | -xx.xxxx | 1 | 10 | 232 | 0 | 350 |
| 9 | 2020-12-30 14:14:39 | 2020-12-30 20:14:39 | xx.xxxx | -xx.xxxx | 1 | 10 | 232 | 0 | 350 |
| 8 | 2020-12-30 14:14:38 | 2020-12-30 20:14:38 | xx.xxxx | -xx.xxxx | 1 | 10 | 232 | 0 | 350 |
Success!!! Thanks again!
I'm trying to understand Apache Beam. I was following the programming guide and in one example, they say talk about The following code example joins the two PCollections with CoGroupByKey, followed by a ParDo to consume the result. Then, the code uses tags to look up and format data from each collection..
I was quite surprised, because I didn't saw at any point a ParDo operation, so I started to wondering if the | was actually the ParDo. The code looks like this:
import apache_beam as beam
from apache_beam.options.pipeline_options import PipelineOptions
emails_list = [
('amy', 'amy#example.com'),
('carl', 'carl#example.com'),
('julia', 'julia#example.com'),
('carl', 'carl#email.com'),
]
phones_list = [
('amy', '111-222-3333'),
('james', '222-333-4444'),
('amy', '333-444-5555'),
('carl', '444-555-6666'),
]
pipeline_options = PipelineOptions()
with beam.Pipeline(options=pipeline_options) as p:
emails = p | 'CreateEmails' >> beam.Create(emails_list)
phones = p | 'CreatePhones' >> beam.Create(phones_list)
results = ({'emails': emails, 'phones': phones} | beam.CoGroupByKey())
def join_info(name_info):
(name, info) = name_info
return '%s; %s; %s' %\
(name, sorted(info['emails']), sorted(info['phones']))
contact_lines = results | beam.Map(join_info)
I do notice that emails and phones are read at the start of the pipeline, so I guess that both of them are different PCollections, right? But where is the ParDo executed? What do the "|" and ">>" actually means? And how I can see the actual output of this? Does it matter if the join_info function, the emails_list and phones_list are defined outside the DAG?
The | represents a separation between steps, this is (using p as Pbegin): p | ReadFromText(..) | ParDo(..) | GroupByKey().
You can also reference other PCollections before |:
read = p | ReadFromText(..)
kvs = read | ParDo(..)
gbk = kvs | GroupByKey()
That's equivalent to the previous pipeline: p | ReadFromText(..) | ParDo(..) | GroupByKey()
The >> are used between | and the PTransform to name the steps: p | ReadFromText(..) | "to key value" >> ParDo(..) | GroupByKey()
How can I format the below data into tabular form using Python ?
Is there any way to print/write the data as per the expected format ?
[{"itemcode":null,"productname":"PKS543452","value_2018":null},
{"itemcode":null,"productname":"JHBG6%&9","value_2018":null},
{"itemcode":null,"productname":"VATER3456","value_2018":null},
{"itemcode":null,"productname":"ACDFER3434","value_2018":null}]
Expected output:
|itemcode | Productname | Value_2018 |
|null |PKS543452|null|
|null |JHBG6%&9|null|
|null |VATER3456|null|
|null |ACDFER3434|null|
You can use pandas to generate a dataframe from the list of dictionaries:
import pandas as pd
null = "null"
lst = [{"itemcode":null,"productname":"PKS543452","value_2018":null},
{"itemcode":null,"productname":"JHBG6%&9","value_2018":null},
{"itemcode":null,"productname":"VATER3456","value_2018":null},
{"itemcode":null,"productname":"ACDFER3434","value_2018":null}]
df = pd.DataFrame.from_dict(lst)
print(df)
Output:
itemcode productname value_2018
0 null PKS543452 null
1 null JHBG6%&9 null
2 null VATER3456 null
3 null ACDFER3434 null
This makes it easy to manipulate data in the table later on. Otherwise, you can print your desired output using built-in string methods:
output=[]
col_names = '|' + ' | '.join(lst[0].keys()) + '|'
print(col_names)
for dic in lst:
row = '|' + ' | '.join(dic.values()) + '|'
print(row)
Output:
|itemcode | productname | value_2018|
|null | PKS543452 | null|
|null | JHBG6%&9 | null|
|null | VATER3456 | null|
|null | ACDFER3434 | null|
You can try like this as well (without using pandas). I have commented each and every line in code itself so don't forget to read them.
Note: Actually, the list/array that you have have pasted is either the result of json.dumps() (in Python, a text) or you have copied the API response (JSON).
null is from JavaScript and the pasted list/array is not a valid Python list but it can be considered as text and converted back to Python list using json.loads(). In this case, null will be converted to None.
And that's why to form the wanted o/p we need another check like "null" if d[key] is None else d[key].
import json
# `null` is used in JavaScript (JSON is JavaScript), so I considered it as string
json_text = """[{"itemcode":null,"productname":"PKS543452","value_2018":null},
{"itemcode":null,"productname":"JHBG6%&9","value_2018":null},
{"itemcode":null,"productname":"VATER3456","value_2018":null},
{"itemcode":null,"productname":"ACDFER3434","value_2018":null}]"""
# Will contain the rows (text)
texts = []
# Converting to original list object, `null`(JavaScript) will transform to `None`(Python)
l = json.loads(json_text)
# Obtain keys (Note that dictionary is an unorederd data type)
# So it is imp to get keys for ordered iteration in all dictionaries of list
# Column may be in different position but related data will be perfect
# If you wish you can hard code the `keys`, here I am getting using `l[0].keys()`
keys = l[0].keys()
# Form header and add to `texts` list
header = '|' + ' | '.join(keys) + " |"
texts.append(header)
# Form body (rows) and append to `texts` list
rows = ['| ' + "|".join(["null" if d[key] is None else d[key] for key in keys]) + "|" for d in l]
texts.extend(rows)
# Print all rows (including header) separated by newline '\n'
answer = '\n'.join(texts)
print(answer)
Output
|itemcode | productname | value_2018 |
| null|PKS543452|null|
| null|JHBG6%&9|null|
| null|VATER3456|null|
| null|ACDFER3434|null|
I have a DataFrame(df) in pyspark, by reading from a hive table:
df=spark.sql('select * from <table_name>')
+++++++++++++++++++++++++++++++++++++++++++
| Name | URL visited |
+++++++++++++++++++++++++++++++++++++++++++
| person1 | [google,msn,yahoo] |
| person2 | [fb.com,airbnb,wired.com] |
| person3 | [fb.com,google.com] |
+++++++++++++++++++++++++++++++++++++++++++
When i tried the following, got an error
df_dict = dict(zip(df['name'],df['url']))
"TypeError: zip argument #1 must support iteration."
type(df.name) is of 'pyspark.sql.column.Column'
How do i create a dictionary like the following, which can be iterated later on
{'person1':'google','msn','yahoo'}
{'person2':'fb.com','airbnb','wired.com'}
{'person3':'fb.com','google.com'}
Appreciate your thoughts and help.
I think you can try row.asDict(), this code run directly on the executor, and you don't have to collect the data on driver.
Something like:
df.rdd.map(lambda row: row.asDict())
How about using the pyspark Row.as_Dict() method? This is part of the dataframe API (which I understand is the "recommended" API at time of writing) and would not require you to use the RDD API at all.
df_list_of_dict = [row.asDict() for row in df.collect()]
type(df_list_of_dict), type(df_list_of_dict[0])
#(<class 'list'>, <class 'dict'>)
df_list_of_dict
#[{'person1': ['google','msn','yahoo']},
# {'person2': ['fb.com','airbnb','wired.com']},
# {'person3': ['fb.com','google.com']}]
If you wanted your results in a python dictionary, you could use collect()1 to bring the data into local memory and then massage the output as desired.
First collect the data:
df_dict = df.collect()
#[Row(Name=u'person1', URL visited=[u'google', u'msn,yahoo']),
# Row(Name=u'person2', URL visited=[u'fb.com', u'airbnb', u'wired.com']),
# Row(Name=u'person3', URL visited=[u'fb.com', u'google.com'])]
This returns a list of pyspark.sql.Row objects. You can easily convert this to a list of dicts:
df_dict = [{r['Name']: r['URL visited']} for r in df_dict]
#[{u'person1': [u'google', u'msn,yahoo']},
# {u'person2': [u'fb.com', u'airbnb', u'wired.com']},
# {u'person3': [u'fb.com', u'google.com']}]
1 Be advised that for large data sets, this operation can be slow and potentially fail with an Out of Memory error. You should consider if this is what you really want to do first as you will lose the parallelization benefits of spark by bringing the data into local memory.
Given:
+++++++++++++++++++++++++++++++++++++++++++
| Name | URL visited |
+++++++++++++++++++++++++++++++++++++++++++
| person1 | [google,msn,yahoo] |
| person2 | [fb.com,airbnb,wired.com] |
| person3 | [fb.com,google.com] |
+++++++++++++++++++++++++++++++++++++++++++
This should work:
df_dict = df \
.rdd \
.map(lambda row: {row[0]: row[1]}) \
.collect()
df_dict
#[{'person1': ['google','msn','yahoo']},
# {'person2': ['fb.com','airbnb','wired.com']},
# {'person3': ['fb.com','google.com']}]
This way you just collect after processing.
Please, let me know if that works for you :)
I need to combine requests and customMetrics tables by parsed url. On output it should have common parsed url, avg duration of requests and avg value of requests from CustomMetrics.
This code doesn't work ^(
let parseUrlOwn = (stringUrl:string) {
let halfparsed = substring(stringUrl,157);
substring(halfparsed,0 , indexof(halfparsed, "?"))
};
customMetrics
| where name == "Api.GetData"
| extend urlURI = tostring(customDimensions.RequestedUri)
| extend urlcustomMeticsParsed = parseUrlOwn(urlURI)
| extend unionColumnUrl = urlcustomMeticsParsed
| summarize summaryCustom = avg(value) by unionColumnUrl
| project summaryCustom, unionColumnUrl
| join (
requests
| where isnotempty(cloud_RoleName)
| extend urlRequestsParsed = parseUrlOwn(url)
| extend unionColumnUrl = urlRequestsParsed
| summarize summaryRequests =sum(itemCount), avg(duration)
| project summaryRequests, unionColumnUrl
) on unionColumnUrl
Instead of inventing your own url parsing, how about using parse_url (https://docs.loganalytics.io/docs/Language-Reference/Scalar-functions/parse_url()) and using that instead?
It also appears that your summarize line in the requests join, isn't summarizing on url, so I'm not sure how that works.
Shouldn't this line:
| summarize summaryRequests =sum(itemCount), avg(duration)
be
| summarize summaryRequests =sum(itemCount), avg(duration) by unionColumnUrl
like it is in the metrics part of the query. Also, why are you calculating the average in that summarize? you're just throwing it away by not projecting it on the next line.