want to create a timestamp using GPS data with python3 - python-3.x

python3 Newby here. I am trying to create a variable that I can use to make a GPS timestamp from an adafruit GPS sensor. I eventually want to store this in a db. The mysql database has a timestamp feature when data is inserted into a table so I want to have that and the UTC time and date that comes from the GPS device be stored as well.
It seems I have something wrong and can not figure it out. The code is hanging on this:
def gpstsRead():
gps_ts = '{}/{}/{} {:02}:{:02}:{:02}'.format(
gps.timestamp_utc.tm_mon,
gps.timestamp_utc.tm_mday,
gps.timestamp_utc.tm_year,
gps.timestamp_utc.tm_hour,
gps.timestamp_utc.tm_min,
gps.timestamp_utc.tm_sec,
)
return gps_ts
I am trying to put all of these into a timestamp like format. The error is this:
Traceback (most recent call last):
File "/home/pi/ek9/Sensors/GPS/gps-db-insert.py", line 57, in <module>
gps_ts = gpstsRead()
File "/home/pi/ek9/Sensors/GPS/gps-db-insert.py", line 20, in gpstsRead
gps.timestamp_utc.tm_mon,
AttributeError: 'NoneType' object has no attribute 'tm_mon'
I have made sure I use spaces instead of tabs as that has caused me grief in the past. Beyond that I really don't know. I have been putzing with this for hours to no avail. Any ideas? thanks for any suggestions.

Thanks all for the input. After reading these I decided to try it a little different. Instead of defining a variable with "def" i just decided to eliminate the "def" and just create the variable itself.
Like this:
# define values
gps_ts = ('{}-{}-{} {:02}:{:02}:{:02}'.format(
gps.timestamp_utc.tm_year,
gps.timestamp_utc.tm_mon,
gps.timestamp_utc.tm_mday,
gps.timestamp_utc.tm_hour,
gps.timestamp_utc.tm_min,
gps.timestamp_utc.tm_sec,)
)
And that seemed to work. For anyone that is doing something similar I will include the complete code. I also understand that like most languages, there is always more than one way to get the job done, some better than others, some not. I am still learning. If anyone cares to point out how I could accomplish the same task by doing something different or more efficient, please feel free to provide me the opportunity to learn. thanks again!
#!/usr/bin/python3
import pymysql
import time
import board
from busio import I2C
import adafruit_gps
i2c = I2C(board.SCL, board.SDA)
gps = adafruit_gps.GPS_GtopI2C(i2c) # Use I2C interface
gps.send_command(b"PMTK314,0,1,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0")
gps.send_command(b"PMTK220,1000")
last_print = time.monotonic()
# Open database connection
db = pymysql.connect("localhost", "database", "password", "table")
# prepare a cursor object using cursor() method
cursor = db.cursor()
while True:
gps.update()
current = time.monotonic()
if current - last_print >= 1.0: # update rate
last_print = current
if not gps.has_fix:
# Try again if we don't have a fix yet.
print("Waiting for a satellite fix...")
continue
# define values
gps_ts = ('{}-{}-{} {:02}:{:02}:{:02}'.format(
gps.timestamp_utc.tm_year,
gps.timestamp_utc.tm_mon,
gps.timestamp_utc.tm_mday,
gps.timestamp_utc.tm_hour,
gps.timestamp_utc.tm_min,
gps.timestamp_utc.tm_sec,)
)
gps_lat = '{}'.format(gps.latitude)
gps_long = '{}'.format(gps.longitude)
gps_fix = '{}'.format(gps.fix_quality)
gps_sat = '{}'.format(gps.satellites)
gps_alt = '{}'.format(gps.altitude_m)
gps_speed = '{}'.format(gps.speed_knots)
gps_track = '{}'.format(gps.track_angle_deg)
sql = "INSERT into ek9_gps(gps_timestamp_utc, latitude, \
longitude, fix_quality, number_satellites, gps_altitude, \
gps_speed, gps_track_angle) \
values (%s,%s,%s,%s,%s,%s,%s,%s)"
arg = (gps_ts, gps_lat, gps_long, gps_fix, gps_sat,
gps_alt, gps_speed, gps_track)
try:
# Execute the SQL command
cursor.execute(sql, arg)
# Commit your changes in the database
db.commit()
except:
print('There was an error on input into the database')
# Rollback in case there is any error
db.rollback()
# disconnect from server
cursor.close()
db.close()
And this is what the mariadb shows:
+----+---------------------+---------------------+----------+-----------+-------------+-------------------+--------------+-----------+-----------------+
| id | datetime | gps_timestamp_utc | latitude | longitude | fix_quality | number_satellites | gps_altitude | gps_speed | gps_track_angle |
+----+---------------------+---------------------+----------+-----------+-------------+-------------------+--------------+-----------+-----------------+
| 11 | 2020-12-30 14:14:42 | 2020-12-30 20:14:42 | xx.xxxx | -xx.xxxx | 1 | 10 | 232 | 0 | 350 |
| 10 | 2020-12-30 14:14:41 | 2020-12-30 20:14:41 | xx.xxxx | -xx.xxxx | 1 | 10 | 232 | 0 | 350 |
| 9 | 2020-12-30 14:14:39 | 2020-12-30 20:14:39 | xx.xxxx | -xx.xxxx | 1 | 10 | 232 | 0 | 350 |
| 8 | 2020-12-30 14:14:38 | 2020-12-30 20:14:38 | xx.xxxx | -xx.xxxx | 1 | 10 | 232 | 0 | 350 |
Success!!! Thanks again!

Related

Line Fitting using LinearRegression in Pyspark gives wildly different coeffecients

I have a dataframe like so:
+---------+------------------+
|rownumber| Moving_Ratio|
+---------+------------------+
| 1000|105.67198820168865|
| 1001|105.65729748456914|
| 1002| 105.6426671752822|
| 1003|105.62808965618223|
| 1004|105.59623035662119|
| 1005|105.52385366516299|
| 1006|105.44762361744378|
| 1007|105.35977134665733|
| 1008|105.25685407339793|
| 1009|105.16307473993363|
| 1010|105.06600545864703|
| 1011|104.96056753478364|
| 1012|104.84525664217107|
| 1013| 104.7401615868953|
| 1014| 104.6283459710509|
| 1015|104.53484736833259|
| 1017|104.43492576734955|
| 1019|104.33599903547659|
| 1020|104.24640223269283|
| 1021|104.15275303890549|
+---------+------------------+
There are 10k rows, I've just truncated it for the sample view.
The data is by no means linear and looks like this:
However, I'm not worried about a perfect fit for each and every data point. I'm basically looking to fit a line that captures the direction of the curve and find its slope. As shown by the green line in the image that was generated by a statistics software.
The feature column I'm trying to fit in a line is Moving_Ratio
The min and max values of Moving_Ratio are:
+-----------------+------------------+
|min(Moving_Ratio)| max(Moving_Ratio)|
+-----------------+------------------+
|26.73629202745194|121.84100616620908|
+-----------------+------------------+
I tried creating a simple linear model with the following code:
vect_assm = VectorAssembler(inputCols =['Moving_Ratio'], outputCol='features')
df_vect=vect_assm.transform(df)\
lir = LinearRegression(featuresCol = 'features', labelCol='rownumber', maxIter=50,
regParam=0.3, elasticNetParam=0.8)
model = lir.fit(df_vect)
Predictions = model.transform(df_vect)
coeff=model.coefficients
When I look at the predictions, I seem to be getting values nowhere near the original data corresponding to those rownumbers.
Predictions.show()
+---------+------------------+--------------------+-----------------+
|rownumber| Moving_Ratio| features| prediction|
+---------+------------------+--------------------+-----------------+
| 1000|105.67198820168865|[105.67198820168865]|8935.419272488462|
| 1001|105.65729748456914|[105.65729748456914]| 8934.20373303444|
| 1002| 105.6426671752822| [105.6426671752822]|8932.993191845864|
| 1003|105.62808965618223|[105.62808965618223]|8931.787018623438|
| 1004|105.59623035662119|[105.59623035662119]|8929.150916159619|
| 1005|105.52385366516299|[105.52385366516299]| 8923.1623232745|
| 1006|105.44762361744378|[105.44762361744378]|8916.854895949407|
| 1007|105.35977134665733|[105.35977134665733]| 8909.58582253401|
| 1008|105.25685407339793|[105.25685407339793]|8901.070240542358|
| 1009|105.16307473993363|[105.16307473993363]|8893.310750051145|
| 1010|105.06600545864703|[105.06600545864703]|8885.279042666287|
| 1011|104.96056753478364|[104.96056753478364]| 8876.55489697866|
| 1012|104.84525664217107|[104.84525664217107]|8867.013842017961|
| 1013| 104.7401615868953| [104.7401615868953]|8858.318065966234|
| 1014| 104.6283459710509| [104.6283459710509]|8849.066217228752|
| 1015|104.53484736833259|[104.53484736833259]|8841.329954963563|
| 1017|104.43492576734955|[104.43492576734955]|8833.062240915566|
| 1019|104.33599903547659|[104.33599903547659]|8824.876844336828|
| 1020|104.24640223269283|[104.24640223269283]|8817.463424838508|
| 1021|104.15275303890549|[104.15275303890549]| 8809.71470236567|
+---------+------------------+--------------------+-----------------+
Predictions.select(min('prediction'),max('prediction')).show()
+-----------------+------------------+
| min(prediction)| max(prediction)|
+-----------------+------------------+
|2404.121157489531|10273.276308929268|
+-----------------+------------------+
coeff[0]
82.74200940195973
The min and max of the predictions are completely outside the input data.
What am I doing wrong?
Any help will be greatly appreciated
When you initialize LinearRegression object, featuresCol should list all features (independent variable) and labelCol should list the label (dependent variable). Since you are predicting 'Moving_Ratio', set up featuresCol='rownumber' and labelCol='Moving_Ratio' to specify LinearRegression correctly.

How to manipulate nested GET calls

My goal is to make multiple Get calls from the results of the first call, then concatenate the clients informations into dataframe. Preferable a faster way because I have a million clients ids
--------------------------------------
| id | name | country | city | phone |
--------------------------------------
| 1 | Leo | France | Paris | 212...|
| . | .. | .. | .. | .. |
| 100| Bale | UK | London| 514...|
The basic request / results (all clients):
import requests
from requests.auth import HTTPBasicAuth
# the initial request which returns all clients
res0 = requests.get('https://x.y.z/api/v1/data/clients', auth=HTTPBasicAuth('me', 'blabla'))
# results
<?xml version="1.0" ?>
<queryResponse>
<entityId type="clients" url="https://x.y.z/api/v1/data/clients/1">1</entityId>
...
...
<entityId type="clients" url="https://x.y.z/api/v1/data/clients/100">100</entityId>
</queryResponse>
The detailed request / results (client infos)
# this request allows to get client informations
res1 = requests.get('https://x.y.z/api/v1/data/clients/1', auth=HTTPBasicAuth('me', 'blabla'))
# results
<queryResponse>
<entity type="client_infos" url="https://x.y.z/api/v1/data/clients/1">
<client_infos displayName="1" id="1">
<name>Leo Massina</name>
<country>France</country>
<city>1607695021057</city>
<phone>+212-61-88-65-123</phone>
</client_infos >
</entity>
You can use lxml to parse the response and make the new calls, retrieve the tags and text in a dictionary and create the dataframe:
(I used fors for clarity, you can optimize the code if needed)
Also, I did not retrieve ids, if needed they can be retrieved as attribute of client_infos tags.
from lxml import etree
root = etree.fromstring(res0)
reqentity = []
data = {"name":[], "country":[], "city":[], "phone":[]}
for entity in root.findall('./entityId'):
reqentity.append(requests.get(entity.attrib['url'], auth=HTTPBasicAuth('me', 'blabla')))
for entity in reqentity:
entity = etree.fromstring(entity)
for item in entity.findall(".//client_infos//"):
data[item.tag].append(item.text)
df = pd.DataFrame(data)

How to Convert a Normal Dataframe into MultiIndex'd based on certain condition

After a long While I visited to the SO's pandas section and got a question which is not indeed nicely framed thus thought to put here in an explicit way as similar kind of situation I'm too as well :-)
Below is the data frame construct:
>>> df
measure Pend Job Run Job Time
cls
ABC [inter, batch] [101, 93] [302, 1327] [56, 131]
DEF [inter, batch] [24279, 421] [4935, 5452] [75, 300]
Desired output would be...
I tried working hard but didn't get any solution, thus though to Sketch it here as that's somewhat I would like it be achieved.
----------------------------------------------------------------------------------
| |Pend Job | Run Job | Time |
cls | measure |-----------------------------------------------------------
| |inter | batch| |inter | batch| |inter | batch |
----|-----------------|------|------|-------|------|------|-----|------|----------
ABC |inter, batch |101 |93 | |302 |1327 | |56 |131 |
----|-----------------|-------------|-------|------|------|-----|------|---------|
DEF |inter, batch |24279 |421 | |4935 |5452 | |75 |300 |
----------------------------------------------------------------------------------
Saying that I want my dataFrame into MultiIndex Dataframe where Pend Job , Run Job, and Time to be on the top as above.
Edit:
cls is not in the columns
This is my approach, you can modify it to your need:
s = (df.drop('measure', axis=1) # remove the measure column
.set_index(df['measure'].apply(', '.join),
append=True) # make `measure` second level index
.stack().explode().to_frame() # concatenate all the values
)
# assign `inter` and `batch` label to each new cell
new_lvl = np.array(['inter','batch'])[s.groupby(level=(0,1,2)).cumcount()]
# or
# new_lvl = np.tile(['inter', 'batch'], len(s)//2)
(s.set_index(new_level, append=True)[0]
.unstack(level=(-2,-1)
.reset_index()
)
Output:
cls measure Pend Job
inter batch
0 ABC inter, batch 101 93
1 DEF inter, batch 24279 421

How to count distinct element over multiple columns and a rolling window in PySpark [duplicate]

This question already has answers here:
pyspark: count distinct over a window
(2 answers)
Closed 1 year ago.
Let's imagine we have the following dataframe :
port | flag | timestamp
---------------------------------------
20 | S | 2009-04-24T17:13:14+00:00
30 | R | 2009-04-24T17:14:14+00:00
32 | S | 2009-04-24T17:15:14+00:00
21 | R | 2009-04-24T17:16:14+00:00
54 | R | 2009-04-24T17:17:14+00:00
24 | R | 2009-04-24T17:18:14+00:00
I would like to calculate the number of distinct port, flag over the 3 hours in Pyspark.
The result will be something like :
port | flag | timestamp | distinct_port_flag_overs_3h
---------------------------------------
20 | S | 2009-04-24T17:13:14+00:00 | 1
30 | R | 2009-04-24T17:14:14+00:00 | 1
32 | S | 2009-04-24T17:15:14+00:00 | 2
21 | R | 2009-04-24T17:16:14+00:00 | 2
54 | R | 2009-04-24T17:17:14+00:00 | 2
24 | R | 2009-04-24T17:18:14+00:00 | 3
The SQL request looks like :
SELECT
COUNT(DISTINCT port) OVER my_window AS distinct_port_flag_overs_3h
FROM my_table
WINDOW my_window AS (
PARTITION BY flag
ORDER BY CAST(timestamp AS timestamp)
RANGE BETWEEN INTERVAL 3 HOUR PRECEDING AND CURRENT
)
I found this topic that solves the problem but only if we want to count distinct elements over one field.
Do someone has any idea of how to achieve that in :
python 3.7
pyspark 2.4.4
Just collect set of structs (port, flag) and get its size. Something like this:
w = Window.partitionBy("flag").orderBy("timestamp").rangeBetween(-10800, Window.currentRow)
df.withColumn("timestamp", to_timestamp("timestamp").cast("long"))\
.withColumn("distinct_port_flag_overs_3h", size(collect_set(struct("port", "flag")).over(w)))\
.orderBy(col("timestamp"))\
.show()
I've just code something like that that works to :
def hive_time(time:str)->int:
"""
Convert string time to number of seconds
time : str : must be in the following format, numberType
For exemple 1hour, 4day, 3month
"""
match = re.match(r"([0-9]+)([a-z]+)", time, re.I)
if match:
items = match.groups()
nb, kind = items[0], items[1]
try :
nb = int(nb)
except ValueError as e:
print(e, traceback.format_exc())
print("The format of {} which is your time aggregaation is not recognize. Please read the doc".format(time))
if kind == "second":
return nb
if kind == "minute":
return 60*nb
if kind == "hour":
return 3600*nb
if kind == "day":
return 24*3600*nb
assert False, "The format of {} which is your time aggregaation is not recognize. \
Please read the doc".format(time)
# Rolling window in spark
def distinct_count_over(data, window_size:str, out_column:str, *input_columns, time_column:str='timestamp'):
"""
data : pyspark dataframe
window_size : Size of the rolling window, check the doc for format information
out_column : name of the column where you want to stock the results
input_columns : the columns where you want to count distinct
time_column : the name of the columns where the timefield is stocked (must be in ISO8601)
return : a new dataframe whith the stocked result
"""
concatenated_columns = F.concat(*input_columns)
w = (Window.orderBy(F.col("timestampGMT").cast('long')).rangeBetween(-hive_time(window_size), 0))
return data \
.withColumn('timestampGMT', data.timestampGMT.cast(time_column)) \
.withColumn(out_column, F.size(F.collect_set(concatenated_columns).over(w)))
Works well, didn't check yet for performance monitoring.

How to convert rows into a list of dictionaries in pyspark?

I have a DataFrame(df) in pyspark, by reading from a hive table:
df=spark.sql('select * from <table_name>')
+++++++++++++++++++++++++++++++++++++++++++
| Name | URL visited |
+++++++++++++++++++++++++++++++++++++++++++
| person1 | [google,msn,yahoo] |
| person2 | [fb.com,airbnb,wired.com] |
| person3 | [fb.com,google.com] |
+++++++++++++++++++++++++++++++++++++++++++
When i tried the following, got an error
df_dict = dict(zip(df['name'],df['url']))
"TypeError: zip argument #1 must support iteration."
type(df.name) is of 'pyspark.sql.column.Column'
How do i create a dictionary like the following, which can be iterated later on
{'person1':'google','msn','yahoo'}
{'person2':'fb.com','airbnb','wired.com'}
{'person3':'fb.com','google.com'}
Appreciate your thoughts and help.
I think you can try row.asDict(), this code run directly on the executor, and you don't have to collect the data on driver.
Something like:
df.rdd.map(lambda row: row.asDict())
How about using the pyspark Row.as_Dict() method? This is part of the dataframe API (which I understand is the "recommended" API at time of writing) and would not require you to use the RDD API at all.
df_list_of_dict = [row.asDict() for row in df.collect()]
type(df_list_of_dict), type(df_list_of_dict[0])
#(<class 'list'>, <class 'dict'>)
df_list_of_dict
#[{'person1': ['google','msn','yahoo']},
# {'person2': ['fb.com','airbnb','wired.com']},
# {'person3': ['fb.com','google.com']}]
If you wanted your results in a python dictionary, you could use collect()1 to bring the data into local memory and then massage the output as desired.
First collect the data:
df_dict = df.collect()
#[Row(Name=u'person1', URL visited=[u'google', u'msn,yahoo']),
# Row(Name=u'person2', URL visited=[u'fb.com', u'airbnb', u'wired.com']),
# Row(Name=u'person3', URL visited=[u'fb.com', u'google.com'])]
This returns a list of pyspark.sql.Row objects. You can easily convert this to a list of dicts:
df_dict = [{r['Name']: r['URL visited']} for r in df_dict]
#[{u'person1': [u'google', u'msn,yahoo']},
# {u'person2': [u'fb.com', u'airbnb', u'wired.com']},
# {u'person3': [u'fb.com', u'google.com']}]
1 Be advised that for large data sets, this operation can be slow and potentially fail with an Out of Memory error. You should consider if this is what you really want to do first as you will lose the parallelization benefits of spark by bringing the data into local memory.
Given:
+++++++++++++++++++++++++++++++++++++++++++
| Name | URL visited |
+++++++++++++++++++++++++++++++++++++++++++
| person1 | [google,msn,yahoo] |
| person2 | [fb.com,airbnb,wired.com] |
| person3 | [fb.com,google.com] |
+++++++++++++++++++++++++++++++++++++++++++
This should work:
df_dict = df \
.rdd \
.map(lambda row: {row[0]: row[1]}) \
.collect()
df_dict
#[{'person1': ['google','msn','yahoo']},
# {'person2': ['fb.com','airbnb','wired.com']},
# {'person3': ['fb.com','google.com']}]
This way you just collect after processing.
Please, let me know if that works for you :)

Resources