Line Fitting using LinearRegression in Pyspark gives wildly different coeffecients - apache-spark

I have a dataframe like so:
+---------+------------------+
|rownumber| Moving_Ratio|
+---------+------------------+
| 1000|105.67198820168865|
| 1001|105.65729748456914|
| 1002| 105.6426671752822|
| 1003|105.62808965618223|
| 1004|105.59623035662119|
| 1005|105.52385366516299|
| 1006|105.44762361744378|
| 1007|105.35977134665733|
| 1008|105.25685407339793|
| 1009|105.16307473993363|
| 1010|105.06600545864703|
| 1011|104.96056753478364|
| 1012|104.84525664217107|
| 1013| 104.7401615868953|
| 1014| 104.6283459710509|
| 1015|104.53484736833259|
| 1017|104.43492576734955|
| 1019|104.33599903547659|
| 1020|104.24640223269283|
| 1021|104.15275303890549|
+---------+------------------+
There are 10k rows, I've just truncated it for the sample view.
The data is by no means linear and looks like this:
However, I'm not worried about a perfect fit for each and every data point. I'm basically looking to fit a line that captures the direction of the curve and find its slope. As shown by the green line in the image that was generated by a statistics software.
The feature column I'm trying to fit in a line is Moving_Ratio
The min and max values of Moving_Ratio are:
+-----------------+------------------+
|min(Moving_Ratio)| max(Moving_Ratio)|
+-----------------+------------------+
|26.73629202745194|121.84100616620908|
+-----------------+------------------+
I tried creating a simple linear model with the following code:
vect_assm = VectorAssembler(inputCols =['Moving_Ratio'], outputCol='features')
df_vect=vect_assm.transform(df)\
lir = LinearRegression(featuresCol = 'features', labelCol='rownumber', maxIter=50,
regParam=0.3, elasticNetParam=0.8)
model = lir.fit(df_vect)
Predictions = model.transform(df_vect)
coeff=model.coefficients
When I look at the predictions, I seem to be getting values nowhere near the original data corresponding to those rownumbers.
Predictions.show()
+---------+------------------+--------------------+-----------------+
|rownumber| Moving_Ratio| features| prediction|
+---------+------------------+--------------------+-----------------+
| 1000|105.67198820168865|[105.67198820168865]|8935.419272488462|
| 1001|105.65729748456914|[105.65729748456914]| 8934.20373303444|
| 1002| 105.6426671752822| [105.6426671752822]|8932.993191845864|
| 1003|105.62808965618223|[105.62808965618223]|8931.787018623438|
| 1004|105.59623035662119|[105.59623035662119]|8929.150916159619|
| 1005|105.52385366516299|[105.52385366516299]| 8923.1623232745|
| 1006|105.44762361744378|[105.44762361744378]|8916.854895949407|
| 1007|105.35977134665733|[105.35977134665733]| 8909.58582253401|
| 1008|105.25685407339793|[105.25685407339793]|8901.070240542358|
| 1009|105.16307473993363|[105.16307473993363]|8893.310750051145|
| 1010|105.06600545864703|[105.06600545864703]|8885.279042666287|
| 1011|104.96056753478364|[104.96056753478364]| 8876.55489697866|
| 1012|104.84525664217107|[104.84525664217107]|8867.013842017961|
| 1013| 104.7401615868953| [104.7401615868953]|8858.318065966234|
| 1014| 104.6283459710509| [104.6283459710509]|8849.066217228752|
| 1015|104.53484736833259|[104.53484736833259]|8841.329954963563|
| 1017|104.43492576734955|[104.43492576734955]|8833.062240915566|
| 1019|104.33599903547659|[104.33599903547659]|8824.876844336828|
| 1020|104.24640223269283|[104.24640223269283]|8817.463424838508|
| 1021|104.15275303890549|[104.15275303890549]| 8809.71470236567|
+---------+------------------+--------------------+-----------------+
Predictions.select(min('prediction'),max('prediction')).show()
+-----------------+------------------+
| min(prediction)| max(prediction)|
+-----------------+------------------+
|2404.121157489531|10273.276308929268|
+-----------------+------------------+
coeff[0]
82.74200940195973
The min and max of the predictions are completely outside the input data.
What am I doing wrong?
Any help will be greatly appreciated

When you initialize LinearRegression object, featuresCol should list all features (independent variable) and labelCol should list the label (dependent variable). Since you are predicting 'Moving_Ratio', set up featuresCol='rownumber' and labelCol='Moving_Ratio' to specify LinearRegression correctly.

Related

want to create a timestamp using GPS data with python3

python3 Newby here. I am trying to create a variable that I can use to make a GPS timestamp from an adafruit GPS sensor. I eventually want to store this in a db. The mysql database has a timestamp feature when data is inserted into a table so I want to have that and the UTC time and date that comes from the GPS device be stored as well.
It seems I have something wrong and can not figure it out. The code is hanging on this:
def gpstsRead():
gps_ts = '{}/{}/{} {:02}:{:02}:{:02}'.format(
gps.timestamp_utc.tm_mon,
gps.timestamp_utc.tm_mday,
gps.timestamp_utc.tm_year,
gps.timestamp_utc.tm_hour,
gps.timestamp_utc.tm_min,
gps.timestamp_utc.tm_sec,
)
return gps_ts
I am trying to put all of these into a timestamp like format. The error is this:
Traceback (most recent call last):
File "/home/pi/ek9/Sensors/GPS/gps-db-insert.py", line 57, in <module>
gps_ts = gpstsRead()
File "/home/pi/ek9/Sensors/GPS/gps-db-insert.py", line 20, in gpstsRead
gps.timestamp_utc.tm_mon,
AttributeError: 'NoneType' object has no attribute 'tm_mon'
I have made sure I use spaces instead of tabs as that has caused me grief in the past. Beyond that I really don't know. I have been putzing with this for hours to no avail. Any ideas? thanks for any suggestions.
Thanks all for the input. After reading these I decided to try it a little different. Instead of defining a variable with "def" i just decided to eliminate the "def" and just create the variable itself.
Like this:
# define values
gps_ts = ('{}-{}-{} {:02}:{:02}:{:02}'.format(
gps.timestamp_utc.tm_year,
gps.timestamp_utc.tm_mon,
gps.timestamp_utc.tm_mday,
gps.timestamp_utc.tm_hour,
gps.timestamp_utc.tm_min,
gps.timestamp_utc.tm_sec,)
)
And that seemed to work. For anyone that is doing something similar I will include the complete code. I also understand that like most languages, there is always more than one way to get the job done, some better than others, some not. I am still learning. If anyone cares to point out how I could accomplish the same task by doing something different or more efficient, please feel free to provide me the opportunity to learn. thanks again!
#!/usr/bin/python3
import pymysql
import time
import board
from busio import I2C
import adafruit_gps
i2c = I2C(board.SCL, board.SDA)
gps = adafruit_gps.GPS_GtopI2C(i2c) # Use I2C interface
gps.send_command(b"PMTK314,0,1,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0")
gps.send_command(b"PMTK220,1000")
last_print = time.monotonic()
# Open database connection
db = pymysql.connect("localhost", "database", "password", "table")
# prepare a cursor object using cursor() method
cursor = db.cursor()
while True:
gps.update()
current = time.monotonic()
if current - last_print >= 1.0: # update rate
last_print = current
if not gps.has_fix:
# Try again if we don't have a fix yet.
print("Waiting for a satellite fix...")
continue
# define values
gps_ts = ('{}-{}-{} {:02}:{:02}:{:02}'.format(
gps.timestamp_utc.tm_year,
gps.timestamp_utc.tm_mon,
gps.timestamp_utc.tm_mday,
gps.timestamp_utc.tm_hour,
gps.timestamp_utc.tm_min,
gps.timestamp_utc.tm_sec,)
)
gps_lat = '{}'.format(gps.latitude)
gps_long = '{}'.format(gps.longitude)
gps_fix = '{}'.format(gps.fix_quality)
gps_sat = '{}'.format(gps.satellites)
gps_alt = '{}'.format(gps.altitude_m)
gps_speed = '{}'.format(gps.speed_knots)
gps_track = '{}'.format(gps.track_angle_deg)
sql = "INSERT into ek9_gps(gps_timestamp_utc, latitude, \
longitude, fix_quality, number_satellites, gps_altitude, \
gps_speed, gps_track_angle) \
values (%s,%s,%s,%s,%s,%s,%s,%s)"
arg = (gps_ts, gps_lat, gps_long, gps_fix, gps_sat,
gps_alt, gps_speed, gps_track)
try:
# Execute the SQL command
cursor.execute(sql, arg)
# Commit your changes in the database
db.commit()
except:
print('There was an error on input into the database')
# Rollback in case there is any error
db.rollback()
# disconnect from server
cursor.close()
db.close()
And this is what the mariadb shows:
+----+---------------------+---------------------+----------+-----------+-------------+-------------------+--------------+-----------+-----------------+
| id | datetime | gps_timestamp_utc | latitude | longitude | fix_quality | number_satellites | gps_altitude | gps_speed | gps_track_angle |
+----+---------------------+---------------------+----------+-----------+-------------+-------------------+--------------+-----------+-----------------+
| 11 | 2020-12-30 14:14:42 | 2020-12-30 20:14:42 | xx.xxxx | -xx.xxxx | 1 | 10 | 232 | 0 | 350 |
| 10 | 2020-12-30 14:14:41 | 2020-12-30 20:14:41 | xx.xxxx | -xx.xxxx | 1 | 10 | 232 | 0 | 350 |
| 9 | 2020-12-30 14:14:39 | 2020-12-30 20:14:39 | xx.xxxx | -xx.xxxx | 1 | 10 | 232 | 0 | 350 |
| 8 | 2020-12-30 14:14:38 | 2020-12-30 20:14:38 | xx.xxxx | -xx.xxxx | 1 | 10 | 232 | 0 | 350 |
Success!!! Thanks again!

Looking for a better performance on PySpark

I'm trying to build a model in PySpark and I'm obviously doing many wrong things.
What I have to do:
I have a list of products, and I have to compare each product to a corpus that will return the most similar ones, and then I have to filter by the product's type.
Here is an example with one product:
1st step:
product_id = 'HZM-1914'
type = get_type(product_id) #takes 8s. I'll show this function below
similar_list = [(product_id , 1.0)] + model.wv.most_similar(positive=id_produto, topn=5) #takes 0.04s
#the similar list shows the product_id and the similarity, it looks like this:
[('HZM-1914', 1.0), ('COL-8430', 0.9951900243759155), ('D23-2178', 0.9946870803833008), ('J96-0611', 0.9943861365318298), ('COL-7719', 0.9930003881454468), ('HZM-1912', 0.9926838874816895)]
2nd step:
#I want to filter the types, so I transform the list in a dataframe, and here is what is taking the longest to perform (and probably what is wrong)
rdd = sc.parallelize([(id, get_type(id), similarity) for (id, similarity) in similar_list]) #takes 55s
products = rdd.map(lambda x: Row(name=str(x[0]), type=str(x[1]), similarity=float(x[2]))) #takes 0.02s
df_recs = sqlContext.createDataFrame(products) #takes 0.02s
df_recs.show() #takes 0.43s
+--------+----------------+------------------+
| name| type| similarity |
+--------+----------------+------------------+
|HZM-1914| Chuteiras| 1.0|
|COL-8430| Chuteiras|0.9951900243759155|
|D23-2178| Bolas|0.9946870803833008|
|J96-0611|Luvas de Goleiro|0.9943861365318298|
|COL-7719| Bolas|0.9930003881454468|
|HZM-1912| Chuteiras|0.9926838874816895|
+--------+----------------+------------------+
3rd step:
#Comes the filter:
df_recs = df_recs.filter(df_recs.type == type) #takes 0.09s
df_recs.show() #takes 0.5s
+--------+---------+------------------+
| name| type| similarity |
+--------+---------+------------------+
|HZM-1914|Chuteiras| 1.0|
|COL-8430|Chuteiras|0.9951900243759155|
|HZM-1912|Chuteiras|0.9926838874816895|
+--------+---------+------------------+
The get_type() function is:
def get_type(product_id):
return df.filter(col("ID") == product_id).select("TYPE").collect()[0]["TYPE"]
And the DataFrame that get_type() gets ID and type is:
+----------+--------------------+--------------------+
|ID | NAME | TYPE |
+----------+--------------------+--------------------+
| 7983 |SNEAKERS 01 | Sneakers|
| 7034 |SHIRT 13 | Shirt|
| 3360 |SHORTS 15 | Short|
The get_type() function and creating the dataframe are the main issues. So if you have any idea how to make it work better, it would be really helpful. I come from Python and I'm struggling a lot with PySpark. Thank you very much in advance.

How to Convert a Normal Dataframe into MultiIndex'd based on certain condition

After a long While I visited to the SO's pandas section and got a question which is not indeed nicely framed thus thought to put here in an explicit way as similar kind of situation I'm too as well :-)
Below is the data frame construct:
>>> df
measure Pend Job Run Job Time
cls
ABC [inter, batch] [101, 93] [302, 1327] [56, 131]
DEF [inter, batch] [24279, 421] [4935, 5452] [75, 300]
Desired output would be...
I tried working hard but didn't get any solution, thus though to Sketch it here as that's somewhat I would like it be achieved.
----------------------------------------------------------------------------------
| |Pend Job | Run Job | Time |
cls | measure |-----------------------------------------------------------
| |inter | batch| |inter | batch| |inter | batch |
----|-----------------|------|------|-------|------|------|-----|------|----------
ABC |inter, batch |101 |93 | |302 |1327 | |56 |131 |
----|-----------------|-------------|-------|------|------|-----|------|---------|
DEF |inter, batch |24279 |421 | |4935 |5452 | |75 |300 |
----------------------------------------------------------------------------------
Saying that I want my dataFrame into MultiIndex Dataframe where Pend Job , Run Job, and Time to be on the top as above.
Edit:
cls is not in the columns
This is my approach, you can modify it to your need:
s = (df.drop('measure', axis=1) # remove the measure column
.set_index(df['measure'].apply(', '.join),
append=True) # make `measure` second level index
.stack().explode().to_frame() # concatenate all the values
)
# assign `inter` and `batch` label to each new cell
new_lvl = np.array(['inter','batch'])[s.groupby(level=(0,1,2)).cumcount()]
# or
# new_lvl = np.tile(['inter', 'batch'], len(s)//2)
(s.set_index(new_level, append=True)[0]
.unstack(level=(-2,-1)
.reset_index()
)
Output:
cls measure Pend Job
inter batch
0 ABC inter, batch 101 93
1 DEF inter, batch 24279 421

how to get k-largest element and index in pyspark dataframe array

I have the following dataframe in pyspark:
+------------------------------------------------------------+
|probability |
+------------------------------------------------------------+
|[0.27047928569511825,0.5312608102025099,0.19825990410237174]|
|[0.06711381377029987,0.8775456658890036,0.05534052034069637]|
|[0.10847074295048188,0.04602848157663474,0.8455007754728833]|
+------------------------------------------------------------+
and I want to get the largest, 2-largest value and their index:
+-------------------------------------------------------------------------------------------------------------- -----+
|probability | largest_1 |index_1|largest_2 |index_2 |
+------------------------------------------------------------|------------------|-------|-------------------|--------+
|[0.27047928569511825,0.5312608102025099,0.19825990410237174]|0.5312608102025099| 1 |0.27047928569511825| 0 |
|[0.06711381377029987,0.8775456658890036,0.05534052034069637]|0.8775456658890036| 1 |0.06711381377029987| 0 |
|[0.10847074295048188,0.04602848157663474,0.8455007754728833]|0.8455007754728833| 2 |0.10847074295048188| 0 |
+--------------------------------------------------------------------------------------------------------------------+
Here is another way using transform (require spark 2.4+) to convert array of doubles into array of structs containing value and index of each item in the original array, sort_array(by descending), and then take the first N:
from pyspark.sql.functions import expr
df.withColumn('d', expr('sort_array(transform(probability, (x,i) -> (x as val, i as idx)), False)')) \
.selectExpr(
'probability',
'd[0].val as largest_1',
'd[0].idx as index_1',
'd[1].val as largest_2',
'd[1].idx as index_2'
).show(truncate=False)
+--------------------------------------------------------------+------------------+-------+-------------------+-------+
|probability |largest_1 |index_1|largest_2 |index_2|
+--------------------------------------------------------------+------------------+-------+-------------------+-------+
|[0.27047928569511825, 0.5312608102025099, 0.19825990410237174]|0.5312608102025099|1 |0.27047928569511825|0 |
|[0.06711381377029987, 0.8775456658890036, 0.05534052034069637]|0.8775456658890036|1 |0.06711381377029987|0 |
|[0.10847074295048188, 0.04602848157663474, 0.8455007754728833]|0.8455007754728833|2 |0.10847074295048188|0 |
+--------------------------------------------------------------+------------------+-------+-------------------+-------+
From Spark-2.4+
You can use array_sort and array_position built in functions for this case.
Example:
df=spark.sql("select array(0.27047928569511825,0.5312608102025099,0.19825990410237174) probability union select array(0.06711381377029987,0.8775456658890036,0.05534052034069637) prbability union select array(0.10847074295048188,0.04602848157663474,0.8455007754728833) probability")
#DataFrame[probability: array<decimal(17,17)>]
#sample data
df.show(10,False)
#+---------------------------------------------------------------+
#|probability |
#+---------------------------------------------------------------+
#|[0.06711381377029987, 0.87754566588900360, 0.05534052034069637]|
#|[0.27047928569511825, 0.53126081020250990, 0.19825990410237174]|
#|[0.10847074295048188, 0.04602848157663474, 0.84550077547288330]|
#+---------------------------------------------------------------+
df.withColumn("sort_arr",array_sort(col("probability"))).\
withColumn("largest_1",element_at(col("sort_arr"),-1)).\
withColumn("largest_2",element_at(col("sort_arr"),-2)).\
selectExpr("*","array_position(probability,largest_1) -1 index_1","array_position(probability,largest_2) -1 index_2").\
drop("sort_arr").\
show(10,False)
#+---------------------------------------------------------------+-------------------+-------------------+-------+-------+
#|probability |largest_1 |largest_2 |index_1|index_2|
#+---------------------------------------------------------------+-------------------+-------------------+-------+-------+
#|[0.06711381377029987, 0.87754566588900360, 0.05534052034069637]|0.87754566588900360|0.06711381377029987|1 |0 |
#|[0.27047928569511825, 0.53126081020250990, 0.19825990410237174]|0.53126081020250990|0.27047928569511825|1 |0 |
#|[0.10847074295048188, 0.04602848157663474, 0.84550077547288330]|0.84550077547288330|0.10847074295048188|2 |0 |
#+---------------------------------------------------------------+-------------------+-------------------+-------+-------+

How to convert rows into a list of dictionaries in pyspark?

I have a DataFrame(df) in pyspark, by reading from a hive table:
df=spark.sql('select * from <table_name>')
+++++++++++++++++++++++++++++++++++++++++++
| Name | URL visited |
+++++++++++++++++++++++++++++++++++++++++++
| person1 | [google,msn,yahoo] |
| person2 | [fb.com,airbnb,wired.com] |
| person3 | [fb.com,google.com] |
+++++++++++++++++++++++++++++++++++++++++++
When i tried the following, got an error
df_dict = dict(zip(df['name'],df['url']))
"TypeError: zip argument #1 must support iteration."
type(df.name) is of 'pyspark.sql.column.Column'
How do i create a dictionary like the following, which can be iterated later on
{'person1':'google','msn','yahoo'}
{'person2':'fb.com','airbnb','wired.com'}
{'person3':'fb.com','google.com'}
Appreciate your thoughts and help.
I think you can try row.asDict(), this code run directly on the executor, and you don't have to collect the data on driver.
Something like:
df.rdd.map(lambda row: row.asDict())
How about using the pyspark Row.as_Dict() method? This is part of the dataframe API (which I understand is the "recommended" API at time of writing) and would not require you to use the RDD API at all.
df_list_of_dict = [row.asDict() for row in df.collect()]
type(df_list_of_dict), type(df_list_of_dict[0])
#(<class 'list'>, <class 'dict'>)
df_list_of_dict
#[{'person1': ['google','msn','yahoo']},
# {'person2': ['fb.com','airbnb','wired.com']},
# {'person3': ['fb.com','google.com']}]
If you wanted your results in a python dictionary, you could use collect()1 to bring the data into local memory and then massage the output as desired.
First collect the data:
df_dict = df.collect()
#[Row(Name=u'person1', URL visited=[u'google', u'msn,yahoo']),
# Row(Name=u'person2', URL visited=[u'fb.com', u'airbnb', u'wired.com']),
# Row(Name=u'person3', URL visited=[u'fb.com', u'google.com'])]
This returns a list of pyspark.sql.Row objects. You can easily convert this to a list of dicts:
df_dict = [{r['Name']: r['URL visited']} for r in df_dict]
#[{u'person1': [u'google', u'msn,yahoo']},
# {u'person2': [u'fb.com', u'airbnb', u'wired.com']},
# {u'person3': [u'fb.com', u'google.com']}]
1 Be advised that for large data sets, this operation can be slow and potentially fail with an Out of Memory error. You should consider if this is what you really want to do first as you will lose the parallelization benefits of spark by bringing the data into local memory.
Given:
+++++++++++++++++++++++++++++++++++++++++++
| Name | URL visited |
+++++++++++++++++++++++++++++++++++++++++++
| person1 | [google,msn,yahoo] |
| person2 | [fb.com,airbnb,wired.com] |
| person3 | [fb.com,google.com] |
+++++++++++++++++++++++++++++++++++++++++++
This should work:
df_dict = df \
.rdd \
.map(lambda row: {row[0]: row[1]}) \
.collect()
df_dict
#[{'person1': ['google','msn','yahoo']},
# {'person2': ['fb.com','airbnb','wired.com']},
# {'person3': ['fb.com','google.com']}]
This way you just collect after processing.
Please, let me know if that works for you :)

Resources