Databricks geospatial (latitude/longitude) plot - apache-spark

Is it possible to visualize (latitude, longitude) data on a world map in an Azure Databricks Notebook? Scatter plot is the closest I got. The documentation tells about the existence of Map (Markers), but I didn't find such. Other possibilities are also welcome.
Here is the example of loading geospatial data from the Databricks doc:
%sql
CREATE TABLE IF NOT EXISTS default.sf_fire_incidents
USING com.databricks.spark.csv
OPTIONS (PATH 'dbfs:/databricks-datasets/learning-spark-v2/sf-fire/sf-fire-incidents.csv',
HEADER "true",
INFERSCHEMA "true");
SELECT location,
`Estimated Property Loss`,
coordinates[0] AS LAT,
coordinates[1] AS LONG
FROM (SELECT location,
`Estimated Property Loss`,
split(btrim(location, '() '), ', ') AS coordinates
FROM default.sf_fire_incidents
ORDER BY `Estimated Property Loss` DESC,
location)
LIMIT 2000;
Scatter plot (LAT, LONG) is the closest I have, but I am unable to show the positions on a world map.

I think your best bet would be using the python library folium.
link to docs : https://python-visualization.github.io/folium/quickstart.html#Markers
You will start by initializing your map object.
import folium
# location refers to where the map will be centered at the begining
m = folium.Map(location=[45.372, -121.6972], zoom_start=12, tiles="Stamen Terrain")
Then convert your query result to pandas dataframe, you can do for example :
YOUR_QUERY = """
SELECT location,
`Estimated Property Loss`,
coordinates[0] AS LAT,
coordinates[1] AS LONG
FROM (SELECT location,
`Estimated Property Loss`,
split(btrim(location, '() '), ', ') AS coordinates
FROM default.sf_fire_incidents
ORDER BY `Estimated Property Loss` DESC,
location)
LIMIT 2000;
"""
pdf = spark.sql(YOUR_QUERY).toPandas()
And finally iterate over your rows and add your markers to the map object like such :
for index, lines in pdf.iterrows():
folium.Marker([lines[LAT], lines[LONG]], popup=f"{index}").add_to(m)
Don't forget to show the map at end by just calling the object :
m

Related

Palantir Foundry spark.sql query

When I attempt to query my input table as a view, I get the error com.palantir.foundry.spark.api.errors.DatasetPathNotFoundException. My code is as follows:
def Median_Product_Revenue_Temp2(Merchant_Segments):
Merchant_Segments.createOrReplaceTempView('Merchant_Segments_View')
df = spark.sql('select * from Merchant_Segments_View limit 5')
return df
I need to dynamically query this table, since I am trying to calculate the median using percentile_approx across numerous fields, and I'm not sure how to do this without using spark.sql.
If I try to avoid using spark.sql to calculate median across numerous fields using something like the below code, it results in the error Missing Transform Attribute: A module object does not have an attribute percentile_approx. Please check the spelling and/or the datatype of the object.
import pyspark.sql.functions as F
exprs = {x: percentile_approx("x", 0.5) for x in df.columns if x is not exclustion_list}
df = df.groupBy(['BANK_NAME','BUS_SEGMENT']).agg(exprs)
try createGlobalTempView. It worked for me.
eg:
df.createGlobalTempView("people")
(Don't know the root cause why localTempView dose not work )
I managed to avoid using dynamic sql for calculating median across columns using the following code:
df_result = df.groupBy(group_list).agg(
*[ F.expr('percentile_approx(nullif('+col+',0), 0.5)').alias(col) for col in df.columns if col not in exclusion_list]
)
Embedding percentile_approx in an F.expr bypassed the issue I was encountering in the second half of my post.

Inserting Timestamp Into Snowflake Using Python 3.8

I have an empty table defined in snowflake as;
CREATE OR REPLACE TABLE db1.schema1.table(
ACCOUNT_ID NUMBER NOT NULL PRIMARY KEY,
PREDICTED_PROBABILITY FLOAT,
TIME_PREDICTED TIMESTAMP
);
And it creates the correct table, which has been checked using desc command in sql. Then using a snowflake python connector we are trying to execute following query;
insert_query = f'INSERT INTO DATA_LAKE.CUSTOMER.ACT_PREDICTED_PROBABILITIES(ACCOUNT_ID, PREDICTED_PROBABILITY, TIME_PREDICTED) VALUES ({accountId}, {risk_score},{ct});'
ctx.cursor().execute(insert_query)
Just before this query the variables are defined, The main challenge is getting the current time stamp written into snowflake. Here the value of ct is defined as;
import datetime
ct = datetime.datetime.now()
print(ct)
2021-04-30 21:54:41.676406
But when we try to execute this INSERT query we get the following errr message;
ProgrammingError: 001003 (42000): SQL compilation error:
syntax error line 1 at position 157 unexpected '21'.
Can I kindly get some help on ow to format the date time value here? Help is appreciated.
In addition to the answer #Lukasz provided you could also think about defining the current_timestamp() as default for the TIME_PREDICTED column:
CREATE OR REPLACE TABLE db1.schema1.table(
ACCOUNT_ID NUMBER NOT NULL PRIMARY KEY,
PREDICTED_PROBABILITY FLOAT,
TIME_PREDICTED TIMESTAMP DEFAULT current_timestamp
);
And then just insert ACCOUNT_ID and PREDICTED_PROBABILITY:
insert_query = f'INSERT INTO DATA_LAKE.CUSTOMER.ACT_PREDICTED_PROBABILITIES(ACCOUNT_ID, PREDICTED_PROBABILITY) VALUES ({accountId}, {risk_score});'
ctx.cursor().execute(insert_query)
It will automatically assign the insert time to TIME_PREDICTED
Educated guess. When performing insert with:
insert_query = f'INSERT INTO ...(ACCOUNT_ID, PREDICTED_PROBABILITY, TIME_PREDICTED)
VALUES ({accountId}, {risk_score},{ct});'
It is a string interpolation. The ct is provided as string representation of datetime, which does not match a timestamp data type, thus error.
I would suggest using proper variable binding instead:
ctx.cursor().execute("INSERT INTO DATA_LAKE.CUSTOMER.ACT_PREDICTED_PROBABILITIES "
"(ACCOUNT_ID, PREDICTED_PROBABILITY, TIME_PREDICTED) "
"VALUES(:1, :2, :3)",
(accountId,
risk_score,
("TIMESTAMP_LTZ", ct)
)
);
Avoid SQL Injection Attacks
Avoid binding data using Python’s formatting function because you risk SQL injection. For example:
# Binding data (UNSAFE EXAMPLE)
con.cursor().execute(
"INSERT INTO testtable(col1, col2) "
"VALUES({col1}, '{col2}')".format(
col1=789,
col2='test string3')
)
Instead, store the values in variables, check those values (for example, by looking for suspicious semicolons inside strings), and then bind the parameters using qmark or numeric binding style.
You forgot to place the quotes before and after the {ct}. The code should be :
insert_query = "INSERT INTO DATA_LAKE.CUSTOMER.ACT_PREDICTED_PROBABILITIES(ACCOUNT_ID, PREDICTED_PROBABILITY, TIME_PREDICTED) VALUES ({accountId}, {risk_score},'{ct}');".format(accountId=accountId,risk_score=risk_score,ct=ct)
ctx.cursor().execute(insert_query)

Trying to plot a pandas dataframe groupby with Bokeh

New here but I've been searching for hours now and can't seem to find the solution for this. What I'm trying to do is display an aggregate of a dataframe in a Bokeh chart. I tried using a groupby object but I get an error when passing the groupby object to the ColumnDataSource (as mentioned in the post below).
how use bokeh vbar chart parameter with groupby object?
Here's some sample code I'm using:
import pandas
from bokeh.models import ColumnDataSource
df = pandas.DataFrame(np.random.randn(50, 4), columns=list('ABCD'))
group = df.groupby("A")
source = ColumnDataSource(group)
Getting this error:
ValueError: expected a dict or pandas.DataFrame, got <pandas.core.groupby.DataFrameGroupBy object at 0x103f7bfd0>
Any ideas as to plot the groupby object in a chart with Bokeh?
Thanks in advance!
I haven't used Bokeh, however from what I see you are passing a pandas.core.groupby.DataFrameGroupBy and ColumnDataSource is expecting a pd.DataFrame. That said the problem is that when using groupby you create a data structure that resembles key, value storage. So each group in the groups object will have a key and value, that value is the DataFrame that your are looking for. Running your code as shown below will help you understand the resulting data structure from applying groupby() to a DataFrame:
groups = df.group('A')
for group in groups:
# get group key
key = group[0]
# Get group df
group_df = group[1]
Notice that I replaced group = df.group('A') with groups = df.group('A')

mimic ' group by' and window function logic in spark

I have a large csv file with the columns id,time,location. I made it an RDD, and want to compute some aggregated metrics of the trips, when a trip is defined as a time-contiguous set of records of the same id, separated by at least 1 hour on either side. I am new to spark. (related)
To do that, I think to create an RDD with elements of the form (trip_id,(time, location)) and use reduceByKey to calculate all the needed metrics.
To calculate the trip_id, i try to implement the SQL-approach of the linked question, to make an indicator field of whether the record is a start of a trip, and make a cumulative sum of this indicator field. This does not sound like a distributed approach: is there a better one?
Furthermore, how can I add this indicator field? it should be 1 if the time-difference to the previous record of the same id is above an hour, and 0 otherwise. I thought of at first doing groupBy id and then sort in each of the values, but they will be inside an Array and thus not amenable to sortByKey, and there is no lead function as in SQL to get the previous value.
Example of the suggested aforementioned approach: for the RDD
(1,9:30,50)
(1,9:37,70)
(1,9:50,80)
(2,19:30,10)
(1,20:50,20)
We want to turn it first into the RDD with the time differences,
(1,9:30,50,inf)
(1,9:37,70,00:07:00)
(1,9:50,80,00:13:00)
(2,19:30,10,inf)
(2,20:50,20,01:20:00)
(The value of the earliest record is, say, scala's PositiveInfinity constant)
and turn this last field into an indicator field of whether it is above 1, which indicates whether we start a trip,
(1,9:30,50,1)
(1,9:37,70,0)
(1,9:50,80,0)
(2,19:30,10,1)
(2,20:50,20,1)
and then turn it into a trip_id
(1,9:30,50,1)
(1,9:37,70,1)
(1,9:50,80,1)
(2,19:30,10,2)
(2,20:50,20,3)
and then use this trip_id as the key to aggregations.
The preprocessing was simply to load the file and delete the header,
val rawdata=sc.textFile("some_path")
def isHeader(line:String)=line.contains("id")
val data=rawdata.filter(!isHeader(_))
Edit
While trying to implement with spark SQL, I ran into an error regarding the time difference:
val lags=sqlContext.sql("
select time - lag(time) over (partition by id order by time) as diff_time from data
");
since spark doesn't know how to take the difference between two timestamps. I'm trying to check whether this difference is above 1 hour.
It Also doesn't recognize the function getTime, that I found online as an answer, the following returns an error too (Couldn't find window function time.getTime):
val lags=sqlContext.sql("
select time.getTime() - (lag(time)).getTime() over (partition by id order by time)
from data
");
Even though making a similar lag difference for a numeric attribute works:
val lag_numeric=sqlContext.sql("
select longitude - lag(longitude) over (partition by id order by time)
from data"); //works
Spark didn't recognize the function Hours.hoursBetween either. I'm using spark 1.4.0.
I also tried to define an appropriate user-defined-function, but UDFS are oddly not recognized inside queries:
val timestamp_diff: ((Timestamp,Timestamp) => Double) =
(d1: Timestamp,d2: Timestamp) => d1.getTime()-d2.getTime()
val lags=sqlContext.sql("select timestamp_diff(time,lag(time))
over (partition by id order by time) from data");
So, how can spark test whether the difference between timestamps is above an hour?
Full code:
import org.apache.spark.{SparkConf, SparkContext}
import org.apache.spark.sql.SQLContext
import org.apache.spark.sql.functions._
import sqlContext._
val sqlContext = new org.apache.spark.sql.hive.HiveContext(sc)
import sqlContext.implicits._
import org.apache.spark.sql.hive.HiveContext//For window functions
import java.util.Date
import java.sql.Timestamp
case class Record(id: Int, time:Timestamp, longitude: Double, latitude: Double)
val raw_data=sc.textFile("file:///home/sygale/merged_table.csv")
val data_records=
raw_data.map(line=>
Record( line.split(',')(0).toInt,
Timestamp.valueOf(line.split(',')(1)),
line.split(',')(2).toDouble,
line.split(',')(3).toDouble
))
val data=data_records.toDF()
data.registerTempTable("data")
val lags=sqlContext.sql("
select time - lag(time) over (partition by id order by time) as diff_time from data
");

Cannot link MS Access query with subquery

I have created a query with a subquery in Access, and cannot link it in Excel 2003: when I use the menu Data -> Import External Data -> Import Data... and select the mdb file, the query is not present in the list. If I use the menu Data -> Import External Data -> New Database Query..., I can see my query in the list, but at the end of the import wizard I get this error:
Too few parameters. Expected 2.
My guess is that the query syntax is causing the problem, in fact the query contains a subquery. So, I'll try to describe the query goal and the resulting syntax.
Table Positions
ID (Autonumber, Primary Key)
position (double)
currency_id (long) (references Currency.ID)
portfolio (long)
Table Currency
ID (Autonumber, Primary Key)
code (text)
Query Goal
Join the 2 tables
Filter by portfolio = 1
Filter by currency.code in ("A", "B")
Group by currency and calculate the sum of the positions for each currency group an call the result: sumOfPositions
Calculate abs(sumOfPositions) on each currency group
Calculate the sum of the previous results as a single result
Query
The query without the final sum can be created using the Design View. The resulting SQL is:
SELECT Currency.code, Sum(Positions.position) AS SumOfposition
FROM [Currency] INNER JOIN Positions ON Currency.ID = Positions.currency_id
WHERE (((Positions.portfolio)=1))
GROUP BY Currency.code
HAVING (((Currency.code) In ("A","B")));
in order to calculate the final SUM I did the following (in the SQL View):
SELECT Sum(Abs([temp].[SumOfposition])) AS sumAbs
FROM [SELECT Currency.code, Sum(Positions.position) AS SumOfposition
FROM [Currency] INNER JOIN Positions ON Currency.ID = Positions.currency_id
WHERE (((Positions.portfolio)=1))
GROUP BY Currency.code
HAVING (((Currency.code) In ("A","B")))]. AS temp;
So, the question is: is there a better way for structuring the query in order to make the export work?
I can't see too much wrong with it, but I would take out some of the junk Access puts in and scale down the query to this, hopefully this should run ok:
SELECT Sum(Abs(A.SumOfPosition)) As SumAbs
FROM (SELECT C.code, Sum(P.position) AS SumOfposition
FROM Currency As C INNER JOIN Positions As P ON C.ID = P.currency_id
WHERE P.portfolio=1
GROUP BY C.code
HAVING C.code In ("A","B")) As A
It might be worth trying to declare your parameters in the MS Access query definition and define their datatypes. This is especially important when you are trying to use the query outside of MS Access itself, since it can't auto-detect the parameter types. This approach is sometimes hit or miss, but worth a shot.
PARAMETERS [[Positions].[portfolio]] Long, [[Currency].[code]] Text ( 255 );
SELECT Sum(Abs([temp].[SumOfposition])) AS sumAbs
FROM [SELECT Currency.code, Sum(Positions.position) AS SumOfposition
FROM [Currency] INNER JOIN Positions ON Currency.ID = Positions.currency_id
WHERE (((Positions.portfolio)=1))
GROUP BY Currency.code
HAVING (((Currency.code) In ("A","B")))]. AS temp;
I have solved my problems thanks to the fact that the outer query is doing a trivial sum. When choosing New Database Query... in Excel, at the end of the process, after pressing Finish, an Import Data form pops up, asking
Where do you want to put the data?
you can click on Create a PivotTable report... . If you define the PivotTable properly, Excel will display only the outer sum.

Resources