How do you project geometries from one EPSG to another with Spark/Geomesa? - apache-spark

I am "translating" some Postgis code to Geomesa and I have some Postgis code like this:
select ST_Transform(ST_SetSRID(ST_Point(longitude, latitude), 4326), 27700)
which converts a point geometry from 4326 to 27700 for example.
On Geomesa-Spark-sql documentation https://www.geomesa.org/documentation/user/spark/sparksql_functions.html I can see ST_Point but I cannot find any equivalent ST_Transform function. Any idea?

I have used sedona library for the geoprocessing and it has the st_transform
function which I have used and working fine so if you want you can use it. Please find below link for the official documentation - https://sedona.apache.org/api/sql/GeoSparkSQL-Function/#st_transform
Even Geomesa is now supporting the function -
https://www.geomesa.org/documentation/3.1.2/user/spark/sparksql_functions.html#st-transform

For GeoMesa 1.x, 2.x, and the upcoming 3.0 release, there is not an ST_Transform presently. One could make their own UDF using GeoTools (or another library) to do the transformation.
Admittedly, this would require some work.

I recently run with the same issue on Azure Databricks. I was able to do it manually installing the JAR library from here.
And then running the following Scala code.
%scala
import org.locationtech.jts.geom._
import org.locationtech.geomesa.spark.jts._
import org.locationtech.geomesa.spark.geotools._
import org.apache.spark.sql.types._
import org.apache.spark.sql.functions._
import spark.implicits._
spark.withJTS
data_points = (
data_points
.withColumn("geom", st_makePoint(col("LONGITUDE"), col("LATITUDE")))
.withColumn("geom_5347", st_transform(col("geom"), lit("EPSG:4326"), lit("EPSG:5347")))
)
display(data_points)
Good luck.

Related

backtesting.py ploting function not working

I'm trying to learn backtesting.py, when I run the following sample code, it pops up these errors, anyone could help? I tried to uninstall the Bokeh package and reinstall an older version, but it doen't work.
BokehDeprecationWarning: Passing lists of formats for DatetimeTickFormatter scales was deprecated in Bokeh 3.0. Configure a single string format for each scale
C:\Users\paul_\AppData\Local\Programs\Python\Python310\lib\site-packages\bokeh\models\formatters.py:399: UserWarning: DatetimeFormatter scales now only accept a single format. Using the first prodvided: '%d %b'
warnings.warn(f"DatetimeFormatter scales now only accept a single format. Using the first prodvided: {fmt[0]!r} ")
BokehDeprecationWarning: Passing lists of formats for DatetimeTickFormatter scales was deprecated in Bokeh 3.0. Configure a single string format for each scale
C:\Users\paul_\AppData\Local\Programs\Python\Python310\lib\site-packages\bokeh\models\formatters.py:399: UserWarning: DatetimeFormatter scales now only accept a single format. Using the first prodvided: '%m/%Y'
warnings.warn(f"DatetimeFormatter scales now only accept a single format. Using the first prodvided: {fmt[0]!r} ")
GridPlot(id='p11925', ...)
import bokeh
import datetime
import pandas_ta as ta
import pandas as pd
from backtesting import Backtest
from backtesting import Strategy
from backtesting.lib import crossover
from backtesting.test import GOOG
class RsiOscillator(Strategy):
upper_bound = 70
lower_bound = 30
rsi_window = 14
# Do as much initial computation as possible
def init(self):
self.rsi = self.I(ta.rsi, pd.Series(self.data.Close), self.rsi_window)
# Step through bars one by one
# Note that multiple buys are a thing here
def next(self):
if crossover(self.rsi, self.upper_bound):
self.position.close()
elif crossover(self.lower_bound, self.rsi):
self.buy()
bt = Backtest(GOOG, RsiOscillator, cash=10_000, commission=.002)
stats = bt.run()
bt.plot()
An issue was opened for this in the GitHub repo:
https://github.com/kernc/backtesting.py/issues/803
A comment in the issue suggests to downgrade bokeh to 2.4.3:
python3 -m pip install bokeh==2.4.3
This worked for me.
I had a similar issue, using Spyder IDE.
Found out I need to call the below for the plot to show for Spyder.
backtesting.set_bokeh_output(notebook=False)
I have update Python to version 3.11 & downgrade bokeh to 2.4.3
This worked for me.
Downgrading Bokeh didn't work for me.
But, after importing backtesting in Jupyter, I needed to do:
backtesting.set_bokeh_output(notebook=False)
The expected plot was then generated in a new interactive browser tab.

module 'statsmodels.tsa.api' has no attribute 'arima_model'

I'm trying to use "statsmodels.api" to work with time series data and trying to fit a simple ARIMA model using
sm.tsa.arima_model.ARIMA(dta,(4,1,1)).fit()
but I got the following error
module 'statsmodels.tsa.api' has no attribute 'arima_model'
I'm using 'statsmodels' version 0.9.0 with 'spyder' version 3.2.8 I'd be pleased to get your help thanks
The correct path is :
import statsmodels.api as sm
sm.tsa.ARIMA()
You can view it using a shell that allows autocomplete like ipython.
It is also viewable in the example provided by statsmodels such as this one.
And more informations about package structure may be found here.

What is imported with spark.implicits._?

What is imported with import spark.implicits._? Does "implicits" refer to some package? If so, why could I not find it in the Scala Api documentation on https://spark.apache.org/docs/latest/api/scala/index.html#org.apache.spark.package?
Scala allows you to import "dynamically" things into scope. You can also do something like that:
final case class Greeting(hi: String)
def greet(greeting: Greeting): Unit = {
import greeting._ // everything in greeting is now available in scope
println(hi)
}
The SparkSession instance carries along some implicits that you import in your scope with that import statement. The most important thing that you get are the Encoders necessary for a lot of operations on DataFrames and Datasets. It also brings into the scope the StringContext necessary for you to use the $"column_name" notation.
The implicits member is an instance of SQLImplicits, whose source code (for version 2.3.1) you can view here.
It's scala's feature to import through object, so the api documentation did not describe about it. From Apache spark source code, implicits is an object class inside SparkSession class. The implicits class has extended the SQLImplicits like this :
object implicits extends org.apache.spark.sql.SQLImplicits with scala.Serializable. The SQLImplicits provides some more functionalities like:
Convert scala object to dataset. (By toDS)
Convert scala object to dataframe. (By toDF)
Convert "$name" into Column.
By importing implicits through import spark.implicits._ where spark is a SparkSession type object, the functionalities are imported implicitly.

How to implement rdd.bulkSaveToCassandra in datastax

I am using datastax cluster with 5.0.5.
[cqlsh 5.0.1 | Cassandra 3.0.11.1485 | DSE 5.0.5 | CQL spec 3.4.0 | Native proto
using spark-cassandra-connector 1.6.8
I tried to implement below code.. import is not working.
val rdd: RDD[SomeType] = ... // create some RDD to save import
com.datastax.bdp.spark.writer.BulkTableWriter._
rdd.bulkSaveToCassandra(keyspace, table)
Can someone suggest me how to implement this code. Are they any dependenceis required for this.
Cassandra Spark Connector has saveToCassandra method that could be used like this (taken from documentation):
val collection = sc.parallelize(Seq(("cat", 30), ("fox", 40)))
collection.saveToCassandra("test", "words", SomeColumns("word", "count"))
There is also saveAsCassandraTableEx that allows you to control schema creation, and other things - it's also described in documentation referenced above.
To use them you need to import com.datastax.spark.connector._ described in "Connecting to Cassandra" document.
And you need to add corresponding dependency - but this depends on what build system do you use.
The bulkSaveToCassandra method is available only when you're using DSE's connector. You need to add corresponding dependencies - see documentation for more details. But even primary developer of Spark connector says that it's better use saveToCassandra instead of it.

The import org.apache.jena.query cannot be resolved

I want to calculate distance between sensors deployed in georaphical area using longitude and latitude in sparql query issued in apache jena 2.11.(Sensor description and observation are stored as RDF triple in sensor.n3, eclipse as IDE and Fedora 19, TDB as triple store)
I found that "Spatial searches with SPARQL" will help in this regard. But when I import package given at http://jena.apache.org/documentation/query/spatial-query.html import org.apache.jena.query.spatial.EntityDefinition in eclipse I get the error The import org.apache.jena.query cannot be resolved. When browsed the folder ../apache-jena-2.11.1/javadoc-arq/org/apache/jena directory it contains only
(altas, common, web, riot) there is no query folder which is the reason why import is highlighted in red.
I have one more doubt whether Apache Solr need to be installed ( I have downloaded solr 4.10.1) or just use build path to import external jar.
You need to separately download jena-spatial. (Use maven to manage your dependencies.) You can use lucene instead of Solr. Again, maven will load the dependencies. AndyS

Resources