How programmatically via GeoMesa/Spark can I read a shapefile? - apache-spark

I am going through the documentation https://www.geomesa.org/documentation/user/convert/shp.html but I cannot find a way to read shapefiles (in my case stored on S3) using GeoMesa/Spark. Any idea?

There are three broad options.
GeoMesa loads data into Spark via 'RDD Providers'. The converters you linked to can be used in Spark via the ConverterRDD Provider. (https://www.geomesa.org/documentation/user/spark/providers.html#converter-rdd-provider) This may just work.
There is also an GeoTools DataStore RDD Provider implementation. (https://www.geomesa.org/documentation/user/spark/providers.html#geotools-rdd-provider) That could be used with the GeoTools ShapefileDataStore (https://docs.geotools.org/stable/userguide/library/data/shape.html) The work here is to line up the correct jars and parameters.
If you are fine with using the GeoTools Shapefile DataStore, you could use that directly in Spark to load features into memory and then sort out how to make an RDD/Dataframe. (This is kinda skipping on the use of the RDD Provider bits.)

Related

Databricks spark.readstream format differences

I am having confusion on the difference of the following code in Databricks
spark.readStream.format('json')
vs
spark.readStream.format('cloudfiles').option('cloudFiles.format', 'json')
I know cloudfiles as the format would be regarded as Databricks Autoloader . In performance/function comparison , which one is better ? Anyone has some experience on that?
Thanks
There are multiple differences between these two. When you use Auto Loader you get at least, there are more things (see doc for all details):
Better performance, scalability and cost efficiency when discovering new files. You can either use file notification mode (when you get notified about new files using cloud-native integration) or optimized file listing mode that uses native cloud APIs to list files and directories. Spark's file streaming is relying on the Hadoop APIs that are much slower, especially if you have a lot of nested directories and a lot of files
Support for schema inference and evolution. With Auto Loader you can detect changes in the schema for JSON/CSV/Avro, and adjust it to process new fields.

Data source on GCP BigQuery

I tried to look for any existing intake components such as Driver, Plugin that can support GCP BigQuery. Given that if it cannot support, please advise on how to implement subclassing of intake.source.base.DataSource
Pandas can read from BigQuery with the function read_gbq. If you are only interested in reading whole results in a single shot, then this is all you need. You would need to do something like the sql source, which calls pandas to load the data in _get_schema method.
There is currently no GBQ reader for dask, so you cannot load out-of-core or in parralel, but see the discussion in this thread.

How to create an hbase sequencefile key in Spark for load to Bigtable?

I want to be able to easily create test data files that I can save and re-load into a dev Bigtable instance at will, and pass to other members of my team so they can do the same. The suggested way of using Dataflow to Bigtable seems ridiculously heavy-weight (anyone loading a new type of data--not for production purposes, even just playing around with Bigtable for the first time--needs to know Apache Beam, Dataflow, Java, and Maven??--that's potentially going to limit Bigtable adoption for my team) and my data isn't already in HBase so I can't just export a sequencefile.
However, per this document, it seems like the sequencefile key for HBase should be constructible in regular Java/Scala/Python code:
The HBase Key consists of: the row key, column family, column qualifier, timestamp and a type.
It just doesn't go into enough detail for me to actually do it. What delimiters exist between the different parts of the key? (This is my main question).
From there, Spark at least has a method to write a sequencefile so I think I should be able to create the files I want as long as I can construct the keys.
I'm aware that there's an alternative (described in this answer, whose example link is broken) that would involve writing a script to spin up a Dataproc cluster, push a TSV file there, and use HBase ImportTsv to push the data to Bigtable. This also seems overly heavy-weight to me but maybe I'm just not used to the cloud world yet.
The sequence file solution is meant for situations where large sets of data need to be imported and/or exported from Cloud Bigtable. If your file is small enough, then create a script that creates a table, reads from a file, and uses a BufferedMutator (or bath writes in your favorite language) to write to Cloud Bigtable.

How to migrate data between two tables in Cassandra properly

I have to change the schema of one of my tables in Cassandra. It's cannot be done by simply using ALTER TABLE command, because there are some changes in primary key.
So the question is: How to do such a migration in the best way?
Using COPY command in cql is not an option in here because dump file can be really huge.
Can I solve this problem by not creating some custom application?
Like Guillaume has suggested in the comment - you can't do this directly in cassandra. Schema altering operations are very limited here. You have to perform such migration manually using one of suggested there tools OR if you have very large tables you can leverage Spark.
Spark can efficiently read data from your nodes, transform them locally and save them back to db. Remember that such migration requires reading whole db content, so might take a while. It might be the most performant solution, however needs some bigger preparation - Spark cluster setup.

What is a good Bulk data loading tool for Cassandra

I'm looking for a tool to load CSV into Cassandra. I was hoping to use RazorSQL for this but I've been told that it will be several months out.
What is a good tool?
Thanks
1) If you have all the data to be loaded in place you can try sstableloader(only for cassandra 0.8.x onwards) utility to bulk load the data.For more details see:cassandra bulk loader
2) Cassandra has introduced BulkOutputFormat bulk loading data into cassandra with hadoop job in latest version that is cassandra-1.1.x onwards.
For more details see:Bulkloading to Cassandra with Hadoop
I'm dubious that tool support would help a great deal with this, since a Cassandra schema needs to reflect the queries that you want to run, rather than just being a generic model of your domain.
The built-in bulk loading mechanism for cassandra is via BinaryMemtables: http://wiki.apache.org/cassandra/BinaryMemtable
However, whether you use this or the more usual Thrift interface, you still probably need to manually design a mapping from your CSV into Cassandra ColumnFamilies, taking into account the queries you need to run. A generic mapping from CSV-> Cassandra may not be appropriate since secondary indexes and denormalisation are commonly needed.
For Cassandra 1.1.3 and higher, there is the CQL COPY command that is available for importing (or exporting) data to (or from) a table. According to the documentation, if you are importing less than 2 million rows, roughly, then this is a good option. Is is much easier to use than the sstableloader and less error prone. The sstableloader requires you to create strictly formatted .db files whereas the CQL COPY command accepts a delimited text file. Documenation here:
http://www.datastax.com/docs/1.1/references/cql/COPY
For larger data sets, you should use the sstableloader.http://www.datastax.com/docs/1.1/references/bulkloader. A working example is described here http://www.datastax.com/dev/blog/bulk-loading.

Resources