How to implement rdd.bulkSaveToCassandra in datastax - apache-spark

I am using datastax cluster with 5.0.5.
[cqlsh 5.0.1 | Cassandra 3.0.11.1485 | DSE 5.0.5 | CQL spec 3.4.0 | Native proto
using spark-cassandra-connector 1.6.8
I tried to implement below code.. import is not working.
val rdd: RDD[SomeType] = ... // create some RDD to save import
com.datastax.bdp.spark.writer.BulkTableWriter._
rdd.bulkSaveToCassandra(keyspace, table)
Can someone suggest me how to implement this code. Are they any dependenceis required for this.

Cassandra Spark Connector has saveToCassandra method that could be used like this (taken from documentation):
val collection = sc.parallelize(Seq(("cat", 30), ("fox", 40)))
collection.saveToCassandra("test", "words", SomeColumns("word", "count"))
There is also saveAsCassandraTableEx that allows you to control schema creation, and other things - it's also described in documentation referenced above.
To use them you need to import com.datastax.spark.connector._ described in "Connecting to Cassandra" document.
And you need to add corresponding dependency - but this depends on what build system do you use.
The bulkSaveToCassandra method is available only when you're using DSE's connector. You need to add corresponding dependencies - see documentation for more details. But even primary developer of Spark connector says that it's better use saveToCassandra instead of it.

Related

How to fix 'Failed to convert the JSON string 'varchar(2)' to a data type.'

We want to move from spark 3.0.1 to 3.1.2. According to migration guide varchar data types are now supported in table schema. Unfortunately data onboarded with new version cant be queried by old spark versions which considered varchar as a string in table schema. According to migration guide applying spark.sql.legacy.charVarcharAsString to true in Spark Session configuration should do the trick but we still get varchar datatype instead of string in hive table schema.
As is:
To be:
What are we missing here?
You should upgrade spark version according to this https://issues.apache.org/jira/browse/SPARK-37452. There is a bug which affect versions 3.1.2, 3.2.0. And it was fixed in versions 3.1.3, 3.2.1, 3.3.0

How to get the Hadoop path with Java/Scala API in Code Repositories

My need is to read other formats: JSON, binary, XML and infer the schema dynamically within a transform in Code Repositories and using Spark datasource api.
Example:
val df = spark.read.json(<hadoop_path>)
For that, I need an accessor to the Foundry file system path, which is something like:
foundry://...#url:port/datasets/ri.foundry.main.dataset.../views/ri.foundry.main.transaction.../startTransactionRid/ri.foundry.main.transaction...
This is possible with PySpark API (Python):
filesystem = input_transform.filesystem()
hadoop_path = filesystem.hadoop_path
However, for Java/Scala I didn’t find a way to do it properly.
The getter to the Hadoop path has been recently added to Foundry Java API. By upgrading the version of the java transform (transformsJavaVersion >= 1.188.0), and you can get it:
val hadoopPath = myInput.asFiles().getFileSystem().hadoopPath()

How do you project geometries from one EPSG to another with Spark/Geomesa?

I am "translating" some Postgis code to Geomesa and I have some Postgis code like this:
select ST_Transform(ST_SetSRID(ST_Point(longitude, latitude), 4326), 27700)
which converts a point geometry from 4326 to 27700 for example.
On Geomesa-Spark-sql documentation https://www.geomesa.org/documentation/user/spark/sparksql_functions.html I can see ST_Point but I cannot find any equivalent ST_Transform function. Any idea?
I have used sedona library for the geoprocessing and it has the st_transform
function which I have used and working fine so if you want you can use it. Please find below link for the official documentation - https://sedona.apache.org/api/sql/GeoSparkSQL-Function/#st_transform
Even Geomesa is now supporting the function -
https://www.geomesa.org/documentation/3.1.2/user/spark/sparksql_functions.html#st-transform
For GeoMesa 1.x, 2.x, and the upcoming 3.0 release, there is not an ST_Transform presently. One could make their own UDF using GeoTools (or another library) to do the transformation.
Admittedly, this would require some work.
I recently run with the same issue on Azure Databricks. I was able to do it manually installing the JAR library from here.
And then running the following Scala code.
%scala
import org.locationtech.jts.geom._
import org.locationtech.geomesa.spark.jts._
import org.locationtech.geomesa.spark.geotools._
import org.apache.spark.sql.types._
import org.apache.spark.sql.functions._
import spark.implicits._
spark.withJTS
data_points = (
data_points
.withColumn("geom", st_makePoint(col("LONGITUDE"), col("LATITUDE")))
.withColumn("geom_5347", st_transform(col("geom"), lit("EPSG:4326"), lit("EPSG:5347")))
)
display(data_points)
Good luck.

How can we convert an external table to managed table in SPARK 2.2.0?

The below command was successfully converting external tables to managed tables in Spark 2.0.0:
ALTER TABLE {table_name} SET TBLPROPERTIES(EXTERNAL=FLASE);
However the above command is failing in Spark 2.2.0 with the below error:
Error in query: Cannot set or change the preserved property key:
'EXTERNAL';
As #AndyBrown pointed our in a comment you have the option of dropping to the console and invoking the Hive statement there. In Scala this worked for me:
import sys.process._
val exitCode = Seq("hive", "-e", "ALTER TABLE {table_name} SET TBLPROPERTIES(\"EXTERNAL\"=\"FALSE\")").!
I faced this problem using Spark 2.1.1 where #Joha's answer does not work because spark.sessionState is not accessible due to being declared lazy.
In Spark 2.2.0 you can do the following:
import org.apache.spark.sql.catalyst.TableIdentifier
import org.apache.spark.sql.catalyst.catalog.CatalogTable
import org.apache.spark.sql.catalyst.catalog.CatalogTableType
val identifier = TableIdentifier("table", Some("database"))
val oldTable = spark.sessionState.catalog.getTableMetadata(identifier)
val newTableType = CatalogTableType.MANAGED
val alteredTable = oldTable.copy(tableType = newTableType)
spark.sessionState.catalog.alterTable(alteredTable)
The issue is case-sensitivity on spark-2.1 and above.
Please try setting TBLPROPERTIES in lower case -
ALTER TABLE <TABLE NAME> SET TBLPROPERTIES('external'='false')
I had the same issue while using a hive external table. I solved the problem by directly setting the propery external to false in hive metastore using a hive metastore client
Table table = hiveMetaStoreClient.getTable("db", "table");
table.putToParameters("EXTERNAL","FALSE");
hiveMetaStoreClient.alter_table("db", "table", table,true);
I tried the above option from scala databricks notebook, and the
external table was converted to MANAGED table and the good part is
that the desc formatted option from spark on the new table is still
showing the location to be on my ADLS. This was one limitation that
spark was having, that we cannot specify the location for a managed
table.
As of now i am able to do a truncate table for this. hopefully there
was a more direct option for creating a managed table with location
specified from spark sql.

Existing column can't be found by DataFrame#filter in PySpark

I am using PySpark to perform SparkSQL on my Hive tables.
records = sqlContext.sql("SELECT * FROM my_table")
which retrieves the contents of the table.
When I use the filter argument as a string, it works okay:
records.filter("field_i = 3")
However, when I try to use the filter method, as documented here
records.filter(records.field_i == 3)
I am encountering this error
py4j.protocol.Py4JJavaError: An error occurred while calling o19.filter.
: org.apache.spark.sql.AnalysisException: resolved attributes field_i missing from field_1,field_2,...,field_i,...field_n
eventhough this field_i column clearly exists in the DataFrame object.
I prefer to use the second way because I need to use Python functions to perform record and field manipulations.
I am using Spark 1.3.0 in Cloudera Quickstart CDH-5.4.0 and Python 2.6.
From Spark DataFrame documentation
In Python it’s possible to access a DataFrame’s columns either by attribute (df.age) or by indexing (df['age']). While the former is convenient for interactive data exploration, users are highly encouraged to use the latter form, which is future proof and won’t break with column names that are also attributes on the DataFrame class.
It seems that the name of your field can be a reserved word, try with:
records.filter(records['field_i'] == 3)
What I did was to upgrade my Spark from 1.3.0 to 1.4.0 in Cloudera Quick Start CDH-5.4.0 and the second filtering feature works. Although I still can't explain why 1.3.0 has problems on that.

Resources