Not able to persist Spark data set to orientdb - apache-spark

I am trying to fetch data from sql server database and created a spark dataset. When i persisting dataset to orientdb, not able to do that.
Getting below error
Exception in thread "main" java.lang.RuntimeException: Connection Exception Occurred: Error on opening database 'jdbc:orient:REMOTE:localhost/test'
Here is my code:
Map<String, String> options = new HashMap<>();
options.put("driver", "com.microsoft.sqlserver.jdbc.SQLServerDriver");
options.put("url", "jdbc:sqlserver://localhost:1433;databaseName=sample");
options.put("user", "username");
options.put("password", "password");
DataFrameReader jdbcDF = spark.read().format("jdbc").options(options);
Dataset<Row> tableDataSet = jdbcDF.option("dbtable", "Table1").load();
tableDataSet.createOrReplaceTempView("TEMP_V");
Dataset<Row> tableDataset1 = spark.sql("SELECT ID AS DEPT_ID, NAME AS DEPT_NAME FROM TEMP_V");
tableDataset1.write().format("org.apache.spark.orientdb.graphs")
.option("dburl", "jdbc:orient:remote:localhost/test")
.option("user", "root")
.option("password", "root")
.option("spark", "true")
.option("vertextype", "DEPARTMENT")
.mode(SaveMode.Overwrite)
.save();

At the moment of writing the orientdb's jdbc driver isn't able to persist a spark dataset. It should be patched to improve shark compatibility . It is, although, able to read from orientdb and load a dataset.
Please open an issue.

Related

spark stream stream join error on restart: Provided schema doesn't match to the schema for existing state

I have two stream sources and trying to have s stream stream inner join, it is working as expected when the spark session is running.
after session ends if no new file is added in any of the read stream location then it starts smoothly but if a file is added while the spark session is restarting then it throws the following error inside spark.
SPARK UI ERROR:
org.apache.spark.sql.execution.streaming.state.StateSchemaNotCompatible: Provided schema doesn't match to the schema for existing state! Please note that Spark allow difference of field name: check count of fields and data type of each field.
Provided key schema: StructType(StructField(field0,LongType,true), StructField(index,LongType,true))
Existing key schema: StructType(StructField(field0,LongType,true), StructField(index,LongType,true))
Provided value schema: StructType(StructField(first_name,StringType,true), StructField(id,LongType,true), StructField(last_name,StringType,true), StructField(status,StringType,true), StructField(matched,BooleanType,true))
Existing value schema: StructType(StructField(first_name,StringType,true), StructField(id,LongType,false), StructField(last_name,StringType,true), StructField(status,StringType,true), StructField(matched,BooleanType,true))
In the notebook console shows this message:
StreamingQueryException: Job aborted. Streaming Query === Identifier: [id = ab5d802a-wxyz, runId = 4496d07c-abcd] Current Committed Offsets: {FileStreamSource[/src2]: {"logOffset":1},FileStreamSource[/src1]: {"logOffset":2}} Current Available Offsets: {FileStreamSource[/src2]: {"logOffset":2},FileStreamSource[/src1]: {"logOffset":2}}
Current State: ACTIVE Thread State: RUNNABLE
Logical Plan: Project [id#67L, first_name#66, last_name#68, status#69, addr1#80, addr2#81] Join Inner, (id#67L = id#82L) StreamingExecutionRelation FileStreamSource[/src1], [first_name#66, id#67L, last_name#68, status#69] StreamingExecutionRelation FileStreamSource[/src2], [addr1#80, addr2#81, id#82L]
src1 location json file schema : id(unique key), first_name, last_name, status
src2 location json file schema : id(unique key), addr1, addr2
pyspark join code:
joined_table=(
src1_read.join(src2_read, ["id"])
.writeStream
.format("delta")
.option("readChangeFeed", "true")\
.option("checkpointLocation", f"{change_path}/_checkpoint")
.start(change_path)
)
joined_table.awaitTermination()
I want to create a spark stream stream join that is fault tolerant and can restart smoothly even if data is added to the src1 or src2 location after the session stops.

Unable to cache Dataset in Apache Spark Java using structured streaming

Design
Using spark structured streaming, reading data from a Kafka topic.
Datset<Row> df = spark.readStream().format("kafka").option(...).load();
Spark read stream to read data from a Postgres table (approx 6M records)
Dataset<row> tableDf = spark.read.format(jdbc).option(...).load();
Perform joins on the #1 and #2 datasets and push to another Kafka topic.
Problem Statement
The data fetched in #2, needs to be refreshed periodically at the 'X' interval as the source data gets modified.
So, we need a mechanism to cache the dataset created at #2 until the refresh happens.
As I am new to Spark, I have already tried using persist()/cache() with RefreshStream(), which works fine and refreshes the data at a set interval
public void loadTables()
{
this.loadLookupTables();
Dataset<Row> staticRefreshStream = spark.readStream().format("rate").option("rowsperSecond",1)
.option(numPartitions",1).load().selectExpr(CAST(value as LONG) as trigger");
staticRefreshStream.writeStream().outputMode("append").foreachbatch((VoidFunction2<Dataset<Row>, Long>) (df2, batchId) -> this.refreshLookupTables()).queryName("RefreshStream").trigger(Trigger.ProcessingTime(1, TimeUnit.HOURS)).start();
}
public void loadLookUpTable()
{
this.tableDataset = this.fetchLookUpTable("table name");
this.tableDataset.persist();
}
public Dataset<Row> fetchLookUptable(String tableName)
{
return spark.read().format("jdbc").option(..url..uname..pwd..driver).load();
}
public void refreshLookupTable()
{
this.tableDataset.unpersist();
this.loadLookUpTable();
}
But, the problem is every time a new batch comes from Kafka input at #1, it somehow gets the refreshed data from the database without even calling the RefreshStream created above.
I want to restrict this to not calling the database until the refreshStream trigger is executed.

How to add session properties of presto in spark

Is there any way to set session parameters of presto in spark, while building a Dataframe out of it.
public Dataset<Row> readPrestoTbl(){
Dataset<Row> stgTblDF = sparksession
.read()
.jdbc(dcrIdentity.getProperty(env + "." + "presto_url")
+ "?SSL="
+ dcrIdentity.getProperty(env + "."
+ "presto_client_SSL"), demoLckQuery, getDBProperties());
}
private Properties getDBProperties() {
Properties dbProperties = new Properties();
dbProperties.put("user", prestoCredentials.getUsername());
dbProperties.put("password", prestoCredentials.getPassword());
dbProperties.put("Driver", "io.prestosql.jdbc.PrestoDriver");
dbProperties.put("task.max-worker-threads", "10");
return dbProperties;
}
The way I have set task.max-worker-threads this property is there any option to set session properties like, required_workers_count or query_max_run_time etc.
I also tried below options, but every time its says Unrecognized connection property 'sessionProperties'.
while adding in properties
dbProperties.put("sessionProperties","task.max-worker-threads:10");
while loading in spark
.option("sessionProperties", "task.max-worker-threads:10")
Trino (formerly PrestoSQL) JDBC driver supports sessionProperties property.
https://trino.io/docs/current/installation/jdbc.html?highlight=sessionproperties#parameter-reference
Also, this is a blog post about the rebranding.
https://trino.io/blog/2020/12/27/announcing-trino.html

Latency in BigQuery Data Availability

I am using streaming insert to load a single row at a time in a Big Query Table.
This is the code
def insertBigQueryTable( tableName:String , datasetName:String, rowContent :java.util.Map[String, Object]): Unit = {
val bigquery = BigQueryOptions.getDefaultInstance.getService
try {
val tableId = TableId.of(datasetName, tableName)
val response =bigquery.insertAll( InsertAllRequest.newBuilder(tableId).addRow(rowContent).build())
if (response.hasErrors()) {
val errors: util.Set[Map.Entry[lang.Long, util.List[BigQueryError]]]=response.getInsertErrors.entrySet()
while(errors.iterator().hasNext){
val error=errors.iterator().next()
println(s"error while loading in bigquery $error.getValue")
}
}
}
catch {
case e: BigQueryException =>e.printStackTrace
}
}
I am able to instantly query the data via the query console in big query.
Then I am loading the table via a spark job (in a different job) running in dataproc cluster.But the data is not available in the spark dataframe immediately.
This is what I am doing
def biqQueryToDFDefault(tabName: String, spark: SparkSession):DataFrame =
spark.read.format("bigquery").option("table",tabName).load()
I am trying to understand if this is expected ? or is there a different way that I should be handling it(like trying to load the single row via a spark job)?
Streaming data to a bigquery table takes a while to become available. You can SELECT the data immediately but it sits inside a stream buffer for a while until its ready to get used by queries like UPDATE, DELETE ...

Cassandra Trigger Exception: InvalidQueryException: table of additional mutation does not match primary update table

i am using Cassandra Trigger on a table. I am following the example and loading trigger jar with 'nodetool reloadtriggers'. Then i am using
'CREATE TRIGGER mytrigger ON ..'
command from cqlsh to create trigger on my table.
Adding an entry into that table , my audit table is being populated.
But calling a method from within my Java application, which persists an entry into my table by using
'session.execute(BoundStatement)' i am getting this exception:
InvalidQueryException: table of additional mutation does not match primary update table
Why does the insertion into the table and the audit work when doing it directly with cqlsh and why does it fail when doing pretty much exactly the same with the Java application?
i am using this as AuditTrigger, very simplified(left out all of the other operations other than Row insertion:
public class AuditTrigger implements ITrigger {
private Properties properties = loadProperties();
public Collection<Mutation> augment(Partition update) {
String auditKeyspace = properties.getProperty("keyspace");
String auditTable = properties.getProperty("table");
CFMetaData metadata = Schema.instance.getCFMetaData(auditKeyspace,
auditTable);
PartitionUpdate.SimpleBuilder audit =
PartitionUpdate.simpleBuilder(metadata, UUIDGen.getTimeUUID());
if (row.primaryKeyLivenessInfo().timestamp() != Long.MIN_VALUE) {
// Row Insertion
JSONObject obj = new JSONObject();
obj.put("message_id", update.metadata().getKeyValidator()
.getString(update.partitionKey().getKey()));
audit.row().add("operation", "ROW INSERTION");
}
audit.row().add("keyspace_name", update.metadata().ksName)
.add("table_name", update.metadata().cfName)
.add("primary_key", update.metadata().getKeyValidator()
.getString(update.partitionKey()
.getKey()));
return Collections.singletonList(audit.buildAsMutation());
It seems like using BoundStatement, the trigger fails:
session.execute(boundStatement);
, using a regular cql queryString works though.
session.execute(query)
We are using Boundstatement everywhere within our application though and cannot change that.
Any help would be appreciated.
Thanks

Resources