Unable to cache Dataset in Apache Spark Java using structured streaming - apache-spark

Design
Using spark structured streaming, reading data from a Kafka topic.
Datset<Row> df = spark.readStream().format("kafka").option(...).load();
Spark read stream to read data from a Postgres table (approx 6M records)
Dataset<row> tableDf = spark.read.format(jdbc).option(...).load();
Perform joins on the #1 and #2 datasets and push to another Kafka topic.
Problem Statement
The data fetched in #2, needs to be refreshed periodically at the 'X' interval as the source data gets modified.
So, we need a mechanism to cache the dataset created at #2 until the refresh happens.
As I am new to Spark, I have already tried using persist()/cache() with RefreshStream(), which works fine and refreshes the data at a set interval
public void loadTables()
{
this.loadLookupTables();
Dataset<Row> staticRefreshStream = spark.readStream().format("rate").option("rowsperSecond",1)
.option(numPartitions",1).load().selectExpr(CAST(value as LONG) as trigger");
staticRefreshStream.writeStream().outputMode("append").foreachbatch((VoidFunction2<Dataset<Row>, Long>) (df2, batchId) -> this.refreshLookupTables()).queryName("RefreshStream").trigger(Trigger.ProcessingTime(1, TimeUnit.HOURS)).start();
}
public void loadLookUpTable()
{
this.tableDataset = this.fetchLookUpTable("table name");
this.tableDataset.persist();
}
public Dataset<Row> fetchLookUptable(String tableName)
{
return spark.read().format("jdbc").option(..url..uname..pwd..driver).load();
}
public void refreshLookupTable()
{
this.tableDataset.unpersist();
this.loadLookUpTable();
}
But, the problem is every time a new batch comes from Kafka input at #1, it somehow gets the refreshed data from the database without even calling the RefreshStream created above.
I want to restrict this to not calling the database until the refreshStream trigger is executed.

Related

Pagination of the result of a JdbcPagingItemReader limited to the first page

I'm missing some details how to execute the pagination of a SQL Select (of almost 100.000 record)in Spring Batch.
My batch has no parallelism, neither partitioning, neither remote chunking.
It has only execute one query , process every record and writes the result in a CSV file.
It 'snt any custom class of ItemReader or InputStream.
In my class BatchConfig I have my input Bean that prepares the JDBCPagingItemReader
#StepScope
#Bean(name = "myinput")
public JdbcPagingItemReader<MyDTO> input(DataSource dataSource, PagingQueryProvider queryProvider, {other jobparams)){...}
inside I call a method of an object that set the JDBCPagingItemReader to return
public JdbcPagingItemReader<MyDTO> myMethod(/**various params: dataSource, size of the pagination, queryProvider **/){
JdbcPagingItemReader<MyDTO> databaseReader = new JdbcPagingItemReader<MyDTO>();
databaseReader.setDataSource(dataSource);
databaseReader.setPageSize(Integer.parseInt(size));
Map<String, Object> params = new HashMap<String, Object>();
//my jobparams are putted in the params
databaseReader.setParameterValues(params);
databaseReader.setRowMapper(new MyMapper());
databaseReader.setQueryProvider(queryProvider);
return databaseReader;
}
Another class declares the queryProvider
public SqlPagingQueryProviderFactoryBean queryProvider(DataSource dataSource) {
SqlPagingQueryProviderFactoryBean queryProvider = new SqlPagingQueryProviderFactoryBean();
queryProvider.setDataSource(dataSource);
queryProvider.setSelectClause(select().toString());
queryProvider.setFromClause(from().toString());
queryProvider.setWhereClause(where().toString());
queryProvider.setSortKeys(this.sortBy());// I declare only 1 field in descending order
return queryProvider;
}
At this point, I have 2 questions:
I verified that using the same pageSize and modifying the sorting field, the number of record in the final CSV file changes: I read that the sorting field has to be a primary key but my select is about a views, not a physical table: is the primary key in sortby() mandatory in this case?
I verified that the method databaseReader.setPageSize() limit the number of the read record by my SELECT, but I expected a pagination that read all the data. Now the batch read only the first page of result and does'nt move forward.
My idea is to use the partition but I see that is a bit over-engineerized and I'm thinking to neglet some point in my code: do you have sime suggest, please?
I read this question (Spring Batch: JdbcPagingItemReader pagination) and the solution of #Mahmoud Ben Hassine, but unfortunately I can't test in my enviroment because the lack critical mass of datain db.

Latency in BigQuery Data Availability

I am using streaming insert to load a single row at a time in a Big Query Table.
This is the code
def insertBigQueryTable( tableName:String , datasetName:String, rowContent :java.util.Map[String, Object]): Unit = {
val bigquery = BigQueryOptions.getDefaultInstance.getService
try {
val tableId = TableId.of(datasetName, tableName)
val response =bigquery.insertAll( InsertAllRequest.newBuilder(tableId).addRow(rowContent).build())
if (response.hasErrors()) {
val errors: util.Set[Map.Entry[lang.Long, util.List[BigQueryError]]]=response.getInsertErrors.entrySet()
while(errors.iterator().hasNext){
val error=errors.iterator().next()
println(s"error while loading in bigquery $error.getValue")
}
}
}
catch {
case e: BigQueryException =>e.printStackTrace
}
}
I am able to instantly query the data via the query console in big query.
Then I am loading the table via a spark job (in a different job) running in dataproc cluster.But the data is not available in the spark dataframe immediately.
This is what I am doing
def biqQueryToDFDefault(tabName: String, spark: SparkSession):DataFrame =
spark.read.format("bigquery").option("table",tabName).load()
I am trying to understand if this is expected ? or is there a different way that I should be handling it(like trying to load the single row via a spark job)?
Streaming data to a bigquery table takes a while to become available. You can SELECT the data immediately but it sits inside a stream buffer for a while until its ready to get used by queries like UPDATE, DELETE ...

Partial Data Being Ingested To Azure Data Explorer From Event Hub

I currently have an Azure Data Explorer setup to ingest data from Event Hub. For some reason unknown to me, my ingestion table is only seeing about 45% of events. I am testing this by sending 100 events to event hub individually at a time. I know my event hub is receiving these events because I setup a SQL table to also ingest these events, and that table is receiving 100% of them (under a separate consumer group). My assumption is that I have setup my Azure Data Explorer table incorrectly.
I have a very basic object I am sending
public class TestDocument
{
[JsonProperty("DocumentId")]
public string DocumentId { get; set; }
[JsonProperty("Title")]
public string Title { get; set; }
{
I have enabled streaming ingestion in Azure
Azure Data Explorer > Configurations > Streaming ingestion (ON)
I have enabled streaming ingestion in my table
.alter table TestTable policy streamingingestion enable
My Table mapping is as follows
.alter table TestTable ingestion json mapping "TestTable_mapping" '[{"column":"DocumentId","datatype":"string","Path":"$[\'DocumentId\']"},{"column":"Title","datatype":"string","Path":"$[\'Title\']"}]'
My data connection settings
Consumer group: Its own group
Event system properties: 0
Table name: TestTable
Data format: JSON
Mapping name: TestTable_mapping
Is there something I am missing here? Consistently, out of 100 events sent, I only see about 45-48 get ingested in my table.
EDIT:
Json payload of TestDocument
{"DocumentId":"10","Title":"TEST"}
Found out what is happening, I am adding a BOM to my serialized object, and it looks like ADX has issues with it. When I tried serializing my object without a BOM, I was able to see all data flow from event hub to ADX.
Here's a sample of how I am doing it:
private static readonly JsonSerializer Serializer;
static SerializationHelper()
{
Serializer = JsonSerializer.Create(SerializationSettings);
}
public static void Serialize(Stream stream, object toSerialize)
{
using var streamWriter = new StreamWriter(stream, Encoding.UTF8, DefaultStreamBufferSize, true);
using var jsonWriter = new JsonTextWriter(streamWriter);
Serializer.Serialize(jsonWriter, toSerialize);
}
What fixed it:
public static void Serialize(Stream stream, object toSerialize)
{
using var streamWriter = new StreamWriter(stream, new UTF8Encoding(false), DefaultStreamBufferSize, true);
using var jsonWriter = new JsonTextWriter(streamWriter);
Serializer.Serialize(jsonWriter, toSerialize);
}

Cassandra Trigger Exception: InvalidQueryException: table of additional mutation does not match primary update table

i am using Cassandra Trigger on a table. I am following the example and loading trigger jar with 'nodetool reloadtriggers'. Then i am using
'CREATE TRIGGER mytrigger ON ..'
command from cqlsh to create trigger on my table.
Adding an entry into that table , my audit table is being populated.
But calling a method from within my Java application, which persists an entry into my table by using
'session.execute(BoundStatement)' i am getting this exception:
InvalidQueryException: table of additional mutation does not match primary update table
Why does the insertion into the table and the audit work when doing it directly with cqlsh and why does it fail when doing pretty much exactly the same with the Java application?
i am using this as AuditTrigger, very simplified(left out all of the other operations other than Row insertion:
public class AuditTrigger implements ITrigger {
private Properties properties = loadProperties();
public Collection<Mutation> augment(Partition update) {
String auditKeyspace = properties.getProperty("keyspace");
String auditTable = properties.getProperty("table");
CFMetaData metadata = Schema.instance.getCFMetaData(auditKeyspace,
auditTable);
PartitionUpdate.SimpleBuilder audit =
PartitionUpdate.simpleBuilder(metadata, UUIDGen.getTimeUUID());
if (row.primaryKeyLivenessInfo().timestamp() != Long.MIN_VALUE) {
// Row Insertion
JSONObject obj = new JSONObject();
obj.put("message_id", update.metadata().getKeyValidator()
.getString(update.partitionKey().getKey()));
audit.row().add("operation", "ROW INSERTION");
}
audit.row().add("keyspace_name", update.metadata().ksName)
.add("table_name", update.metadata().cfName)
.add("primary_key", update.metadata().getKeyValidator()
.getString(update.partitionKey()
.getKey()));
return Collections.singletonList(audit.buildAsMutation());
It seems like using BoundStatement, the trigger fails:
session.execute(boundStatement);
, using a regular cql queryString works though.
session.execute(query)
We are using Boundstatement everywhere within our application though and cannot change that.
Any help would be appreciated.
Thanks

Spark 2 Job Monitoring with Multiple Simultaneous Jobs on a Spark Context (JobProgressListener)

On Spark 2.0.x, I have been using a JobProgressListener implementation to retrieve Job/Stage/Task progress information in real-time from our cluster. I understand how the event flow works, and successfully receive updates on the work.
My problem is that we have several different submissions running at the same time on the same Spark Context, and it is seemingly impossible to differentiate between which Job/Stage/Task belongs to each submittal. Each Job/Stage/Task receives a unique id, which is great. However, I'm looking for a way to provide a submission "id" or "name" that would be returned along with the received JobProgressListener event objects.
I realize that the "Job Group" can be set on the Spark Context, but if multiple jobs are simultaneously running on the same context, they will become scrambled.
Is there a way I can sneak in custom properties that would be returned with the listener events for a single SQLContext? In so doing, I should be able to link up subsequent Stage and Task events and get what I need.
Please note: I am not using spark-submit for these jobs. They are being executed using Java references to a SparkSession/SQLContext.
Thanks for any solutions or ideas.
I'm using a local property - this can be accessed from listener during the onStageSubmit event. After that I'm using the corresponding stageId in order to identify the task executed during that stage.
Future({
sc.setLocalProperty("job-context", "second")
val listener = new MetricListener("second")
sc.addSparkListener(listener)
//do some spark actions
val df = spark.read.load("...")
val countResult = df.filter(....).count()
println(listener.rows)
sc.removeSparkListener(listener)
})
class MetricListener(name:String) extends SparkListener{
var rows: Long = 0L
var stageId = -1
override def onStageSubmitted(stageSubmitted: SparkListenerStageSubmitted): Unit = {
if (stageSubmitted.properties.getProperty("job-context") == name){
stageId = stageSubmitted.stageInfo.stageId
}
}
override def onTaskEnd(taskEnd: SparkListenerTaskEnd): Unit = {
if (taskEnd.stageId == stageId)
rows = rows + taskEnd.taskMetrics.inputMetrics.recordsRead
}
}

Resources