Latency in BigQuery Data Availability - apache-spark

I am using streaming insert to load a single row at a time in a Big Query Table.
This is the code
def insertBigQueryTable( tableName:String , datasetName:String, rowContent :java.util.Map[String, Object]): Unit = {
val bigquery = BigQueryOptions.getDefaultInstance.getService
try {
val tableId = TableId.of(datasetName, tableName)
val response =bigquery.insertAll( InsertAllRequest.newBuilder(tableId).addRow(rowContent).build())
if (response.hasErrors()) {
val errors: util.Set[Map.Entry[lang.Long, util.List[BigQueryError]]]=response.getInsertErrors.entrySet()
while(errors.iterator().hasNext){
val error=errors.iterator().next()
println(s"error while loading in bigquery $error.getValue")
}
}
}
catch {
case e: BigQueryException =>e.printStackTrace
}
}
I am able to instantly query the data via the query console in big query.
Then I am loading the table via a spark job (in a different job) running in dataproc cluster.But the data is not available in the spark dataframe immediately.
This is what I am doing
def biqQueryToDFDefault(tabName: String, spark: SparkSession):DataFrame =
spark.read.format("bigquery").option("table",tabName).load()
I am trying to understand if this is expected ? or is there a different way that I should be handling it(like trying to load the single row via a spark job)?

Streaming data to a bigquery table takes a while to become available. You can SELECT the data immediately but it sits inside a stream buffer for a while until its ready to get used by queries like UPDATE, DELETE ...

Related

Unable to cache Dataset in Apache Spark Java using structured streaming

Design
Using spark structured streaming, reading data from a Kafka topic.
Datset<Row> df = spark.readStream().format("kafka").option(...).load();
Spark read stream to read data from a Postgres table (approx 6M records)
Dataset<row> tableDf = spark.read.format(jdbc).option(...).load();
Perform joins on the #1 and #2 datasets and push to another Kafka topic.
Problem Statement
The data fetched in #2, needs to be refreshed periodically at the 'X' interval as the source data gets modified.
So, we need a mechanism to cache the dataset created at #2 until the refresh happens.
As I am new to Spark, I have already tried using persist()/cache() with RefreshStream(), which works fine and refreshes the data at a set interval
public void loadTables()
{
this.loadLookupTables();
Dataset<Row> staticRefreshStream = spark.readStream().format("rate").option("rowsperSecond",1)
.option(numPartitions",1).load().selectExpr(CAST(value as LONG) as trigger");
staticRefreshStream.writeStream().outputMode("append").foreachbatch((VoidFunction2<Dataset<Row>, Long>) (df2, batchId) -> this.refreshLookupTables()).queryName("RefreshStream").trigger(Trigger.ProcessingTime(1, TimeUnit.HOURS)).start();
}
public void loadLookUpTable()
{
this.tableDataset = this.fetchLookUpTable("table name");
this.tableDataset.persist();
}
public Dataset<Row> fetchLookUptable(String tableName)
{
return spark.read().format("jdbc").option(..url..uname..pwd..driver).load();
}
public void refreshLookupTable()
{
this.tableDataset.unpersist();
this.loadLookUpTable();
}
But, the problem is every time a new batch comes from Kafka input at #1, it somehow gets the refreshed data from the database without even calling the RefreshStream created above.
I want to restrict this to not calling the database until the refreshStream trigger is executed.

Spark Stateful Structured Streaming: State getting too big in mapGroupsWithState

I am trying to use mapGroupsWithState method for stateful structured streaming for my incoming stream of data. But the problem that I face is that the key I am choosing for groupByKey makes my state too big too fast. The obvious way out would be to change the key but the business logic I wish to apply in update method, requires the key to exactly same as I have it right now OR if it is possible, access GroupState for all keys.
For example, I have a stream of data coming in from various Organizations and typically an organization contains userId, personId etc. Please see the code below:
val stream: Dataset[User] = dataFrame.as[User]
val noTimeout = GroupStateTimeout.NoTimeout
val statisticStream = stream
.groupByKey(key => key.orgId)
.mapGroupsWithState(noTimeout)(updateUserStatistic)
val df = statisticStream.toDF()
val query = df
.writeStream
.outputMode(Update())
.option("checkpointLocation", s"$checkpointLocation/$name")
.foreach(new UserCountWriter(spark.sparkContext.getConf))
.outputMode(Update())
.queryName(name)
.trigger(Trigger.ProcessingTime(Duration.apply("10 seconds")))
case classes:
case class User(
orgId: Long,
profileId: Long,
userId: Long)
case class UserStatistic(
orgId: Long,
known: Long,
uknown: Long,
userSeq: Seq[User])
update method:
def updateUserStatistic(
orgId: Long,
newEvents: Iterator[User],
oldState: GroupState[UserStatistic]): UserStatistic = {
var state: UserStatistic = if (oldState.exists) oldState.get else UserStatistic(orgId, 0L, 0L, Seq.empty)
for (event <- newEvents) {
//business logic like checking if userId in this organization is of certain type and then accordingly update the known or unknown attribute for that particular user.
oldState.update(state)
state
}
The problem gets worse when I have to execute this on Driver-Executor model as I am expecting 1-10 million users in every organization which could mean these many states on a single executor(correct me if I am wrong in understanding this.)
Possible solutions that failed:
grouping by key as User Id - because then I am unable to get all userIds for a given orgId as these GroupStates are put in aggregation key, value pair and here, it is UserId. so for every new UserId, a new state is created, even if it belongs to same organization.
Any help or suggestions are appreciated.
Your state keeps increasing in size because in the current implementation no key/state pair will ever be removed from the GroupState.
To mitigate exactly the problem you are facing (infinite increasing state) the mapGroupsWithState method allows you to use a Timeout. You can choose between two types of timeouts:
Processing-Time timeouts using GroupStateTimeout.ProcessingTimeTimeout with GroupState.setTimeoutDuration() , or
Event-Time timeouts using GroupStateTimeout.EventTimeTimeout with GroupState.setTimeoutTimestamp().
Note the difference between them is a duration-based timeout and the more flexible time-based timeout.
In the ScalaDocs of the trait GroupState you will find a nice template on how to use timeouts in your mapping function:
def mappingFunction(key: String, value: Iterator[Int], state: GroupState[Int]): String = {
if (state.hasTimedOut) { // If called when timing out, remove the state
state.remove()
} else if (state.exists) { // If state exists, use it for processing
val existingState = state.get // Get the existing state
val shouldRemove = ... // Decide whether to remove the state
if (shouldRemove) {
state.remove() // Remove the state
} else {
val newState = ...
state.update(newState) // Set the new state
state.setTimeoutDuration("1 hour") // Set the timeout
}
} else {
val initialState = ...
state.update(initialState) // Set the initial state
state.setTimeoutDuration("1 hour") // Set the timeout
}
...
// return something
}

Any idea on how to use gcp stackdriver to collect logs when query result of bigquery is empty (using pthon)?

I have a cloud function which is executing a query and storing job result into new bigquery table. I would like to store logs into stackdriver whenever I get query job result is empty, that means no record found for that particular query execution. Can anyone suggest me how to achieve this task.
cloud function code:
def main(request):
query = "select * from `myproject.mydataset.mytable`"
client = bigquery.Client()
job_config = bigquery.QueryJobConfig()
dest_dataset = client.dataset(destination_dataset, destination_project)
dest_table = dest_dataset.table(destination_table)
job_config.destination = dest_table
job_config.create_disposition = 'CREATE_IF_NEEDED'
job_config.write_disposition = 'WRITE_APPEND'
job = client.query(query, location='US', job_config=job_config)
job.result()
Go to Stackdriver logging section in the GCP console.
Activate advanced filter (click on the arrow on the right of the filter field)
fill the custom field with this:
resource.type="bigquery_resource"
"device_states"
protoPayload.methodName="jobservice.getqueryresults"
NOT protoPayload.serviceData.jobGetQueryResultsResponse.totalResults>"0"
By this, you select only the total result equal to 0. The tricks is that the totalResults is missing with there is 0 result. With this syntax, that works.
Then you can create an export to BigQuery, Storage or PubSub, or create a metric. If you do this, then you could use this metric in Stackdriver monitoring and alerting. All depends what do you want to do with.
You have to create sink the logs in the either using code or using GCP logging UI
https://cloud.google.com/logging/docs/export/configure_export_v2
Write code in the pass the logs to bigquery
private static MonitoredResource monitoredResource =
MonitoredResource.newBuilder("global")
.addLabel("project_id", logging.getOptions().getProjectId())
.build();
public static void writeLog(Severity severity, String logName, Map<String, String> jsonMap) {
List<Map<String, String>> maps = limitMap(jsonMap);
for (Map<String, String> map : maps) {
LogEntry logEntry = LogEntry.newBuilder(Payload.JsonPayload.of(map))
.setSeverity(severity)
.setLogName(logName)
.setResource(monitoredResource)
.build();
logging.write(Collections.singleton(logEntry));
}
https://cloud.google.com/bigquery/docs/writing-results

Cassandra Trigger Exception: InvalidQueryException: table of additional mutation does not match primary update table

i am using Cassandra Trigger on a table. I am following the example and loading trigger jar with 'nodetool reloadtriggers'. Then i am using
'CREATE TRIGGER mytrigger ON ..'
command from cqlsh to create trigger on my table.
Adding an entry into that table , my audit table is being populated.
But calling a method from within my Java application, which persists an entry into my table by using
'session.execute(BoundStatement)' i am getting this exception:
InvalidQueryException: table of additional mutation does not match primary update table
Why does the insertion into the table and the audit work when doing it directly with cqlsh and why does it fail when doing pretty much exactly the same with the Java application?
i am using this as AuditTrigger, very simplified(left out all of the other operations other than Row insertion:
public class AuditTrigger implements ITrigger {
private Properties properties = loadProperties();
public Collection<Mutation> augment(Partition update) {
String auditKeyspace = properties.getProperty("keyspace");
String auditTable = properties.getProperty("table");
CFMetaData metadata = Schema.instance.getCFMetaData(auditKeyspace,
auditTable);
PartitionUpdate.SimpleBuilder audit =
PartitionUpdate.simpleBuilder(metadata, UUIDGen.getTimeUUID());
if (row.primaryKeyLivenessInfo().timestamp() != Long.MIN_VALUE) {
// Row Insertion
JSONObject obj = new JSONObject();
obj.put("message_id", update.metadata().getKeyValidator()
.getString(update.partitionKey().getKey()));
audit.row().add("operation", "ROW INSERTION");
}
audit.row().add("keyspace_name", update.metadata().ksName)
.add("table_name", update.metadata().cfName)
.add("primary_key", update.metadata().getKeyValidator()
.getString(update.partitionKey()
.getKey()));
return Collections.singletonList(audit.buildAsMutation());
It seems like using BoundStatement, the trigger fails:
session.execute(boundStatement);
, using a regular cql queryString works though.
session.execute(query)
We are using Boundstatement everywhere within our application though and cannot change that.
Any help would be appreciated.
Thanks

Spark 2 Job Monitoring with Multiple Simultaneous Jobs on a Spark Context (JobProgressListener)

On Spark 2.0.x, I have been using a JobProgressListener implementation to retrieve Job/Stage/Task progress information in real-time from our cluster. I understand how the event flow works, and successfully receive updates on the work.
My problem is that we have several different submissions running at the same time on the same Spark Context, and it is seemingly impossible to differentiate between which Job/Stage/Task belongs to each submittal. Each Job/Stage/Task receives a unique id, which is great. However, I'm looking for a way to provide a submission "id" or "name" that would be returned along with the received JobProgressListener event objects.
I realize that the "Job Group" can be set on the Spark Context, but if multiple jobs are simultaneously running on the same context, they will become scrambled.
Is there a way I can sneak in custom properties that would be returned with the listener events for a single SQLContext? In so doing, I should be able to link up subsequent Stage and Task events and get what I need.
Please note: I am not using spark-submit for these jobs. They are being executed using Java references to a SparkSession/SQLContext.
Thanks for any solutions or ideas.
I'm using a local property - this can be accessed from listener during the onStageSubmit event. After that I'm using the corresponding stageId in order to identify the task executed during that stage.
Future({
sc.setLocalProperty("job-context", "second")
val listener = new MetricListener("second")
sc.addSparkListener(listener)
//do some spark actions
val df = spark.read.load("...")
val countResult = df.filter(....).count()
println(listener.rows)
sc.removeSparkListener(listener)
})
class MetricListener(name:String) extends SparkListener{
var rows: Long = 0L
var stageId = -1
override def onStageSubmitted(stageSubmitted: SparkListenerStageSubmitted): Unit = {
if (stageSubmitted.properties.getProperty("job-context") == name){
stageId = stageSubmitted.stageInfo.stageId
}
}
override def onTaskEnd(taskEnd: SparkListenerTaskEnd): Unit = {
if (taskEnd.stageId == stageId)
rows = rows + taskEnd.taskMetrics.inputMetrics.recordsRead
}
}

Resources