how to implement timeWindow() in Apache Flink's StreamTableEnvironment? - apache-spark

everyone,
I want to use flink time window in StreamTableEnvironment.
I have previously used the timeWindow(Time.seconds()) function with a dataStream that comes from a kafka topic.
For external issues I am converting this DataStream to DataTable and applying a SQL query with sqlQuery().
I want to do x second time window aggregations with SQL and then send it to another kafka topic
Data source:
val stream = senv
.addSource(new FlinkKafkaConsumer[String]("flink", new SimpleStringSchema(), properties))
example of previous aggregation:
val windowCounts = stream.keyBy("x").timeWindow(Time.seconds(5), Time.seconds(5))
Current DataTable:
val tableA = tableEnv.fromDataStream(parsed, 'user, 'product, 'amount)
In this part there should be a query that makes an aggregation each X time
val result = tableEnv.sqlQuery(
s"SELECT * FROM $tableA WHERE amount > 2".stripMargin)
more or less my aggregation will be count(y) OVER(PARTITION BY x)
Thank you!

Ververica's training for Flink SQL will help you with this. In includes some exercises/examples that cover just this kind of query in the section on Querying Dynamic Tables with SQL.
You'll have to establish the source of timing information for each event, which can be either processing time or event time, after which the query corresponding to stream.keyBy("x").timeWindow(Time.seconds(5), Time.seconds(5)) will be something like this:
SELECT
x,
TUMBLE_END(timestamp, INTERVAL '5' SECOND) AS t,
COUNT(*) AS cnt
FROM Events
GROUP BY
x, TUMBLE(timestamp, INTERVAL '5' SECOND);
For details on how to work with time attributes, see the Introduction to Time Attributes.
And for more detailed documentation on windowing with Flink SQL see the docs on Group Windows.

Related

Stream writes having multiple identical keys to delta lake

I am writing streams to delta lake through spark structured streaming. Each streaming batch contains key - value (also contains timestamp as one column). delta lake doesn't support of update with multiple same keys at source( steaming batch) So I want to update delta lake with only record with latest timestamp. How can I do this ?
This is code snippet I am trying:
def upsertToDelta(microBatchOutputDF: DataFrame, batchId: Long) {
println(s"Executing batch $batchId ...")
microBatchOutputDF.show()
deltaTable.as("t")
.merge(
microBatchOutputDF.as("s"),
"s.key = t.key")
.whenMatched().updateAll()
.whenNotMatched().insertAll()
.execute()
}
Thanks in advance.
You can eliminate records having older timestamp from your "microBatchOutputDF" dataframe & keep only record with latest timestamp for given key.
You can use spark's 'reduceByKey' operation & implement custom reduce function as below.
def getLatestEvents(input: DataFrame) : RDD[Row] = {
input.rdd.map(x => (x.getAs[String]("key"), x)).reduceByKey(reduceFun).map(_._2) }
def reduceFun(x: Row, y: Row) : Row = {
if (x.getAs[Timestamp]("timestamp").getTime > y.getAs[Timestamp]("timestamp").getTime) x else y }
Assumed key is of type string & timestamp of type timestamp. And call "getLatestEvents" for your streaming batch 'microBatchOutputDF'. It ignores older timestamp events & keeps only latest one.
val latestRecordsDF = spark.createDataFrame(getLatestEvents(microBatchOutputDF), <schema of DF>)
Then call deltalake merge operation on top of 'latestRecordsDF'
In streaming for a microbatch, you might got more than one records for a given key. In order to update it with target table, you have to figure out the latest record for the key in the microbatch. In your case you can use max of timestamp column and the value column to find the latest record and use that one for merge operation.
You can refer this link for more details on finding the latest record for the given key.

JOIN in Azure Stream Analytics

I have a requirement to validate the values of one column with a master data in stream analytics.
I have written queries to fetch some data from a blob location and One of the column value should be validated against a master data available in another blob location.
Below is the SAQL I tried. signals1 is the master data in blob and signals2 is the data processed and to be validated:
WITH MASTER AS (
SELECT [signals1].VAL as VAL
FROM [signals1]
)
SELECT
ID,
VAL,
SIG
INTO [output]
FROM signals2
I have to check the VAL from signals2 to be validated against VAL in signals1.
If the VAL in signals2 is there in signals1, then we should write to output.
If the VAL in signals2 is not there in signals1, then that doc should be ignored(should not write into output).
I tried with JOIN and WHERE clause, but not working as expected.
Any leads, how to achieve this using JOIN or WHERE?
In case your Signal1 data is the reference input, and Signal2 is the streaming input, you can use something like the following query:
with signals as (select * from Signal2 I join Signal1 R ON I.Val = R.Val)
select * into output from signals
I tested this query locally, and I assumed that your reference data(Signal1) is in the format:
[
{
"Val":"123",
"Data":"temp"
},
{
"Val":"321",
"Data":"humidity"
}
]
And for example, your Signal2 - the streaming input is:
{
"Val":"123",
"SIG":"k8s23kk",
"ID":"1234589"
}
Have a look at this query and data samples to see if it can guide you towards the solution.
Side note you cannot use this join in case that Signal1 is the streaming data. The way these types of joins are working is that you have to use time-windowing. Without that is not possible.

Apache Spark Query with HiveContext doesn't work

I use Spark 1.6.1. In my Spark Java Programm I connect to a Postgres Database and register every table as a temporary table via JDBC. For example:
Map<String, String> optionsTable = new HashMap<String, String>();
optionsTable.put("url", "jdbc:postgresql://localhost/database?user=postgres&password=passwd");
optionsTable.put("dbtable", "table");
optionsTable.put("driver", "org.postgresql.Driver");
DataFrame table = sqlContext.read().format("jdbc").options(optionsTable).load();
table.registerTempTable("table");
This works without problems:
hiveContext.sql("select * from table").show();
Also this works:
DataFrame tmp = hiveContext.sql("select * from table where value=key");
tmp.registerTempTable("table");
And then I can see the contents of the table with:
hiveContext.sql("select * from table").show();
But now I have a Problem. When I execute this:
hiveContext.sql("SELECT distinct id, timestamp FROM measure, measure_range w WHERE tble.timestamp >= w.left and tble.timestamp <= w.right").show();
Spark does nothing, but at the origin databse on Postgres it works very good. So I decided to modify the query a little bit to this:
hiveContext.sql("SELECT id, timestamp FROM measure, measure_range w WHERE tble.timestamp >= w.left").show();
This Query is working and gives me results. But the other query is not working. Where is the difference and why is the first query not working, but the second is working good?
And the database is not very Big. For testing it has a size of 4 MB.
Since you're trying to select a distinct ID, you need to select timestamp as a part of an aggregate function and then group by ID. Otherwise, it doesn't know which time stamp to pair with the ID.

mimic ' group by' and window function logic in spark

I have a large csv file with the columns id,time,location. I made it an RDD, and want to compute some aggregated metrics of the trips, when a trip is defined as a time-contiguous set of records of the same id, separated by at least 1 hour on either side. I am new to spark. (related)
To do that, I think to create an RDD with elements of the form (trip_id,(time, location)) and use reduceByKey to calculate all the needed metrics.
To calculate the trip_id, i try to implement the SQL-approach of the linked question, to make an indicator field of whether the record is a start of a trip, and make a cumulative sum of this indicator field. This does not sound like a distributed approach: is there a better one?
Furthermore, how can I add this indicator field? it should be 1 if the time-difference to the previous record of the same id is above an hour, and 0 otherwise. I thought of at first doing groupBy id and then sort in each of the values, but they will be inside an Array and thus not amenable to sortByKey, and there is no lead function as in SQL to get the previous value.
Example of the suggested aforementioned approach: for the RDD
(1,9:30,50)
(1,9:37,70)
(1,9:50,80)
(2,19:30,10)
(1,20:50,20)
We want to turn it first into the RDD with the time differences,
(1,9:30,50,inf)
(1,9:37,70,00:07:00)
(1,9:50,80,00:13:00)
(2,19:30,10,inf)
(2,20:50,20,01:20:00)
(The value of the earliest record is, say, scala's PositiveInfinity constant)
and turn this last field into an indicator field of whether it is above 1, which indicates whether we start a trip,
(1,9:30,50,1)
(1,9:37,70,0)
(1,9:50,80,0)
(2,19:30,10,1)
(2,20:50,20,1)
and then turn it into a trip_id
(1,9:30,50,1)
(1,9:37,70,1)
(1,9:50,80,1)
(2,19:30,10,2)
(2,20:50,20,3)
and then use this trip_id as the key to aggregations.
The preprocessing was simply to load the file and delete the header,
val rawdata=sc.textFile("some_path")
def isHeader(line:String)=line.contains("id")
val data=rawdata.filter(!isHeader(_))
Edit
While trying to implement with spark SQL, I ran into an error regarding the time difference:
val lags=sqlContext.sql("
select time - lag(time) over (partition by id order by time) as diff_time from data
");
since spark doesn't know how to take the difference between two timestamps. I'm trying to check whether this difference is above 1 hour.
It Also doesn't recognize the function getTime, that I found online as an answer, the following returns an error too (Couldn't find window function time.getTime):
val lags=sqlContext.sql("
select time.getTime() - (lag(time)).getTime() over (partition by id order by time)
from data
");
Even though making a similar lag difference for a numeric attribute works:
val lag_numeric=sqlContext.sql("
select longitude - lag(longitude) over (partition by id order by time)
from data"); //works
Spark didn't recognize the function Hours.hoursBetween either. I'm using spark 1.4.0.
I also tried to define an appropriate user-defined-function, but UDFS are oddly not recognized inside queries:
val timestamp_diff: ((Timestamp,Timestamp) => Double) =
(d1: Timestamp,d2: Timestamp) => d1.getTime()-d2.getTime()
val lags=sqlContext.sql("select timestamp_diff(time,lag(time))
over (partition by id order by time) from data");
So, how can spark test whether the difference between timestamps is above an hour?
Full code:
import org.apache.spark.{SparkConf, SparkContext}
import org.apache.spark.sql.SQLContext
import org.apache.spark.sql.functions._
import sqlContext._
val sqlContext = new org.apache.spark.sql.hive.HiveContext(sc)
import sqlContext.implicits._
import org.apache.spark.sql.hive.HiveContext//For window functions
import java.util.Date
import java.sql.Timestamp
case class Record(id: Int, time:Timestamp, longitude: Double, latitude: Double)
val raw_data=sc.textFile("file:///home/sygale/merged_table.csv")
val data_records=
raw_data.map(line=>
Record( line.split(',')(0).toInt,
Timestamp.valueOf(line.split(',')(1)),
line.split(',')(2).toDouble,
line.split(',')(3).toDouble
))
val data=data_records.toDF()
data.registerTempTable("data")
val lags=sqlContext.sql("
select time - lag(time) over (partition by id order by time) as diff_time from data
");

How to improve performance for slow Spark jobs using DataFrame and JDBC connection?

I am trying to access a mid-size Teradata table (~100 million rows) via JDBC in standalone mode on a single node (local[*]).
I am using Spark 1.4.1. and is setup on a very powerful machine(2 cpu, 24 cores, 126G RAM).
I have tried several memory setup and tuning options to make it work faster, but neither of them made a huge impact.
I am sure there is something I am missing and below is my final try that took about 11 minutes to get this simple counts vs it only took 40 seconds using a JDBC connection through R to get the counts.
bin/pyspark --driver-memory 40g --executor-memory 40g
df = sqlContext.read.jdbc("jdbc:teradata://......)
df.count()
When I tried with BIG table (5B records) then no results returned upon completion of query.
All of the aggregation operations are performed after the whole dataset is retrieved into memory into a DataFrame collection. So doing the count in Spark will never be as efficient as it would be directly in TeraData. Sometimes it's worth it to push some computation into the database by creating views and then mapping those views using the JDBC API.
Every time you use the JDBC driver to access a large table you should specify the partitioning strategy otherwise you will create a DataFrame/RDD with a single partition and you will overload the single JDBC connection.
Instead you want to try the following AI (since Spark 1.4.0+):
sqlctx.read.jdbc(
url = "<URL>",
table = "<TABLE>",
columnName = "<INTEGRAL_COLUMN_TO_PARTITION>",
lowerBound = minValue,
upperBound = maxValue,
numPartitions = 20,
connectionProperties = new java.util.Properties()
)
There is also an option to push down some filtering.
If you don't have an uniformly distributed integral column you want to create some custom partitions by specifying custom predicates (where statements). For example let's suppose you have a timestamp column and want to partition by date ranges:
val predicates =
Array(
"2015-06-20" -> "2015-06-30",
"2015-07-01" -> "2015-07-10",
"2015-07-11" -> "2015-07-20",
"2015-07-21" -> "2015-07-31"
)
.map {
case (start, end) =>
s"cast(DAT_TME as date) >= date '$start' AND cast(DAT_TME as date) <= date '$end'"
}
predicates.foreach(println)
// Below is the result of how predicates were formed
//cast(DAT_TME as date) >= date '2015-06-20' AND cast(DAT_TME as date) <= date '2015-06-30'
//cast(DAT_TME as date) >= date '2015-07-01' AND cast(DAT_TME as date) <= date '2015-07-10'
//cast(DAT_TME as date) >= date '2015-07-11' AND cast(DAT_TME as date) <= date //'2015-07-20'
//cast(DAT_TME as date) >= date '2015-07-21' AND cast(DAT_TME as date) <= date '2015-07-31'
sqlctx.read.jdbc(
url = "<URL>",
table = "<TABLE>",
predicates = predicates,
connectionProperties = new java.util.Properties()
)
It will generate a DataFrame where each partition will contain the records of each subquery associated to the different predicates.
Check the source code at DataFrameReader.scala
Does the unserialized table fit into 40 GB? If it starts swapping on disk performance will decrease drammatically.
Anyway when you use a standard JDBC with ansi SQL syntax you leverage the DB engine, so if teradata ( I don't know teradata ) holds statistics about your table, a classic "select count(*) from table" will be very fast.
Instead spark, is loading your 100 million rows in memory with something like "select * from table" and then will perform a count on RDD rows. It's a pretty different workload.
One solution that differs from others is to save the data from the oracle table in an avro file (partitioned in many files) saved on hadoop.
This way reading those avro files with spark would be a peace of cake since you won't call the db anymore.

Resources