How to control Spark Stream with stream of dynamic queries? - apache-spark

I have a stream of data coming from on Kafka call it a SourceStream.
I have another stream of Spark SQL queries whose individual values are Spark SQL queries along with a window size.
I want those queries to be applied to the SourceStream data, and pass the results of queries to the sink.
Eg.
Source Stream
Id type timestamp user amount
------- ------ ---------- ---------- --------
uuid1 A 342342 ME 10.0
uuid2 B 234231 YOU 120.10
uuid3 A 234234 SOMEBODY 23.12
uuid4 A 234233 WHO 243.1
uuid5 C 124555 IT 35.12
...
....
Query Stream
Id window query
------- ------ ------
uuid13 1 hour select 'uuid13' as u, max(amount) as output from df where type = 'A' group by ..
uuid21 5 minute select 'uuid121' as u, count(1) as output from df where amount > 100 group by ..
uuid321 1 day select 'uuid321' as u, sum(amount) as output from df where amount > 100 group by ..
...
....
Each query in query stream would be applied to the source stream's incoming data at window mentioned along with the query, and the output would be sent to the sink.
What ways can I implement it with the Spark?

Related

Splitting / Grouping DataFrame based on a date / time difference

I have the following rows in a dataframe:
sender
receiver
bytes
timestamp
A
B
50
2147483647
C
D
100
2147483648
A
B
150
2147483657
C
D
200
2147483658
A
B
550
2147487657
Each record/row in that dataframe contains the amount of data that has been sent between a sender and receiver within a 10s time window. The timestamps marks when that individual time window started.
Now, I want to compute the amount of data between every pair of sender and receiver within a "flow".
With a flow, I mean that data is continuously transferred between sender and receiver.
If for a longer period of time (say 1 hour) no data is transferred, I want the flows to be split. In the example above, I would like to get:
flow_AB_1 = 200 bytes
flow_CD_1 = 300 bytes
flow_AB_2 = 550 bytes
flow_AB_2 would be a separate flow, as 2147487657 - 2147483657 = 4000 that is greater than 3600.
Is there a way to achieve this with pyspark/Apache Spark?
To solve your issue you can:
create a new column flow using the sessionization algorithm based on Spark's window described in this blog post
group by flow, sender and receiver to sum bytes
optionally, build your flows' names by concatenating sender, receiver and flow column
The complete code would be as follows:
from pyspark.sql import functions as F
from pyspark.sql import Window
window = Window.partitionBy('sender', 'receiver').orderBy('timestamp')
result = dataframe \
.withColumn('flow_split', F.when(F.col('timestamp') - F.lag('timestamp').over(window) > 3600, F.lit(1)).otherwise(F.lit(0))) \
.withColumn('flow', F.sum('flow_split').over(window)) \
.groupby('sender', 'receiver', 'flow') \
.agg(F.sum('bytes').alias('bytes')) \
.select(
F.concat(F.lit('flow_'), F.col('sender'), F.col('receiver'), F.lit('_'), F.col('flow') + 1).alias('flow'),
F.col('bytes')
)
With the input dataframe in your question, you will get the following result:
+---------+-----+
|flow |bytes|
+---------+-----+
|flow_AB_1|200 |
|flow_AB_2|550 |
|flow_CD_1|300 |
+---------+-----+

Parquet data to AWS Redshift slow

I want to insert data from S3 parquet files to Redshift.
Files in parquet comes from a process that reads JSON files, flatten them out, and store as parquet. To do it we use pandas dataframes.
To do so, I tried two different things. The first one:
COPY schema.table
FROM 's3://parquet/provider/A/2020/11/10/11/'
IAM_ROLE 'arn:aws:iam::XXXX'
FORMAT AS PARQUET;
It returned:
Invalid operation: Spectrum Scan Error
error: Spectrum Scan Error
code: 15001
context: Unmatched number of columns between table and file. Table columns: 54, Data columns: 41
I understand the error but I don't have an easy option to fix it.
If we have to do a reload from 2 months ago the file will only have for example 40 columns, because on that given data we needed just this data but table already increased to 50 columns.
So we need something automatically, or that we can specify the columns at least.
Then I applied another option which is to do a SELECT with AWS Redshift Spectrum. We know how many columns the table have using system tables, and we now the structure of the file loading again to a Pandas dataframe. Then I can combine both to have the same identical structure and do the insert.
It works fine but it is slow.
The select looks like:
SELECT fields
FROM schema.table
WHERE partition_0 = 'A'
AND partition_1 = '2020'
AND partition_2 = '11'
AND partition_3 = '10'
AND partition_4 = '11';
The partitions are already added as I checked using:
select *
from SVV_EXTERNAL_PARTITIONS
where tablename = 'table'
and schemaname = 'schema'
and values = '["A","2020","11","10","11"]'
limit 1;
I have around 170 files per hour, both in json and parquet file. The process list all files in S3 json path, and process them and store in S3 parquet path.
I don't know how to improve execution time, as the INSERT from parquet takes 2 minutes per each partition_0 value. I tried the select alone to ensure its not an INSERT issue, and it takes 1:50 minutes. So the issue is to read data from S3.
If I try to select a non existent value for partition_0 it takes again around 2 minutes, so there is some kind of problem to access data. I don't know if partition_0 naming and others are considered as Hive partitioning format.
Edit:
AWS Glue Crawler table specification
Edit: Add SVL_S3QUERY_SUMMARY results
step:1
starttime: 2020-12-13 07:13:16.267437
endtime: 2020-12-13 07:13:19.644975
elapsed: 3377538
aborted: 0
external_table_name: S3 Scan schema_table
file_format: Parquet
is_partitioned: t
is_rrscan: f
is_nested: f
s3_scanned_rows: 1132
s3_scanned_bytes: 4131968
s3query_returned_rows: 1132
s3query_returned_bytes: 346923
files: 169
files_max: 34
files_avg: 28
splits: 169
splits_max: 34
splits_avg: 28
total_split_size: 3181587
max_split_size: 30811
avg_split_size: 18825
total_retries:0
max_retries:0
max_request_duration: 360496
avg_request_duration: 172371
max_request_parallelism: 10
avg_request_parallelism: 8.4
total_slowdown_count: 0
max_slowdown_count: 0
Add query checks
Query: 37005074 (SELECT in localhost using pycharm)
Query: 37005081 (INSERT in AIRFLOW AWS ECS service)
STL_QUERY Shows that both queries takes around 2 min
select * from STL_QUERY where query=37005081 OR query=37005074 order by query asc;
Query: 37005074 2020-12-14 07:44:57.164336,2020-12-14 07:46:36.094645,0,0,24
Query: 37005081 2020-12-14 07:45:04.551428,2020-12-14 07:46:44.834257,0,0,3
STL_WLM_QUERY Shows that no queue time, all in exec time
select * from STL_WLM_QUERY where query=37005081 OR query=37005074;
Query: 37005074 Queue time 0 Exec time: 98924036 est_peak_mem:0
Query: 37005081 Queue time 0 Exec time: 100279214 est_peak_mem:2097152
SVL_S3QUERY_SUMMARY Shows that query takes 3-4 seconds in s3
select * from SVL_S3QUERY_SUMMARY where query=37005081 OR query=37005074 order by endtime desc;
Query: 37005074 2020-12-14 07:46:33.179352,2020-12-14 07:46:36.091295
Query: 37005081 2020-12-14 07:46:41.869487,2020-12-14 07:46:44.807106
stl_return Comparing min start for to max end for each query. 3-4 seconds as says SVL_S3QUERY_SUMMARY
select * from stl_return where query=37005081 OR query=37005074 order by query asc;
Query:37005074 2020-12-14 07:46:33.175320 2020-12-14 07:46:36.091295
Query:37005081 2020-12-14 07:46:44.817680 2020-12-14 07:46:44.832649
I dont understand why SVL_S3QUERY_SUMMARY shows just 3-4 seconds to run query in spectrum, but then STL_WLM_QUERY says the excution time is around 2 minutes as i see in my localhost and production environtments... Neither how to improve it, because stl_return shows that query returns few data.
EXPLAIN
XN Partition Loop (cost=0.00..400000022.50 rows=10000000000 width=19608)
-> XN Seq Scan PartitionInfo of parquet.table (cost=0.00..22.50 rows=1 width=0)
Filter: (((partition_0)::text = 'A'::text) AND ((partition_1)::text = '2020'::text) AND ((partition_2)::text = '12'::text) AND ((partition_3)::text = '10'::text) AND ((partition_4)::text = '12'::text))
-> XN S3 Query Scan parquet (cost=0.00..200000000.00 rows=10000000000 width=19608)
" -> S3 Seq Scan parquet.table location:""s3://parquet"" format:PARQUET (cost=0.00..100000000.00 rows=10000000000 width=19608)"
svl_query_report
select * from svl_query_report where query=37005074 order by segment, step, elapsed_time, rows;
Just like in your other question you need to change your keypaths on your objects. It is not enough to just have "A" in the keypath - it needs to be "partition_0=A". This is how Spectrum knows that the object is or isn't in the partition.
Also you need to make sure that your objects are of reasonable size or it will be slow if you need to scan many of them. It takes time to open each object and if you have many small objects the time to open them can be longer than the time to scan them. This is only an issue if you need to scan many many files.

JOIN in Azure Stream Analytics

I have a requirement to validate the values of one column with a master data in stream analytics.
I have written queries to fetch some data from a blob location and One of the column value should be validated against a master data available in another blob location.
Below is the SAQL I tried. signals1 is the master data in blob and signals2 is the data processed and to be validated:
WITH MASTER AS (
SELECT [signals1].VAL as VAL
FROM [signals1]
)
SELECT
ID,
VAL,
SIG
INTO [output]
FROM signals2
I have to check the VAL from signals2 to be validated against VAL in signals1.
If the VAL in signals2 is there in signals1, then we should write to output.
If the VAL in signals2 is not there in signals1, then that doc should be ignored(should not write into output).
I tried with JOIN and WHERE clause, but not working as expected.
Any leads, how to achieve this using JOIN or WHERE?
In case your Signal1 data is the reference input, and Signal2 is the streaming input, you can use something like the following query:
with signals as (select * from Signal2 I join Signal1 R ON I.Val = R.Val)
select * into output from signals
I tested this query locally, and I assumed that your reference data(Signal1) is in the format:
[
{
"Val":"123",
"Data":"temp"
},
{
"Val":"321",
"Data":"humidity"
}
]
And for example, your Signal2 - the streaming input is:
{
"Val":"123",
"SIG":"k8s23kk",
"ID":"1234589"
}
Have a look at this query and data samples to see if it can guide you towards the solution.
Side note you cannot use this join in case that Signal1 is the streaming data. The way these types of joins are working is that you have to use time-windowing. Without that is not possible.

what is the key to improve speed when we need action operation in bigdata processing by Spark?

I am recently using Spark 1.5.1 to process hadoop data. However, my experience of Spark is not so good for the slowness of processing action operation(e.g., .count(),.collect()). My task can be described as following:
I have a dataframe like this:
----------------------------
trans item_code item_qty
----------------------------
001 A 2
001 B 3
002 A 4
002 B 6
002 C 10
003 D 1
----------------------------
I need to find association rules of two items, e.g. one of A will result in one and a half of B with confidence of 0.8. The desired result dataframe is like this:
----------------------------
item1 item2 conf coef
----------------------------
A B 0.8 1.5
B A 1.0 0.67
A C 0.7 2.5
----------------------------
My method is using FP-growth to generate frequent item sets first and then filter item sets of one item and item sets with two items. After that I can calculate the confidence of one item resulting in another. For example, having (itemset=[A], support=0.4),(itemset=[B], support=0.2),(itemset=[A,B], support=0.2), I can generate association rules:(rule=(A->B),confidence=0.5),
(rule=(B->A),confidence=1.0).
However, when I broadcast the one-item frequent item sets as dictionary, the action of .collectAsMap is really very slow. I tried to use .join and it is even slower. I even need to wait hours to see rdd.count(). I know we should avoid any use of action operation in Spark, but sometimes it is unavoidable. So I am curious what is the key to improve speed when we face action operations.
My code is here:
#!/usr/bin/python
from pyspark import SparkContext,HiveContext
from pyspark.mllib.fpm import FPGrowth
import time
#read raw data from database
def read_data():
sql="""select t.orderno_nosplit,
t.prod_code,
t.item_code,
sum(t.item_qty)
as item_qty
from ioc_fdm.fdm_dwr_ioc_fcs_pk_spu_item_f_chain t
group by t.prod_code, t.orderno_nosplit,t.item_code """
data=sql_context.sql(sql)
return data.cache()
#calculate quantity coefficient of two items
def qty_coef(item1,item2):
sql =""" select t1.item, t1.qty from table t1
where t1.trans in
(select t2.trans from spu_table t2 where t2.item ='%s'
and
(select t3.trans from spu_table t3 where t3.item = '%s' """ % (item1,item2)
df=sql_context.sql(sql)
qty_item1=df.filter(df.item_code==item1).agg({"item_qty":"sum"}).first()[0]
qty_item2=df.filter(df.item_code==item2).agg({"item_qty":"sum"}).first()[0]
coef=float(qty_item2)/qty_item1
return coef
def train(prod):
spu=total_spu.filter(total_spu.prod_code == prod)
print 'data length',spu.count(),time.strftime("%H:%M:%S")
supp=0.1
conf=0.7
sql_context.registerDataFrameAsTable(spu,'spu_table')
sql_context.cacheTable('spu_table')
print 'table register over', time.strftime("%H:%M:%S")
trans_sets=spu.rdd.repartition(32).map(lambda x:(x[0],x[2])).groupByKey().mapvalues(list).values().cache()
print 'trans group over',time.strftime("%H:%M:%S")
model=FPGrowth.train(trans_sets,supp,10)
print 'model train over',time.strftime("%H:%M:%S")
model_f1=model.freqItemsets().filter(lambda x: len(x[0]==1))
model_f2=model.freqItemsets().filter(lambda x: len(x[0]==2))
#register model_f1 as dictionary
model_f1_tuple=model_f1.map(lambda (U,V):(tuple(U)[0],V))
model_f1Map=model_f1_tuple.collectAsMap()
#convert model_f1Map to broadcast
bc_model=sc.broadcast(model_f1Map)
#generate association rules
model_f2_conf=model_f2.map(lambda x:(x[0][0],x[0][1],float(x[1])/bc_model.value[x[0][0]],float(x[1]/bc_model.value[x[0][1]])))
print 'conf calculation over',time.strftime("%H:%M:%S")
model_f2_conf_flt=model_f2_conf.flatMap(lambda x: (x[0],x[1]))
#filter the association rules by confidence threshold
model_f2_conf_flt_ftr=model_f2_conf_flt.filter(lambda x: x[2]>=conf)
#calculate the quantity coefficient for the filtered association rules
#since we cannot use nested sql operations in rdd, I have to collect the rules to list first
asso_list=model_f2_conf_flt_ftr.map(lambda x: list(x)).collect()
print 'coef calculation over',time.strftime("%H:%M:%S")
for row in asso_list:
row.append(qty_coef(row[0],row[1]))
#rewrite the list to dataframe
asso_df=sql_context.createDataFrame(asso_list,['item1','item2','conf','coef'])
sql_context.clearCache()
path = "hdfs:/user/hive/wilber/%s"%(prod)
asso_df.write.mode('overwrite').parquet(path)
if __name__ == '__main__':
sc = SparkContext()
sql_context=HiveContext(sc)
prod_list=sc.textFile('hdfs:/user/hive/wilber/prod_list').collect()
total_spu=read_data()
print 'spu read over',time.strftime("%H:%M:%S")
for prod in list(prod_list):
print 'prod',prod
train(prod)

Polybase - maximum reject threshold (0 rows) was reached while reading from an external source: 1 rows rejected out of total 1 rows processed

[Question from customer]
I have following data in a text file. Delimited by |
A | null , ZZ
C | D
When I run this query using HDInsight:
CREATE EXTERNAL TABLE myfiledata(
col1 string,
col2 string
)
row format delimited fields terminated by '|' STORED AS TEXTFILE LOCATION 'wasb://.....';
I get the following result as expected:
A null , ZZ
C D
But when I run the same query using SQL DW Polybase, it throws error:
Query aborted-- the maximum reject threshold (0 rows) was reached while reading from an external source: 1 rows rejected out of total 1 rows processed.
How do I fix this?
Here's my script in SQL DW:
-- Creating external data source (Azure Blob Storage)
CREATE EXTERNAL DATA SOURCE azure_storage1
WITH
(
TYPE = HADOOP
, LOCATION ='wasbs://....blob.core.windows.net'
, CREDENTIAL = ASBSecret
)
;
-- Creating external file format (delimited text file)
CREATE EXTERNAL FILE FORMAT text_file_format
WITH
(
FORMAT_TYPE = DELIMITEDTEXT
, FORMAT_OPTIONS (
FIELD_TERMINATOR ='|'
, USE_TYPE_DEFAULT = TRUE
)
)
;
-- Creating external table pointing to file stored in Azure Storage
CREATE EXTERNAL TABLE [Myfile]
(
Col1 varchar(5),
Col2 varchar(5)
)
WITH
(
LOCATION = '/myfile.txt'
, DATA_SOURCE = azure_storage1
, FILE_FORMAT = text_file_format
)
;
We’re currently working on a way to bubble up the reason for reject to the user.
In the meantime, here's what's happening:
The default # of rows allowed to fail schema matching is 0. This means that if at least one of the rows you’re loading in from /myfile.txt doesn’t match the schema. In Hive, strings can accommodate an arbitrary amount of chars, but varchars cannot. In this case it’s failing on the varchar(5) for “null , ZZ” because that is more than 5 characters.
If you’d like to change the REJECT_VALUE in the CREATE EXTERNAL TABLE call, that will let through the other row – more info can be found here: https://msdn.microsoft.com/library/dn935021(v=sql.130).aspx
It's due to dirty record for the respective file format, for example in the case of parquet if the column contains '' (empty string) then it won't work, and will throw Query aborted-- the maximum reject threshold
[AZURE.NOTE] A query on an external table can fail with the error "Query aborted-- the maximum reject threshold was reached while reading from an external source". This indicates that your external data contains dirty records. A data record is considered 'dirty' if the actual data types/number of columns do not match the column definitions of the external table or if the data doesn't conform to the specified external file format. To fix this, ensure that your external table and external file format definitions are correct and your external data conform to these definitions. In case a subset of external data records is dirty, you can choose to reject these records for your queries by using the reject options in CREATE EXTERNAL TABLE DDL.

Resources