Problem with explode in Hive or Spark query - apache-spark

There is Hive table with ~ 500,000 rows.
It has the single column which keeps the JSON string.
JSON stores the measurements from 15 devices organized like this:
company_id=…
device_1:
array of measurements
every single measurements has 2 attributes:
value=
date=
device_2:
…
device_3
…
device_15
...
There are 15 devices in json where every device has the nested array of measurements inside. The size of measurements array is not fixed.
The goal is to get from the measurements only the one with max(date) per device.
The output of SELECT should have the following columns:
company_id
device_1_value
device_1_date
...
device_15_value
device_15_date
I tried to use the LATERAL VIEW to explode the measurements array:
SELECT get_json_object(json_string,'$.company_id),
d1.value, d1.date, ... d15.value, d15.date
FROM T
LATERAL VIEW explode(device_1.measurements) as d1
LATERAL VIEW explode(device_2.measurements) as d2
…
LATERAL VIEW explode(device_15.measurements) as d15
I can use the result of this SQL as an input for another SQL which will extract the records with max(date) per device.
My approach does not scale well: with 15 devices and 2 measurements per device the single row in input table will generate
2^15 = 32,768 rows using my SQL above.
There are 500,000 rows in input table.

You are actually in a great position, to make a cheaper table/join. Bundling (your JSON string) is a optimization trick use to take horribly ugly joins/tables and optimizing them.
The downside is that you should likely be using a hive user defined function or a spark function to pair down the data. SQL is amazing but likely this isn't the right tool for this job. You likely want to use a programming language to help ingest this data into a format that works for SQL.

To avoid the cartesian product generated by multiple lateral views I split the original SQL into 15 independent SQLs (one per device) where the single SQL has just 1 lateral view.
Then I join all 15 SQLs.

Related

How to partition Delta tables efficiently?

Looking for efficient partitioning strategies for my dataframe when storing my dataframe in the delta table.
My current dataframe 1.5000.000 rowa it takes 3.5h to move data from dataframe to delta table.
Looking for a more efficient way to do this writing I decided to try different columns of my table as partitioning columns.I searched for the cardinality of my columns and selected the following ones.
column1 = have 3 distinct_values
column2 = have 7 distinct values
column3 = have 26 disctinc values
column4 = have 73 distinc values
column5 = have 143 distinc values
column6 = have 246 distinct values
column7 = have 543 disctinc values
cluster: 64GB, 8 cores
using the folloging code in my notebook
df.write.partitionBy("column_1").format("delta").mode("overwrite").save(partition_1)
..
df.write.partitionBy("column_7").format("delta").mode("overwrite").save(partition7)
Thus, I wanted to see which partitioning strategy would bring better results: a column with high cardinality, one with low cardinality or one in between.
To my surprise this has not had any effect as it has taken practically the same time in all of them with differences of a few minutes but all of them with + 3h.
why have I failed ? is there no advantage to partitioning ?
When you use Delta (either Databricks or OSS Delta 1.2.x, better 2.0) then often you may not need to use partitioning at all for following reasons (that aren't applicable for Parquet or other file formats):
Delta supports data skipping that allows to read only necessary files, especially effective when you use it in combination with OPTIMIZE ZORDER BY that will put related data closer to each other.
Bloom filters allow to skip files even more granularly.
The rules of thumb of using partitioning with Delta lake tables are following:
use it when it will benefit queries, especially when you perform MERGE into the table, because it allows to avoid conflicts between parallel transactions
when it helps to delete old data (for example partitioning by date)
when it really benefits your queries. For example, you have data per country, and most of queries will use country as a part of condition. Or for example, when you partition by date, and querying data based on the time...
In all cases, don't use partitioning for high cardinality columns (hundreds of values) and having too many partition columns because in most cases it lead to creation of small files that are less efficient to read (each file is accessed separately), plus it leads to increased load to the driver as it needs to keep metadata for each of the file.

Computing the size of a derived table in Spark SQL query

Is it possible to approximate the size of a derived table (in kb/mb/gb etc) in a Spark SQL query ? I don't need the exact size but an approximate value will do, which would allow me to plan my queries better by determining if a table could be broadcast in a join, or if using a filtered subquery in a Join will be better than using the entire table etc.
For e.g. in the following query, is it possible to approximate the size (in MB) of the derived table named b ? This will help me figure out if it will be better to use the derived table in the Join vs using the entire table with the filter outside -
select
a.id, b.name, b.cust
from a
left join (select id, name, cust
from tbl
where size > 100
) b
on a.id = b.id
We use Spark SQL 2.4. Any comments appreciated.
I have had to something similar before (to work out how many partitions to split to when writing).
What we ended up doing was working out an average row size and doing a count on the DataFrame then multiplying it by the row count.

Spark SQL window function causes skew in data distribution

The performance of this Spark SQL query is bad due to skew data distribution:
select c.*, coalesce(
sum(revenue)
OVER (PARTITION BY cid, pid, code
ORDER BY (cTime div (1000*3600))
RANGE BETWEEN 336 PRECEDING and 1 PRECEDING), 0L) as totalRevenue
from records c
I see in SparkUI that single task stack and the cluster fail if I increase the scanned range.
I am using Yarn at AWS EMR, with Spark 2.2.0
How can I overcome this issue?
Thanks
I can only recommend several approaches to alleviate your condition for investigation. I would actually try two approaches that don’t treat the skew first:
Try increasing the executor memory per the message. On YARN you may additionally need to increase the maximum container memory as well. The default on Spark IIRC is 2gb and its not uncommon to need to increase it.
Try switching to memory_and_disk or disk_only persistence levels. I believe this should work for your query although it can be hard to eyeball the full query plan
The reason for this is that at least to my eye your data is fundamentally skewed. You’re setting yourself up for maintenance difficulties if you start reshaping the data to address the skew in specific ways to the current shape of the data because the shape of the data may change over time. In my opinion at least you want to preserve the most straightforward implementation of your query for as long as you can, and only optimize skew issues programmatically if you hit problems with SLA violations, etc.
If those don’t work then you can try to address the skew directly. A simple approach for this is to create a third column that is populated by a random number for the column values that are known to be problematic. Do one pass of your summing operation with this in place, using it as a key, then a second pass with the extra random column removed. Alternatively you can do two queries and concatenate them: one with the random number for skewed data (which must still be handled in two passes) and another unaltered query for the non problematic data.
Edit - compute partial sums through two frames
The fundamentally useful observation here is that addition is commutative and associative. My original proposal based on random numbers won't work but this will. Basically, you want to compute the partial sum of the frame you want in several parts. The easiest way to do this is probably as a set of ranges (two used here for simplicity):
create temporary table partial_revenue_1 as select c.*, coalesce(
sum(revenue)
OVER (PARTITION BY cid, pid, code
ORDER BY (cTime div (1000*3600))
RANGE BETWEEN 336 PRECEDING and 118 PRECEDING), 0L) as partialTotalRevenue
from records c
create temporary table partial_revenue_2 as select c.*, coalesce(
sum(revenue)
OVER (PARTITION BY cid, pid, code
ORDER BY (cTime div (1000*3600))
RANGE BETWEEN 117 PRECEDING and 1 PRECEDING), 0L) as partialTotalRevenue
from records c
create temporary table combined_partials as select * from
partial_reveneue_1 union all select * from partial_revenue_2
select sum(partialTotalRevenue), first(c.some_col) ... from
combined_partials c group by cid, pid, code
Notice you need to use the first aggregate function to cull the duplicate fields that you will have from the earlier select * operations on the records table. Don't worry, this will be fine since both values came from the same table.

Timeseries with Spark/Cassandra - How to find timestamps when values satisfy a condition?

I have timeseries stored in a Cassandra table, coming from several sensors. Here is the schema I use for storing data :
CREATE TABLE data_sensors (
sensor_id int,
time timestamp,
value float,
PRIMARY KEY ((sensor_id), time)
);
Values can be temperature or pressure for instance, depending on the sensor from which it is coming from.
My objective is to be able to find basic statistics (min, max, avg, std) on pressure, but only when temperature is higher than a certain value.
Here is a schema of the whole process I'd like to get.
I think it could be better if I changed the Cassandra model, at least for temperature data, to be able to filter on value. Is there another way, after importing data into a Spark RDD, to avoid altering the Cassandra table?
Then, once filtering on temperature is done, how to get the sequence of timestamps I have to use to filter pressure data? Please note that I don't have necessarily the same timestamps for temperature and pressure, that is why I think I need to have periods of time instead of a list of precise timestamps.
Thanks for your help!
It's not really a Cassandra-specific answer, but maybe you want to look at time series databases that provide SQL layer on top of NoSQL stores with support for JOINs and aggregations.
Here's an example of an ATSD SQL syntax that supports period aggregations and joins.
SELECT t1.entity, t1.datetime, min(t1.value), max(t1.value), avg(t2.value)
FROM mpstat.cpu_busy t1
JOIN meminfo.memfree t2
WHERE t1.datetime >= '2016-09-20T15:00:00Z' AND t1.datetime < '2016-09-20T15:15:00Z'
GROUP BY entity, t1.PERIOD(1 MINUTE)
HAVING max(t1.value) > 30
The query joins two metrics, filters out 1-minute rows where first metric was below the threshold and then returns a bunch of statistics for the second series.
If the two series are unevenly spaced, you can regularize the array using linear interpolation.
Disclosure: I work for Axibase that develops ATSD.

Cassandra: secondary index inside a single partition (per partition indexing)?

This question is I hope not answered in the usual "secondary index v. clustering key" questions.
Here is a simple model I have:
CREATE TABLE ks.table1 (
name text,
timestamp int,
device text,
value int,
PRIMARY KEY (md_name, timestamp, device)
)
Basically I view my data as datasets with name name, each dataset is a kind of sparse 2D matrix (rows = timestamps, columns = device) containing value.
As the problem and the queries can be pretty symmetric (ie. is my "matrix" the best representation, or should I use the transposed "matrix") I couldn't decide easily what clustering key I should put first. It makes a bit more sense the way I did: for each timestamp I have a set of data (values for each devices present at that timestamp).
The usual query is then
select * from cycles where md_name = 'xyz';
It targets a single partition, that will be super fast, easy enough. If there's a large amount of data my users could do something like this instead:
select * from cycles where md_name = 'xyz' and timestamp < n;
However I'd like to be able to "transpose" the problem and do this:
select * from cycles where md_name = 'xyz' and device='uvw';
That means I have to create a secondary index on device.
But (and that's where the question starts"), this index is a bit different from usual indexes, as it is used for queries inside a single partition. Create the index allows to do the same on multiple partitions:
select * from cycles where device='uvw'
Which is not necessary in my case.
Can I improve my model to support such queries without too much duplication?
Is there something like a "per-partition index"?
The index would allow you to do queries like this:
select * from cycles where md_name='xyz' and device='uvw'
But that would return all timestamps for that device in the xyz partition.
So it sounds like maybe you want two views of the data. Once based on name and time range, and one based on name, device, and time range.
If that's what you're asking, then you probably need two tables. If you're using C* 3.0, then you could use the materialized views feature to create the second view. If you're on an earlier version, then you'd have to create the two tables and do a write to each table in your application.

Resources