I am using data factory's expression builder to build dataflows (Aggregrate Function) to 1. group movies by year, 2.find the max rating of movies 3. Return movie title for max.
I have already grouped by year so I'm trying to return something like
max(toInteger(Rating)) or greatest(toInteger(Rating))
and also get the 'title' of the movie that is max, can this be done in expression builder?
The Aggregate transformation defines aggregations of columns in your data streams. Using the Expression Builder, you can define different types of aggregations such as SUM, MIN, MAX, and COUNT grouped by existing or computed columns.
I tried to repro the issue with sample data and I can observe that getting the movie title isn't possible in Aggregate function in mapping data flow.
In Data preview we can see we are only getting group by column and aggregate column. There is no option to include movie name column here.
There is Hive table with ~ 500,000 rows.
It has the single column which keeps the JSON string.
JSON stores the measurements from 15 devices organized like this:
company_id=…
device_1:
array of measurements
every single measurements has 2 attributes:
value=
date=
device_2:
…
device_3
…
device_15
...
There are 15 devices in json where every device has the nested array of measurements inside. The size of measurements array is not fixed.
The goal is to get from the measurements only the one with max(date) per device.
The output of SELECT should have the following columns:
company_id
device_1_value
device_1_date
...
device_15_value
device_15_date
I tried to use the LATERAL VIEW to explode the measurements array:
SELECT get_json_object(json_string,'$.company_id),
d1.value, d1.date, ... d15.value, d15.date
FROM T
LATERAL VIEW explode(device_1.measurements) as d1
LATERAL VIEW explode(device_2.measurements) as d2
…
LATERAL VIEW explode(device_15.measurements) as d15
I can use the result of this SQL as an input for another SQL which will extract the records with max(date) per device.
My approach does not scale well: with 15 devices and 2 measurements per device the single row in input table will generate
2^15 = 32,768 rows using my SQL above.
There are 500,000 rows in input table.
You are actually in a great position, to make a cheaper table/join. Bundling (your JSON string) is a optimization trick use to take horribly ugly joins/tables and optimizing them.
The downside is that you should likely be using a hive user defined function or a spark function to pair down the data. SQL is amazing but likely this isn't the right tool for this job. You likely want to use a programming language to help ingest this data into a format that works for SQL.
To avoid the cartesian product generated by multiple lateral views I split the original SQL into 15 independent SQLs (one per device) where the single SQL has just 1 lateral view.
Then I join all 15 SQLs.
I have some time-series data and let's say that model looks like this:
data {
date: Date,
greenPoint: Double,
redPoint: Double,
anomalyDetected: Boolean
}
Let's say that if red point is under specific threshold I want to mark it as anomaly.
I can easily load this data set (I'm using Spark SQL with Elastic, so I have here DataFrame to be more specific) and mark some data as anomaly, but how can I group them after that, or at least count how many anomalies were detected?
If I could somehow add to each anomaly some unique ID (generated when anomaly flag is changing) then I would easily group them even in Elastic query.
It would be great also to have start anomaly date and end anomaly date in these grouped data.
Basically my goal is to in below example detect five different anomalies (marked in red) and either mark original data to easily group them and display on UI or produce list contains five elements (five anomalies, with some additional fields like start and end date etc.).
I have timeseries stored in a Cassandra table, coming from several sensors. Here is the schema I use for storing data :
CREATE TABLE data_sensors (
sensor_id int,
time timestamp,
value float,
PRIMARY KEY ((sensor_id), time)
);
Values can be temperature or pressure for instance, depending on the sensor from which it is coming from.
My objective is to be able to find basic statistics (min, max, avg, std) on pressure, but only when temperature is higher than a certain value.
Here is a schema of the whole process I'd like to get.
I think it could be better if I changed the Cassandra model, at least for temperature data, to be able to filter on value. Is there another way, after importing data into a Spark RDD, to avoid altering the Cassandra table?
Then, once filtering on temperature is done, how to get the sequence of timestamps I have to use to filter pressure data? Please note that I don't have necessarily the same timestamps for temperature and pressure, that is why I think I need to have periods of time instead of a list of precise timestamps.
Thanks for your help!
It's not really a Cassandra-specific answer, but maybe you want to look at time series databases that provide SQL layer on top of NoSQL stores with support for JOINs and aggregations.
Here's an example of an ATSD SQL syntax that supports period aggregations and joins.
SELECT t1.entity, t1.datetime, min(t1.value), max(t1.value), avg(t2.value)
FROM mpstat.cpu_busy t1
JOIN meminfo.memfree t2
WHERE t1.datetime >= '2016-09-20T15:00:00Z' AND t1.datetime < '2016-09-20T15:15:00Z'
GROUP BY entity, t1.PERIOD(1 MINUTE)
HAVING max(t1.value) > 30
The query joins two metrics, filters out 1-minute rows where first metric was below the threshold and then returns a bunch of statistics for the second series.
If the two series are unevenly spaced, you can regularize the array using linear interpolation.
Disclosure: I work for Axibase that develops ATSD.
I am interested in creating a gvNIX/Roo application which shows the location of health facilities in Tanzania on a map. I am trying the tutorial available here. However my data is in the format shown below where my location data is in two columns (southings and eastings). The tutorial shows how to create three data types:
field geo --fieldName location --type POINT --class ~.domain.Owner
field geo --fieldName distance --type LINESTRING --class ~.domain.Owner
field geo --fieldName area --type POLYGON --class ~.domain.Owner
Am assuming I need the POINT data type to hold data on a health facility location but am not sure how to get the below 2 columns (southings and eastings) into a single POINT variable. Am pretty new to GIS as well. The data is as below (csv format):
outlet_name,Status ,southings,eastings,streetward,name_of_outlet
REHEMA MEDICS,02,2.49993,32.89512,K/POLISI,REVINA
KIRUMBA MEDICS,02,2.50023,32.89503,K/POLISI,GEDION
KIRUMBA PHARMACY,02,2.50152,32.89742,K/POLISI,MAURETH
TULI MEDICS,02,2.48737,32.89686,KITANGIRI,TULI
JULLY MEDICS,02,2.53275,32.93855,BUZURUGA,JULLY
MAGOMA MEDICS,02,2.53181,32.94211,BUZURUGA,MAGOMA
MECO PHARMACY,02,2.52923,32.94730,MECCO,DORCAS
UPENDO MEDICS,02,2.52923,32.94786,MECCO,UPENDO
DORIS MEDICS,02,2.49961,32.89191,KABUHORO,DORIS
SOPHIA MEDICS,02,2.49975,32.89120,KABUHORO,ESTER
MWALONI PHAMCY,02,2.56351,32.89416,MWALONI,ESTER
SILVER PHAMACY,02,2.51728,32.90614,K/KILOMERO,WANDWATA
KIBO PHARMACY,02,2.51688,32.90710,MISSION,MARIAM
Thanks
You need to transform your coordinates to WKT format (Well Known Text) in order to insert them in a column in your database (a postgresql database with postgis support). In order to achieve this you need to follow these steps:
Find the SRID of your coordinates reference system (CRS). That is, the identificator which define your coordinates system. Otherwise, your points won't match the real coordinates. You'll need the SRID in the last step.
Transform your data to WKT. The data needed for inserting the points is in the southings and eastings columns (I suppose they are equal to latitude and longitude, that are the most common used), so you'll need to transform these columns in one single column with WKT format. e.g. for your first row of data: Point(32.89512 2.49993). Note the space between them and the switch between the numbers.
Proceed with the inserts with SQL syntax, but using postgis functions. An example for your first row would be: INSERT into health_facilities (outlet_name, Status, streetward, location) VALUES ('REHEMA MEDICS', 02, 'K/POLISI', ST_GeomFromText('Point(32.89512 2.49993)', 4326));. Where "4326" are the numbers of the SRID you have to find (supossing it is the most common -> EPSG:4326).
You can find more info here and here. Also there are several pages where you can check coordinates and transform them between diferent CRS, like this and this.