Using an expression in a PARTITIONED BY definition in Delta Table

Using an expression in a PARTITIONED BY definition in Delta Table - databricks

Attempting to load data into Databricks using COPY INTO, I have data in storage (as CSV files) that has the following schema:
event_time TIMESTAMP,
aws_region STRING,
event_id STRING,
event_name STRING
I wish for the target table to be partitioned by DAY, which should be extracted from the event_time column.
However, attempting to use an expression in the PARTITIONED BY column yields the following error:
CREATE TABLE IF NOT EXISTS MY_TABLE (
event_time TIMESTAMP,
aws_region STRING,
event_id STRING,
event_name STRING
)
PARTITIONED BY(TO_DATE(event_time))
java.sql.SQLException: [Databricks]DatabricksJDBCDriver ERROR processing query/statement. Error Code: 0, SQL state: org.apache.hive.service.cli.HiveSQLException: Error running query: [DELTA_OPERATION_NOT_ALLOWED] com.databricks.sql.transaction.tahoe.DeltaAnalysisException: Operation not allowed: Partitioning by expressions is not supported for Delta tables
My limitation is that I cannot alter the existing schema, for example by creating a new field:
event_date DATE
And use that as the partition by column.
Is there any other way to overcome this?

You need to use GENERATED column like this (doc):
CREATE TABLE IF NOT EXISTS MY_TABLE (
event_time TIMESTAMP,
event_date date GENERATED ALWAYS AS TO_DATE(event_time),
aws_region STRING,
event_id STRING,
event_name STRING
)
PARTITIONED BY(event_date)
You can read more about generated columns in the linked documentation or here.

Related

org.apache.spark.sql.AnalysisException: while saving Spark Dataframe

I have 1 table in 2 tables in my database.I am tring to save data from first table to second table using insertInto.
CREATE TABLE if not exists dbname.tablename_csv ( id STRING, location STRING, city STRING, country STRING) ROW FORMAT DELIMITED FIELDS TERMINATED BY ',' STORED AS TEXTFILE ;
CREATE TABLE if not exists dbname.tablename_orc ( id String,location STRING, country String PARTITIONED BY (city string) CLUSTERED BY (country) into 4 buckets ROW FORMAT DELIMITED FIELDS TERMINATED BY ',' STORED AS ORCFILE tblproperties("orc.compress"="SNAPPY");
var query=spark.sql("id,location,city,country from dbname.tablename_csv")
query.write.insertInto("dbname.tablename_orc")
but its giving issue."
"org.apache.spark.sql.AnalysisException: `dbname`.`tablename_orc` requires that the data to be inserted have the same number of columns as the target table: target table has 3 column(s) but the inserted data has 4 column(s), including 0 partition column(s) having constant value(s).;"
Plese someone give me a hint what else need to add.I tried by adding partitionBy also but got same error and was showing partitionBy not Required.
query.write.partitionBy("city").insertInto("dbname.tablename_orc")

saveAsTable(...) with mode = "append"

HIVE rendered timestamp column data as NULL

I am trying to create an external table using Hive. Below is the Hive query I ran:
create external table trips_raw
(
VendorID int,
tpep_pickup_datetime timestamp,
tpep_dropoff_datetime timestamp
)
ROW FORMAT DELIMITED FIELDS TERMINATED BY ',' location '/user/taxi_trips/';
When I looked at the output from the 'trips_raw' table created by the query above, I saw that both the 'tpep_pickup_date_time' and 'tpep_dropoff_datetime' columns are 'NULL' in all rows. I have seen other threads talked about the reason being that the '1/1/2018 11:13:00 AM' timestamp format is not accepted by Hive, but problem is that's the timestamp format I have in my csv source data (as you can see from screenshot here).
I could specify those 2 timestamp columns as 'string' and Hive will be able to render them correctly, but I still would want those 2 columns to be 'timestamp' type so specifying those 2 columns as 'string' type is not a viable option here.
I had also tried the following technique using recommendation from this site (https://community.hortonworks.com/questions/55266/hive-date-time-problem.html) but had no success:
Create the 'trips_raw' table using 'string' as type for the 2 timestamp columns. This allows the resulting table to render the timestamps correctly, albeit in 'string' type. The Hive command I used is shown below:
create external table trips_raw
(
VendorID int,
tpep_pickup_datetime string,
tpep_dropoff_datetime string
)
ROW FORMAT DELIMITED FIELDS TERMINATED BY ',' location
'/user/taxi_trips/';
When I look at the resulting table, the dates are shown as string as you can see from this screenshot below.
But as I had mentioned earlier, I want the time columns to be in timestamp type and not string type. Therefore in the next 2 steps I tried to create a blank table and then insert the data from the table created from Step 1 but converting the string to timestamp this time.
Create an external blank table called 'trips_not_raw' using the following Hive commands:
create external table trips_not_raw
(VendorID int,
tpep_pickup_datetime timestamp,
tpep_dropoff_datetime timestamp
);
Insert data from 'trips_raw' table (which was mentioned earlier in this question), using the Hive commands below:
insert into table trips_not_raw select vendorid,
from_unixtime(unix_timestamp(tpep_pickup_datetime, 'MM/dd/yyyy HH:mm:ss
aa')) as tpep_pickup_datetime,
from_unixtime(unix_timestamp(tpep_dropoff_datetime, 'MM/dd/yyyy HH:mm:ss
aa')) as tpep_dropoff_datetime
from trips_raw;
Doing this inserts the rows into the blank table 'trips_not_raw', but the results from the 2 timestamp columns still showed as 'Null' as you can see from the screenshot below:
Is there a simple way to store the 2 time columns as 'timestamp' type and not 'string', but still be able to render them correctly in the output without seeing 'Null/None'?

I'm afraid you need to parse timestamp column and then cast string as timestamp. For example,
select cast(regexp_replace('1/1/2018 11:13:00 AM', '(\\d{1,2})/(\\d{1,2})/(\\d{4})\\s(\\d{2}:\\d{2}:\\d{2}) \\w{2}', '$3-$1-$2 $4') as timestamp)
You can create and use a macro function for convenience, e.g.,
create temporary macro parse_date (ts string)
cast(regexp_replace(ts, '(\\d{1,2})/(\\d{1,2})/(\\d{4})\\s(\\d{2}:\\d{2}:\\d{2}) \\w{2}', '$3-$1-$2 $4') as timestamp);
then use it as follows
select parse_date('1/1/2018 11:13:00 AM');

Normalization in Cassandra for the specified use case?

I have a device table (say 'device' table) which has the static fields with current statistics and I have another table (say 'devicestat' table) which has the statistics of that device collected for every one minute and sorted by timestamp like below.
Example :
CREATE TABLE device(
"partitionId" text,
"deviceId" text,
"name" text,
"totalMemoryInMB" bigint,
"totalCpu" int,
"currentUsedMemoryInMB" bigint,
"totalStorageInMB" bigint,
"currentUsedCpu" int,
"ipAddress" text,
primary key ("partitionId","deviceId"));
CREATE TABLE devicestat(
"deviceId" text,
"timestamp" timestamp,
"totalMemoryInMB" bigint,
"totalCpu" int,
"usedMemoryInMB" bigint,
"totalStorageInMB" bigint,
"usedCpu" int
primary key ("deviceId","timestamp"));
where,
currentUsedMemoryInMB & currentUsedCpu => Hold the most recent statistics
usedMemoryInMB & usedCpu => Hold the most and also old statistics based on time stamp.
Could somebody suggest me the correct approach for the following concept?
So whenever I need static data with the most recent statistics I read from device table, Whenever I need history of device staistical data I read from the devicestat table
This looks fine for me, But only problem is I need to write the statitics in both table, In case of devicestat table It will be a new entry based on timestamp but In case of device table, we will just update the statistics. What is your thought on this, Does this need to be maintained in only the single stat table or Is it fine to update the most recent stat in device table too.

in Cassandra the common approach is to have a table(ColumnFamily) per query. And denormalization is also a good practice in Cassandra. So it's ok to keep 2 column families in this case.
Another way to get the latest stat from devicestat table is make data be DESC sorted by timestamp:
CREATE TABLE devicestat(
"deviceId" text,
"timestamp" timestamp,
"totalMemoryInMB" bigint,
"totalCpu" int,
"usedMemoryInMB" bigint,
"totalStorageInMB" bigint,
"usedCpu" int
primary key ("deviceId","timestamp"))
WITH CLUSTERING ORDER BY (timestamp DESC);
so you can query with limit 1 when you know deviceId
select * from devicestat where deviceId = 'someId' limit 1;
But if you want to list last stat of devices by partitionId then your approach with updating device table with latest stat is correct

Cassandra aggregation to second table

i have a cassandra table like:
CREATE TABLE sensor_data (
sensor VARCHAR,
timestamp timestamp,
value float,
PRIMARY KEY ((sensor), timestamp)
)
And and aggregation table.
CREATE TABLE sensor_data_aggregated (
sensor VARCHAR,
aggregation VARCHAR /* hour or day */
timestamp timestamp,
aggragation
min_timestamp timestamp,
min_value float,
max_timestamp timestamp,
max_value float,
avg_value float,
PRIMARY KEY ((sensor, aggregation), timestamp)
)
Is there a possibility of any trigger, to fill the "sensor_data_aggregated" table automaticly on insert, update, delete or "sensor_data" table?
My current solution whould be to write an custom trigger, with second commit log.
And an application that read and truncate this log peridicly to generate the aggregated data.
But i also found information that the datastax ops center can do this but no instruction how to do that?
What will be the best solution how to to this?

You can implement your own C* trigger for that, which will execute additional queries for your aggregation table after each row insert into sensor_data.
Also, for maintaining min/max values you can use CAS and C* lightweight transactions like
update sensor_data_aggregated
set min_value=123
where
sensor='foo'
and aggregation='bar'
and ts='2015-01-01 00:00:00'
if min_value>123;
using a bit updated schema ('timestamp' is a reserved keyword in cql3, you cannot use it unescaped):
CREATE TABLE sensor_data_aggregated (
sensor text,
aggregation text,
ts timestamp,
min_timestamp timestamp,
min_value float,
max_timestamp timestamp,
max_value float,
avg_value float,
PRIMARY KEY ((sensor, aggregation), ts)
)

IN operator in Cassandra doesn't work for table having a column with type-collection(Map or List)

I'm working on Cassandra, trying to get to know how it works. Encountered something strange while using IN operator. Example:
Table:
CREATE TABLE test_time (
name text,
age int,
time timeuuid,
"timestamp" timestamp,
PRIMARY KEY ((name, age), time)
)
I have inserted few dummy data. Used IN operator as follows:
SELECT * from test_time
where name="9" and age=81
and time IN (c7c88000-190e-11e4-8000-000000000000, c7c88000-190e-11e4-7000-000000000000);
It worked properly.
Then, added a column of type Map. Table will look like:
CREATE TABLE test_time (
name text,
age int,
time timeuuid,
name_age map<text, int>,
"timestamp" timestamp,
PRIMARY KEY ((name, age), time)
)
On executing same query, I got following error:
Bad Request: Cannot restrict PRIMARY KEY part time by IN relation as a collection is selected by the query
From the above examples, we can say, IN operator doesn't work if there are any column of type collection(Map or List) in the table.
I don't understand why it behaves like this. Please let me know If I'm missing anything here. Thanks in advance.

Yup...that is a limitation. You can do the following:
select * from ...where name='9' and age=81 and time > x and time < y
select [columns except collection] from ...where name='9' and age=81 and time in (...)
You can then filter client side, or do another query.

You can either include your column as a part of partitioning expression in the primary key
CREATE TABLE test_time (
name text,
age int,
time timeuuid,
"timestamp" timestamp,
PRIMARY KEY ((name, time), age)
);
or create a separate Materialized View to satisfy your query requirements:
CREATE MATERIALIZED VIEW test_time_mv AS
SELECT * FROM test_time
WHERE name IS NOT NULL AND time IS NOT NULL AND age IS NOT NULL
PRIMARY KEY ((name, time), age);
Now use the Materialized View in your query instead of the base table:
SELECT * from test_time_mv
where name='9'
and age=81
and time IN (c7c88000-190e-11e4-8000-000000000000,
c7c88000-190e-11e4-7000-000000000000);

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string

Using an expression in a PARTITIONED BY definition in Delta Table - databricks

Related

org.apache.spark.sql.AnalysisException: while saving Spark Dataframe

HIVE rendered timestamp column data as NULL

Normalization in Cassandra for the specified use case?

Cassandra aggregation to second table

IN operator in Cassandra doesn't work for table having a column with type-collection(Map or List)

Categories

Resources