Calculatng median for every hourly time slice in a temporal database in PostgreSQL - median

I am working on temporal database.
I need to accomplish aggregation in the time series data.
I have supposed to find instantaneous hours which can be Hour:00, Hour:01 or Hour-1:59 and extracting the correspondent observations relates to 5 minutes after and before the instantaneous hours.
Consequently, extracting the median value from aforementioned 5 candidate.
Sample data1
Time&Date(timestamps) Surface_Temprature
2012-11-02 00:45:09+02 -1.770
2012-11-02 00:47:09+02 -1.780
2012-11-02 00:49:09+02 -1.500
2012-11-02 00:51:09+02 -1.460
2012-11-02 00:53:09+02 -1.720
2012-11-02 00:55:09+02 -1.670
2012-11-02 00:57:09+02 -1.560
2012-11-02 00:59:09+02 -1.690
2012-11-02 01:01:09+02 -1.970
2012-11-02 01:03:09+02 -1.790
2012-11-02 01:05:09+02 -1.790
2012-11-02 01:07:09+02 -1.840
2012-11-02 01:09:09+02 -1.910
2012-11-02 01:11:09+02 -1.870
Sample Data2:
Date&Time (Timestamps) SurfaceTemprature
2007-09-28 23:46:14+02 -1.320
2007-09-28 23:48:14+02 -1.460
2007-09-28 23:50:14+02 -1.620
2007-09-28 23:52:14+02 -1.670
2007-09-28 23:54:14+02 -1.640
2007-09-28 23:56:14+02 -1.700
2007-09-28 23:58:14+02 -1.810
2012-11-03 00:00:14+02 -1.890
2012-11-03 00:02:14+02 -1.790
2012-11-03 00:04:14+02 -1.780
2012-11-03 00:06:14+02 -1.660
2012-11-03 00:08:14+02 -1.680
2012-11-03 00:10:14+02 -1.900
2012-11-03 00:12:14+02 -1.820
2012-11-03 00:14:14+02 -1.780
2012-11-03 00:16:14+02 -1.940
2012-11-03 00:18:14+02 -1.900
Is there anyone who can help me?

Something like this should do
SELECT
EXTRACT(HOUR FROM mytimestamp) AS hour,
PERCENTILE_CONT(0.5) WITHIN GROUP(ORDER by temperature) AS median
FROM mytable GROUP BY EXTRACT hour;

Related

How to use pyspark to do some calculations of a csv file?

I am working on one CSV file below using PySpark(on databricks), but I am not sure how to get the total scan (event name) duration time. Assume one scan per time.
timestamp
event
value
1
2020-11-17_19:15:33.438102
scan
start
2
2020-11-17_19:18:33.433002
scan
end
3
2020-11-17_20:05:21.538125
scan
start
4
2020-11-17_20:13:08.528102
scan
end
5
2020-11-17_21:23:19.635104
pending
start
6
2020-11-17_21:33:26.572123
pending
end
7
2020-11-17_22:05:29.738105
pending
start
.........
Below are some of my thoughts:
first get scan start time
scan_start = df[(df['event'] == 'scan') & (df['value'] == 'start')]
scan_start_time = scan_start['timestamp']
get scan end time
scan_end = df[(df['event'] == 'scan') & (df['value'] == 'end')]
scan_end_time = scan_start['timestamp']
the duration of each scan
each_duration = scan_end_time.values - scan_start_time.values
total duration
total_duration_ns = each_duration.sum()
But, I am not sure how to do the calculation in PySpark.
First, do we need to create a schema to pre-define the column name 'timestamp' type in timestamp? (Assume all the column types (timestamp, event, value) are in str type)
On the other hand, assume we have multiple(1000+files) similar to the above CSV files stored in databricks, how can we create a reusable code for all the CSV files. Eventually, create one table to store info of the total scan_duration.
Can someone please share with me some code in PySpark?
Thank you so much
This code will compute for each row the difference between the current timestamp and the timestamp in the previous row.
I'm creating a dataframe for reproducibility.
from pyspark.sql import SparkSession, Window
from pyspark.sql.types import *
from pyspark.sql.functions import regexp_replace, col, lag
import pandas as pd
spark = SparkSession.builder.appName("DataFarme").getOrCreate()
data = pd.DataFrame(
{
"timestamp": ["2020-11-17_19:15:33.438102","2020-11-17_19:18:33.433002","2020-11-17_20:05:21.538125","2020-11-17_20:13:08.528102"],
"event": ["scan","scan","scan","scan"],
"value": ["start","end","start","end"]
}
)
df=spark.createDataFrame(data)
df.show()
# +--------------------+-----+-----+
# | timestamp|event|value|
# +--------------------+-----+-----+
# |2020-11-17_19:15:...| scan|start|
# |2020-11-17_19:18:...| scan| end|
# |2020-11-17_20:05:...| scan|start|
# |2020-11-17_20:13:...| scan| end|
# +--------------------+-----+-----+
Convert "timestamp" column to TimestampType() to be able to compute differences:
df=df.withColumn("timestamp",
regexp_replace(col("timestamp"),"_"," "))
df.show(truncate=False)
# +——————————-------------———+---——+—---—+
# |timestamp |event|value|
# +————————————-------------—+---——+—---—+
# |2020-11-17 19:15:33.438102|scan |start|
# |2020-11-17 19:18:33.433002|scan |end |
# |2020-11-17 20:05:21.538125|scan |start|
# |2020-11-17 20:13:08.528102|scan |end |
# +——————————-------------———+---——+---——+
df = df.withColumn("timestamp",
regexp_replace(col("timestamp"),"_"," ").cast(TimestampType()))
df.dtypes
# [('timestamp', 'timestamp'), ('event', 'string'), ('value', 'string')]
Use pyspark.sql.functions.lag function that returns the value of the previous row (offset=1 by default).
See also How to calculate the difference between rows in PySpark? or Applying a Window function to calculate differences in pySpark
df.withColumn("lag_previous", col("timestamp").cast("long") - lag('timestamp').over(
Window.orderBy('timestamp')).cast("long")).show(truncate=False)
# WARN WindowExec: No Partition Defined for Window operation! Moving all data to a single partition, this can cause serious performance degradation.
Using Window without partition gives a warning.
It is better to partition the dataframe for the Window operation, I partitioned here by type of event:
df.withColumn("lag_previous", col("timestamp").cast("long") - lag('timestamp').over(
Window.partitionBy("event").orderBy('timestamp')).cast("long")).show(truncate=False)
# +———————————-------------——+---——+—---—+—------—————+
# |timestamp |event|value|lag_previous|
# +———————————-------------——+---——+---——+------——————+
# |2020-11-17 19:15:33.438102|scan |start|null |
# |2020-11-17 19:18:33.433002|scan |end |180 |
# |2020-11-17 20:05:21.538125|scan |start|2808 |
# |2020-11-17 20:13:08.528102|scan |end |467 |
# +—————-------------————————+---——+—---—+—------—————+
From this table you can filter out the rows with "end" value to get the total durations.

How to insert multiple rows of a pandas dataframe into Azure Synapse SQL DW using pyodbc?

I am using pyodbc to establish connection with Azure Synapse SQL DW. The connection is successfully established. However when it comes to inserting a pandas dataframe into the database, I am getting an error when I try inserting multiple rows as values. However, it works if I insert rows one by one. Inserting multiple rows together as values used to work fine with AWS Redshift and MS SQL, but fails with Azure Synapse SQL DW. I think the Azure Synapse SQL is T-SQL and not MS-SQL. Nonetheless, I am unable to find any relevant documentation as well.
I have a pandas df named 'df' that looks like this:
student_id admission_date
1 2019-12-12
2 2018-12-08
3 2018-06-30
4 2017-05-30
5 2020-03-11
This code below works fine
import pandas as pd
import pyodbc
#conn object below is the pyodbc 'connect' object
batch_size = 1
i = 0
chunk = df[i:i+batch_size]
conn.autocommit = True
sql = 'insert INTO {} values {}'.format('myTable', ','.join(
str(e) for e in zip(chunk.student_id.values, chunk.admission_date.values.astype(str))))
print(sql)
cursor = conn.cursor()
cursor.execute(sql)
As you can see, it's inserting just 1 row of the 'df'. So, yes, I can loop through and insert one by one but it takes hell lot of time when it comes dataframes of larger sizes
This code below doesn't work when I try to insert all rows together
import pandas as pd
import pyodbc
batch_size = 5
i = 0
chunk = df[i:i+batch_size]
conn.autocommit = True
sql = 'insert INTO {} values {}'.format('myTable', ','.join(
str(e) for e in zip(chunk.student_id.values, chunk.admission_date.values.astype(str))))
print(sql)
cursor = conn.cursor()
cursor.execute(sql)
The error I get this one below:
ProgrammingError: ('42000', "[42000]
[Microsoft][ODBC Driver 17 for SQL Server][SQL Server]Parse error at
line: 1, column: 74: Incorrect syntax near ','. (103010)
(SQLExecDirectW)")
This is the sample SQL query for 2 rows which fails:
insert INTO myTable values (1, '2009-12-12'),(2, '2018-12-12')
That's because Azure Synapse SQL does not support multi-row insert via the values constructor.
One work around is to chain "select (value list) union all". Your pseudo SQL should look like so:
insert INTO {table}
select {chunk.student_id.values}, {chunk.admission_date.values.astype(str)} union all
...
select {chunk.student_id.values}, {chunk.admission_date.values.astype(str)}
COPY statement in Azure Synapse Analytics is a better way for loading your data in Synapse SQL Pool.
COPY INTO test_parquet
FROM 'https://myaccount.blob.core.windows.net/myblobcontainer/folder1/*.parquet'
WITH (
FILE_FORMAT = myFileFormat,
CREDENTIAL=(IDENTITY= 'Shared Access Signature', SECRET='<Your_SAS_Token>')
)
You can save your pandas dataframe into blob storage, and then trigger the copy command using execute method.

cumalativive the all other columns expect date column in python ML with cumsum()

I have stock data set like
**Date Open High ... Close Adj Close Volume**
0 2014-09-17 465.864014 468.174011 ... 457.334015 457.334015 21056800
1 2014-09-18 456.859985 456.859985 ... 424.440002 424.440002 34483200
2 2014-09-19 424.102997 427.834991 ... 394.795990 394.795990 37919700
3 2014-09-20 394.673004 423.295990 ... 408.903992 408.903992 36863600
4 2014-09-21 408.084991 412.425995 ... 398.821014 398.821014 26580100
I need to cumulative sum the columns Open,High,Close,Adj Close, Volume
I tried this df.cumsum(), its shows the the error time stamp error.
I think for processing trade data is best create DatetimeIndex:
#if necessary
#df['Date'] = pd.to_datetime(df['Date'])
df = df.set_index('Date')
And then if necessary cumulative sum for all column:
df = df.cumsum()
If want cumulative sum only for some columns:
cols = ['Open','High','Close','Adj Close','Volume']
df[cols] = df.cumsum()

How to homogonize data in a Pyspark spark.sql dataframe

I downloaded a 1.9 GB csv file containing AirBnB data. Although all the columns have a data type of "string", I have a few columns that are not "homogenous", like a column for "Amenities" where some of the entries have a count of amenities at that particular property, and others have a list of amenities. All in a string format.
So, here's what I have so far:
from pyspark import SparkContext, SparkConf
import pandas as pd
import numpy as np
conf = SparkConf().setAppName("app")
sc = SparkContext(conf=conf)
from pyspark.sql import SQLContext
SQLCtx = SQLContext(sc)
air =SQLCtx.read.load('/home/john/Downloads/airbnb-listings.csv',
format = "com.databricks.spark.csv",
header = "true",
sep = ";",
inferSchema = "true")
#check for missing values
from pyspark.sql.functions import col,sum
air.select(*(sum(col(c).isNull().cast("int")).alias(c) for c in air.columns)).show()
So after dropping a few columns and then removing missing values, I have this:
Keep = ['Price', 'Bathrooms', 'Bedrooms', 'Beds', 'Bed Type', 'Amenities',
'Security Deposit', 'Cleaning Fee', 'Guests Included', 'Extra People',
'Review Scores Rating', 'Cancellation Policy','Host Response Rate',
'Country Code', 'Zipcode']
data = air.select(*Keep)
reduced2 = data.na.drop()
#final shape after dropping missing values.
print((reduced2.count(), len(reduced2.columns)))
I can convert a few rows into a pandas dataframe:
df3 = pd.DataFrame(reduced2.take(50), columns = reduced2.columns)
A small bit of the "Amenities" list:
Wireless Internet,Air conditioning,Kitchen,Fre...
2 10
3 Internet,Wireless Internet,Air conditioning,Ki...
4 TV,Cable TV,Internet,Wireless Internet,Air con...
5 TV,Wireless Internet,Air conditioning,Pool,Kit...
6 TV,Wireless Internet,Air conditioning,Pool,Kit...
7 Internet,Wireless Internet,Kitchen,Free parkin...
8 TV,Wireless Internet,Air conditioning,Pool,Kit...
9 Wireless Internet,Air conditioning,Kitchen,Fre...
10 TV,Cable TV,Internet,Wireless Internet,Air con...
14 10
16 10
17 TV,Internet,Wireless Internet,Air conditioning...
18 TV,Cable TV,Internet,Wireless Internet,Air con...
19 TV,Internet,Wireless Internet,Air conditioning...
20 TV,Wireless Internet,Air conditioning,Pool,Kit...
23 TV,Cable TV,Internet,Wireless Internet,Air con...
28 9
33 10
34 Internet,Wireless Internet,Kitchen,Elevator in...
37 10
As you can see I will have trouble dealing with this as it is.
I can do something in regular pandas easily enough to fix it, like this:
for i in range(len(df3['Amenities'])):
if len(df3["Amenities"][i])>2:
df3['Amenities'][i] = str(len(df3['Amenities'][i].split(',')))
Now I realize it may not be the nicest way to do it, but it turns everything that's a list into a number.
What I need is a way to do something like this to a column in a pyspark SQL dataframe, if it's at all possible.
Thanks!
If I understand you correctly, you want to calculate the number of items delimited by ,, but keep rows which are already numbers. if so, you might try the following:
from pyspark.sql import functions as F
df.withColumn('Amenities'
, F.when(df.Amenities.rlike('^\d+$'), df.Amenities) \
.otherwise(F.size(F.split('Amenities', ","))) \
.astype("string")
).show()
So if the columns Amenities is an integer df.Amenities.rlike('^\d+$'), we will keep it as is df.Amenities, otherwise, use F.size() and F.split() to calculate the number of items. then convert the result to a "string"
I am not familiar with PySpark SQL Dataframes, only vanilla Pandas.
Not sure what your task is, but maybe consider turning that column in two columns. E.g. (assuming this is possible in PySpark):
df['Amenities_count'] = pd.to_numeric(df['Amenities'], errors='coerce')
mask_entries_with_list = df['Amenities_count'].isna()
mask_entries_with_number = ~mask_entries_with_list
df.loc[mask_entries_with_number, 'Amenities'] = []
df.loc[mask_entries_with_list, 'Amenities_count'] = df['Amenities'].apply(len)
(untested)

Need some guidance on creating an entry in cassandra

I am new to cassandra and find it a bit difficult to understand in creating a simple keyspace with some structure that I am thinking of. I created a keyspace called "acquisition" using the cassandra CLI.
Using Cassandra-CLI how can I create the following for the "acquisition" keyspace -
TagNo // This is the super column
{
ID // This is the column family
{
// here we shall have lots of entries. (Rows)
user1: {rate, distance, capacity}
user2: {rate, distance, capacity}
}
}
Rate distance and capacity can be stored as either strings or doubles. But does not really matter at the moment.
I am not sure how on to do this using the CLI. So please help
create keyspace.
create keyspace acquisition with placement_strategy =
'org.apache.cassandra.locator.SimpleStrategy' and strategy_options =
{replication_factor:1};
create super column family.
create column family TagNo with column_type = 'Super' and comparator = 'UTF8Type' and subcomparator = 'UTF8Type' and default_validation_class = 'UTF8Type' and column_metadata = [{ column_name : rate, validation_class : AsciiType}, { column_name : 'distance', validation_class : AsciiType}, {column_name : 'capacity', validation_class : AsciiType}];
set a few example values to the TagNo super column family.
[default#acquisition] set TagNo[utf8('ID')]['user1']['rate'] = '10';
Value inserted.
Elapsed time: 2 msec(s).
[default#acquisition] set TagNo[utf8('ID')]['user1']['distance'] = '100';
Value inserted.
Elapsed time: 2 msec(s).
[default#acquisition] set TagNo[utf8('ID')]['user1']['capacity'] = '50';
Value inserted.
Elapsed time: 2 msec(s).
[default#acquisition] set TagNo[utf8('ID')]['user2']['capacity'] = '50';
Value inserted.
Elapsed time: 2 msec(s).
[default#acquisition] set TagNo[utf8('ID')]['user2']['rate'] = '20';
Value inserted.
Elapsed time: 1 msec(s).
[default#acquisition] set TagNo[utf8('ID')]['user2']['distance'] = '100';
Value inserted.
Elapsed time: 2 msec(s).
show values..
[default#acquisition] get TagNo[utf8('ID')];
=> (super_column=user1,
(column=capacity, value=50, timestamp=1331605812776000)
(column=distance, value=100, timestamp=1331605805912000)
(column=rate, value=10, timestamp=1331605780216000))
=> (super_column=user2,
(column=capacity, value=50, timestamp=1331605816568000)
(column=distance, value=100, timestamp=1331605846008000)
(column=rate, value=20, timestamp=1331605821608000))
Returned 2 results.
Elapsed time: 3 msec(s).

Resources