execution of DolphinDB pivot by statement - pivot

The sample code is as follows:
symbol = take(`APPL, 6) join take(`FB, 5)
time = 2019.02.27T09:45:01 2019.02.27T09:45:01 2019.02.27T09:45:04 2019.02.27T09:45:04 2019.02.27T09:45:05 2019.02.27T09:45:06 2019.02.27T09:45:01 2019.02.27T09:45:02 2019.02.27T09:45:03 2019.02.27T09:45:04 2019.02.27T09:45:05
price = (170 + rand(5, 6)) join (64 + rand(2, 5))
quotes = table(symbol, time, price)
weights = dict(`APPL`FB, 0.6 0.4)
ETF = select symbol, time, price * weights[symbol] as price from quotes
t = select rowSum(ffill(last(price))) from ETF pivot by time, symbol
What is the specific execution process of the above select rowSum(ffill(last(price))) from ETF pivot by time, symbol?

Related

Improve performance of the table pivoting in Clickhouse

I have a table containing the market data of 5,000 unique stocks. Each stock has 24 records a day and each record has 1,000 fields (factors). I want to pivot the table for cross-sectional analysis. You can find my script below.
I have two questions: (1) The current script is a bit complex. Is there a simpler implementation? (2) The execution takes 521 seconds. Any way to make it faster?
1.Create table
CREATE TABLE tb
(
tradeTime DateTime,
symbol String,
factor String,
value Float64
)
ENGINE = MergeTree
PARTITION BY toYYYYMMDD(tradeTime)
ORDER BY (symbol, tradeTime)
SETTINGS index_granularity = 8192
2.Insert test data
INSERT INTO tb SELECT
tradetime,
symbol,
untuple(factor)
FROM
(
SELECT
tradetime,
symbol
FROM
(
WITH toDateTime('2022-01-01 00:00:00') AS start
SELECT arrayJoin(timeSlots(start, toUInt32((22 * 23) * 3600), 3600)) AS tradetime
)
ARRAY JOIN arrayMap(x -> concat('symbol', toString(x)), range(0, 5000)) AS symbol
)
ARRAY JOIN arrayMap(x -> (concat('f', toString(x)), toFloat64(x) + toFloat64(0.1)), range(0, 1000)) AS factor
3.Finally, send the query
SELECT
tradeTime,
sumIf(value, factor = 'factor1') AS factor1,
sumIf(value, factor = 'factor2') AS factor2,
sumIf(value, factor = 'factor3') AS factor3,
sumIf(value, factor = 'factor4') AS factor4,
...// so many factors to list out
sumIf(value, factor = 'factor1000') AS factor1000
FROM tb
GROUP BY tradeTime,symbol
ORDER BY tradeTime,symbol ASC
Have you considered building a materialized view to solve this with the inserts into a SummingMergeTree ?

Pyspark - Applying custom function on structured streaming

I have 4 columns ['clienttimestamp",'sensor_id','actvivity',"incidents"]. From kafka stream, i consume data,preprocess and aggregate in window.
If i do with groupby with ".count()", The stream works very well writing each window with their count in the console.
This works,
df = df.withWatermark("clientTimestamp", "1 minutes")\
.groupby(window(df.clientTimestamp, "1 minutes", "1 minutes"), col('sensor_type')).count()
query = df.writeStream.outputMode("append").format('console').start()
query.awaitTermination()
But the real motive is to find the total time for which critical activity was live.
i.e. For each sensor_type, i group the data by window and i get the list of critical activity and find the total time for which the all critical activity lasted" (The code is below). But am not sure if i am using the udf in right way! Because below method does not work! Can anyone provide an example of applying a custom function for each group of window and to write the output to console.
This does not work
#f.pandas_udf(schemahh, f.PandasUDFType.GROUPED_MAP)
def calculate_time(pdf):
pdf = pdf.reset_index(drop=True)
total_time = 0
index_list = pdf.index[pdf['activity'] == 'critical'].to_list()
for ind in index_list:
start = pdf.loc[ind]['clientTimestamp']
end = pdf.loc[ind + 1]['clientTimestamp']
diff = start - end
time_n_mins = round(diff.seconds / 60, 2)
total_time = total_time + time_n_mins
largest_session_time = total_time
new_pdf = pd.DataFrame(columns=['sensor_type', 'largest_session_time'])
new_pdf.loc[0] = [pdf.loc[0]['sensor_type'], largest_session_time]
return new_pdf
df = df.withWatermark("clientTimestamp", "1 minutes")\
.groupby(window(df.clientTimestamp, "1 minutes", "1 minutes"), col('sensor_type'), col('activity')).apply(calculate_time)
query = df.writeStream.outputMode("append").format('console').start()
query.awaitTermination()

Need to fetch n percentage of rows in u-sql query

Need help in writing u-sql query to fetch me top n percentage of rows.I have one dataset from which need to take total count of rows and take top 3% rows from dataset based on col1. Code which I have written is :
#count = SELECT Convert.ToInt32(COUNT(*)) AS cnt FROM #telData;
#count1=SELECT cnt/100 AS cnt1 FROM #count;
DECLARE #cnt int=SELECT Convert.ToInt32(cnt1*3) FROM #count1;
#EngineFailureData=
SELECT vin,accelerator_pedal_position,enginefailure=1
FROM #telData
ORDER BY accelerator_pedal_position DESC
FETCH #cnt ROWS;
#telData is my basic dataset.Thanks for help.
Some comments first:
FETCH currently only takes literals as arguments (https://msdn.microsoft.com/en-us/library/azure/mt621321.aspx)
#var = SELECT ... will assign the name #var to the rowset expression that starts with the SELECT. U-SQL (currently) does not provide you with stateful scalar variable assignment from query results. Instead you would use a CROSS JOIN or other JOIN to join the scalar value in.
Now to the solution:
To get the percentage, take a look at the ROW_NUMBER() and PERCENT_RANK() functions. For example, the following shows you how to use either to answer your question. Given the simpler code for PERCENT_RANK() (no need for the MAX() and CROSS JOIN), I would suggest that solution.
DECLARE #percentage double = 0.25; // 25%
#data = SELECT *
FROM (VALUES(1),(2),(3),(4),(5),(6),(7),(8),(9),(10),(11),(12),(13),(14),(15),(16),(17),(18),(19),(20)
) AS T(pos);
#data =
SELECT PERCENT_RANK() OVER(ORDER BY pos) AS p_rank,
ROW_NUMBER() OVER(ORDER BY pos) AS r_no,
pos
FROM #data;
#cut_off =
SELECT ((double) MAX(r_no)) * (1.0 - #percentage) AS max_r
FROM #data;
#r1 =
SELECT *
FROM #data CROSS JOIN #cut_off
WHERE ((double) r_no) > max_r;
#r2 =
SELECT *
FROM #data
WHERE p_rank >= 1.0 - #percentage;
OUTPUT #r1
TO "/output/top_perc1.csv"
ORDER BY p_rank DESC
USING Outputters.Csv();
OUTPUT #r2
TO "/output/top_perc2.csv"
ORDER BY p_rank DESC
USING Outputters.Csv();

Create Python string placeholder (%s) n times

I am looking to automatically generate the following string in Python 2.7 using a loop based on the number of columns in a Pandas DataFrame:
INSERT INTO table_name (firstname, lastname) VALUES (534737, 100.115)
This assumes that the DataFrame has 2 columns.
Here is what I have:
# Generate test numbers for table:
df = pd.DataFrame(np.random.rand(5,2), columns=['firstname','lastname'])
# Create list of tuples from numbers in each row of DataFrame:
list_of_tuples = [tuple(x) for x in df.values]
Now, I create the string:
Manually - this works:
add_SQL = INSERT INTO table_name (firstname, lastname) VALUES %s" % (list_of_tuples[4])
In this example, I only used 2 column names - 'firstname' and 'lastname'. But I must do this with a loop since I have 156 column names - I cannot do this manually.
What I need:
I need to automatically generate the placeholder %s the same
number of times as the number of columns in the Pandas DataFrame.
Here, the DataFrame has 2 columns so I need an automatic way to
generate %s twice.
Then I need to create a tuple with 2 entries,
without the ''.
My attempt:
sss = ['%s' for x in range(0,len(list(df)))]
add_SQL = "INSERT INTO table_name (" + sss + ") VALUES %s" % (len(df), list_of_tuples[4])
But this is not working.
Is there a way for me to generate this string automatically?
Here is what I came up with - it is based on dwanderson's approach in the 2nd comment of the original post (question):
table_name = name_a #name of table
# Loop through all columns of dataframe and generate one string per column:
cols_n = df.columns.tolist()
placeholder = ",".join(["%s"]*df.shape[1]) #df.shape[1] gives # of columns
column_names = ",".join(cols_n)
insrt = "INSERT INTO %s " % table_name
for qrt in range(0,df.shape[0]):
add_SQL_a_1 = insrt + "(" + column_names + ") VALUES (" + placeholder + ")" #part 1/2
add_SQL_a_2 = add_SQL_a_1 % list_of_tuples[qrt] #part 2/2
This way, the final string is in part 2/2.
For some reason, it would not let me do this all in one line and I can't figure out why.

how to implement spark sql pagination query

Does anyone how to do pagination in spark sql query?
I need to use spark sql but don't know how to do pagination.
Tried:
select * from person limit 10, 10
It has been 6 years, don't know if it was possible back then
I would add a sequential id on the answer and search for registers between offset and offset + limit
On pure spark sql query it would be something like this, for offset 10 and limit 10
WITH count_person AS (
SELECT *, monotonically_increasing_id() AS count FROM person)
SELECT * FROM count_person WHERE count > 10 AND count < 20
On Pyspark it would be very similar
import pyspark.sql.functions as F
offset = 10
limit = 10
df = df.withColumn('_id', F.monotonically_increasing_id())
df = df.where(F.col('_id').between(offset, offset + limit))
Its flexible and fast enough even for a big volume of data
karthik's answer will fail if there are duplicate rows in the dataframe. 'except' will remove all rows in df1 which are in df2 .
val filteredRdd = df.rdd.zipWithIndex().collect { case (r, i) if 10 >= start && i <=20 => r }
val newDf = sqlContext.createDataFrame(filteredRdd, df.schema)
There is no support for offset as of now in spark sql. One of the alternatives you can use for paging is through DataFrames using except method.
Example: You want to iterate with a paging limit of 10, you can do the following:
DataFrame df1;
long count = df.count();
int limit = 10;
while(count > 0){
df1 = df.limit(limit);
df1.show(); //will print 10, next 10, etc rows
df = df.except(df1);
count = count - limit;
}
If you want to do say, LIMIT 50, 100 in the first go, you can do the following:
df1 = df.limit(50);
df2 = df.except(df1);
df2.limit(100); //required result
Hope this helps!
Please find bellow a useful PySpark (Python 3 and Spark 3) class named SparkPaging which abstract the pagination mecanism :
https://gitlab.com/enahwe/public/lib/spark/sparkpaging
Here's the usage :
SparkPaging
Class for paging dataframes and datasets
Example
- Init example 1:
Approach by specifying a limit.
sp = SparkPaging(initData=df, limit=753)
- Init example 2:
Approach by specifying a number of pages (if there's a rest, the number of pages will be incremented).
sp = SparkPaging(initData=df, pages=6)
- Init example 3:
Approach by specifying a limit.
sp = SparkPaging()
sp.init(initData=df, limit=753)
- Init example 4:
Approach by specifying a number of pages (if there's a rest, the number of pages will be incremented).
sp = SparkPaging()
sp.init(initData=df, pages=6)
- Reset:
sp.reset()
- Iterate example:
print("- Total number of rows = " + str(sp.initDataCount))
print("- Limit = " + str(sp.limit))
print("- Number of pages = " + str(sp.pages))
print("- Number of rows in the last page = " + str(sp.numberOfRowsInLastPage))
while (sp.page < sp.pages-1):
df_page = sp.next()
nbrRows = df_page.count()
print(" Page " + str(sp.page) + '/' + str(sp.pages) + ": Number of rows = " + str(nbrRows))
- Output:
- Total number of rows = 4521
- Limit = 753
- Number of pages = 7
- Number of rows in the last page = 3
Page 0/7: Number of rows = 753
Page 1/7: Number of rows = 753
Page 2/7: Number of rows = 753
Page 3/7: Number of rows = 753
Page 4/7: Number of rows = 753
Page 5/7: Number of rows = 753
Page 6/7: Number of rows = 3

Resources