The sample code is as follows:
symbol = take(`APPL, 6) join take(`FB, 5)
time = 2019.02.27T09:45:01 2019.02.27T09:45:01 2019.02.27T09:45:04 2019.02.27T09:45:04 2019.02.27T09:45:05 2019.02.27T09:45:06 2019.02.27T09:45:01 2019.02.27T09:45:02 2019.02.27T09:45:03 2019.02.27T09:45:04 2019.02.27T09:45:05
price = (170 + rand(5, 6)) join (64 + rand(2, 5))
quotes = table(symbol, time, price)
weights = dict(`APPL`FB, 0.6 0.4)
ETF = select symbol, time, price * weights[symbol] as price from quotes
t = select rowSum(ffill(last(price))) from ETF pivot by time, symbol
What is the specific execution process of the above select rowSum(ffill(last(price))) from ETF pivot by time, symbol?
Related
I have a table containing the market data of 5,000 unique stocks. Each stock has 24 records a day and each record has 1,000 fields (factors). I want to pivot the table for cross-sectional analysis. You can find my script below.
I have two questions: (1) The current script is a bit complex. Is there a simpler implementation? (2) The execution takes 521 seconds. Any way to make it faster?
1.Create table
CREATE TABLE tb
(
tradeTime DateTime,
symbol String,
factor String,
value Float64
)
ENGINE = MergeTree
PARTITION BY toYYYYMMDD(tradeTime)
ORDER BY (symbol, tradeTime)
SETTINGS index_granularity = 8192
2.Insert test data
INSERT INTO tb SELECT
tradetime,
symbol,
untuple(factor)
FROM
(
SELECT
tradetime,
symbol
FROM
(
WITH toDateTime('2022-01-01 00:00:00') AS start
SELECT arrayJoin(timeSlots(start, toUInt32((22 * 23) * 3600), 3600)) AS tradetime
)
ARRAY JOIN arrayMap(x -> concat('symbol', toString(x)), range(0, 5000)) AS symbol
)
ARRAY JOIN arrayMap(x -> (concat('f', toString(x)), toFloat64(x) + toFloat64(0.1)), range(0, 1000)) AS factor
3.Finally, send the query
SELECT
tradeTime,
sumIf(value, factor = 'factor1') AS factor1,
sumIf(value, factor = 'factor2') AS factor2,
sumIf(value, factor = 'factor3') AS factor3,
sumIf(value, factor = 'factor4') AS factor4,
...// so many factors to list out
sumIf(value, factor = 'factor1000') AS factor1000
FROM tb
GROUP BY tradeTime,symbol
ORDER BY tradeTime,symbol ASC
Have you considered building a materialized view to solve this with the inserts into a SummingMergeTree ?
I have 4 columns ['clienttimestamp",'sensor_id','actvivity',"incidents"]. From kafka stream, i consume data,preprocess and aggregate in window.
If i do with groupby with ".count()", The stream works very well writing each window with their count in the console.
This works,
df = df.withWatermark("clientTimestamp", "1 minutes")\
.groupby(window(df.clientTimestamp, "1 minutes", "1 minutes"), col('sensor_type')).count()
query = df.writeStream.outputMode("append").format('console').start()
query.awaitTermination()
But the real motive is to find the total time for which critical activity was live.
i.e. For each sensor_type, i group the data by window and i get the list of critical activity and find the total time for which the all critical activity lasted" (The code is below). But am not sure if i am using the udf in right way! Because below method does not work! Can anyone provide an example of applying a custom function for each group of window and to write the output to console.
This does not work
#f.pandas_udf(schemahh, f.PandasUDFType.GROUPED_MAP)
def calculate_time(pdf):
pdf = pdf.reset_index(drop=True)
total_time = 0
index_list = pdf.index[pdf['activity'] == 'critical'].to_list()
for ind in index_list:
start = pdf.loc[ind]['clientTimestamp']
end = pdf.loc[ind + 1]['clientTimestamp']
diff = start - end
time_n_mins = round(diff.seconds / 60, 2)
total_time = total_time + time_n_mins
largest_session_time = total_time
new_pdf = pd.DataFrame(columns=['sensor_type', 'largest_session_time'])
new_pdf.loc[0] = [pdf.loc[0]['sensor_type'], largest_session_time]
return new_pdf
df = df.withWatermark("clientTimestamp", "1 minutes")\
.groupby(window(df.clientTimestamp, "1 minutes", "1 minutes"), col('sensor_type'), col('activity')).apply(calculate_time)
query = df.writeStream.outputMode("append").format('console').start()
query.awaitTermination()
Need help in writing u-sql query to fetch me top n percentage of rows.I have one dataset from which need to take total count of rows and take top 3% rows from dataset based on col1. Code which I have written is :
#count = SELECT Convert.ToInt32(COUNT(*)) AS cnt FROM #telData;
#count1=SELECT cnt/100 AS cnt1 FROM #count;
DECLARE #cnt int=SELECT Convert.ToInt32(cnt1*3) FROM #count1;
#EngineFailureData=
SELECT vin,accelerator_pedal_position,enginefailure=1
FROM #telData
ORDER BY accelerator_pedal_position DESC
FETCH #cnt ROWS;
#telData is my basic dataset.Thanks for help.
Some comments first:
FETCH currently only takes literals as arguments (https://msdn.microsoft.com/en-us/library/azure/mt621321.aspx)
#var = SELECT ... will assign the name #var to the rowset expression that starts with the SELECT. U-SQL (currently) does not provide you with stateful scalar variable assignment from query results. Instead you would use a CROSS JOIN or other JOIN to join the scalar value in.
Now to the solution:
To get the percentage, take a look at the ROW_NUMBER() and PERCENT_RANK() functions. For example, the following shows you how to use either to answer your question. Given the simpler code for PERCENT_RANK() (no need for the MAX() and CROSS JOIN), I would suggest that solution.
DECLARE #percentage double = 0.25; // 25%
#data = SELECT *
FROM (VALUES(1),(2),(3),(4),(5),(6),(7),(8),(9),(10),(11),(12),(13),(14),(15),(16),(17),(18),(19),(20)
) AS T(pos);
#data =
SELECT PERCENT_RANK() OVER(ORDER BY pos) AS p_rank,
ROW_NUMBER() OVER(ORDER BY pos) AS r_no,
pos
FROM #data;
#cut_off =
SELECT ((double) MAX(r_no)) * (1.0 - #percentage) AS max_r
FROM #data;
#r1 =
SELECT *
FROM #data CROSS JOIN #cut_off
WHERE ((double) r_no) > max_r;
#r2 =
SELECT *
FROM #data
WHERE p_rank >= 1.0 - #percentage;
OUTPUT #r1
TO "/output/top_perc1.csv"
ORDER BY p_rank DESC
USING Outputters.Csv();
OUTPUT #r2
TO "/output/top_perc2.csv"
ORDER BY p_rank DESC
USING Outputters.Csv();
I am looking to automatically generate the following string in Python 2.7 using a loop based on the number of columns in a Pandas DataFrame:
INSERT INTO table_name (firstname, lastname) VALUES (534737, 100.115)
This assumes that the DataFrame has 2 columns.
Here is what I have:
# Generate test numbers for table:
df = pd.DataFrame(np.random.rand(5,2), columns=['firstname','lastname'])
# Create list of tuples from numbers in each row of DataFrame:
list_of_tuples = [tuple(x) for x in df.values]
Now, I create the string:
Manually - this works:
add_SQL = INSERT INTO table_name (firstname, lastname) VALUES %s" % (list_of_tuples[4])
In this example, I only used 2 column names - 'firstname' and 'lastname'. But I must do this with a loop since I have 156 column names - I cannot do this manually.
What I need:
I need to automatically generate the placeholder %s the same
number of times as the number of columns in the Pandas DataFrame.
Here, the DataFrame has 2 columns so I need an automatic way to
generate %s twice.
Then I need to create a tuple with 2 entries,
without the ''.
My attempt:
sss = ['%s' for x in range(0,len(list(df)))]
add_SQL = "INSERT INTO table_name (" + sss + ") VALUES %s" % (len(df), list_of_tuples[4])
But this is not working.
Is there a way for me to generate this string automatically?
Here is what I came up with - it is based on dwanderson's approach in the 2nd comment of the original post (question):
table_name = name_a #name of table
# Loop through all columns of dataframe and generate one string per column:
cols_n = df.columns.tolist()
placeholder = ",".join(["%s"]*df.shape[1]) #df.shape[1] gives # of columns
column_names = ",".join(cols_n)
insrt = "INSERT INTO %s " % table_name
for qrt in range(0,df.shape[0]):
add_SQL_a_1 = insrt + "(" + column_names + ") VALUES (" + placeholder + ")" #part 1/2
add_SQL_a_2 = add_SQL_a_1 % list_of_tuples[qrt] #part 2/2
This way, the final string is in part 2/2.
For some reason, it would not let me do this all in one line and I can't figure out why.
Does anyone how to do pagination in spark sql query?
I need to use spark sql but don't know how to do pagination.
Tried:
select * from person limit 10, 10
It has been 6 years, don't know if it was possible back then
I would add a sequential id on the answer and search for registers between offset and offset + limit
On pure spark sql query it would be something like this, for offset 10 and limit 10
WITH count_person AS (
SELECT *, monotonically_increasing_id() AS count FROM person)
SELECT * FROM count_person WHERE count > 10 AND count < 20
On Pyspark it would be very similar
import pyspark.sql.functions as F
offset = 10
limit = 10
df = df.withColumn('_id', F.monotonically_increasing_id())
df = df.where(F.col('_id').between(offset, offset + limit))
Its flexible and fast enough even for a big volume of data
karthik's answer will fail if there are duplicate rows in the dataframe. 'except' will remove all rows in df1 which are in df2 .
val filteredRdd = df.rdd.zipWithIndex().collect { case (r, i) if 10 >= start && i <=20 => r }
val newDf = sqlContext.createDataFrame(filteredRdd, df.schema)
There is no support for offset as of now in spark sql. One of the alternatives you can use for paging is through DataFrames using except method.
Example: You want to iterate with a paging limit of 10, you can do the following:
DataFrame df1;
long count = df.count();
int limit = 10;
while(count > 0){
df1 = df.limit(limit);
df1.show(); //will print 10, next 10, etc rows
df = df.except(df1);
count = count - limit;
}
If you want to do say, LIMIT 50, 100 in the first go, you can do the following:
df1 = df.limit(50);
df2 = df.except(df1);
df2.limit(100); //required result
Hope this helps!
Please find bellow a useful PySpark (Python 3 and Spark 3) class named SparkPaging which abstract the pagination mecanism :
https://gitlab.com/enahwe/public/lib/spark/sparkpaging
Here's the usage :
SparkPaging
Class for paging dataframes and datasets
Example
- Init example 1:
Approach by specifying a limit.
sp = SparkPaging(initData=df, limit=753)
- Init example 2:
Approach by specifying a number of pages (if there's a rest, the number of pages will be incremented).
sp = SparkPaging(initData=df, pages=6)
- Init example 3:
Approach by specifying a limit.
sp = SparkPaging()
sp.init(initData=df, limit=753)
- Init example 4:
Approach by specifying a number of pages (if there's a rest, the number of pages will be incremented).
sp = SparkPaging()
sp.init(initData=df, pages=6)
- Reset:
sp.reset()
- Iterate example:
print("- Total number of rows = " + str(sp.initDataCount))
print("- Limit = " + str(sp.limit))
print("- Number of pages = " + str(sp.pages))
print("- Number of rows in the last page = " + str(sp.numberOfRowsInLastPage))
while (sp.page < sp.pages-1):
df_page = sp.next()
nbrRows = df_page.count()
print(" Page " + str(sp.page) + '/' + str(sp.pages) + ": Number of rows = " + str(nbrRows))
- Output:
- Total number of rows = 4521
- Limit = 753
- Number of pages = 7
- Number of rows in the last page = 3
Page 0/7: Number of rows = 753
Page 1/7: Number of rows = 753
Page 2/7: Number of rows = 753
Page 3/7: Number of rows = 753
Page 4/7: Number of rows = 753
Page 5/7: Number of rows = 753
Page 6/7: Number of rows = 3