Can someone help me how to use Null last in Mem sqlMEM, in RDBMS we have option for null last but in Mem SQL it does not supports
SingleStore supports it:
singlestore> create table t(a int);
Query OK, 0 rows affected (0.02 sec)
singlestore> insert t values(1),(2),(null),(4);
singlestore> select a from t order by a;
+------+
| a |
+------+
| NULL |
| 1 |
| 2 |
| 4 |
+------+
4 rows in set (0.03 sec)
singlestore> select a from t order by a NULLS LAST;
+------+
| a |
+------+
| 1 |
| 2 |
| 4 |
| NULL |
+------+
Related
In Spark documentation there is an example
df = ... # streaming DataFrame with IOT device data with schema { device: string, deviceType: string, signal: double, time: DateType }
# Select the devices which have signal more than 10
df.select("device").where("signal > 10")
What does select("device") part do?
If it is a selection by signal field value, then why to mention device field?
Why don't write just
df.where("signal > 10")
or
df.select("time").where("signal > 10")
?
select("device")
this only select the Column "device"
df.show
+----------+-------------------+
|signal | B | C | D | E | F |
+----------+---+---+---+---+---+
|10 | 4 | 1 | 0 | 3 | 1 |
|15 | 6 | 4 | 3 | 2 | 0 |
+----------+---+---+---+---+---+
df.select("device").show
+----------+
|signal |
+----------+
|10 |
|15 |
+----------+
I am ingesting a dataframe and I want to append a monotonically increasing column that increases whenever another column matches a certain value. For example I have the following table
+------+-------+
| Col1 | Col2 |
+------+-------+
| B | 543 |
| A | 1231 |
| B | 14234 |
| B | 34234 |
| B | 3434 |
| A | 43242 |
| B | 43242 |
| B | 56453 |
+------+-------+
I would like to append a column that increases in value whenever "A" in col1 is present. So the result would look like
+------+-------+------+
| Col1 | Col2 | Col3 |
+------+-------+------+
| B | 543 | 0 |
| A | 1231 | 1 |
| B | 14234 | 1 |
| B | 34234 | 1 |
| B | 3434 | 1 |
| A | 43242 | 2 |
| B | 43242 | 2 |
| B | 56453 | 2 |
+------+-------+------+
Keeping the initial order is important.
I tried zippering but that doesn't seem to produce the right result. Splitting it up into individual seqs manually and doing it that way is not going to be performant enough (think 100+ GB tables).
I looked into trying this with a map function that would keep a counter somewhere but couldn't get that to work.
Any advice or pointer in the right direction would be greatly appreciated.
spark does not provide any default functions to achieve this kind of functionality
I would do like to do most probably in this way
//inputDF contains Col1 | Col2
val df = inputDF.select("Col1").distinct.rdd.zipWithIndex().toDF("Col1","Col2")
val finalDF = inputDF.join(df,df("Col1") === inputDF("Col1"),"left").select(inputDF("*"),"Col3")
but the problem here I can see is (join which will result in the shuffle).
you can also check other autoincrement API's here.
Use window and sum over the window of the value 1 when Col1 = A.
import pyspark.sql.functions as f
from pyspark.sql import Window
w = Window.partitionBy().rowsBetween(Window.unboundedPreceding, Window.currentRow)
df.withColumn('Col3', f.sum(f.when(f.col('Col1') == f.lit('A'), 1).otherwise(0)).over(w)).show()
+----+-----+----+
|Col1| Col2|Col3|
+----+-----+----+
| B| 543| 0|
| A| 1231| 1|
| B|14234| 1|
| B|34234| 1|
| B| 3434| 1|
| A|43242| 2|
| B|43242| 2|
| B|56453| 2|
+----+-----+----+
I am trying to solve a problem with pyspark,
I have a dataset such as:
Condition | Date
0 | 2019/01/10
1 | 2019/01/11
0 | 2019/01/15
1 | 2019/01/16
1 | 2019/01/19
0 | 2019/01/23
0 | 2019/01/25
1 | 2019/01/29
1 | 2019/01/30
I would like to get the latest lag value of the date column when condition == 1 was met
The desired output would be something like:
Condition | Date | Lag
0 | 2019/01/10 | NaN
1 | 2019/01/11 | NaN
0 | 2019/01/15 | 2019/01/11
1 | 2019/01/16 | 2019/01/11
1 | 2019/01/19 | 2019/01/16
0 | 2019/01/23 | 2019/01/19
0 | 2019/01/25 | 2019/01/19
1 | 2019/01/29 | 2019/01/19
1 | 2019/01/30 | 2019/01/29
How can I perform that?
Please do keep in mind its a very large dataset - which I will have to partition and group by an UUID so the solution has to be somewhat performatic.
Thank you,
Here is a solution with Pyspark. The logic remains the same as #GordonLinoff's solution with SQL query.
w = Window.orderBy("Date").rowsBetween(Window.unboundedPreceding, Window.currentRow - 1)
df.withColumn("Lag", max(when(col("Condition") == lit(1), col("Date"))).over(w)).show()
Gives:
+---------+----------+----------+
|Condition| Date| Lag|
+---------+----------+----------+
| 0|2019/01/10| null|
| 1|2019/01/11| null|
| 0|2019/01/15|2019/01/11|
| 1|2019/01/16|2019/01/11|
| 1|2019/01/19|2019/01/16|
| 0|2019/01/23|2019/01/19|
| 0|2019/01/25|2019/01/19|
| 1|2019/01/29|2019/01/19|
| 1|2019/01/30|2019/01/29|
+---------+----------+----------+
In SQL, you can use a conditional running max():
select t.*,
max(case when condition = 1 then date end) over (order by date
rows between unbounded preceding and 1 preceding
) as prev_condition_1_date
from t;
I like to use SQL to solve that:
from pyspark.sql.functions import expr
display(
df.withColumn(
'lag',
expr('max(case when Condition == 1 then Date end) over (order by Date rows between unbounded preceding and 1 preceding)'
)
)
I have the following as rows in HIVE (HDFS) and using Presto as the Query Engine.
1,#markbutcher72 #charlottegloyn Not what Belinda Carlisle thought. And yes, she was singing about Edgbaston.
2,#tomkingham #markbutcher72 #charlottegloyn It's true the garden of Eden is currently very green...
3,#MrRhysBenjamin #gasuperspark1 #markbutcher72 Actually it's Springfield Park, the (occasional) home of the might
The requirement is to do get the following through Presto Query. How can we get this please
1,markbutcher72
1,charlottegloyn
2,tomkingham
2,markbutcher72
2,charlottegloyn
3,MrRhysBenjamin
3,gasuperspark1
3,markbutcher72
select t.id
,u.token
from mytable as t
cross join unnest (regexp_extract_all(text,'(?<=#)\S+')) as u(token)
;
+----+----------------+
| id | token |
+----+----------------+
| 1 | markbutcher72 |
| 1 | charlottegloyn |
| 2 | tomkingham |
| 2 | markbutcher72 |
| 2 | charlottegloyn |
| 3 | MrRhysBenjamin |
| 3 | gasuperspark1 |
| 3 | markbutcher72 |
+----+----------------+
I have a Spark data frame that looks like this (simplifying timestamp and id column values for clarity):
| Timestamp | id | status |
--------------------------------
| 1 | 1 | pending |
| 2 | 2 | pending |
| 3 | 1 | in-progress |
| 4 | 1 | in-progress |
| 5 | 3 | in-progress |
| 6 | 1 | pending |
| 7 | 4 | closed |
| 8 | 1 | pending |
| 9 | 1 | in-progress |
It's a time series of status events. What I'd like to end up with is only the rows representing a status change. In that sense, the problem can be seen as one of removing redundant rows - e.g. entries at times 4 and 8 - both for id = 1 - should be dropped as they do not represent a change of status for a given id.
For the above set of rows, this would give (order being unimportant):
| Timestamp | id | status |
--------------------------------
| 1 | 1 | pending |
| 2 | 2 | pending |
| 3 | 1 | in-progress |
| 5 | 3 | in-progress |
| 6 | 1 | pending |
| 7 | 4 | closed |
| 9 | 1 | in-progress |
Original plan was to partition by id and status, order by timestamp, and pick the first row for each partition - however this would give
| Timestamp | id | status |
--------------------------------
| 1 | 1 | pending |
| 2 | 2 | pending |
| 3 | 1 | in-progress |
| 5 | 3 | in-progress |
| 7 | 4 | closed |
i.e. it loses repeated status changes.
Any pointers appreciated, I'm new to data frames and may be missing a trick or two.
Using the lag window function should do the trick
case class Event(timestamp: Int, id: Int, status: String)
val events = sqlContext.createDataFrame(sc.parallelize(
Event(1, 1, "pending") :: Event(2, 2, "pending") ::
Event(3, 1, "in-progress") :: Event(4, 1, "in-progress") ::
Event(5, 3, "in-progress") :: Event(6, 1, "pending") ::
Event(7, 4, "closed") :: Event(8, 1, "pending") ::
Event(9, 1, "in-progress") :: Nil
))
events.registerTempTable("events")
val query = """SELECT timestamp, id, status FROM (
SELECT timestamp, id, status, lag(status) OVER (
PARTITION BY id ORDER BY timestamp
) AS prev_status FROM events) tmp
WHERE prev_status IS NULL OR prev_status != status
ORDER BY timestamp, id"""
sqlContext.sql(query).show
Inner query
SELECT timestamp, id, status, lag(status) OVER (
PARTITION BY id ORDER BY timestamp
) AS prev_status FROM events
creates table as below where prev_status is a previous value of status for a given id and ordered by timestamp.
+---------+--+-----------+-----------+
|timestamp|id| status|prev_status|
+---------+--+-----------+-----------+
| 1| 1| pending| null|
| 3| 1|in-progress| pending|
| 4| 1|in-progress|in-progress|
| 6| 1| pending|in-progress|
| 8| 1| pending| pending|
| 9| 1|in-progress| pending|
| 2| 2| pending| null|
| 5| 3|in-progress| null|
| 7| 4| closed| null|
+---------+--+-----------+-----------+
Outer query
SELECT timestamp, id, status FROM (...)
WHERE prev_status IS NULL OR prev_status != status
ORDER BY timestamp, id
simply filters rows where prev_status is NULL (first row for a given id) or prev_status is different than status (there was a status change between consecutive timestamps). Order added just to make a visual inspection easier.