I have a question about Cassandra. At present, "entities_by_time" is ok on the 18-bit uuid through column1 sorting, but there is something wrong with uuid ascending to the 19-bit sorting. Please help me.
cqlsh:minds> select * from entities_by_time where key='activity:user:990192934408163330' order by column1 desc limit 10;
key | column1 | value
----------------------------------+--------------------+--------------------
activity:user:990192934408163330 | 999979571363188746 | 999979571363188746
activity:user:990192934408163330 | 999979567064027139 | 999979567064027139
activity:user:990192934408163330 | 999979562764865555 | 999979562764865555
activity:user:990192934408163330 | 999979558465703953 | 999979558465703953
activity:user:990192934408163330 | 999979554170736649 | 999979554170736649
activity:user:990192934408163330 | 999979549871575047 | 999979549871575047
activity:user:990192934408163330 | 999979545576607752 | 999979545576607752
activity:user:990192934408163330 | 999979541290029073 | 999979541290029073
activity:user:990192934408163330 | 999979536990867461 | 999979536990867461
activity:user:990192934408163330 | 999979532700094475 | 999979532700094475
cqlsh:minds> select * from entities_by_time where key='activity:user:990192934408163330' order by column1 asc limit 10;
key | column1 | value
----------------------------------+---------------------+---------------------
activity:user:990192934408163330 | 1000054880351555598 | 1000054880351555598
activity:user:990192934408163330 | 1000054884671688706 | 1000054884671688706
activity:user:990192934408163330 | 1000054888966656017 | 1000054888966656017
activity:user:990192934408163330 | 1000054893257429005 | 1000054893257429005
activity:user:990192934408163330 | 1000054897552396308 | 1000054897552396308
activity:user:990192934408163330 | 1000054901843169290 | 1000054901843169290
activity:user:990192934408163330 | 1000054906138136577 | 1000054906138136577
activity:user:990192934408163330 | 1000054910433103883 | 1000054910433103883
activity:user:990192934408163330 | 1000054914723876869 | 1000054914723876869
activity:user:990192934408163330 | 1000054919010455568 | 1000054919010455568
CREATE TABLE minds.entities_by_time (
key text,
column1 text,
value text,
PRIMARY KEY (key, column1)
) WITH COMPACT STORAGE
AND CLUSTERING ORDER BY (column1 ASC)
AND bloom_filter_fp_chance = 0.01
AND caching = {'keys': 'ALL', 'rows_per_partition': 'NONE'}
AND comment = ''
AND compaction = {'class': 'org.apache.cassandra.db.compaction.SizeTieredCompactionStrategy', 'max_threshold': '32', 'min_threshold': '4'}
AND compression = {'enabled': 'false'}
AND crc_check_chance = 1.0
AND dclocal_read_repair_chance = 0.0
AND default_time_to_live = 0
AND gc_grace_seconds = 864000
AND max_index_interval = 2048
AND memtable_flush_period_in_ms = 0
AND min_index_interval = 128
AND read_repair_chance = 0.1
AND speculative_retry = '99PERCENTILE';
Through inquiry, it is found that in Cassandra, 1007227353832624141 is less than 963426376394739730. Why?
Good call Chris! The table definition tells it all! I recreated your table and ran queries sorting in both directions:
flynn#cqlsh:stackoverflow> SELECT * FROM entities_by_time
WHERE key='activity:user:990192934408163330' ORDER BY column1 DESC;
key | column1 | value
----------------------------------+---------------------+---------------------
activity:user:990192934408163330 | 999979571363188746 | 999979571363188746
activity:user:990192934408163330 | 999979567064027139 | 999979567064027139
activity:user:990192934408163330 | 963426376394739730 | 963426376394739730
activity:user:990192934408163330 | 1007227353832624141 | 1007227353832624141
activity:user:990192934408163330 | 1000054884671688706 | 1000054884671688706
activity:user:990192934408163330 | 1000054880351555598 | 1000054880351555598
(6 rows)
flynn#cqlsh:stackoverflow> SELECT * FROM entities_by_time
WHERE key='activity:user:990192934408163330' ORDER BY column1 ASC;
key | column1 | value
----------------------------------+---------------------+---------------------
activity:user:990192934408163330 | 1000054880351555598 | 1000054880351555598
activity:user:990192934408163330 | 1000054884671688706 | 1000054884671688706
activity:user:990192934408163330 | 1007227353832624141 | 1007227353832624141
activity:user:990192934408163330 | 963426376394739730 | 963426376394739730
activity:user:990192934408163330 | 999979567064027139 | 999979567064027139
activity:user:990192934408163330 | 999979571363188746 | 999979571363188746
(6 rows)
So to your question...
in Cassandra, 1007227353832624141 is less than 963426376394739730. Why?
Simply put, because 9 > 1, that's why.
Your table definition clusters on column1, which is a TEXT/UTF8 string and not a numeric. Essentially, Cassandra is sorting strings the only way it knows how - in ASCII-betical order, which is not alpha-numeric order.
Store your numerics as numerics, and sorting will behave in ways that are more predictable.
Related
Can someone help me how to use Null last in Mem sqlMEM, in RDBMS we have option for null last but in Mem SQL it does not supports
SingleStore supports it:
singlestore> create table t(a int);
Query OK, 0 rows affected (0.02 sec)
singlestore> insert t values(1),(2),(null),(4);
singlestore> select a from t order by a;
+------+
| a |
+------+
| NULL |
| 1 |
| 2 |
| 4 |
+------+
4 rows in set (0.03 sec)
singlestore> select a from t order by a NULLS LAST;
+------+
| a |
+------+
| 1 |
| 2 |
| 4 |
| NULL |
+------+
I would like to collapse the rows in a dataframe based on an ID column and count the number of records per ID using window functions. Doing this, I would like to avoid partitioning the window by ID, because this would result in a very large number of partitions.
I have a dataframe of the form
+----+-----------+-----------+-----------+
| ID | timestamp | metadata1 | metadata2 |
+----+-----------+-----------+-----------+
| 1 | 09:00 | ABC | apple |
| 1 | 08:00 | NULL | NULL |
| 1 | 18:00 | XYZ | apple |
| 2 | 07:00 | NULL | banana |
| 5 | 23:00 | ABC | cherry |
+----+-----------+-----------+-----------+
where I would like to keep only the records with the most recent timestamp per ID, such that I have
+----+-----------+-----------+-----------+-------+
| ID | timestamp | metadata1 | metadata2 | count |
+----+-----------+-----------+-----------+-------+
| 1 | 18:00 | XYZ | apple | 3 |
| 2 | 07:00 | NULL | banana | 1 |
| 5 | 23:00 | ABC | cherry | 1 |
+----+-----------+-----------+-----------+-------+
I have tried:
window = Window.orderBy( [asc('ID'), desc('timestamp')] )
window_count = Window.orderBy( [asc('ID'), desc('timestamp')] ).rowsBetween(-sys.maxsize,sys.maxsize)
columns_metadata = [metadata1, metadata2]
df = df.select(
*(first(col_name, ignorenulls=True).over(window).alias(col_name) for col_name in columns_metadata),
count(col('ID')).over(window_count).alias('count')
)
df = df.withColumn("row_tmp", row_number().over(window)).filter(col('row_tmp') == 1).drop(col('row_tmp'))
which is in part based on How to select the first row of each group?
This without the use of pyspark.sql.Window.partitionBy, this does not give the desired output.
I read you wanted without partitioning by ID after I posted it. I could only think of this approach.
Your dataframe:
df = sqlContext.createDataFrame(
[
('1', '09:00', 'ABC', 'apple')
,('1', '08:00', '', '')
,('1', '18:00', 'XYZ', 'apple')
,('2', '07:00', '', 'banana')
,('5', '23:00', 'ABC', 'cherry')
]
,['ID', 'timestamp', 'metadata1', 'metadata2']
)
We can use rank and partition by ID over timestamp:
from pyspark.sql.window import Window
import pyspark.sql.functions as F
w1 = Window().partitionBy(df['ID']).orderBy(df['timestamp']).orderBy(F.desc('timestamp'))
w2 = Window().partitionBy(df['ID'])
df\
.withColumn("rank", F.rank().over(w1))\
.withColumn("count", F.count('ID').over(w2))\
.filter(F.col('rank') == 1)\
.select('ID', 'timestamp', 'metadata1', 'metadata2', 'count')\
.show()
+---+---------+---------+---------+-----+
| ID|timestamp|metadata1|metadata2|count|
+---+---------+---------+---------+-----+
| 1| 18:00| XYZ| apple| 3|
| 2| 07:00| | banana| 1|
| 5| 23:00| ABC| cherry| 1|
+---+---------+---------+---------+-----+
This question already has answers here:
In spark iterate through each column and find the max length
(3 answers)
Closed 3 years ago.
I have a Spark dataframe like that
+-----------------+---------------+----------+-----------+
| column1 | column2 | column3 | column4 |
+-----------------+---------------+----------+-----------+
| a | bbbbb | cc | >dddddddd |
| >aaaaaaaaaaaaaa | bb | c | dddd |
| aa | >bbbbbbbbbbbb | >ccccccc | ddddd |
| aaaaa | bbbb | ccc | d |
+-----------------+---------------+----------+-----------+
I would like to find a length of the longest element in each column to obtain something like that
+---------+-----------+
| column | maxLength |
+---------+-----------+
| column1 | 14 |
| column2 | 12 |
| column3 | 7 |
| column4 | 8 |
+---------+-----------+
I know how to do it column by column but don't know how to tell Spark - Do it for all columns .
I am using Scala Spark.
You can use agg function max and length function to achieve it as
val x = df.columns.map(colName => {
(colName, df.agg(max(length(col(colName)))).head().getAs[Integer](0))
}).toSeq.toDF("column", "maxLength")
Output:
+-------+---------+
|column |maxLength|
+-------+---------+
|column1|14 |
|column2|13 |
|column3|8 |
|column4|9 |
+-------+---------+
Other way is
df.select(df.columns.map(c => max(length(col(c))).as(s"max_${c}")): _*)
Output:
+-----------+-----------+-----------+-----------+
|max_column1|max_column2|max_column3|max_column4|
+-----------+-----------+-----------+-----------+
|14 |13 |8 |9 |
+-----------+-----------+-----------+-----------+
I have a data in below format :
+---------------------+----+----+---------+----------+
| date_time | id | cm | p_count | bcm |
+---------------------+----+----+---------+----------+
| 2018-02-01 04:38:00 | v1 | c1 | 1 | null |
| 2018-02-01 05:37:07 | v1 | c1 | 1 | null |
| 2018-02-01 11:19:38 | v1 | c1 | 1 | null |
| 2018-02-01 12:09:19 | v1 | c1 | 1 | c1 |
| 2018-02-01 14:05:10 | v2 | c2 | 1 | c2 |
+---------------------+----+----+---------+----------+
I need to find rolling sum of p_count column between two date_time and partition by id.
logic for start_date_time and end_date_time for rolling sum window is below :
start_date_time=min(date_time) group by (id,cm)
end_date_time= bcm == cm ? date_time : null
in this case start_date_time=2018-02-01 04:38:00 and end_date_time=2018-02-01 12:09:19 .
Output should look like :
+---------------------+----+----+---------+----------+-------------+
| date_time | id | cm | p_count | bcm | p_sum_count |
+---------------------+----+----+---------+----------+-------------+
| 2018-02-01 04:38:00 | v1 | c1 | 1 | null |1 |
| 2018-02-01 05:37:07 | v1 | c1 | 1 | null |2 |
| 2018-02-01 11:19:38 | v1 | c1 | 1 | null |3 |
| 2018-02-01 12:09:19 | v1 | c1 | 1 | c1 |4 |
| 2018-02-01 14:05:10 | v2 | c2 | 1 | c2 |1 |
+---------------------+----+----+---------+----------+-------------+
var input = sqlContext.createDataFrame(Seq(
("2018-02-01 04:38:00", "v1", "c1",1,null),
("2018-02-01 05:37:07", "v1", "c1",1,null),
("2018-02-01 11:19:38", "v1", "c1",1,null),
("2018-02-01 12:09:19", "v1", "c1",1,"c1"),
("2018-02-01 14:05:10", "v2", "c2",1,"c2")
)).toDF("date_time","id","cm","p_count" ,"bcm")
input.show()
Results:
+---------------------+----+----+---------+----------+-------------+
| date_time | id | cm | p_count | bcm | p_sum_count |
+---------------------+----+----+---------+----------+-------------+
| 2018-02-01 04:38:00 | v1 | c1 | 1 | null |1 |
| 2018-02-01 05:37:07 | v1 | c1 | 1 | null |2 |
| 2018-02-01 11:19:38 | v1 | c1 | 1 | null |3 |
| 2018-02-01 12:09:19 | v1 | c1 | 1 | c1 |4 |
| 2018-02-01 14:05:10 | v2 | c2 | 1 | c2 |1 |
+---------------------+----+----+---------+----------+-------------+
Next Code:
input.createOrReplaceTempView("input_Table");
import org.apache.spark.sql.expressions.Window
import org.apache.spark.sql.functions._
//val results = spark.sqlContext.sql("SELECT sum(p_count) from input_Table tbl GROUP BY tbl.cm")
val results = sqlContext.sql("select *, " +
"SUM(p_count) over ( order by id rows between unbounded preceding and current row ) cumulative_Sum " +
"from input_Table ").show
Results:
+-------------------+---+---+-------+----+--------------+
| date_time| id| cm|p_count| bcm|cumulative_Sum|
+-------------------+---+---+-------+----+--------------+
|2018-02-01 04:38:00| v1| c1| 1|null| 1|
|2018-02-01 05:37:07| v1| c1| 1|null| 2|
|2018-02-01 11:19:38| v1| c1| 1|null| 3|
|2018-02-01 12:09:19| v1| c1| 1| c1| 4|
|2018-02-01 14:05:10| v2| c2| 1| c2| 5|
+-------------------+---+---+-------+----+--------------+
You need to group by while windowing and add your logic to get expected reslts
ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW
Logically a Windowed Aggregate Function is newly calculated for each row within the PARTITION based on all ROWS between a starting row and an ending row.
Starting and ending rows might be fixed or relative to the current row based on the following keywords:
UNBOUNDED PRECEDING, all rows before the current row -> fixed
UNBOUNDED FOLLOWING, all rows after the current row -> fixed
x PRECEDING, x rows before the current row -> relative
y FOLLOWING, y rows after the current row -> relative
Possible kinds of calculation include:
Both starting and ending row are fixed, the window consists of all rows of a partition, e.g. a Group Sum, i.e. aggregate plus detail rows
One end is fixed, the other relative to current row, the number of rows increases or decreases, e.g. a Running Total, Remaining Sum
Starting and ending row are relative to current row, the number of rows within a window is fixed, e.g. a Moving Average over n rows
So SUM(x) OVER (ORDER BY col ROWS UNBOUNDED PRECEDING) results in a Cumulative Sum or Running Total
11 -> 11
2 -> 11 + 2 = 13
3 -> 13 + 3 (or 11+2+3) = 16
44 -> 16 + 44 (or 11+2+3+44) = 60
What is ROWS UNBOUNDED PRECEDING used for in Teradata?
I have a Dataset which has below Below Cols .
df.show();
+--------+---------+---------+---------+---------+
| Col1 | Col2 | Expend1 | Expend2 | Expend3 |
+--------+---------+---------+---------+---------+
| Value1 | Cvalue1 | 123 | 2254 | 22 |
| Value1 | Cvalue2 | 124 | 2255 | 23 |
+--------+---------+---------+---------+---------+
I want that to be changed to this below format using some joins or cube or any Operations.
1.
+--------+---------+------+
| Value1 | Cvalue1 | 123 |
| Value1 | Cvalue1 | 2254 |
| Value1 | Cvalue1 | 22 |
| Value1 | Cvalue1 | 124 |
| Value1 | Cvalue1 | 2255 |
| Value1 | Cvalue1 | 23 |
+--------+---------+------+
or Better if this format
2.
+--------+---------+---------+------+
| Value1 | Cvalue1 | Expend1 | 123 |
| Value1 | Cvalue1 | Expend2 | 2254 |
| Value1 | Cvalue1 | Expend3 | 22 |
| Value1 | Cvalue1 | Expend1 | 124 |
| Value1 | Cvalue1 | Expend2 | 2255 |
| Value1 | Cvalue1 | Expend3 | 23 |
+--------+---------+---------+------+
Can I be able to achieve this above two Possible format. If In case of #1 , can I get the Column name of Last value , whether it is Expend1 or Expend 2 or Expend3.
Functions map and then explode can be used:
val data = List(
("Value1", "Cvalue1", 123, 2254, 22),
("Value1", "Cvalue2", 124, 2255, 23)
)
val df = data.toDF("Col1", "Col2", "Expend1", "Expend2", "Expend3")
// action
val unpivotedColumns = List("Expend1", "Expend2", "Expend3")
val columnMapping = unpivotedColumns.foldLeft(new ArrayBuffer[Column]())((acc, current) => {
acc += lit(current)
acc += col(current)
})
val mapped = df.select($"Col1", $"Col2", map(columnMapping: _*).alias("result"))
val result = mapped.select($"Col1", $"Col2", explode($"result"))
result.show(false)
Result is:
+------+-------+-------+-----+
|Col1 |Col2 |key |value|
+------+-------+-------+-----+
|Value1|Cvalue1|Expend1|123 |
|Value1|Cvalue1|Expend2|2254 |
|Value1|Cvalue1|Expend3|22 |
|Value1|Cvalue2|Expend1|124 |
|Value1|Cvalue2|Expend2|2255 |
|Value1|Cvalue2|Expend3|23 |
+------+-------+-------+-----+
You can do this with the Hive function stack:
df.selectExpr("col1",
"col2",
"stack(3 , 'Expend1' , Expend1,
'Expend2' , Expend2,
'Expend3' , Expend3)
as (Expend, Value) "
).show(false)
+------+-------+-------+-----+
|col1 |col2 |Expend |Value|
+------+-------+-------+-----+
|Value1|Cvalue1|Expend1|123 |
|Value1|Cvalue1|Expend2|2254 |
|Value1|Cvalue1|Expend3|22 |
|Value1|Cvalue2|Expend1|124 |
|Value1|Cvalue2|Expend2|2255 |
|Value1|Cvalue2|Expend3|23 |
+------+-------+-------+-----+
You can convert to array for three columns and explode it
import org.apache.spark.sql.functions._
df.withColumn("Expand", explode(array("Expand1", "Expand2", "Expand3")))
.drop("Expand1", "Expand2", "Expand3")
To keep the column value you can do as below
data.withColumn("Expand1", concat_ws(":", lit("Expand1"), $"Expand1"))
.withColumn("Expand2", concat_ws(":", lit("Expand2") , $"Expand2"))
.withColumn("Expand3", concat_ws(":", lit("Expand3") , $"Expand3"))
.withColumn("Expand", explode(array("Expand1", "Expand2", "Expand3")))
.drop("Expand1", "Expand2", "Expand3")
.withColumn("ExpandColumn", split($"Expand", ":")(0))
.withColumn("Expand", split($"Expand", ":")(1))
.drop("Expand1", "Expand2", "Expand3")
.show(false)
I hope this was helpful
Using udf function you can achieve you 2nd required dataframe.
val columns = df.select("Expend1","Expend2","Expend3").columns
import org.apache.spark.sql.functions._
def arrayStructUdf = udf((columnNames: collection.mutable.WrappedArray[String], columnValues: collection.mutable.WrappedArray[String]) => columnNames.zip(columnValues).map(x => (x._1, x._2)).toArray)
Then just call the udf function, drop the three extra columns and then explode the new formed column and finally select the desired columns
df.withColumn("new", arrayStructUdf(array(columns.map(x => lit(x)): _*), array(columns.map(col): _*)))
.drop("Expend1","Expend2","Expend3")
.withColumn("new", explode(col("new")))
.select("Col1","Col2", "new.*")
You should have the 2nd required dataframe
+------+-------+-------+----+
|Col1 |Col2 |_1 |_2 |
+------+-------+-------+----+
|Value1|Cvalue1|Expend1|123 |
|Value1|Cvalue1|Expend2|2254|
|Value1|Cvalue1|Expend3|22 |
|Value1|Cvalue2|Expend1|124 |
|Value1|Cvalue2|Expend2|2255|
|Value1|Cvalue2|Expend3|23 |
+------+-------+-------+----+