I have a Spark dataframe that looks something like this:
columns = ["object_type", "object_name"]
data = [("galaxy", "andromeda,milky way,condor,andromeda"),
("planet", "mars,jupiter,venus,mars,saturn,venus,earth,mars,venus,earth"),
("star", "mira,sun,altair,sun,sirius,rigel,mira,sirius,aldebaran"),
("natural satellites", "moon,io,io,elara,moon,kale,titan,kale,phobos,titan,europa")]
init_df = spark.createDataFrame(data).toDF(*columns)
init_df.show(truncate = False)
+------------------+-----------------------------------------------------------+
|object_type |object_name |
+------------------+-----------------------------------------------------------+
|galaxy |andromeda,milky way,condor,andromeda |
|planet |mars,jupiter,venus,mars,saturn,venus,earth,mars,venus,earth|
|star |mira,sun,altair,sun,sirius,rigel,mira,sirius,aldebaran |
|natural satellites|moon,io,io,elara,moon,kale,titan,kale,phobos,titan,europa |
+------------------+-----------------------------------------------------------+
I need to create a new column with the most frequent words from the object_name column using PySpark.
Conditions:
if there is one dominant word in the row (mode = 1), then choose this word as most frequent (like "andromeda" in the first row)
if there are two dominant words in the row that occur the equal number of times (mode = 2), then select both these words (like "mars" and "venus" in the second row - they occur by 3 times, while the rest of the words are less common)
if there are three dominant words in the row that occur an equal number of times, then pick all these three words (like "mira", "sun" and "sirius" which occur by 2 times, while the rest of the words only once)
if there are four or more dominant words in the row that occur an equal number of times (like in the fourth row), then set the "many objects" flag.
Expected output:
+-----------------+-----------------------------------------------------------+---------------+
|object_type |object_name |most_frequent |
+-----------------+-----------------------------------------------------------+---------------+
|galaxy |andromeda,milky way,condor,andromeda |andromeda |
|planet |mars,jupiter,venus,mars,saturn,venus,earth,mars,venus,earth|mars,venus |
|star |mira,sun,altair,sun,sirius,rigel,mira,sirius,aldebaran |mira,sun,sirius|
|natural satellite|moon,io,io,elara,moon,kale,titan,kale,phobos,titan,europa |many objects |
+-----------------+-----------------------------------------------------------+---------------+
I'll be very grateful for any advice!
You can try this,
res_df = init_df.withColumn("list_obj", F.split(F.col("object_name"),",")) \
.withColumn("most_frequent", F.udf(lambda x: ', '.join([mitem[1] for mitem in zip((x.count(item) for item in set(x)),set(x)) if mitem[0] == max((x.count(item) for item in set(x)))]))(F.col("list_obj"))) \
.drop("list_obj")
res_df.show(truncate=False)
+------------------+-----------------------------------------------------------+---------------------+
|object_type |object_name |most_frequent |
+------------------+-----------------------------------------------------------+---------------------+
|galaxy |andromeda,milky way,condor,andromeda |andromeda |
|planet |mars,jupiter,venus,mars,saturn,venus,earth,mars,venus,earth|venus, mars |
|star |mira,sun,altair,sun,sirius,rigel,mira,sirius,aldebaran |sirius, mira, sun |
|natural satellites|moon,io,io,elara,moon,kale,titan,kale,phobos,titan,europa |moon, kale, titan, io|
+------------------+-----------------------------------------------------------+---------------------+
EDIT:
According to OP's suggestion, we can achieve their desired output by doing something like this,
from pyspark.sql.types import *
res_df = init_df.withColumn("list_obj", F.split(F.col("object_name"),",")) \
.withColumn("most_frequent", F.udf(lambda x: [mitem[1] for mitem in zip((x.count(item) for item in set(x)),set(x)) if mitem[0] == max((x.count(item) for item in set(x)))], ArrayType(StringType()))(F.col("list_obj"))) \
.withColumn("most_frequent", F.when(F.size(F.col("most_frequent")) >= 4, F.lit("many objects")).otherwise(F.concat_ws(", ", F.col("most_frequent")))) \
.drop("list_obj")
res_df.show(truncate=False)
+------------------+-----------------------------------------------------------+-----------------+
|object_type |object_name |most_frequent |
+------------------+-----------------------------------------------------------+-----------------+
|galaxy |andromeda,milky way,condor,andromeda |andromeda |
|planet |mars,jupiter,venus,mars,saturn,venus,earth,mars,venus,earth|venus, mars |
|star |mira,sun,altair,sun,sirius,rigel,mira,sirius,aldebaran |sirius, mira, sun|
|natural satellites|moon,io,io,elara,moon,kale,titan,kale,phobos,titan,europa |many objects |
+------------------+-----------------------------------------------------------+-----------------+
Try this:
from pyspark.sql import functions as psf
from pyspark.sql.window import Window
columns = ["object_type", "object_name"]
data = [("galaxy", "andromeda,milky way,condor,andromeda"),
("planet", "mars,jupiter,venus,mars,saturn,venus,earth,mars,venus,earth"),
("star", "mira,sun,altair,sun,sirius,rigel,mira,sirius,aldebaran"),
("natural satellites", "moon,io,io,elara,moon,kale,titan,kale,phobos,titan,europa")]
init_df = spark.createDataFrame(data).toDF(*columns)
# unpivot the object name and count
df_exp = init_df.withColumn('object_name_exp', psf.explode(psf.split('object_name',',')))
df_counts = df_exp.groupBy('object_type', 'object_name_exp').count()
window_spec = Window.partitionBy('object_type').orderBy(psf.col('count').desc())
df_ranked = df_counts.withColumn('rank', psf.dense_rank().over(window_spec))
# rank the counts, keeping the top ranked object names
df_top_ranked = df_ranked.filter(psf.col('rank')==psf.lit(1)).drop('count')
# count the number of top ranked object names
df_top_counts = df_top_ranked.groupBy('object_type', 'rank').count()
# join these back to the original object names
df_with_counts = df_top_ranked.join(df_top_counts, on='object_type', how='inner')
# implement the rules whether to retain the reference to the object name or state 'many objects'
df_most_freq = df_with_counts.withColumn('most_frequent'
, psf.when(psf.col('count')<=psf.lit(3), psf.col('object_name_exp')).otherwise(psf.lit('many objects'))
)
# collect the object names retained back into and array and de-duplicate them
df_results = df_most_freq.groupBy('object_type').agg(psf.array_distinct(psf.collect_list('most_frequent')).alias('most_frequent'))
# show output
df_results.show()
+------------------+-------------------+
| object_type| most_frequent|
+------------------+-------------------+
| galaxy| [andromeda]|
|natural satellites| [many objects]|
| planet| [mars, venus]|
| star|[sirius, mira, sun]|
+------------------+-------------------+
This question already has answers here:
pyspark: count distinct over a window
(2 answers)
Closed 1 year ago.
Let's imagine we have the following dataframe :
port | flag | timestamp
---------------------------------------
20 | S | 2009-04-24T17:13:14+00:00
30 | R | 2009-04-24T17:14:14+00:00
32 | S | 2009-04-24T17:15:14+00:00
21 | R | 2009-04-24T17:16:14+00:00
54 | R | 2009-04-24T17:17:14+00:00
24 | R | 2009-04-24T17:18:14+00:00
I would like to calculate the number of distinct port, flag over the 3 hours in Pyspark.
The result will be something like :
port | flag | timestamp | distinct_port_flag_overs_3h
---------------------------------------
20 | S | 2009-04-24T17:13:14+00:00 | 1
30 | R | 2009-04-24T17:14:14+00:00 | 1
32 | S | 2009-04-24T17:15:14+00:00 | 2
21 | R | 2009-04-24T17:16:14+00:00 | 2
54 | R | 2009-04-24T17:17:14+00:00 | 2
24 | R | 2009-04-24T17:18:14+00:00 | 3
The SQL request looks like :
SELECT
COUNT(DISTINCT port) OVER my_window AS distinct_port_flag_overs_3h
FROM my_table
WINDOW my_window AS (
PARTITION BY flag
ORDER BY CAST(timestamp AS timestamp)
RANGE BETWEEN INTERVAL 3 HOUR PRECEDING AND CURRENT
)
I found this topic that solves the problem but only if we want to count distinct elements over one field.
Do someone has any idea of how to achieve that in :
python 3.7
pyspark 2.4.4
Just collect set of structs (port, flag) and get its size. Something like this:
w = Window.partitionBy("flag").orderBy("timestamp").rangeBetween(-10800, Window.currentRow)
df.withColumn("timestamp", to_timestamp("timestamp").cast("long"))\
.withColumn("distinct_port_flag_overs_3h", size(collect_set(struct("port", "flag")).over(w)))\
.orderBy(col("timestamp"))\
.show()
I've just code something like that that works to :
def hive_time(time:str)->int:
"""
Convert string time to number of seconds
time : str : must be in the following format, numberType
For exemple 1hour, 4day, 3month
"""
match = re.match(r"([0-9]+)([a-z]+)", time, re.I)
if match:
items = match.groups()
nb, kind = items[0], items[1]
try :
nb = int(nb)
except ValueError as e:
print(e, traceback.format_exc())
print("The format of {} which is your time aggregaation is not recognize. Please read the doc".format(time))
if kind == "second":
return nb
if kind == "minute":
return 60*nb
if kind == "hour":
return 3600*nb
if kind == "day":
return 24*3600*nb
assert False, "The format of {} which is your time aggregaation is not recognize. \
Please read the doc".format(time)
# Rolling window in spark
def distinct_count_over(data, window_size:str, out_column:str, *input_columns, time_column:str='timestamp'):
"""
data : pyspark dataframe
window_size : Size of the rolling window, check the doc for format information
out_column : name of the column where you want to stock the results
input_columns : the columns where you want to count distinct
time_column : the name of the columns where the timefield is stocked (must be in ISO8601)
return : a new dataframe whith the stocked result
"""
concatenated_columns = F.concat(*input_columns)
w = (Window.orderBy(F.col("timestampGMT").cast('long')).rangeBetween(-hive_time(window_size), 0))
return data \
.withColumn('timestampGMT', data.timestampGMT.cast(time_column)) \
.withColumn(out_column, F.size(F.collect_set(concatenated_columns).over(w)))
Works well, didn't check yet for performance monitoring.
I am trying to simply run a subquery in Azure application insights, using Kusto, so that I can get some information from two tables displayed as one.
The query I'm trying is
table1
| extend progressLog = toscalar(
table2
| where common_Id == table1.common_Id // errors saying Ensure that expression: table1.common_Id is indeed a simple name
| summarize makelist(stringColumn)
)
I have attempted to alias this id, and even join the two tables, as such:
requests
| extend aliased_id = common_Id
| join traces on operation_Id, $left.operation_Id == $right.operation_Id
| extend test_id = operation_Id
| extend progressLog = toscalar(
traces
| where operation_Id == aliased_id // Failed to resolve column or scalar expression named 'aliased_id'
| summarize makelist(message)
)
Failed to resolve column or scalar expression named 'aliased_id'.
I am simply trying to do the equivalent of the T-SQL query:
SELECT
... ,
STRING_AGG(table2.stringColumn, ',')
FROM
table1
INNER JOIN
table2
ON table1.common_Id = table2.common_Id
GROUP BY
table.<props>
My main question is - how do I reference "common_Id" in the kusto language inside a subquery
Please see if the next query provides what you're looking for. If not, please share sample input using datatable, as I did below, and expected output:
let requests = datatable(common_Id:string, operation_Id:string)
[
"A", "X",
"B", "Y",
"C", "Z"
];
let traces = datatable(operation_Id:string, message:string)
[
"X", "m1",
"X", "m2",
"Y", "m3"
];
let messagesByOperationId = traces | summarize makelist(message) by operation_Id;
requests
| join kind=leftouter messagesByOperationId on operation_Id
| project common_Id, operation_Id, progressLog = list_message
Multiple LEFT JOINS are not working as expected in Azure Stream Analytics.
I am using LEFT joins in Azure Stream Analytics and getting values for the first Two JOINS and null value for the rest of the LEFT JOIN
The below is the Json input i have used.
[
{"ID":"006XXXXX",
"ABC":
[{"E":1557302231320,"V":54.799999237060547}],
"XYZ":
[{"E":1557302191899,"V":31.0},{"E":1557302231320,"V":55}],
"PQR":
[{"E":1557302191899,"V":33},{"E":1557302231320,"V":15}],
"IJK":
[{"E":1557302191899,"V":65},{"E":1557302231320,"V":09}],
{"ID":"007XXXXX",
"ABC":
[{"E":1557302195483,"V":805.375},{"E":1557302219803,"V":0}],
"XYZ":
[{"E":1557302219803,"V":-179.0},{"E":1557302195483,"V":88}],
"PQR":
[{"E":1557302219803,"V":9.0},{"E":1557302195483,"V":98}],
"IJK":
[{"E":1557302219803,"V":1.0},{"E":1557302195483,"V":9}]
]
Below is the Query I used.
WITH
ABCINNERQUERY AS (
SELECT
event.ID as ID,
event.TYPE as TYPE,
ABCArrayElement.ArrayValue.E as TIME,
ABCArrayElement.ArrayValue.V as ABC
FROM
[YourInputAlias] as event
CROSS APPLY GetArrayElements(event.ABC) AS ABCArrayElement
),
XYZINNERQUERY AS (
SELECT
event.ID as ID,
event.TYPE as TYPE,
XYZArrayElement.ArrayValue.E as TIME,
XYZArrayElement.ArrayValue.V as XYZ
FROM
[YourInputAlias] as event
CROSS APPLY GetArrayElements(event.XYZ) AS XYZArrayElement
),
PQRINNERQUERY AS (
SELECT
event.ID as ID,
event.TYPE as TYPE,
PQRArrayElement.ArrayValue.E as TIME,
PQRArrayElement.ArrayValue.V as PQR
FROM
[YourInputAlias] as event
CROSS APPLY GetArrayElements(event.PQR) AS PQRArrayElement
),
IJKINNERQUERY AS (
SELECT
event.ID as ID,
event.TYPE as TYPE,
IJKArrayElement.ArrayValue.E as TIME,
IJKArrayElement.ArrayValue.V as IJK
FROM
[YourInputAlias] as event
CROSS APPLY GetArrayElements(event.IJK) AS IJKArrayElement
),
KEYS AS
(
SELECT
ABCINNERQUERY.ID AS ID,
ABCINNERQUERY.TIME as TIME
FROM ABCINNERQUERY
UNION
SELECT
XYZINNERQUERY.ID AS ID,
XYZINNERQUERY.TIME as TIME
FROM XYZINNERQUERY
UNION
SELECT
PQRINNERQUERY.ID AS ID,
PQRINNERQUERY.TIME as TIME
FROM PQRINNERQUERY
UNION
SELECT
IJKINNERQUERY.ID AS ID,
IJKINNERQUERY.TIME as TIME
FROM IJKINNERQUERY
)
SELECT
KEYS.ID as ID,
KEYS.TIME as TIME,
ABCINNERQUERY.ABC AS ABC,
XYZINNERQUERY.XYZ AS XYZ,
PQRINNERQUERY.PQR AS PQR,
IJKINNERQUERY.IJK AS IJK
INTO [YourOutputAlias]
FROM KEYS
LEFT JOIN ABCINNERQUERY
ON DATEDIFF(minute, KEYS, ABCINNERQUERY) BETWEEN 0 AND 10
AND KEYS.ID = ABCINNERQUERY.ID
AND KEYS.TIME = ABCINNERQUERY.TIME
LEFT JOIN XYZINNERQUERY
ON DATEDIFF(minute, KEYS, XYZINNERQUERY) BETWEEN 0 AND 10
AND KEYS.ID = XYZINNERQUERY.ID
AND KEYS.TIME = XYZINNERQUERY.TIME
LEFT JOIN PQRINNERQUERY ---From here onwards JOIN will not work. Only first two joins are working as expected.
ON DATEDIFF(minute, KEYS, PQRINNERQUERY) BETWEEN 0 AND 10
AND KEYS.ID = PQRINNERQUERY.ID
AND KEYS.TIME = PQRINNERQUERY.TIME
LEFT JOIN IJKINNERQUERY ---Once we shift this join to 1st or 2nd then it will work.
ON DATEDIFF(minute, KEYS, IJKINNERQUERY) BETWEEN 0 AND 10
AND KEYS.ID = IJKINNERQUERY.ID
AND KEYS.TIME = IJKINNERQUERY.TIME
Actual result is as below.
ID STIME ABC XYZ PQR IJK
006XXXXX 1557302231320.00 54.79999924 31 null null
006XXXXX 1557302191899.00 null 31 null null
007XXXXX 1557302195483.00 805.375 88 null null
Expected values for PQR and IJK for the corresponding time.
I followed your sample data and got 0 rows.Notice that the value you provided in the sample data is not right.
09 can't be serialized for the number type.I fixed that then the sql works for me.
I'm trying to implement some kind of pagination feature for my app that using cassandra in the backend.
CREATE TABLE sample (
some_pk int,
some_id int,
name1 txt,
name2 text,
value text,
PRIMARY KEY (some_pk, some_id, name1, name2)
)
WITH CLUSTERING ORDER BY(some_id DESC)
I want to query 100 records, then store the last records keys in memory to use them later.
+---------+---------+-------+-------+-------+
| sample_pk| some_id | name1 | name2 | value |
+---------+---------+-------+-------+-------+
| 1 | 125 | x | '' | '' |
+---------+---------+-------+-------+-------+
| 1 | 124 | a | '' | '' |
+---------+---------+-------+-------+-------+
| 1 | 124 | b | '' | '' |
+---------+---------+-------+-------+-------+
| 1 | 123 | y | '' | '' |
+---------+---------+-------+-------+-------+
(for simplicity, i left some columns empty. partition key(sample_pk) is not important)
let's assume my page size is 2.
select * from sample where sample_pk=1 limit 2;
returns first 2 rows. now i store the last record in my query result and run query again to get next 2 rows;
this is the query that does not work because of restriction of a single non-EQ relation
select * from where sample_pk=1 and some_id <= 124 and name1>='a' and name2>='' limit 2;
and this one returns wrong results because some_id is in descending order and name columns are in ascending order.
select * from where sample_pk=1 and (some_id, name1, name2) <= (124, 'a', '') limit 2;
So I'm stuck. How can I implement pagination?
You can run your second query like,
select * from sample where some_pk =1 and some_id <= 124 limit x;
Now after fetching the records ignore the record(s) which you have already read (this can be done because you are storing the last record from the previous select query).
And after ignoring those records if you are end up with empty list of rows/records that means you have iterated over all the records else continue doing this for your pagination task.
You don't have to store any keys in memory, also you don't need to use limit in your cqlsh query. Just use the capabilities of datastax driver in your application code for doing pagination like the following code:
public Response getFromCassandra(Integer itemsPerPage, String pageIndex) {
Response response = new Response();
String query = "select * from sample where sample_pk=1";
Statement statement = new SimpleStatement(query).setFetchSize(itemsPerPage); // set the number of items we want per page (fetch size)
// imagine page '0' indicates the first page, so if pageIndex = '0' then there is no paging state
if (!pageIndex.equals("0")) {
statement.setPagingState(PagingState.fromString(pageIndex));
}
ResultSet rows = session.execute(statement); // execute the query
Integer numberOfRows = rows.getAvailableWithoutFetching(); // this should get only number of rows = fetchSize (itemsPerPage)
Iterator<Row> iterator = rows.iterator();
while (numberOfRows-- != 0) {
response.getRows.add(iterator.next());
}
PagingState pagingState = rows.getExecutionInfo().getPagingState();
if(pagingState != null) { // there is still remaining pages
response.setNextPageIndex(pagingState.toString());
}
return response;
}
note that if you make the while loop like the following:
while(iterator.hasNext()) {
response.getRows.add(iterator.next());
}
it will first fetch number of rows as equal as the fetch size we set, then as long as the query still matches some rows in Cassandra it will go fetch again from cassandra till it fetches all rows matching the query from cassandra which may not be intended if you want to implement a pagination feature
source: https://docs.datastax.com/en/developer/java-driver/3.2/manual/paging/