Extract Year, Month, Day from Unix TimeStamp column in Rust DataFusion DataFrame? - rust

I have created a DataFusion DataFrame:
| asin | vote | verified | unixReviewTime | reviewText |
+------------+------+----------+----------------+-----------------+
| 0486427706 | 3 | true | 1381017600 | good |
| 0486427707 | | false | 1376006400 | excellent |
| 0486427707 | 1 | true | 1459814400 | Did not like it |
| 0486427708 | 4 | false | 1376006400 | |
+------------+------+----------+----------------+-----------------+
I was trying to find the solution of following information from the API document, but could not figure it out:
Convert the unixReviewTime column into Rust Native timestamp
Extract the Year, Month and Day from the newly created column into separate columns
Here is how json datafile looks like:
{"asin": "0486427706", "vote": 3, "verified": true, "unixReviewTime": 1381017600, "reviewText": "good", "overall": 5.0}
{"asin": "0486427707", "vote": null, "verified": false, "unixReviewTime": 1376006400, "reviewText": "excellent", "overall": 5.0}
{"asin": "0486427707", "vote": 1, "verified": true, "unixReviewTime": 1459814400, "reviewText": "Did not like it", "overall": 2.0}
{"asin": "0486427708", "vote": 4, "verified": false, "unixReviewTime": 1376006400, "reviewText": null, "overall": 4.0}
It is very easy to do in pyspark as follows:
from PySpark.sql import functions as fn
from PySpark.sql.functions import col
main_df = (
main_df
.withColumn(
'reviewed_at',
fn.from_unixtime(col('unixReviewTime'))
)
)
main_df = main_df.withColumn("reviewed_year", fn.year(col("reviewed_at")))
main_df = main_df.withColumn("reviewed_month", fn.month(col("reviewed_at")))

use datafusion::prelude::*;
use datafusion::error::Result;
use datafusion::arrow::datatypes::{DataType, TimeUnit};
#[tokio::main]
async fn main() -> Result<()> {
let mut ctx = SessionContext::new();
let df = ctx
.read_json("/tmp/data.json", NdJsonReadOptions::default())
.await?
.with_column(
"unixReviewTimestamp",
cast(
col("unixReviewTime"),
DataType::Timestamp(TimeUnit::Millisecond, None),
),
)?
.with_column(
"reviewed_year",
date_part(lit("year"), col("unixReviewTimestamp")),
)?
.with_column(
"reviewed_month",
date_part(lit("month"), col("unixReviewTimestamp")),
)?;
df.show().await?;
Ok(())
}
fn cast(expr: Expr, data_type: DataType) -> Expr {
Expr::Cast {
expr: Box::new(expr),
data_type,
}
}
Produces:
+------------+------+----------+----------------+-----------------+---------+-------------------------+---------------+----------------+
| asin | vote | verified | unixReviewTime | reviewText | overall | unixReviewTimestamp | reviewed_year | reviewed_month |
+------------+------+----------+----------------+-----------------+---------+-------------------------+---------------+----------------+
| 0486427706 | 3 | true | 1381017600 | good | 5 | 1970-01-16 23:36:57.600 | 1970 | 1 |
| 0486427707 | | false | 1376006400 | excellent | 5 | 1970-01-16 22:13:26.400 | 1970 | 1 |
| 0486427707 | 1 | true | 1459814400 | Did not like it | 2 | 1970-01-17 21:30:14.400 | 1970 | 1 |
| 0486427708 | 4 | false | 1376006400 | | 4 | 1970-01-16 22:13:26.400 | 1970 | 1 |
+------------+------+----------+----------------+-----------------+---------+-------------------------+---------------+----------------+

Related

How to remove rows with duplicate values in some columns, while keeping values that are different?

I have a pandas dataframe in the following format.
| id | name | last_name | address | x | y | x_list | y_list|
| -- | ------- | --------- | ------- | ------- | - | --------- | ----- |
| 1 | 'John' | 'Smith' | 'add_1' | 'one' | 1 | ['one'] | [1] |
| 2 | 'Tom' | 'Davis' | 'add_2' | 'two' | 2 | ['two'] | [2] |
| 3 | 'John' | 'Smith' | 'add_1' | 'three' | 3 | ['three'] | [3] |
| 4 | 'Tom' | 'Davis' | 'add_2' | 'four' | 4 | ['four'] | [4] |
| 5 | 'Susan' | 'Jones' | 'add_1' | 'one' | 1 | ['one'] | [1] |
I have no idea how to approach this problem. I need this output:
| id | name | last_name | address | x_list | y_list |
| -- | ------- | ---------- | ------- | ---------------- | ------ |
| 1 | 'John' | 'Smith' | 'add_1' | ['one', 'three'] | [1, 3] |
| 2 | 'Tom' | 'Davis' | 'add_2' | ['two', 'four'] | [2, 4] |
| 3 | 'Susan' | 'Jones' | 'add_1' | ['one'] | [1] |
Basically, I need to return a new DataFrame, or modify the existing one so the columns with the same name, last_name, and address have their x_list and y_list merged. Can anyone help me how to do this in pandas? This needs to be done on a dataframe of about 58 000 rows.
then use following code:
df.groupby(['name', 'last_name', 'address'])[['x_list', 'y_list']].sum().reset_index()
output:
name last_name address x_list y_list
0 John Smith add_1 [ one , three ] [ 1 , 3 ]
1 Susan Jones add_1 [ one ] [ 1 ]
2 Tom Davis add_2 [ two , four ] [ 2 , 4 ]
From what I could see, your x_list and y_list columns are redundant data that contain x and y variables in list form. If my observation is right, there is no issue with dropping those two variables. You can use a groupby with list aggregation.
Assuming the id variable is your index, it will go something like this
df.groupby(['name','last_name','address'], as_index=False).agg(list)

How to get max value group by another column from Pandas dataframe

I have the following dataframe. I would like to get the rows where the date is max for each pipeline_name
Here is the dataframe:
+----+-----------------+--------------------------------------+----------------------------------+
| | pipeline_name | runid | run_end_dt |
|----+-----------------+--------------------------------------+----------------------------------|
| 0 | test_pipeline | test_pipeline_run_101 | 2021-03-10 20:01:26.704265+00:00 |
| 1 | test_pipeline | test_pipeline_run_102 | 2021-03-13 20:08:31.929038+00:00 |
| 2 | test_pipeline2 | test_pipeline2_run_101 | 2021-03-10 20:13:53.083525+00:00 |
| 3 | test_pipeline2 | test_pipeline2_run_102 | 2021-03-12 20:14:51.757058+00:00 |
| 4 | test_pipeline2 | test_pipeline2_run_103 | 2021-03-13 20:17:00.285573+00:00 |
Here is the result I want to achieve:
+----+-----------------+--------------------------------------+----------------------------------+
| | pipeline_name | runid | run_end_dt |
|----+-----------------+--------------------------------------+----------------------------------|
| 0 | test_pipeline | test_pipeline_run_102 | 2021-03-13 20:08:31.929038+00:00 |
| 1 | test_pipeline2 | test_pipeline2_run_103 | 2021-03-13 20:17:00.285573+00:00 |
In the expected result, we have only the runid against each pipeline_name with the max run_end_dt
Thanks
Suppose your dataframe stored in a variable named df
Just use groupby() method:-
df.groupby('pipeline_name',as_index=False)[['runid','run_end_dt']].max()
Use groupby followed by a transform. Get the indices of the rows which have the max value in each group.
idx = (df.groupby(['pipeline_name'], sort=False)['run_end_dt'].transform('max') == df['run_end_dt'])
df = df.loc[idx]

PySpark: How to get the average time it took for one column to change value?

I have a table which specifies whether a chatroom is connected or not:
+--------+-----------+-------------+------+
| roomId | timeStamp | isConnected | col3 |
+--------+-----------+-------------+------+
| 1 | 10000 | true | ... |
| 2 | 9000 | true | ... |
| 1 | 8000 | true | ... |
| 3 | 7000 | true | ... |
| 2 | 6000 | false | ... |
| 3 | 5000 | false | ... |
| 1 | 4000 | false | ... |
| 1 | 3000 | false | ... |
| 3 | 2000 | true | ... |
| 3 | 1000 | false | ... |
+--------+-----------+-------------+------+
For each roomId, I want to know the average time it took for the isConnected value to turn from each first occurrence of false to true. In other words, I want to know the average time it took to for each chatrooms to reconnect whenever their connection is down.
+--------+------------+
| roomId | avgConTime |
+--------+------------+
| 1 | 5000 |
| 2 | 3000 |
| 3 | 1500 |
+--------+------------+
For example, roomId = 1 is not connected at timestamp = 3000. It managed to connect again at timestamp = 8000. So the average time it took to reconnect is 5000
It is a bit long, but there are actually a lot of intermediary steps to achieve this simple result :
from pyspark.sql import functions as F, Window
w = Window.partitionBy("roomId").orderBy("timeStamp")
df = df.withColumn(
"_id",
F.sum(
F.when(F.col("isConnected") == F.lag("isConnected").over(w), 0).otherwise(1)
).over(w),
)
df_agg = df.groupBy("roomId", "_id").agg(
F.min("timeStamp").alias("timeStamp"), F.first("isConnected").alias("isConnected")
)
df_agg = (
df_agg.withColumnRenamed("timeStamp", "timeStamp_start")
.withColumn(
"timeStamp_end",
F.lead("timeStamp_start").over(Window.partitionBy("roomId").orderBy("_id")),
)
.where("timeStamp_end is not null")
.where("not isConnected")
.withColumn("duration", F.col("timeStamp_end") - F.col("timeStamp_start"))
)
df_agg.groupBy("roomId").agg(F.avg("duration")).show()
+------+-------------+
|roomId|avg(duration)|
+------+-------------+
| 1| 5000.0|
| 3| 1500.0|
| 2| 3000.0|
+------+-------------+

Getting a column as concatenated column from a reference table and primary id's from a Dataset

I'm trying to get a concatenated data as a single column using below datasets.
Sample DS:
val df = sc.parallelize(Seq(
("a", 1,2,3),
("b", 4,6,5)
)).toDF("value", "id1", "id2", "id3")
+-------+-----+-----+-----+
| value | id1 | id2 | id3 |
+-------+-----+-----+-----+
| a | 1 | 2 | 3 |
| b | 4 | 6 | 5 |
+-------+-----+-----+-----+
from the Reference Dataset
+----+----------+--------+
| id | descr | parent|
+----+----------+--------+
| 1 | apple | fruit |
| 2 | banana | fruit |
| 3 | cat | animal |
| 4 | dog | animal |
| 5 | elephant | animal |
| 6 | Flight | object |
+----+----------+--------+
val ref= sc.parallelize(Seq(
(1,"apple","fruit"),
(2,"banana","fruit"),
(3,"cat","animal"),
(4,"dog","animal"),
(5,"elephant","animal"),
(6,"Flight","object"),
)).toDF("id", "descr", "parent")
I am trying to get the below desired OutPut
+-----------------------+--------------------------+
| desc | parent |
+-----------------------+--------------------------+
| apple+banana+cat/M | fruit+fruit+animal/M |
| dog+Flight+elephant/M | animal+object+animal/M |
+-----------------------+--------------------------+
And also I need to concat only if(id2,id3) is not null. Otherwise only with id1.
I breaking my head for the solution.
Exploding the first dataframe df and joining to ref with followed by groupBy should work as you expected
val dfNew = df.withColumn("id", explode(array("id1", "id2", "id3")))
.select("id", "value")
ref.join(dfNew, Seq("id"))
.groupBy("value")
.agg(
concat_ws("+", collect_list("descr")) as "desc",
concat_ws("+", collect_list("parent")) as "parent"
)
.drop("value")
.show()
Output:
+-------------------+--------------------+
|desc |parent |
+-------------------+--------------------+
|Flight+elephant+dog|object+animal+animal|
|apple+cat+banana |fruit+animal+fruit |
+-------------------+--------------------+

Reshaping table Excel PowerQuery

I have a large table in Excel, which is output of a data-gathering tool, that looks more or less like this:
DateA | ValueA | DateB | ValueB | ... | DateZ | ValueZ
---------------------------------------------------------------------------
2019-01-01 | 3 | 2019-01-01 | 6 | ... | 2019-01-04 | 7
2019-01-02 | 1 | 2019-01-04 | 2 | ... | 2019-01-05 | 3
And I'd like to process it so it would like this:
Date | Value | Type
-----------------------------
2019-01-01 | 3 | A
2019-01-02 | 1 | A
2019-01-01 | 6 | B
2019-01-04 | 2 | B
...
2019-01-04 | 7 | Z
2019-01-05 | 3 | Z
Because this is the format, that is used on our sql database.
How to do this in the least tedious way, preferably using PowerQuery? I'd like to avoid brute-force coping and pasting with vba loop.
The number of columns is fixed, but would be nice to have an option to add another one later on, the number of rows would however vary around some value (like 20, 21, 20, 22, 19, 20) day-to-day
Columns are harder to work with, so I'd first transform each column into a new row as a list.
ColumnsToRows =
Table.FromColumns(
{
Table.ToColumns(Source),
Table.ColumnNames(Source)
},
{"ColumnValues","ColumnName"}
)
This should give you a table as follows where each list consists of values in the corresponding column. For example, the top list is {1/1/2019,1/2/2019}. (The from columns part is to add the ColumnName column.)
| ColumnValues | ColumnName |
|--------------|------------|
| [List] | DateA |
| [List] | ValueA |
| [List] | DateB |
| [List] | ValueB |
| [List] | DateZ |
| [List] | ValueZ |
We can then filter this based on the data type in each list. To get the date rows you can write:
DataRows =
Table.SelectRows(
ColumnsToRows,
each Value.Type(List.First([ColumnValues])) = type date
)
Which gets you the following filtered table:
| ColumnValues | ColumnName |
|--------------|------------|
| [List] | DateA |
| [List] | DateB |
| [List] | DateZ |
If you expand the first column with Table.ExpandListColumn(DataRows, "ColumnValues"), then you get
| ColumnValues | ColumnName |
|--------------|------------|
| 1/1/2019 | DateA |
| 1/2/2019 | DateA |
| 1/1/2019 | DateB |
| 1/4/2019 | DateB |
| 1/4/2019 | DateZ |
| 1/5/2019 | DateZ |
The logic is analogous to filter and expand the value rows.
ValueRows =
Table.ExpandListColumn(
Table.SelectRows(
ColumnsToRows,
each Value.Type(List.First([ColumnValues])) = type number
),
"ColumnValues"
)
Which gets you a similar looking table:
| ColumnValues | ColumnName |
|--------------|------------|
| 3 | ValueA |
| 1 | ValueA |
| 6 | ValueB |
| 2 | ValueB |
| 7 | ValueZ |
| 3 | ValueZ |
Now we just need to combine together the columns we want into a single table:
Combine Columns =
Table.FromColumns(
{
DateRows[ColumnValues],
ValueRows[ColumnValues],
ValueRows[ColumnName]
},
{"Date", "Value", "Type"}
)
and then extract the text following Value in the column names.
ExtractType =
Table.TransformColumns(
CombineColumnns,
{{"Type", each Text.AfterDelimiter(_, "Value"), type text}}
)
The final table should be just as specified:
| Date | Value | Type |
|----------|-------|------|
| 1/1/2019 | 3 | A |
| 1/2/2019 | 1 | A |
| 1/1/2019 | 6 | B |
| 1/4/2019 | 2 | B |
| 1/4/2019 | 7 | Z |
| 1/5/2019 | 3 | Z |
All in a single query, the M code looks like this:
let
Source = <Source Goes Here>,
ColumnsToRows = Table.FromColumns({Table.ToColumns(Source), Table.ColumnNames(Source)}, {"ColumnValues","ColumnName"}),
DateRows = Table.ExpandListColumn(Table.SelectRows(ColumnsToRows, each Value.Type(List.First([ColumnValues])) = type date), "ColumnValues"),
ValueRows = Table.ExpandListColumn(Table.SelectRows(ColumnsToRows, each Value.Type(List.First([ColumnValues])) = type number), "ColumnValues"),
CombineColumnns = Table.FromColumns({DateRows[ColumnValues], ValueRows[ColumnValues], ValueRows[ColumnName]},{"Date", "Value", "Type"}),
ExtractType = Table.TransformColumns(CombineColumnns, {{"Type", each Text.AfterDelimiter(_, "Value"), type text}})
in
ExtractType

Resources