Kotlin - Get difference between datetimes in seconds - android-studio

Is there any way, to get the difference between two datetimes in seconds?
For example
First datetime: 2022-04-25 12:09:10
Second datetime: 2022-05-24 02:46:21

There is a dedicated class for that - Duration (the same in Android doc).
A time-based amount of time, such as '34.5 seconds'.
This class models a quantity or amount of time in terms of seconds and nanoseconds. It can be accessed using other duration-based units, such as minutes and hours. In addition, the DAYS unit can be used and is treated as exactly equal to 24 hours, thus ignoring daylight savings effects. See Period for the date-based equivalent to this class.
Here is example usage:
val date1 = LocalDateTime.now()
val date2 = LocalDateTime.now()
val duration = Duration.between(date1, date2)
val asSeconds: Long = duration.toSeconds()
val asMinutes: Long = duration.toMinutes()

If your date types are in the java.time package, in other words, are inheritors of Temporal: Use the ChronoUnit class.
val diffSeconds = ChronoUnit.SECONDS.between(date1, date2)
Note that this can result in a negative value, so do take its absolute value (abs) if necessary.

Related

Spark: problem with crossJoin (takes a tremendous amount of time)

First of all, I have to say that I've already tried everything I know or found on google (Including this Spark: How to use crossJoin which is exactly my problem).
I have to calculate the Cartesian product between two DataFrame - countries and units such that -
A.cache().count()
val units = A.groupBy("country")
.agg(sum("grade").as("grade"),
sum("point").as("point"))
.withColumn("AVR", $"grade" / $"point" * 1000)
.drop("point", "grade")
val countries = D.select("country").distinct()
val C = countries.crossJoin(units)
countries contains a countries name and its size bounded by 150. units is DataFrame with 3 rows - an aggregated result of other DataFrame. I checked 100 times the result and those are the sizes indeed - and it takes 5 hours to complete.
I know I missed something. I've tried caching, repartitioning, etc.
I would love to get some other ideas.
I have two suggestions for you:
Look at the explain plan and the spark properties, for the amount of data you have mentioned 5 hours is a really long time. My expectation is you have way too many shuffles, you can look at different properties like : spark.sql.shuffle.partitions
Instead of doing a cross join, you can maybe do a collect and explore broadcasts
https://sparkbyexamples.com/spark/spark-broadcast-variables/ but do this only on small amounts of data as this data is brought back to the driver.
What is the action you are doing afterwards with C?
Also, if these datasets are so small, consider collecting them to the driver, and doing these manupulation there, you can always spark.createDataFrame later again.
Update #1:
final case class Unit(country: String, AVR: Double)
val collectedUnits: Seq[Unit] = units.as[Unit].collect
val collectedCountries: Seq[String] = countries.collect
val pairs: Seq[(String, Unit)] = for {
unit <- units
country <- countries
} yield (country, unit)
I've finally understood the problem - Spark used too many excessive numbers of partitions, and thus the shuffle takes a lot of time.
The way to solve it is to change the default number -
sparkSession.conf.set("spark.sql.shuffle.partitions", 10)
And it works like magic.

Kotlin string date formatter

I'm trying to parse string date
"AuthDate": "2021-08-19T23:40:52+04:00",
here is code for parsing and displaying
var date = item?.authDate.toString()
val formatter = DateTimeFormatter.ofPattern("yyyy-MM-dd'T'HH:mm:ssz")
val parsedDate = formatter.parse(date)
val displayFormatter = DateTimeFormatter.ofPattern("yyyy-MM-dd, HH:MM:SS")
text = displayFormatter.format(parsedDate).toString()
This works fine, except one thing, Seconds is always displayed in "00".
For example, if authDate is 2021-08-19T23:40:52+04:00,
displayed authDate is 2021-08-19, 23:40:00
not 23:40:52 as I want.
val formatter = DateTimeFormatter.ofPattern("yyyy-MM-dd'T'HH:mm:ssz")
...
val displayFormatter = DateTimeFormatter.ofPattern("yyyy-MM-dd, HH:MM:SS")
Notice how the first of these uses mm and ss, while the second uses MM and SS. The former says to parse hours, minutes, then seconds. The latter says to display hours, the month, and then the fraction of a second. See the documentation for a full list of the specifiers, but you're probably looking for
val displayFormatter = DateTimeFormatter.ofPattern("yyyy-MM-dd, HH:mm:ss")

How to expire state of dropDuplicates in structured streaming to avoid OOM?

I want to count the unique access for each day using spark structured streaming, so I use the following code
.dropDuplicates("uuid")
and in the next day the state maintained for today should be dropped so that I can get the right count of unique access of the next day and avoid OOM. The spark document indicates using dropDuplicates with watermark, for example:
.withWatermark("timestamp", "1 day")
.dropDuplicates("uuid", "timestamp")
but the watermark column must be specified in dropDuplicates. In such case the uuid and timestamp will be used as a combined key to deduplicate elements with the same uuid and timestamp, which is not what I expected.
So is there a perfect solution?
After a few days effort I finally find out the way myself.
While studying the source code of watermark and dropDuplicates, I discovered that besides an eventTime column, watermark also supports window column, so we can use the following code:
.select(
window($"timestamp", "1 day"),
$"timestamp",
$"uuid"
)
.withWatermark("window", "1 day")
.dropDuplicates("uuid", "window")
Since all events in the same day have the same window, this will produce the same results as using only uuid to deduplicate. Hopes can help someone.
Below is the modification of the procedure proposed in Spark documentation. Trick is to manipulate event time i.e. put event time in
buckets. Assumption is that event time is provided in milliseconds.
// removes all duplicates that are in 15 minutes tumbling window.
// doesn't remove duplicates that are in different 15 minutes windows !!!!
public static Dataset<Row> removeDuplicates(Dataset<Row> df) {
// converts time in 15 minute buckets
// timestamp - (timestamp % (15 * 60))
Column bucketCol = functions.to_timestamp(
col("event_time").divide(1000).minus((col("event_time").divide(1000)).mod(15*60)));
df = df.withColumn("bucket", bucketCol);
String windowDuration = "15 minutes";
df = df.withWatermark("bucket", windowDuration)
.dropDuplicates("uuid", "bucket");
return df.drop("bucket");
}
I found out that window function didn't work so I chose to use window.start or window.end.
.select(
window($"timestamp", "1 day").start,
$"timestamp",
$"uuid"
)
.withWatermark("window", "1 day")
.dropDuplicates("uuid", "window")

Dataframe sample in Apache spark | Scala

I'm trying to take out samples from two dataframes wherein I need the ratio of count maintained. eg
df1.count() = 10
df2.count() = 1000
noOfSamples = 10
I want to sample the data in such a way that i get 10 samples of size 101 each( 1 from df1 and 100 from df2)
Now while doing so,
var newSample = df1.sample(true, df1.count() / noOfSamples)
println(newSample.count())
What does the fraction here imply? can it be greater than 1? I checked this and this but wasn't able to comprehend it fully.
Also is there anyway we can specify the number of rows to be sampled?
The fraction parameter represents the aproximate fraction of the dataset that will be returned. For instance, if you set it to 0.1, 10% (1/10) of the rows will be returned. For your case, I believe you want to do the following:
val newSample = df1.sample(true, 1D*noOfSamples/df1.count)
However, you may notice that newSample.count will return a different number each time you run it, and that's because the fraction will be a threshold for a random-generated value (as you can see here), so the resulting dataset size can vary. An workaround can be:
val newSample = df1.sample(true, 2D*noOfSamples/df1.count).limit(df1.count/noOfSamples)
Some scalability observations
You may note that doing a df1.count might be expensive as it evaluates the whole DataFrame, and you'll lose one of the benefits of sampling in the first place.
Therefore depending on the context of your application, you may want to use an already known number of total samples, or an approximation.
val newSample = df1.sample(true, 1D*noOfSamples/knownNoOfSamples)
Or assuming the size of your DataFrame as huge, I would still use a fraction and use limit to force the number of samples.
val guessedFraction = 0.1
val newSample = df1.sample(true, guessedFraction).limit(noOfSamples)
As for your questions:
can it be greater than 1?
No. It represents a fraction between 0 and 1. If you set it to 1 it will bring 100% of the rows, so it wouldn't make sense to set it to a number larger than 1.
Also is there anyway we can specify the number of rows to be sampled?
You can specify a larger fraction than the number of rows you want and then use limit, as I show in the second example. Maybe there is another way, but this is the approach I use.
To answer your question, is there anyway we can specify the number of rows to be sampled?
I recently needed to sample a certain number of rows from a spark data frame. I followed the below process,
Convert the spark data frame to rdd.
Example: df_test.rdd
RDD has a functionality called takeSample which allows you to give the number of samples you need with a seed number.
Example: df_test.rdd.takeSample(withReplacement, Number of Samples, Seed)
Convert RDD back to spark data frame using sqlContext.createDataFrame()
Above process combined to single step:
Data Frame (or Population) I needed to Sample from has around 8,000 records:
df_grp_1
df_grp_1
test1 = sqlContext.createDataFrame(df_grp_1.rdd.takeSample(False,125,seed=115))
test1 data frame will have 125 sampled records.
To answer if the fraction can be greater than 1. Yes, it can be if we have replace as yes. If a value greater than 1 is provided with replace false, then following exception will occur:
java.lang.IllegalArgumentException: requirement failed: Upper bound (2.0) must be <= 1.0.
I too find lack of sample by count functionality disturbing. If you are not picky about creating a temp view I find the code below useful (df is your dataframe, count is sample size):
val tableName = s"table_to_sample_${System.currentTimeMillis}"
df.createOrReplaceTempView(tableName)
val sampled = sqlContext.sql(s"select *, rand() as random from ${tableName} order by random limit ${count}")
sqlContext.dropTempTable(tableName)
sampled.drop("random")
It returns an exact count as long as your current row count is as large as your sample size.
The below code works if you want to do a random split of 70% & 30% of a data frame df,
val Array(trainingDF, testDF) = df.randomSplit(Array(0.7, 0.3), seed = 12345)
I use this function for random sampling when exact number of records are desirable:
def row_count_sample (df, row_count, with_replacement=False, random_seed=113170):
ratio = 1.08 * float(row_count) / df.count() # random-sample more as dataframe.sample() is not a guaranteed to give exact record count
# it could be more or less actual number of records returned by df.sample()
if ratio>1.0:
ratio = 1.0
result_df = (df
.sample(with_replacement, ratio, random_seed)
.limit(row_count) # since we oversampled, make exact row count here
)
return result_df
May be you want to try below code..
val splits = data.randomSplit(Array(0.7, 0.3))
val (trainingData, testData) = (splits(0), splits(1))

Spark 1.5.2: Grouping DataFrame Rows over a Time Range

I have a df with the following schema:
ts: TimestampType
key: int
val: int
The df is sorted in ascending order of ts. Starting with row(0), I would like to group the dataframe within certain time intervals.
For example, if I say df.filter(row(0).ts + expr(INTERVAL 24 HOUR)).collect(), it should return all the rows within the 24 hr time window of row(0).
Is there a way to achieve the above within Spark DF context?
Generally speaking it is relatively simple task. All you need is basic arithmetics on UNIX timestamps. First lets cast all timestamps to numerics:
val dfNum = df.withColumn("ts", $"timestamp".cast("long"))
Next lets find minimum timestamp over all rows:
val offset = dfNum.agg(min($"ts")).first.getLong(0)
and use it to compute groups:
val aDay = lit(60 * 60 * 24)
val group = (($"ts" - lit(offset)) / aDay).cast("long")
val dfWithGroups = dfNum.withColumn("group", group)
Finally you can use it as a grouping column:
dfWithGroups.groupBy($"group").agg(min($"value")).
If you want meaningful intervals (interpretable as timestamps) just multiply groups by aDay.
Obviously this won't handle complex cases like handling daylight saving time or leap seconds but should be good enough most of the time. If you need to handle properly any of this you use a similar logic using Joda time with an UDF.

Resources