Spark History - Log timestamps have wrong time zone

Spark History - Log timestamps have wrong time zone - apache-spark

When I submit a job on a set of machines machine located in London timezone, the Spark Master on the dashboard has the correct time, but the dashboard of history server shows time that is 1 hrs ahead which is GMT. Is there a way to fix this in Apache Spark?

It's most likely that your log timestamps don't have "wrong" time zone, but either your spark cluster was located in GMT, or the conf was set as:
spark.conf.set("spark.sql.session.timeZone", "GMT")
Change this line explicitly to London timezone (BST?)
Or use the fancy from_utc_timestamp function which lets you specify a timezone while converting ts
Also check whether your ts is in milliseconds, and set -Duser.timezone in JVM config spark.executor.extraJavaOptions

Related

How to reduce the temporary table expiration time in BigQuery

We run Spark jobs which access BigQuery. During read phase, data is being pulled from a temporary table with naming convention _sbc_*. By default, the table exipration is 24 hours. But for our usecase, retention period of 1 hour is more than enough. I was wondering is there any we can bring down the temporary table expiration duration from 24 hours to 1 hour.
Below is how we instantiate spark config,
val sparkConf = new SparkConf
sparkConf.setAppName("test-app")
sparkConf.setMaster("local[*]")
sparkConf.set("viewsEnabled", "true")
sparkConf.set("parentProject", "<parentProject>")
sparkConf.set("materializationProject", "<materializationProject>")
sparkConf.set("materializationDataset", "<materializationDataset>")
sparkConf.set("credentials", "<credentials>")
Note: Temporary table is getting created in project passed for materializationProject parameter.
Spark version : spark-2.3.1

spark-bigquery-connector doesn't provide any option to set expiration time over the materialized views it creates during reading.
However, if you're using a specific materializationDataset for these jobs, you can directly define the default expiration time for that dataset in BigQuery. It will be applied to all tables and views created under the dataset.
bq update --default_table_expiration 3600 materializationProject:materializationDataset

As of 2023-01-13, there now appears to be an option called materializationExpirationTimeInMinutes that defines the temporary table expiration time. If not set, it defaults to 24 hours.
See here.

How to get Spark Streaming running time

I need to set up a Spark Streaming application. Jobs of the application need to make some decisions based on the whole application running time.
For example, assume the Spark Streaming application was submitted at 08:00. The jobs run between 08:00 and 10:00 should do a plus operation, while the jobs run after 10:00 should do a minus operation.
How can I record the first job's (or the application's) start time and determine the interval between each job and the first job? Or is there any other good solution?

SparkContext's startTime() method returns the time when it became active.

Cassandra select query with timezone issue

We have two different cassandra cluster on two different timezones.
Cluster1: 2.1.8 version, with IST TZ
Cluster2: 2.1.9 version, with UTC TZ
On cluster1 for a select query with timestamp column, i need not mention the tz[+0530] value , whereas on the other cluster I must and should provide the TZ value in select query to fetch the row. Is it to do with cassandra version?
I use cqlsh to do the query part. I tried cqlshrc file option, which only changes the format of output.
cluster1:
select * from test.check where row_timestamp = '1970-01-01 00:00:00';
cluster2:
select * from test.check where row_timestamp = '1970-01-01 00:00:00+0000';
IF no TZ is mentioned, i get "0" rows.
I dont want to give TZ in cluster2, please adivce how to do that.

It is a bit strange, I must admit, but there might been some changes in time zone manipulation between 2.1.8 and 2.1.9. This is from the changelog:
(cqlsh) Fix timestamps before 1970 on Windows, always use UTC for
timestamp display (CASSANDRA-10000)
On the other hand, the documentation is quite clear on this issue:
If no time zone is specified, the time zone of the Cassandra
coordinator node handing the write request is used. For accuracy,
DataStax recommends specifying the time zone rather than relying on
the time zone configured on the Cassandra nodes.
So, my sincere recommendation is to specify the time zone, and specify the same, presumably GMT (or UTC time). Save yourself the headache. Mind, GMT is not exactly equal to UTC, there is a slight difference in meaning. That way, you should ignore the time zone settings on the clusters. The time stamp is ultimately stored as a number of milliseconds (from certain point). The time zone information is purely a "rendering" thing. The number of milliseconds passed is the same in, for example 2015/03/05 14:00:00+0100 and 2015/03/05 16:00:00+0300.
If you are specifying nothing, and getting 0 results, while you do get results when you use +0000, then make sure that the data you are expecting originally is written with the expected time zone. Maybe there actually is not any data in the span because of that, or the coordinating node time stamp is different.

How can I change Apache Cassandra's default time zone?

I need to run a Cassandra instance on Windows... Don't ask why...
Anyway the issue is that I have time stamp columns that show datetime in PST time zone but I would like to see GMT time zone. My machine runs with BST time zone(British Summer Time).
Is there a way for me to change the default time zone to GMT?

Timestamp values are stored independently from the timezone they have been converted from. Any representation of a TZ will be done by the cqlsh which is depending on Python for the conversion from the TZ agnostic timestamp value to the cqlsh output. Python in turn will use the OS default TZ. In Linux, you can change the TZ by setting the following environment variable in the same shell used to start cqlsh: export TZ='GMT'. I suppose this should work for Windows as well using something like set TZ='GMT'.
Update 11/Feb/2016: described behaviour will not work anymore for 2.1+. See this answer for details. (The linked answer explains this has been fixed.)

Why use To_Localtime when analyzing IIS logs

I've searched for several examples to analyze IIS logs using the Log Parser, taking time into account... For example, this query that shows the number of hits per hour:
SELECT
QUANTIZE(TO_LOCALTIME(TO_TIMESTAMP(date, time)), 3600) AS Hour,
COUNT(*) AS Hits
FROM D:\Logs\*.log
Group By Hour
However I cannot understand why use "TO_LOCALTIME"... Also, if there is a time difference (and a difference in results while using "TO_LOCALTIME" or not), how is that?... Thank you!

All IIS uses UTC for all times in its logs regardless of the time zone of the server, so to get your local time, you can use TO_LOCALTIME.
I guess if you are fine with UTC, you don't need to use TO_LOCALTIME.

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string

Spark History - Log timestamps have wrong time zone - apache-spark

When I submit a job on a set of machines machine located in London timezone, the Spark Master on the dashboard has the correct time, but the dashboard of history server shows time that is 1 hrs ahead which is GMT. Is there a way to fix this in Apache Spark?

Related

How to reduce the temporary table expiration time in BigQuery

How to get Spark Streaming running time

Cassandra select query with timezone issue

How can I change Apache Cassandra's default time zone?

Why use To_Localtime when analyzing IIS logs

Categories

Resources