I am trying to upload data in to the AWS neptune, but getting error because of date format
sample format of csv:
~id, ~from, ~to, ~label, date:Date
e1, v1, v2, created, 2019-11-04
can some one help me on this?
I did a test today using a CSV as follows
~id,~from,~to,~label,date:Date
e1,3,4,testedge,2019-11-19
and it worked fine. This is after the load:
gremlin> g.E('e1')
==>e[e1][3-testedge->4]
gremlin> g.E('e1').valueMap()
==>{date=Tue Nov 19 00:00:00 UTC 2019}
Perhaps curl the loader endpoint for your cluster adding ?errors=true&details=true to the curl URL
Cheers,
Kelvin
Related
In my PostgreSQL database, the datetime stored as 2022-05-10 10:44:19+08 and when I get
the datetime by using the sequelize, it will give in format:: 2022-05-10T02:44:19.000Z.
So, my question is how to convert to 2022-05-10 10:44:19 ?
Thanks in advance.
There is a direct dependence on the time of your server. Therefore, depending on what you want to get, you can use different options.
Here is a dbfiddle with examples
I would like to drop Databricks SQL DB tables, if the table was created more than 30 days ago. How do I get the table created datetime from databricks?
Thanks
Given a tableName, the easiest way to get the creation time is as follows:
import org.apache.spark.sql.catalyst.TableIdentifier
val createdAtMillis = spark.sessionState.catalog
.getTempViewOrPermanentTableMetadata(new TableIdentifier(tableName))
.createTime
getTempViewOrPermanentTableMetadata() returns CatalogTable that contains information such as:
CatalogTable(
Database: default
Table: dimension_npi
Owner: root
Created Time: Fri Jan 10 23:37:18 UTC 2020
Last Access: Thu Jan 01 00:00:00 UTC 1970
Created By: Spark 2.4.4
Type: MANAGED
Provider: parquet
Num Buckets: 8
Bucket Columns: [`npi`]
Sort Columns: [`npi`]
Table Properties: [transient_lastDdlTime=1578699438]
Location: dbfs:/user/hive/warehouse/dimension_npi
Serde Library: org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe
InputFormat: org.apache.hadoop.mapred.SequenceFileInputFormat
OutputFormat: org.apache.hadoop.hive.ql.io.HiveSequenceFileOutputFormat
Schema: root
|-- npi: integer (nullable = true)
...
)
You can list all tables in a database using sessionCatalog.listTables(database).
There are alternative ways of accomplishing the same but with a lot more effort and risking errors due to Spark behavior changes: poking about table metadata using SQL and/or traversing the locations where tables are stored and looking at file timestamps. That's why it's best to go via the catalog APIs.
Hope this helps.
Assuming your DB table is delta:
You can use the DESCRIBE HISTORY <database>.<table> to retrieve all transactions made to that table, including timestamps. According to the databricks documentation - history is only retained for 30 days. Depending on how you plan to implement your solution that may just work.
We are sending datetime as string in the format 2018-03-20 10:50:037.996, and we have written Steam analytics query as below.
SELECT
powerscout.Device_Id AS PowerScout,
powerscout.[kW System],
CAST(powerscout.[TimeStamp] AS datetime) AS [TimeStamp]
INTO
[PowergridView]
FROM
[IoTHubIn]
When we are sending data through Stream analytics, Job is getting failed.
Any Suggestions please.
Thanks In Advance
ASA can parse DATETIME fields represented in one of the formats described in ISO 8601. This format is not supported. You can try using custom JavaScript function to parse it.
I have a datetime column in an Azure Data Lake Analytics table.
All my incoming data is UTC +0000. When using the below code, all the csv outputs convert the dates to -0800
OUTPUT #data
TO #"/data.csv"
USING Outputters.Text(quoting : false, delimiter : '|');
An example datatime in the output:
2018-01-15T12:20:13.0000000-08:00
Are there any options for controlling the output format of the dates? I don't really understand why everything is suddenly in -0800 when the incoming data isn't.
Currently, ADLA does not store TimeZone information in DateTime, meaning it will always default to the local time of the cluster machine when reading (-8:00 in your case). Therefore, you can either normalize your DateTime to this local time by running
DateTime.SpecifyKind(myDate, DateTimeKind.Local)
or use
DateTime.ConvertToUtc()
to output in Utc form (but note that next time you ingest that same data, ADLA will still default to reading it in offset -0800). Examples below:
#getDates =
EXTRACT
id int,
date DateTime
FROM "/test/DateTestUtc.csv"
USING Extractors.Csv();
#formatDates =
SELECT
id,
DateTime.SpecifyKind(date, DateTimeKind.Local) AS localDate,
date.ConvertToUtc() AS utcDate
FROM #getDates;
OUTPUT #formatDates
TO "/test/dateTestUtcKind_AllUTC.csv"
USING Outputters.Csv();
You can file a feature request for DateTime with offset on our ADL feedback site. Let me know if you have other questions!
I am trying to automate my spark code in Scala or python and here is what I am trying to do
Format of files in s3 bucket is filename_2016_02_01.csv.gz
From s3 bucket the spark code should be able to pick the file name and create an Dataframe
example Dataframe=sqlContext.read.format("com.databricks.spark.csv").options(header="true").options(delimiter=",").options(inferSchema="true").load("s3://bucketname/filename_2016-01-29.csv.gz")
So every day when I run the job it should be pick that particular days file and create an dataframe instead of me specifying the file name .
Any Idea on how to write code for this condition ?
Thanks in Advance.
If i understood you correctly, you want the file name change automatically based on that day date.
if that's the case:
here is a Scala solution:
Im using joda-time to generate that date.
import org.joda.time.format.DateTimeFormat
import org.joda.time.{DateTimeZone, DateTime}
...
val today = DateTime.now(DateTimeZone.UTC).toString(DateTimeFormat.forPattern("yyyy_MM_dd"))
val fileName = "filename_" + today + ".csv.gz"
...
Python solution:
from datetime import datetime
today = datetime.utcnow().strftime('%Y_%m_%d')
file_name = 'filename_' + today + '.csv.gz'
load("s3://bucketname/{}").format(file_name)