Error while writing data from python to redshift - Invalid date format - length must be 10 or more - python-3.x

I have a dataframe in python where date columns in datetime64[ns] data type. Now I am trying to write this data frame to redshift. I am getting following stl_load_errors:
Invalid date format - length must be 10 or more
All my dates are 2016-10-21 format, thus have length of 10. More over, I have ensured that no row has any messed up format like 2016-1-8 where it can have only 8 character. So the error is not making sense.
Any one faced similar error while writing data to redshift ? Any explanation ?
Note:
Here's some context. I am running the python script from EC2. This script writes the data in json format to S3 bucket and then this json is uploaded to an empty redshift table. The redshift table describes the date columns as 'date' format. I know there's another way which uses boto3/copy but for now I am stuck to this method.

Related

Write each row of a dataframe to a separate json file in s3 with pyspark

in one of my projects, I need to write each row of a dataframe into a separate S3 file in json format. In the actual implementation, map/foreach's input is a Row, though I don't seem to find any member function on Row that could transform a row into json format.
I'm using spark df and don't want to convert it to pandas (as it involves sending everything to the driver?), hence cannot use the to_json function. Is there any other way to do it? I can definitely write my own json converter based on my specific df schema, but just wondering if there is a readily available module.

Prevent pyspark/spark from transforming timestamp when creating a dataframe from a parquet file

I am reading a parquet file into a dataframe. My goal is to verify that my time data (column type in parquet : timestamp) are ISO 8601.
The dates in time column look like this : 2021-03-13T05:34:27.100Z or 2021-03-13T05:34:27.100+0000
But when I read my dataframe, pyspark transform 2021-03-13T05:34:27.100Z into 2021-03-13 05:34:27.100
I want to keep the original format, but I can't figure out how to stop pyspark from doing this. I tried to use a custom schema with string for dates but I get this error: Parquet column cannot be converted in file file.snappy.parquet. Column: [time], Expected: string, Found: INT96
Also I tried using conf parameters but it didn't work for me.

A DATETIME column in Synapse tables is loading date values that are a few hours into the past compared to the incoming value

I have a datetime column in Synapse called "load_day" which is being loaded through a pyspark dataframe (parquet). During runtime, the code adds a new column in the dataframe with an incoming date ('timestamp') of format yyyy-mm-dd hh:mm:ss into the dataframe.
df = df.select(lit(incoming_date).alias("load_day"), "*")
Later we are writing this dataframe into a synapse table using a df.write command.
But what's strange is that every date value that is going into this load_day column is being written as a value that is a few hours into the past. This is happening with all the synapse tables in my database for all the new loads that I'm doing. To my knowledge, nothing in the code has changed from before.
Eg: If my incoming date is "2022-02-19 00:00:00" it's being written as 2022-02-18 22:00:00.000 instead of 2022-02-19 00:00:00.000. The hours part in the date is also not stable; sometimes it writes as 22:00:00.000 and sometimes 23:00:00.000
I debugged the code but the output of the variable looks totally fine. It just shows the value as 2022-02-19 00:00:00 as expected but the moment the data is getting ingested into the Synapse table, it goes back a couple of hours.
I'm not understanding why this might be happening or what to look for during debugging.
Did any of you face something like this before? Any ideas on how to I can approach this to find out what's causing this erroneous date?

How to handle dates in cx_oracle using python?

I'm trying to access Oracle table using cx_oracle module and convert that as a dataframe, every thing is fine except couple of date columns has date format like "01-JAN-01" Python considering it as datetime.datetime(1,1,1,0,0) and after creating dataframe it's showing as 0001-01-01 00:00:00. I am expecting output as 2001-01-01 00:00:00. Please help me on this. Thanks in advance.
You have a couple of choices. You could
* Retrieve it from the Oracle database with [read_sql](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.read_sql.html specifying the date in a format (TO_CHAR) more appropriate for the default date format of pandas
* Retrieve it from the database as a string (as above) and then convert it into a date in the pandas framework.

Unable to read timstamp value in pyspark from Hive (spark 1.6.1)

I am trying to read a Hive Table having a date column with datatype as timestamp, length=9.
my code looks something like the following -
df = hc.sql("select * from schema.table")
It can read all other columns (datatype = varchar).
It either reads null or gives none in date column
I have printed the df.dtypes and found that the dataframe schema was inferred correctly and date columns have timestamp datatype
surprisingly the same code works in notebook, only fails in spark-shell environment.
Can someone guide me what could be wrong or limitation that renders this error and how could I rectify it?
I have seen this problem in spark, Where It will display as null when the datatype is timestamp, Its a bug. There is a way to get around it where you have to read that date column as string using something like this to_char(Column_name ,'YYYY-MM-DD HH-MM-SS') as column_name, and then cast it to timestamp. If you can tell me the source type and the tool you used to pull the data like sqoop or are you getting the data in some form of files ?? I can help you better.

Resources