Bigquery failed to parse input string as TIMESTAMP - python-3.x
I'm trying to load a csv from Google Cloud Storage into Bigquery using schema autodetect.
However I'm getting stumped by a parsing error on one of my columns. I'm perplexed why bigquery can't parse the field. In the documentation, it should be able to parse fields that look like YYYY-MM-DD HH:MM:SS.SSSSSS (which is exactly what my BQInsertTimeUTC column is).
Here's my code:
from google.cloud import bigquery
from google.oauth2 import service_account
project_id = "<my_project_id>"
table_name = "<my_table_name>"
gs_link = "gs://<my_bucket_id>/my_file.csv"
creds = service_account.Credentials.from_service_account_info(gcs_creds)
bq_client = bigquery.Client(project=project_id, credentials=creds)
dataset_ref = bq_client.dataset(<my_dataset_id>)
# create job_config object
job_config = bigquery.LoadJobConfig(
autodetect=True,
skip_leading_rows=1,
source_format="CSV",
write_disposition="WRITE_TRUNCATE",
)
# prepare the load_job
load_job = bq_client.load_table_from_uri(
gs_link,
dataset_ref.table(table_name),
job_config=job_config,
)
# execute the load_job
result = load_job.result()
Error Message:
Could not parse '2021-07-07 23:10:47.989155' as TIMESTAMP for field BQInsertTimeUTC (position 4) starting at location 64 with message 'Failed to parse input string "2021-07-07 23:10:47.989155"'
And here's the csv file that is living in GCS:
first_name, last_name, date, number_col, BQInsertTimeUTC, ModifiedBy
lisa, simpson, 1/2/2020T12:00:00, 2, 2021-07-07 23:10:47.989155, tim
bart, simpson, 1/2/2020T12:00:00, 3, 2021-07-07 23:10:47.989155, tim
maggie, simpson, 1/2/2020T12:00:00, 4, 2021-07-07 23:10:47.989155, tim
marge, simpson, 1/2/2020T12:00:00, 5, 2021-07-07 23:10:47.989155, tim
homer, simpson, 1/3/2020T12:00:00, 6, 2021-07-07 23:10:47.989155, tim
Loading CSV files to BigQuery assumes that all the timestamp fields are going to follow the same format. In your CSV file, since the first timestamp value is "1/2/2020T12:00:00" so it is going to consider the timestamp format that the CSV file uses is [M]M-[D]D-YYYYT[H]H:[M]M:[S]S[.F]][time zone].
Therefore, it complains that the value "2021-07-07 23:10:47.989155" could not be parsed. If you change "2021-07-07 23:10:47.989155" to "7/7/2021T23:10:47.989155", it will work.
To fix this, you can either
Create a table with date column's type and BQInsertTimeUTC column's type as STRING. Load the CSV into it. And then expose a view which will have the expected TIMESTAMP column types for date and BQInsertTimeUTC, using SQL to transform the data from the base table.
Open the CSV file and transform either the "date" values or "BQInsertTimeUTC" values to make their formats consistent.
By the way, the CSV sample you pasted here has extra space after the delimiter ",".
Working version:
first_name,last_name,date,number_col,BQInsertTimeUTC,ModifiedBy
lisa,simpson,1/2/2020T12:00:00,7/7/2021T23:10:47.989155,tim
bart,simpson,1/2/2020T12:00:00,3,7/7/2021T23:10:47.989155,tim
maggie,simpson,1/2/2020T12:00:00,4,7/7/2021T23:10:47.989155,tim
marge,simpson,1/2/2020T12:00:00,5,7/7/2021T23:10:47.989155,tim
homer,simpson,1/3/2020T12:00:00,6,7/7/2021T23:10:47.989155,tim
As per the limitaions mentioned here,
When you load JSON or CSV data, values in TIMESTAMP columns must use a dash - separator for the date portion of the timestamp, and the date must be in the following format: YYYY-MM-DD (year-month-day). The hh:mm:ss (hour-minute-second) portion of the timestamp must use a colon : separator.
So can you try passing the BQInsertTimeUTC as 2021-07-07 23:10:47 without the milli seconds instead of 2021-07-07 23:10:47.989155
If you want to still use different Date formats you can do the following:
Load the CSV file as-is to BigQuery (i.e. your schema should be modified to BQInsertTimeUTC:STRING)
Create a BigQuery view that transforms the shipped field from a string to a recognized date format.
Do a PARSE_DATE for the BQInsertTimeUTC and use that view for your analysis
Related
How do I convert my response with byte characters to readable CSV - PYTHON
I am building an API to save CSVs from Sharepoint Rest API using python 3. I am using a public dataset as an example. The original csv has 3 columns Group,Team,FIFA Ranking with corresponding data in the rows.For reference. the original csv on sharepoint ui looks like this: after using data=response.content the output of data is: b'Group,Team,FIFA Ranking\r\nA,Qatar,50\r\nA,Ecuador,44\r\nA,Senegal,18\r\nA,Netherlands,8\r\nB,England,5\r\nB,Iran,20\r\nB,United States,16\r\nB,Wales,19\r\nC,Argentina,3\r\nC,Saudi Arabia,51\r\nC,Mexico,13\r\nC,Poland,26\r\nD,France,4\r\nD,Australia,38\r\nD,Denmark,10\r\nD,Tunisia,30\r\nE,Spain,7\r\nE,Costa Rica,31\r\nE,Germany,11\r\nE,Japan,24\r\nF,Belgium,2\r\nF,Canada,41\r\nF,Morocco,22\r\nF,Croatia,12\r\nG,Brazil,1\r\nG,Serbia,21\r\nG,Switzerland,15\r\nG,Cameroon,43\r\nH,Portugal,9\r\nH,Ghana,61\r\nH,Uruguay,14\r\nH,South Korea,28\r\n' how do I convert the above to csv that pandas can manipulate with the columns being Group,Team,FIFA and then the corresponding data dynamically so this method works for any csv. I tried: data=response.content.decode('utf-8', 'ignore').split(',') however, when I convert the data variable to a dataframe then export the csv the csv just returns all the values in one column. I tried: data=response.content.decode('utf-8') or data=response.content.decode('utf-8', 'ignore') without the split however, pandas does not take this in as a valid df and returns invalid use of dataframe constructor I tried: data=json.loads(response.content) however, the format itself is invalid json format as you will get the error json.decoder.JSONDecodeError: Expecting value: line 1 column 1 (char 0)
Given: data = b'Group,Team,FIFA Ranking\r\nA,Qatar,50\r\nA,Ecuador,44\r\nA,Senegal,18\r\n' #... If you just want a CSV version of your data you can simply do: with open("foo.csv", "wt", encoding="utf-8", newline="") as file_out: file_out.writelines(data.decode()) If your objective is to load this data into a pandas dataframe and the CSV is not actually important, you can: import io import pandas foo = pandas.read_csv(io.StringIO(data.decode())) print(foo)
Problem in converting csv to json in python
I am trying to convert csv file to json in python and i have an issue where in one column data has a comma but it is enclosed in double quotes. When considering it as a csv file, data is loading properly without any issues. But while converting to json it is failing saying "Too few arguments passed". sample Data: col1,col2,col3 apple,Fruit,good for health banana,Fruit,"good for weight gain , good for calcium" Brinjal,Vegetable,good for skin while converting the above file to json, it is failed considering 2nd row has 4 columns. Error statement: pandas.errors.ParserError: Too many columns specified: expected 3 and found 4 data=pd.read_csv(sampledata.csv,header=None) data_json = json.loads(data.to_json(orient='records')) with open(filename.json,'w',encoding='utf-8')as jsonf: jsonf.write(json.dumps(data_json,indent=4))
This works: df = pd.read_csv("test.csv") df_json = df.to_json()
How to convert dataframe to a text file in spark?
I unloaded snowflake table and created a data frame. this table has data of various datatype. I tried to save it as a text file but got an error: Text data source does not support Decimal(10,0). So to resolve the error, I casted my select query and converted all columns to string datatype. Then I got the below error: Text data source supports only single column, and you have 5 columns. my requirement is to create a text file as follows. "column1value column2value column3value and so on"
You can use a CSV output with a space delimiter: import pyspark.sql.functions as F df.select([F.col(c).cast('string') for c in df.columns]).write.csv('output', sep=' ') If you want only 1 output file, you can add .coalesce(1) before .write.
You need to have one column if you want to write using spark.write.text. You can use csv instead as suggested in #mck's answer or you can concatenate all columns into one before you write: df.select( concat_ws(" ", df.columns.map(c => col(c).cast("string")): _*).as("value") ).write .text("output")
DateTime in USQL automatically convert to Unix Timestamp in parquet file
I have a problem with DateTime value generated by U-SQL. I wrote some U-SQL to save data into the parquet file, but all the DateTime column automatically converted to Int64 (Unix Timestamp). I tried to investigate and I found some informations at https://github.com/Azure/AzureDataLake/blob/master/docs/Release_Notes/2018/2018_Spring/USQL_Release_Notes_2018_Spring.md#data-driven-output-partitioning-with-output-fileset-is-in-private-preview. All the DateTime will be convert to int64. Why MS need to do that ? and how I can keep original DateTime value generated by U-SQL in parquet file. I put an simple code bellow: SET ##FeaturePreviews = "EnableParquetUdos:on"; #I = SELECT DateTime.Parse("2015-05-10T12:15:35.1230000Z") AS c_timestamp FROM (VALUES(1)) AS T(x); OUTPUT #I TO #"/data/Test_TimeStamp.parquet" USING Outputters.Parquet(); Result in parquet file is: c_timestamp -Int64 datatype - 1431260135123000 But I expected parquet file like: c_timestamp - DateTime datatype - 2015-05-10T12:15:35.1230000Z
Round pandas timestamp series to seconds - then save to csv without ms/ns resolution
I have a dataframe, df with index: pd.DatetimeIndex. The individual timestamps are changed from 2017-12-04 08:42:12.173645000 to 2017-12-04 08:42:12 using the excellent pandas rounding command: df.index = df.index.round("S") When stored to csv, this format is kept (which is exactly what I want). I also need a date-only column, and this is now easily created: df = df.assign(DateTimeDay = df.index.round("D")) When stored to csv-file using df.to_csv(), this does write out the entire timestamp (2017-12-04 00:00:00), except when it is the ONLY column to be saved. So, I add the following command before save: df["DateTimeDay"] = df["DateTimeDay"].dt.date ...and the csv-file looks nice again (2017-12-04) Problem description Now over to the question, I have two other columns with timestamps on the same format as above (but different - AND - with some very few NaNs). I want to also round these to seconds (keeping NaNs as NaNs of course), then make sure that when written to csv, they are not padded with zeros "below the second resolution". Whatever I try, I am simply not able to do this. Additional information: print(df.dtypes) print(df.index.dtype) ...all results in datetime64[ns]. If I convert them to an index: df["TimeCol2"] = pd.DatetimeIndex(df["TimeCol2"]).round("s") df["TimeCol3"] = pd.DatetimeIndex(df["TimeCol3"]).round("s") ...it works, but the csv-file still pads them with unwanted and unnecessary zeros. Optimal solution: No conversion of the columns (like above) or use of element-wise apply unless they are quick (100+ million rows). My dream command would be like this: df["TimeCol2"] = df["TimeCol2"].round("s") # Raises TypeError: an integer is required (got type str)
You can specify the date format for datetime dtypes when calling to_csv: In[170]: df = pd.DataFrame({'date':[pd.to_datetime('2017-12-04 07:05:06.767')]}) df Out[170]: date 0 2017-12-04 07:05:06.767 In[171]: df.to_csv(date_format='%Y-%m-%d %H:%M:%S') Out[171]: ',date\n0,2017-12-04 07:05:06\n' If you want to round the values, you need to round prior to writing to csv: In[173]: df1 = df['date'].dt.round('s') df1.to_csv(date_format='%Y-%m-%d %H:%M:%S') Out[173]: '0,2017-12-04 07:05:07\n'