Big Query is not able to convert String to Timestamp - apache-spark

I have a BigQuery table where one of the column (publishTs) is timeStamp. I am trying to upload a parquet file into same table using GCP UI BQ upload option having same column name (publishTs) with String datatype (e.g. “2021-08-24T16:06:21.122Z “), But BQ is complaining with following error :-
I am generating parquet file using Apache Spark. I tried searching on internet but could not get the answer.

Try to generate this column as INT64 - link

Related

Athena query displays data differently than in S3

Athena Query is changing few of data points to 0.
During data sanity I found that a particular column was showing huge difference on dashboard and in S3 files, data displayed on dashboard was around 40k and on reading it after downloading file from S3 was around 80k.
Since I am querying data directly from S3 using Athena, the datasource remains the same for aAhena and file download. I am wondering why this is happening, any help would be appreciated.
Eg:
Athena Query Results :
Data in S3:
Queried the data through a simple select query:
SELECT "orderid","orderdate","total tax"
FROM gbc_owss"
The datatype in Athena for the total tax column was double.
EDIT: Solved the above issue, it was indeed a delimiter issue and that was pushing the values to next column thus making it look like athena is changing values but that wasn't the case.

BigQuery load parquet file

I am trying to load a parquet type file into the bigquery table.
However, the date type column in the yyyy-mm-dd format is recognized as a String, and the following error occurs.
ERROR - Failed to execute task: 400 Provided Schema does not match Table my_prj.my_dataset.my_table. Field _P1 has changed type from DATE to STRING.
field_P1: 2022-10-05
Is there any way to solve it?
The first solution that comes to my mind is to load the data using Pandas library in Python. This way, you can convert string to related date format and load the data directly to BigQuery.

Column Type mismatch in Spark Dataframe and source

I am trying to read data from elastic, I could see Column is present as Array of string in elastic but while I am reading by Spark as Dataframe i am seeing as a Srting, how could I handle this data in Spark.
Note: I am trying to read with mode (sqlContext.read.format("org.elasticsearch.spark.sql") becuase i need to write it as CSV file in future.

Databricks: Incompatible format detected (temp view)

I am trying to create a temp view from a number of parquet files, but it does not work so far. As a first step, I am trying to create a dataframe by reading parquets from a path. I want to load all parquet files into the df, but so far I dont even manage to load a single one, as you can see on the screenshot below. Can anyone help me out here? Thanks
Info: batch_source_path is the string in column "path", row 1
Your data is in Delta format and this is how you must read:
data = spark.read.load('your_path_here', format='delta')

Load array file to Big query

I'm looking for effective way to upload the following array to Big query table
in this format :
Big query columns (example)
event_type: video_screen
event_label: click_on_screen
is_ready:false
time:202011231958
long:1
high:43
lenght:0
**
Array object
**
[["video_screen","click_on_screen","false","202011231958","1","43","0"],["buy","error","2","202011231807","1","6","0"],["sign_in","enter","user_details","202011231220","2","4","0"]]
I thought of several options but none of them seems to be The best practice.
Option A: Upload the following file to Google storage and then create table related to this bucket - not worked because of file format, Google Bigquery can't parse array from Google bucket.
Option B: Use by backend (node.js) to change the file structure to CSV and upload it directly to Bigquery - failed because of latency (the array is long, more than my example).
Option C: Use Google Appcript to get the array object and insert it to Bigquery - I didn't find a simple code for this, Google storage has no API connected to Appscript.
Someone deal with such a case and can share his solution? What is the best practice for this case? if you've code for this it will be great.
Load the file from GCS to BigQuery into a table with 1 single string column. So you get 100K rows and one single column.
Essentially you will have a table that has a JSON in a string.
Use JSON_EXTRACT_ARRAY to process the JSON array into elements
then later extract each position into its coresponding variable/column and write it to a table
here is a demo:
with t as (
select '[["video_screen","click_on_screen","false","202011231958","1","43","0"],["buy","error","2","202011231807","1","6","0"],["sign_in","enter","user_details","202011231220","2","4","0"]]' as s
),
elements as (
select e from t,unnest(JSON_EXTRACT_ARRAY(t.s)) e
)
select
json_extract_scalar(e,'$[0]') as event_type ,
json_extract_scalar(e,'$[1]') as event_label,
from elements
the output is:

Resources