Athena Query is changing few of data points to 0.
During data sanity I found that a particular column was showing huge difference on dashboard and in S3 files, data displayed on dashboard was around 40k and on reading it after downloading file from S3 was around 80k.
Since I am querying data directly from S3 using Athena, the datasource remains the same for aAhena and file download. I am wondering why this is happening, any help would be appreciated.
Eg:
Athena Query Results :
Data in S3:
Queried the data through a simple select query:
SELECT "orderid","orderdate","total tax"
FROM gbc_owss"
The datatype in Athena for the total tax column was double.
EDIT: Solved the above issue, it was indeed a delimiter issue and that was pushing the values to next column thus making it look like athena is changing values but that wasn't the case.
I am trying to load a parquet type file into the bigquery table.
However, the date type column in the yyyy-mm-dd format is recognized as a String, and the following error occurs.
ERROR - Failed to execute task: 400 Provided Schema does not match Table my_prj.my_dataset.my_table. Field _P1 has changed type from DATE to STRING.
field_P1: 2022-10-05
Is there any way to solve it?
The first solution that comes to my mind is to load the data using Pandas library in Python. This way, you can convert string to related date format and load the data directly to BigQuery.
I am trying to read data from elastic, I could see Column is present as Array of string in elastic but while I am reading by Spark as Dataframe i am seeing as a Srting, how could I handle this data in Spark.
Note: I am trying to read with mode (sqlContext.read.format("org.elasticsearch.spark.sql") becuase i need to write it as CSV file in future.
I am trying to create a temp view from a number of parquet files, but it does not work so far. As a first step, I am trying to create a dataframe by reading parquets from a path. I want to load all parquet files into the df, but so far I dont even manage to load a single one, as you can see on the screenshot below. Can anyone help me out here? Thanks
Info: batch_source_path is the string in column "path", row 1
Your data is in Delta format and this is how you must read:
data = spark.read.load('your_path_here', format='delta')
I'm looking for effective way to upload the following array to Big query table
in this format :
Big query columns (example)
event_type: video_screen
event_label: click_on_screen
is_ready:false
time:202011231958
long:1
high:43
lenght:0
**
Array object
**
[["video_screen","click_on_screen","false","202011231958","1","43","0"],["buy","error","2","202011231807","1","6","0"],["sign_in","enter","user_details","202011231220","2","4","0"]]
I thought of several options but none of them seems to be The best practice.
Option A: Upload the following file to Google storage and then create table related to this bucket - not worked because of file format, Google Bigquery can't parse array from Google bucket.
Option B: Use by backend (node.js) to change the file structure to CSV and upload it directly to Bigquery - failed because of latency (the array is long, more than my example).
Option C: Use Google Appcript to get the array object and insert it to Bigquery - I didn't find a simple code for this, Google storage has no API connected to Appscript.
Someone deal with such a case and can share his solution? What is the best practice for this case? if you've code for this it will be great.
Load the file from GCS to BigQuery into a table with 1 single string column. So you get 100K rows and one single column.
Essentially you will have a table that has a JSON in a string.
Use JSON_EXTRACT_ARRAY to process the JSON array into elements
then later extract each position into its coresponding variable/column and write it to a table
here is a demo:
with t as (
select '[["video_screen","click_on_screen","false","202011231958","1","43","0"],["buy","error","2","202011231807","1","6","0"],["sign_in","enter","user_details","202011231220","2","4","0"]]' as s
),
elements as (
select e from t,unnest(JSON_EXTRACT_ARRAY(t.s)) e
)
select
json_extract_scalar(e,'$[0]') as event_type ,
json_extract_scalar(e,'$[1]') as event_label,
from elements
the output is: