I'm looking for effective way to upload the following array to Big query table
in this format :
Big query columns (example)
event_type: video_screen
event_label: click_on_screen
is_ready:false
time:202011231958
long:1
high:43
lenght:0
**
Array object
**
[["video_screen","click_on_screen","false","202011231958","1","43","0"],["buy","error","2","202011231807","1","6","0"],["sign_in","enter","user_details","202011231220","2","4","0"]]
I thought of several options but none of them seems to be The best practice.
Option A: Upload the following file to Google storage and then create table related to this bucket - not worked because of file format, Google Bigquery can't parse array from Google bucket.
Option B: Use by backend (node.js) to change the file structure to CSV and upload it directly to Bigquery - failed because of latency (the array is long, more than my example).
Option C: Use Google Appcript to get the array object and insert it to Bigquery - I didn't find a simple code for this, Google storage has no API connected to Appscript.
Someone deal with such a case and can share his solution? What is the best practice for this case? if you've code for this it will be great.
Load the file from GCS to BigQuery into a table with 1 single string column. So you get 100K rows and one single column.
Essentially you will have a table that has a JSON in a string.
Use JSON_EXTRACT_ARRAY to process the JSON array into elements
then later extract each position into its coresponding variable/column and write it to a table
here is a demo:
with t as (
select '[["video_screen","click_on_screen","false","202011231958","1","43","0"],["buy","error","2","202011231807","1","6","0"],["sign_in","enter","user_details","202011231220","2","4","0"]]' as s
),
elements as (
select e from t,unnest(JSON_EXTRACT_ARRAY(t.s)) e
)
select
json_extract_scalar(e,'$[0]') as event_type ,
json_extract_scalar(e,'$[1]') as event_label,
from elements
the output is:
Related
Athena Query is changing few of data points to 0.
During data sanity I found that a particular column was showing huge difference on dashboard and in S3 files, data displayed on dashboard was around 40k and on reading it after downloading file from S3 was around 80k.
Since I am querying data directly from S3 using Athena, the datasource remains the same for aAhena and file download. I am wondering why this is happening, any help would be appreciated.
Eg:
Athena Query Results :
Data in S3:
Queried the data through a simple select query:
SELECT "orderid","orderdate","total tax"
FROM gbc_owss"
The datatype in Athena for the total tax column was double.
EDIT: Solved the above issue, it was indeed a delimiter issue and that was pushing the values to next column thus making it look like athena is changing values but that wasn't the case.
I'm facing a pretty interesting task to convert an arbitrary CSV file to a JSON structure following this schema:
{
"Data": [
["value_1", "value_2"],
["value_3", "value_4"]
]
}
In this case, the input file will look like this:
value_1,value_2
value_3,value_4
The requirement is to use Azure Data Factory and I won't be able to delegate this task to Azure Functions or other services.
I'm thinking about using 'Copy data' activity but can't get my mind around the configuration. TabularTranslator seems to only work with a definite number of columns but the CSV that I can receive can contain any number of columns.
Maybe DataFlows can help me but their setup doesn't look to be an easy one either. Plus, if I get it correctly, DataFlows take more time to start up.
So, basically, I just need to take the CSV content and put it into "Data" 2d array.
Any ideas on how to accomplish this?
To achieve this requirement, using Copy data or TabularTranslator is complicated. This can be achieved using dataflows in the following way.
First create a source dataset using the following configurations. This allows us to read entire row as a single column value (string):
Import the projection and name the column as data. The following is how the data preview looks like:
Now, first split these column values using split function in derived column transformations. I am replacing the same column using split(data,',').
Then, I have added a key column with a constant value 'x' so that I can group all rows and covert the grouped data into array of arrays.
The data would look like this after the above step:
Use aggregate transformation to group by the above created column and use collect aggregate function to create array of arrays (collect(data)).
Use select transformation to select only the above created column Data.
Finally, in the sink, select your destination and create a sink JSON dataset. Choose output to single file in settings and give a file name.
Create dataflow pipeline activity and run the above dataflow. The file will be created, and it looks like the following:
I have some csv files in my data lake which are being quite frequently updated through another process. Ideally I would like to be able to query these files through spark-sql, without having to run an equally frequent batch process to load all the new files into a spark table.
Looking at the documentation, I'm unsure as all the examples show views that query existing tables or other views, rather than loose files stored in a data lake.
You can do something like this if your csv is in S3 under the location s3://bucket/folder:
spark.sql(
"""
CREATE TABLE test2
(a string, b string)
ROW FORMAT DELIMITED FIELDS TERMINATED BY ","
LOCATION 's3://bucket/folder'
"""
)
You have to adapt the fields tho and the field separators.
To test it, you can first run:
Seq(("1","a"), ("2","b"), ("3","a"), ("4","b")).toDF("num", "char").repartition(1).write.mode("overwrite").csv("s3://bucket/folder")
I have a BigQuery table where one of the column (publishTs) is timeStamp. I am trying to upload a parquet file into same table using GCP UI BQ upload option having same column name (publishTs) with String datatype (e.g. “2021-08-24T16:06:21.122Z “), But BQ is complaining with following error :-
I am generating parquet file using Apache Spark. I tried searching on internet but could not get the answer.
Try to generate this column as INT64 - link
I have a case where I need to read an Excel/csv/text file containing two columns (say colA and colB) of values (around 1000 rows). I need to query the database using values in colA. The query will return an XMLType into which the respective colB value needs to be inserted. I have the XML query and the insert working but I am stuck on what approach I should take to read the data, query and update it on the fly.
I have tried using external tables but realized that I don't have access to the server root to host the data file. I have also considered creating a temporary table to load the data to using SQL Loader or something similar and run the query/update within the tables. But that would need some formal overhead to go through. I would appreciate suggestions on the approach. Examples would be greatly helpful.
e.g.
text or Excel file:
ColA,ColB
abc,123
def,456
ghi,789
XMLTypeVal e.g.
<node1><node2><node3><colA></colA><colB></colB></node3></node2></node1>
UPDATE TableA SET XMLTypeVal
INSERTCHILDXML(XMLTypeVal,
'/node1/node2/node3', 'colBval',
XMLType('<colBval>123</colBval>'))
WHERE EXTRACTVALUE(TableA.XMLTypeVal, node1/node2/node3/ColA') = ('colAval');