I am using Airflow hiveserver2 hook to get results from Hive table and load into CSV. The hook to_csv function has a parameter 'output_headers'. If set to true, it gets column names in the form of tablename.columnname along with data and writes to a CSV file. In the CSV header I just need the column names and need to get rid of the tablename from tablename.columnname. Can I override the parameter somehow to just get column names? Is there any other way to just retrieve column names using HiveServer2Hook?
I have connected to Hive using HiveServer2Hook. I have also executed the hooks to_csv function. I just need to change the format of the column names returned using the function. Here is the link to the hook. You can find the to_csv, get_records and get_results function under HiveServer2Hook.
https://airflow.apache.org/_modules/airflow/hooks/hive_hooks.html
I also tried running 'describe tablename' and 'show columns from tablename' for HQL but the hive hook's get_records and get_results function breaks on header issue as the result returned by 'describe' and 'show columns' is not in the expected format.
tried the following:
1) describe tablename;
2) show columns from tablename;
The airflow hook has functions get_records and get_results. Both break on following line when I use above HQL statements.
header = next(results_iter)
Is there any other way to get column names, write to CSV and pull data using HiveServer2Hook and Python?
I ran into the same problem and here is what I found as an easier way to do it.
Pass the below hive_conf parameter to to_csv(..) method
hive_conf={"hive.resultset.use.unique.column.names": "false"}
This will suppress the table name before column name.
Use HiveMetastoreHook's get_table(..) function to get exact column names as follows
# imports
from airflow.hooks.hive_hooks import HiveMetastoreHook
from hmsclient.genthrift.hive_metastore import ttypes
from typing import List
# create hook
hive_metastore_hook: HiveMetastoreHook = HiveMetastoreHook(metastore_conn_id="my-hive-metastore-conn-id")
# fetch table object
table: ttypes.Table = mt_hook.get_table(table_name="my_table_name", db="my_db_name")
# determine column names
column_names: List[str] = [field_schema.name for field_schema in table.sd.cols]
..
After this you must subclass Hiveserver2Hook to modify the to_csv(..) method. In particular, changing the header value to the column_names extracted above should suffice.
Alternatively if you do not wish to subclass Hiveserver2Hook, you can just implement it's to_csv(..) separately (such as in a hive_utils.py file) and achieve the same behaviour
Most queries that throw this error have their own function from the HiveMetastoreHook object that can provide the correct result. Most of these have to do with table and partition metadata.
Related
I have metadata in my Azure SQL db /csv file as below which has old column name and datatypes and new column names.
I want to rename and change the data type of oldfieldname based on those metadata in ADF.
The idea is to store the metadata file in cache and use this in lookup but I am not able to do it in data flow expression builder. Any idea which transform or how I should do it?
I have reproduced the above and able to change the column names and datatypes like below.
This is the sample csv file I have taken from blob storage which has meta data of table.
In your case, take care of new Data types because if we don't give correct types, it will generate error because of the data inside table.
Create dataset and give this to lookup and don't check first row option.
This is my sample SQL table:
Give the lookup output array to ForEach.
Inside ForEach use script activity to execute the script for changing column name and Datatype.
Script:
EXEC SP_RENAME 'mytable2.#{item().OldName}', '#{item().NewName}', 'COLUMN';
ALTER TABLE mytable2
ALTER COLUMN #{item().NewName} #{item().Newtype};
Execute this and below is my SQL table with changes.
I'm facing a pretty interesting task to convert an arbitrary CSV file to a JSON structure following this schema:
{
"Data": [
["value_1", "value_2"],
["value_3", "value_4"]
]
}
In this case, the input file will look like this:
value_1,value_2
value_3,value_4
The requirement is to use Azure Data Factory and I won't be able to delegate this task to Azure Functions or other services.
I'm thinking about using 'Copy data' activity but can't get my mind around the configuration. TabularTranslator seems to only work with a definite number of columns but the CSV that I can receive can contain any number of columns.
Maybe DataFlows can help me but their setup doesn't look to be an easy one either. Plus, if I get it correctly, DataFlows take more time to start up.
So, basically, I just need to take the CSV content and put it into "Data" 2d array.
Any ideas on how to accomplish this?
To achieve this requirement, using Copy data or TabularTranslator is complicated. This can be achieved using dataflows in the following way.
First create a source dataset using the following configurations. This allows us to read entire row as a single column value (string):
Import the projection and name the column as data. The following is how the data preview looks like:
Now, first split these column values using split function in derived column transformations. I am replacing the same column using split(data,',').
Then, I have added a key column with a constant value 'x' so that I can group all rows and covert the grouped data into array of arrays.
The data would look like this after the above step:
Use aggregate transformation to group by the above created column and use collect aggregate function to create array of arrays (collect(data)).
Use select transformation to select only the above created column Data.
Finally, in the sink, select your destination and create a sink JSON dataset. Choose output to single file in settings and give a file name.
Create dataflow pipeline activity and run the above dataflow. The file will be created, and it looks like the following:
I'm querying an API using Azure Data Factory and the data I receive from the API looks like this.
{
"96":"29/09/2022",
"95":"31/08/2022",
"93":"31/07/2022"
)
When I come to write this data to a table, ADF assumes the column names are the numbers and the dates are stored as rows like this
96
95
93
29/09/2022
31/08/2022
31/07/2022
when i would like it to look like this
Date
ID
29/09/2022
96
31/08/2022
95
31/07/2022
93
Does any one have any suggestions on how to handle this, I ideally want to avoid using USP's and dynamic SQL. I really only need the ID for the month of the previous one we're in.
PS - API doesn't support any filtering on this object
Updates
I'm querying the API using a web activity and if i try to store the data to an Array variable the activity fails as the output is an object.
When I use a copy data activity I've set the sink to automatically create the table and the mapping looks likes this
mapping image
Thanks
Instead of directly trying to copy the JSON response to SQL table, convert it the response to a string, extract the required values and insert them into the SQL table.
Look at the following demonstration. I have taken the sample response provided as a parameter (object type). I used set variable activity for extracting the values.
My parameter:
{"93":"31/07/2022","95":"31/08/2022","96":"29/09/2022"}
Dynamic content used in set variable activity:
#split(replace(replace(replace(replace(string(pipeline().parameters.api_op),' ',''),'"',''),'{',''),'}',''),',')
The output for set variable activity will be:
Now inside For each activity (pass the previous variable value as items value in for each), I used copy data to copy each row separately to my sink table (Auto create option enabled). I have taken a sample json file as my source (We are going to ignore all the columns anyway.)
Create the required 2 additional columns called id and date with the following dynamic content:
#id
#split(item(),':')[0]
#date
#split(item(),':')[1]
Configure the sink. Select the database, create dataset, give a name for table (I have given dbo.opTable) and select Auto create table under sink settings.
The following is an image of mapping. Delete the column mappings which are not required and only use additional columns created above.
When I debug the pipeline, it will run successfully, and the required values are inserted into the table. The following is output sink table for reference.
I have a pyspark dataframe currently from which I initially created a delta table using below code -
df.write.format("delta").saveAsTable("events")
Now, since the above dataframe populates the data on daily basis in my requirement, hence for appending new records into delta table, I used below syntax -
df.write.format("delta").mode("append").saveAsTable("events")
Now this whole thing I did in databricks and in my cluster. I want to know how can I write generic pyspark code in python that will create delta table if it does not exists and append records if delta table exists.This thing I want to do because if I give my python package to someone, they will not have the same delta table in their environment so it should get created dynamically from code.
If you don't have Delta table yet, then it will be created when you're using the append mode. So you don't need to write any special code to handle the case when table doesn't exist yet, and when it exits.
P.S. You'll need to have such code only in case if you're performing merge into the table, not append. In this case the code will looks like this:
if table_exists:
do_merge
else:
df.write....
P.S. here is a generic implementation of that pattern
There are eventually two operations available with spark
saveAsTable:- create or replace the table if present or not with the current DataFrame
insertInto:- Successful if the table present and perform operation based on the mode('overwrite' or 'append'). it requires the table to be available in the database.
The .saveAsTable("events") Basically rewrites the table every time you call it. which means that, even if you have a table present earlier or not, it will replace the table with the current DataFrame value. Instead, you can perform the below operation to be in the safer side:
Step 1: Create the table even if it is present or not. If present, remove the data from the table and append the new data frame records, else create the table and append the data.
df.createOrReplaceTempView('df_table')
spark.sql("create table IF NOT EXISTS table_name using delta select * from df_table where 1=2")
df.write.format("delta").mode("append").insertInto("events")
So, every time it will check if the table is available or not, else it will create the table and move to next step. Else, if the table is available, then append the data into the table.
I have a case where I need to read an Excel/csv/text file containing two columns (say colA and colB) of values (around 1000 rows). I need to query the database using values in colA. The query will return an XMLType into which the respective colB value needs to be inserted. I have the XML query and the insert working but I am stuck on what approach I should take to read the data, query and update it on the fly.
I have tried using external tables but realized that I don't have access to the server root to host the data file. I have also considered creating a temporary table to load the data to using SQL Loader or something similar and run the query/update within the tables. But that would need some formal overhead to go through. I would appreciate suggestions on the approach. Examples would be greatly helpful.
e.g.
text or Excel file:
ColA,ColB
abc,123
def,456
ghi,789
XMLTypeVal e.g.
<node1><node2><node3><colA></colA><colB></colB></node3></node2></node1>
UPDATE TableA SET XMLTypeVal
INSERTCHILDXML(XMLTypeVal,
'/node1/node2/node3', 'colBval',
XMLType('<colBval>123</colBval>'))
WHERE EXTRACTVALUE(TableA.XMLTypeVal, node1/node2/node3/ColA') = ('colAval');