create different dataframe based on field value in Spark/Scala - apache-spark

I have a dataframe in below format with 2 fields. One of the field contains code and other field contains XML.
EventCd|XML_VALUE
1.3.6.10|<nt:SNMP>
<nt:var id="1.3.0" type="STRING"> MESSAGE </nt:var>
<nt:var id="1.3.9" type="STRING">AB-CD-EF</nt:var>
</nt:SNMP>
1.3.6.11|<nt:SNMP>
<nt:var id="1.3.1" type="STRING"> CALL </nt:var>
<nt:var id="1.3.2" type="STRING">XX-AC-EF</nt:var>
</nt:SNMPe
Based on value in code field I want to create different dataframe conditionally and place the data in corresponding hdfs folder.
if code is 1.3.6.10, it should create message dataframe and place files under ../message/ HDFS folder and if the code is 1.3.6.11, it should create call dataframe and write data into call hdfs folder like ../call/
I am able to create the dataframes using multiple filter options but is there any option to call only one dataframe and corresponding HDFS write command.
Can someone suggest how can I do this in spark/scala please.

Related

How to rename column names from lookup in ADF?

I have metadata in my Azure SQL db /csv file as below which has old column name and datatypes and new column names.
I want to rename and change the data type of oldfieldname based on those metadata in ADF.
The idea is to store the metadata file in cache and use this in lookup but I am not able to do it in data flow expression builder. Any idea which transform or how I should do it?
I have reproduced the above and able to change the column names and datatypes like below.
This is the sample csv file I have taken from blob storage which has meta data of table.
In your case, take care of new Data types because if we don't give correct types, it will generate error because of the data inside table.
Create dataset and give this to lookup and don't check first row option.
This is my sample SQL table:
Give the lookup output array to ForEach.
Inside ForEach use script activity to execute the script for changing column name and Datatype.
Script:
EXEC SP_RENAME 'mytable2.#{item().OldName}', '#{item().NewName}', 'COLUMN';
ALTER TABLE mytable2
ALTER COLUMN #{item().NewName} #{item().Newtype};
Execute this and below is my SQL table with changes.

How to convert CSV to a nested JSON array using Azure Data Factory?

I'm facing a pretty interesting task to convert an arbitrary CSV file to a JSON structure following this schema:
{
"Data": [
["value_1", "value_2"],
["value_3", "value_4"]
]
}
In this case, the input file will look like this:
value_1,value_2
value_3,value_4
The requirement is to use Azure Data Factory and I won't be able to delegate this task to Azure Functions or other services.
I'm thinking about using 'Copy data' activity but can't get my mind around the configuration. TabularTranslator seems to only work with a definite number of columns but the CSV that I can receive can contain any number of columns.
Maybe DataFlows can help me but their setup doesn't look to be an easy one either. Plus, if I get it correctly, DataFlows take more time to start up.
So, basically, I just need to take the CSV content and put it into "Data" 2d array.
Any ideas on how to accomplish this?
To achieve this requirement, using Copy data or TabularTranslator is complicated. This can be achieved using dataflows in the following way.
First create a source dataset using the following configurations. This allows us to read entire row as a single column value (string):
Import the projection and name the column as data. The following is how the data preview looks like:
Now, first split these column values using split function in derived column transformations. I am replacing the same column using split(data,',').
Then, I have added a key column with a constant value 'x' so that I can group all rows and covert the grouped data into array of arrays.
The data would look like this after the above step:
Use aggregate transformation to group by the above created column and use collect aggregate function to create array of arrays (collect(data)).
Use select transformation to select only the above created column Data.
Finally, in the sink, select your destination and create a sink JSON dataset. Choose output to single file in settings and give a file name.
Create dataflow pipeline activity and run the above dataflow. The file will be created, and it looks like the following:

How to get schema while dynamically creating external table in synapse serverless sql

I want schema through get metadata activity so it can be passed as an output to stored procedure
If you give only folder in the dataset that you provided for, Get Meta Data activity it won't show the structure property in the list.
It will show the Folder properties like Item name, type.
To get the structure property, you need to give the file name in the Dataset and check on First row as header.
Now, you can see file properties like structure and columns count in the activity.
Sample Output of a Blob file:
You can pass this JSON string to your stored procedure activity.

Load single column from csv file

I have a csv file that contains large number of columns. I want to load just one column from that file using spark.
I know that we can use select statement to filter a column. But what i want, while doing the read operation itself, it should load just one column.
In this way, i should be able to avoid extra memory getting used by other columns. Is there any way to do this?
Spark will load complete file and parse for columns. As you mentioned, you can use select to restrict columns in dataframe, so dataframe will have only one column.
Spark will load the complete file in memory and will filter down the column you want with the help of select statements which you have mentioned.
Because all the read operation in spark, reads and scans the whole file as a distributed stream reader gets created (the reader gets instantiated at every node where the data has been stored).
And if your problem is to read the data column-wise then you can store the file in parquet format and read that file. Indeed, parquet is columnar storage and it is exactly meant for this type of use case(you can verify it using the explain).

Get column names from Hive to CSV using Airflow hook using python

I am using Airflow hiveserver2 hook to get results from Hive table and load into CSV. The hook to_csv function has a parameter 'output_headers'. If set to true, it gets column names in the form of tablename.columnname along with data and writes to a CSV file. In the CSV header I just need the column names and need to get rid of the tablename from tablename.columnname. Can I override the parameter somehow to just get column names? Is there any other way to just retrieve column names using HiveServer2Hook?
I have connected to Hive using HiveServer2Hook. I have also executed the hooks to_csv function. I just need to change the format of the column names returned using the function. Here is the link to the hook. You can find the to_csv, get_records and get_results function under HiveServer2Hook.
https://airflow.apache.org/_modules/airflow/hooks/hive_hooks.html
I also tried running 'describe tablename' and 'show columns from tablename' for HQL but the hive hook's get_records and get_results function breaks on header issue as the result returned by 'describe' and 'show columns' is not in the expected format.
tried the following:
1) describe tablename;
2) show columns from tablename;
The airflow hook has functions get_records and get_results. Both break on following line when I use above HQL statements.
header = next(results_iter)
Is there any other way to get column names, write to CSV and pull data using HiveServer2Hook and Python?
I ran into the same problem and here is what I found as an easier way to do it.
Pass the below hive_conf parameter to to_csv(..) method
hive_conf={"hive.resultset.use.unique.column.names": "false"}
This will suppress the table name before column name.
Use HiveMetastoreHook's get_table(..) function to get exact column names as follows
# imports
from airflow.hooks.hive_hooks import HiveMetastoreHook
from hmsclient.genthrift.hive_metastore import ttypes
from typing import List
# create hook
hive_metastore_hook: HiveMetastoreHook = HiveMetastoreHook(metastore_conn_id="my-hive-metastore-conn-id")
# fetch table object
table: ttypes.Table = mt_hook.get_table(table_name="my_table_name", db="my_db_name")
# determine column names
column_names: List[str] = [field_schema.name for field_schema in table.sd.cols]
..
After this you must subclass Hiveserver2Hook to modify the to_csv(..) method. In particular, changing the header value to the column_names extracted above should suffice.
Alternatively if you do not wish to subclass Hiveserver2Hook, you can just implement it's to_csv(..) separately (such as in a hive_utils.py file) and achieve the same behaviour
Most queries that throw this error have their own function from the HiveMetastoreHook object that can provide the correct result. Most of these have to do with table and partition metadata.

Resources