Adding custom metadata to DataFrame schema using iceberg table format - apache-spark

I'm adding custom metadata into the DataFrames schema in my PySpark application using StructField's metadata field
It worked fine when I wrote parquet files directly into s3.
The custom metadata was available when reading these parquet files as expected.
But it's not working using iceberg table format. There is no error, but the df.schema.fields.metadata is always empty.
Is there a way to solve it?

Solved by making sure the key is always 'comment'
For example:
{'comment': 'my_metadata_info_field'}

Related

Write data frame to hive table in spark

could you please tell me if this command could create problems with overwriting all tables in the DB:
df.write.option(“path”, “path_to_the_db/hive/”).mode(overwrite).saveAsTable("result_data")
table_name is a new table in the DB, it hasn't existed.
After these commands, all tables disappeared.
I was using Spark3 and tried to solve an error:
Can not create the managed table('result_data').
The associated location('dbfs:/user/hive/warehouse/result_data') already exists.
I expected that a new table will be created without any issues if it doesn’t exist.
If path_to_the_db/hive contains other tables, then you overwrite into that folder, it seems possible that the whole directory would be emptied first, yes. Perhaps you should instead use path_to_the_db/hive/result_data
According to the error, though, your table does already exist.
You can use Spark to register a temporary table in SQL code, then run INSERT OVERWRITE query for existing tables.

How to read hive managed table data using spark?

I am able to read hive external table using spark-shell but, when I try to read data from hive managed table it only shows column names.
Please find queries here:
Could you please try using database name as well along with table name?
sql(select * from db_name.test_managed)
If still result is same, request you to please share output of describe formatted for both the tables.

How to overwrite data with PySpark's JDBC without losing schema?

I have a DataFrame that I'm willing to write it to a PostgreSQL database. If I simply use the "overwrite" mode, like:
df.write.jdbc(url=DATABASE_URL, table=DATABASE_TABLE, mode="overwrite", properties=DATABASE_PROPERTIES)
The table is recreated and the data is saved. But the problem is that I'd like to keep the PRIMARY KEY and Indexes in the table. So, I'd like to either overwrite only the data, keeping the table schema or to add the primary key constraint and indexes afterward. Can either one be done with PySpark? Or do I need to connect to the PostgreSQL and execute the commands to add the indexes myself?
The default behavior for mode="overwrite" is to first delete the table, then recreate it with the new data. You can instead truncate the data by including option("truncate", "true") and then push your own:
df.write.option("truncate", "true").jdbc(url=DATABASE_URL, table=DATABASE_TABLE, mode="overwrite", properties=DATABASE_PROPERTIES)
This way, you are not recreating the table so it shouldn't make any modifications to your schema.

Nifi-Loading XML data into Cassandra

I am trying to insert XML data into Cassandra DB. Please can somebody suggest the flow in nifi. I have JMS on which I need to post messagedata and then consume & insert the data into Cassandra.
I'm not sure if you can directly ingest XML into Cassandra. However you could convert the XML to JSON using the TransformXml processor (and this XSLT), or as of NiFi 1.2.0, you can use ConvertRecord by specifying the input and output schemas.
If there are multiple XML records per flow file and you need one CQL statement per record, you may need SplitJson or SplitRecord after the XML-to-JSON conversion has taken place.
Then you can use ReplaceText to form a CQL statement to insert the JSON, then PutCassandraQL to push to Cassandra. Alternatively you can use CQL map syntax to insert into a map field, etc.

Schema crawler reading data from table

I understood we can read data from a table using command in Schema crawler.
How to do that programatically in java. I could see example to read schema , table etc. But how to get data?
Thanks in advance.
SchemaCrawler allows you to obtain database metadata, including result set metadata. Standard JDBC provides you a way to get data by using java.sql.ResultSet, and you can use SchemaCrawler for obtainting result set metadata using schemacrawler.utility.SchemaCrawlerUtility.getResultColumns(ResultSet).
Sualeh Fatehi, SchemaCrawler

Resources