Since spark 3.2,
there is this interesting functionality from Parquet: Parquet Columnar Encryption
The documentation is pretty clear on how to specify which key to use for a specific column in the dataframe schema. I.e.:
squaresDF.write.
option("parquet.encryption.column.keys" , "keyA:square")
if we want to encrypt a column called square with a key indentified by keyA tag in our KMS system.
The problem is:
how to specify the column name if my column is an array of a Struct type ?
For example
myDF.printSchema
root
|-- int_column: integer (nullable = false)
|-- square_int_column: double (nullable = false)
|-- more: array (nullable = true)
| |-- element: struct (containsNull = true)
| | |-- name: string (nullable = true)
| | |-- description: string (nullable = true)
How can I specify the key for the column more ? or for column more.name ? Is it supported ? I cannot find anything on the parquet or spark doc about that.
After some research,
I decided to explore a generated parquet file with parquet-tools in order to understand how arrays of struct are organised in the file.
So, after creating a parquet file with the needed schema, I opened it with:
java ~/parquet-tools-1.11.0.jar meta <my-parquet-file-path> | less
Checking in the metadata of the columns, I found that:
[...cut...]
more: OPTIONAL F:1
.list: REPEATED F:1
..element: OPTIONAL F:8
...name: OPTIONAL BINARY L:STRING R:1 D:4
[...cut...]
So, to encrypt that, we need to specify the column as:
squaresDF.write
.option("parquet.encryption.column.keys" , "keyA:more.list.element.name")
Related
Similar to this question I want to add a column to my pyspark DataFrame containing nothing but an empty map. If I use the suggested answer from that question, however, the type of the map is <null,null>, unlike in the answer posted there.
from pyspark.sql.functions import create_map
spark.range(1).withColumn("test", create_map()).printSchema()
root
|-- test: map(nullable = false)
| |-- key: null
| |-- value: null (valueContainsNull = false)
I need an empty <string,string> map. I can do it in Scala like so:
import org.apache.spark.sql.functions.typedLit
spark.range(1).withColumn("test", typedLit(Map[String, String]())).printSchema()
root
|-- test: map(nullable = false)
| |-- key: string
| |-- value: string (valueContainsNull = true)
How can I do it in pyspark? I am using Spark 3.01 with underlying Scala 2.12 on Databricks Runtime 7.3 LTS. I need the <string,string> map because otherwise I can't save my Dataframe to parquet:
AnalysisException: Parquet data source does not support map<null,null> data type.;
You can cast the map to the appropriate type creating the map using create_map.
from pyspark.sql.functions import create_map
spark.range(1).withColumn("test", create_map().cast("map<string,string>")).printSchema()
root
|-- id: long (nullable = false)
|-- test: map (nullable = false)
| |-- key: string
| |-- value: string (valueContainsNull = true)
Spark connector Write fails with a java.lang.IllegalArgumentException: udtId is not a field defined in this definition error when using case-sensitive field names
I need the fields in the Cassandra table to maintain case. So i have used
quotes to create them.
My Cassandra schema
CREATE TYPE my_keyspace.my_udt (
"udtId" text,
"udtValue" text
);
CREATE TABLE my_keyspace.my_table (
"id" text PRIMARY KEY,
"someCol" text,
"udtCol" list<frozen<my_udt>>
);
My Spark DataFrame schema is
root
|-- id: string (nullable = true)
|-- someCol: string (nullable = true)
|-- udtCol: array (nullable = true)
| |-- element: struct (containsNull = true)
| | |-- udtId: string (nullable = true)
|-- udtValue: string (nullable = true)
Are there any other options to get this write to work other than defining my udt with lowercase names? Making them lower case would make me invoke case management code everywhere this is used and i'd like to avoid that ?
Because i couldn't write successfully, i did try read yet? Is this an issue with reads as well ?
You need to upgrade to Spark Cassandra Connector 2.5.0 - I can't find specific commit that fixes it, or specific Jira that mentions that - I suspect that it was fixed in the DataStax version first, and then released as part of merge announced here.
Here is how it works in SCC 2.5.0 + Spark 2.4.6, while it fails with SCC 2.4.2 + Spark 2.4.6:
scala> import org.apache.spark.sql.cassandra._
import org.apache.spark.sql.cassandra._
scala> val data = spark.read.cassandraFormat("my_table", "test").load()
data: org.apache.spark.sql.DataFrame = [id: string, someCol: string ... 1 more field]
scala> val data2 = data.withColumn("id", concat(col("id"), lit("222")))
data2: org.apache.spark.sql.DataFrame = [id: string, someCol: string ... 1 more field]
scala> data2.write.cassandraFormat("my_table", "test").mode("append").save()
I have a huge dataset with messy structured schema.
Say, the same data fields can have different data type of data, for example, data.tags can be a list of string or a list of object
I tried to load the JSON data from hdfs and print the schema but it occurs the error below.
TypeError: Can not merge type <class 'pyspark.sql.types.ArrayType'> and <class 'pyspark.sql.types.StringType'>
Here is the code
data_json = sc.textFile(data_path)
data_dataset = data_json.map(json.loads)
data_dataset_df = data_dataset.toDF()
data_dataset_df.printSchema()
Is it possible to figure out the schema something like
root
|-- children: array (nullable = true)
| |-- element: map (containsNull = true)
| | |-- key: string
| | |-- value: boolean (valueContainsNull = true)
| |-- element: string
|-- first_name: string (nullable = true)
|-- last_name: string (nullable = true)
|-- occupation: string (nullable = true)
in this case?
If I understand correctly, you're looking to find how to infer the schema of a JSON file. You should take a look at reading the JSON into a DataFrame straightaway, instead of through a Python mapping function.
Also, I'm referring you to How to infer schema of JSON files?, as I think it answers your question.
My cassandra table columns in lower case like below
CREATE TABLE model_family_by_id(
model_family_id int PRIMARY KEY,
model_family text,
create_date date,
last_update_date date,
model_family_name text
);
my dataframe schema is like this
root
|-- MODEL_FAMILY_ID: decimal(38,10) (nullable = true)
|-- MODEL_FAMILY: string (nullable = true)
|-- CREATE_DATE: timestamp (nullable = true)
|-- LAST_UPDATE_DATE: timestamp (nullable = true)
|-- MODEL_FAMILY_NAME: string (nullable = true)
So while insert into cassandra I am getting below error
tabException in thread "main" java.util.NoSuchElementException: Columns not found in table sample_cbd.model_family_by_id: MODEL_FAMILY_ID, MODEL_FAMILY, CREATE_DATE, LAST_UPDATE_DATE, MODEL_FAMILY_NAME
at com.datastax.spark.connector.SomeColumns.selectFrom(ColumnSelector.scala:44)
If I correctly understand the source code, the Spark Connector wraps the columns in to the double quotes, so they become case-sensitive, and don't match to the names in the CQL definition.
You need to change schema of your DataFrame - either run the withColumnRenamed on it for every column, or use select with alias for every column.
I'm trying to create a table in Hive created from a spark job with the following data format:
{'Group1': {[start=0, end=20]: 'Data goes here'}}
The spark dataframe schema for this is:
MapType(StringType(),
MapType(StructType([
StructField('start', IntegerType(), False),
StructField('end', IntegerType(), False)]),
StringType()))
which displays as:
root
|-- column_1: map (nullable = true)
| |-- key: string
| |-- value: map (valueContainsNull = true)
| | |-- key: struct
| | |-- value: string (valueContainsNull = true)
| | | |-- start: integer (nullable = true)
| | | |-- end: integer (nullable = true)
This seems to work just fine in spark but when I try to create a hive table from this schema:
CREATE EXTERNAL TABLE test_table (
column_1 MAP<STRING, MAP<STRUCT<`start`:BIGINT,`end`:BIGINT>, STRING>>
)
STORED AS PARQUET
LOCATION 'path_to_files';
I get:
FAILED: ParseException cannot recognize input near 'STRUCT' '<' 'start' in primitive type specification
It looks like legal table construction as far as I can tell. I can't find anything that tells me you can't have struct as a key in a map with hive 2.0 and spark 2.0 handles it just fine.
In Hive the key for a Map column must be a primitive (i.e. not a Struct).
https://cwiki.apache.org/confluence/display/Hive/LanguageManual+Types#LanguageManualTypes-ComplexTypes
I would highly recommend you not make the key a Struct. In your example, how do I access the value of the Map if I don't know the start or end? The user would need to know the exact start and end and does it change for each row in your table?