Is DataFrame schema saved when using parquet format? - apache-spark

If one calls df.write.parquet(destination), is the DataFrame schema (i.e. StructType information) saved along with the data?
If the parquet files are generated by other programs other than Spark, how does sqlContext.read.parquet figure out the schema of the DataFrame?

Parquet files automatically preserves the schema of the original data when saving. So there will be no difference if it's Spark or another system that writes/reads the data.
If one or multiple columns are used to partition the data when saving, the data type for these columns are lost (since the information is stored in the file structure). The data types of these can be automatically inferred by Spark when reading (currently only numeric data types and strings are supported).
This automatic inference can be turned off by setting spark.sql.sources.partitionColumnTypeInference.enabled to false, which will make these columns be read as strings. For more information see here.

Related

Column Type mismatch in Spark Dataframe and source

I am trying to read data from elastic, I could see Column is present as Array of string in elastic but while I am reading by Spark as Dataframe i am seeing as a Srting, how could I handle this data in Spark.
Note: I am trying to read with mode (sqlContext.read.format("org.elasticsearch.spark.sql") becuase i need to write it as CSV file in future.

Load single column from csv file

I have a csv file that contains large number of columns. I want to load just one column from that file using spark.
I know that we can use select statement to filter a column. But what i want, while doing the read operation itself, it should load just one column.
In this way, i should be able to avoid extra memory getting used by other columns. Is there any way to do this?
Spark will load complete file and parse for columns. As you mentioned, you can use select to restrict columns in dataframe, so dataframe will have only one column.
Spark will load the complete file in memory and will filter down the column you want with the help of select statements which you have mentioned.
Because all the read operation in spark, reads and scans the whole file as a distributed stream reader gets created (the reader gets instantiated at every node where the data has been stored).
And if your problem is to read the data column-wise then you can store the file in parquet format and read that file. Indeed, parquet is columnar storage and it is exactly meant for this type of use case(you can verify it using the explain).

After creating a dataframe from a hive table, if the data in the table is altered, will the dataframe contain new data or old data?

Data gets loaded into dataframe when an action is performed on it.
But before performing any action and after creating it from a hive table, if the data in the table is modified, will the changes be reflected in the dataframe?
The dataframe will not contain the old data, because the dataframe does not contain any data at all. a dataframe is nothing more than a "query plan", not materalized data.
In your case, I would say that you get your new data or alternatively a FileNotFoundException if spark has already cached hive table metadata and filenames and these things changed with the new data.

Should columns generated during ETL be added to schema?

I read a CSV then do some transformations, adding a few columns using .withColumn(...), then write to parquet format. Would it help performance to add those added columns to the read() schema?

mapping spark dataframe datatypes to jdbc datatype

I am using df.write.jdbc to save a dataframe to a database table.
One of the columns in my dataframe is a of type map(string, string). How do I write this into a database table? The target is a redshift database.
You should be able to serialize the map to a string or probably make json out of it and then save it.

Resources