Should columns generated during ETL be added to schema? - apache-spark

I read a CSV then do some transformations, adding a few columns using .withColumn(...), then write to parquet format. Would it help performance to add those added columns to the read() schema?

Related

Load single column from csv file

I have a csv file that contains large number of columns. I want to load just one column from that file using spark.
I know that we can use select statement to filter a column. But what i want, while doing the read operation itself, it should load just one column.
In this way, i should be able to avoid extra memory getting used by other columns. Is there any way to do this?
Spark will load complete file and parse for columns. As you mentioned, you can use select to restrict columns in dataframe, so dataframe will have only one column.
Spark will load the complete file in memory and will filter down the column you want with the help of select statements which you have mentioned.
Because all the read operation in spark, reads and scans the whole file as a distributed stream reader gets created (the reader gets instantiated at every node where the data has been stored).
And if your problem is to read the data column-wise then you can store the file in parquet format and read that file. Indeed, parquet is columnar storage and it is exactly meant for this type of use case(you can verify it using the explain).

Attach description of columns in Apache Spark using parquet format

I read a parquet with :
df = spark.read.parquet(file_name)
And get the columns with:
df.columns
And returns a list of columns ['col1', 'col2', 'col3']
I read that parquet format is able to store some metadata in the file.
Is there a way to store and read extra metadata, for example, attach a human description of what is each column?
Thanks.
There is no way to read or store arbitrary additional metadata in a Parquet file.
When metadata in a Parquet file is mentioned it is referring to the technical metadata associated with the field including the number of nested fields, type information, length information, etc. If you look at the SchemaElement class in the documentation for Parquet ( https://static.javadoc.io/org.apache.parquet/parquet-format/2.6.0/org/apache/parquet/format/SchemaElement.html) you will find all of the available metadata for each field in a schema. This does not include any human readable description beyond the field name.
A good overview of the Parquet metadata can be found in the "File Format" section here - https://parquet.apache.org/documentation/latest/

Is DataFrame schema saved when using parquet format?

If one calls df.write.parquet(destination), is the DataFrame schema (i.e. StructType information) saved along with the data?
If the parquet files are generated by other programs other than Spark, how does sqlContext.read.parquet figure out the schema of the DataFrame?
Parquet files automatically preserves the schema of the original data when saving. So there will be no difference if it's Spark or another system that writes/reads the data.
If one or multiple columns are used to partition the data when saving, the data type for these columns are lost (since the information is stored in the file structure). The data types of these can be automatically inferred by Spark when reading (currently only numeric data types and strings are supported).
This automatic inference can be turned off by setting spark.sql.sources.partitionColumnTypeInference.enabled to false, which will make these columns be read as strings. For more information see here.

Refresh Dataframe created by writing to Parquet files in append mode

Is there a way that I can refresh a dataframe that is being created by appending to the parquets in pyspark?
Basically I am writing to parquet in append mode, data that I get daily.
If I want to check the parquet file created I load it up in pyspark and do a count of the data. However if new data is appended to the parquet and I try a count again on the dataframe without reloading the dataframe I do not get the updated count. Basically, I have to create a new dataframe everytime there are any changes to my parquet file. Is there a way in Spark, where the changes are automatically loaded once my parquet is updated?

How to quickly migrate from one table into another one with different table structure in the same/different cassandra?

I had one table with more than 10,000,000 records in Cassandra, but for some reason, I want to build another Cassandra table with the same fields and several additional fields, and I will migrate the previous data into it. And now the two tables are in the same Cassandra cluster.
I want to ask how to finish this task in a shortest time?
And If my new table in the different Cassandra, How to do it?
Any advice will be appreciated!
If you just need to add blank fields to a table, then the best thing to do is use the alter table command to add the fields to the existing table. Then no copying of the data would be needed and the new fields would show up as null in the existing rows until you set them to something.
If you want to change the structure of the data in the new table, or write it to a different cluster, then you'd probably need to write an application to read each row of the old table, transform the data as needed, and then write each row to the new location.
You could also do this by exporting the data to a csv file, write a program to restructure the csv file as needed, then import the csv file into the new location.
Another possible method would be to use Apache Spark. You'd read the existing table into an RDD, transform and filter the data into a new RDD, then save the transformed RDD to the new table. That would only work within the same cluster and would be fairly complex to set up.

Resources