Exclude column/list of columns while loading into dataframe - apache-spark

I have a weird question or issue which I am facing, my parquet file is having around 1200 columns out of which few are corrupt columns for some reason. I can't edit the files for some reason so I want to exclude those columns from being introduced to a data frame I want to create. I don't seem to find any such feature which would help me to do so.
The idea of creating a data frame first and then dropping the columns do crossed my mind but that can't be done because getting the issue of
Failed to merge incompatible data types string and bigint
So any way we can exclude the columns before hand even before loading in called is welcomed. Declaring a schema is going to be tedious job which I want to avoid at any cost because I want to keep the process as dynamic as possible.
Any leads or ideas are more than welcome.
Thanks
Ashit

can you provide some more information?
do all files have the same schema? except for some "corrupted" columns - which seems that they have a different type?
if this is the case - I would recommend not avoiding providing a schema. you can do something like this (this is the scala version) :
val df = spark.read.parquet("path to on file that has a good schema")
val allColumns = df.columns.toSet //get all columns
val goodColumns = allColumns - "badColumn"
val allData = spark.read.parquet("path to all data").select(goodColumns: _*) //syntax for exploding the set and providing the strings
spark should project only the needed columns and skip the corrupted once
you can also do the same with getting the schemaType from one file drop the not needed columns and provide this schema

Related

How to ingest multiple csv files into a Spark dataframe?

I am trying to ingest 2 csv files into a single spark dataframe. However, the schema of these 2 datasets is very different, and when I perform the below operation, I get back only the schema of the second csv, as if the first one doesn't exist. How can I solve this? My final goal is to count the total number of words.
paths = ["abfss://lmne.dfs.core.windows.net/csvs/MachineLearning_reddit.csv", "abfss://test1#lmne.dfs.core.windows.net/csvs/bbc_news.csv"]
df0_spark=spark.read.format("csv").option("header","false").load(paths)
df0_spark.write.mode("overwrite").saveAsTable("ML_reddit2")
df0_spark.show()
I tried to load both of the files into a single spark dataframe, but it only gives me back one of the tables.
I have reproduced the above and got the below results.
For sample, I have two csv files in dbfs with different schemas. when I execute the above code, I got the same result.
To get the desired schema enable mergeSchemaand header while reading the files.
Code:
df0_spark=spark.read.format("csv").option("mergeSchema","true").option("header","true").load(paths)
df0_spark.show()
If you want to combine the two files without nulls, we should have a common identity column and we have to read the files individually and use inner join for that.
The solution that has worked for me the best in such cases was to read all distinct files separately, and then union them after they have been put into DataFrames. So your code could look something like this:
paths = ["abfss://lmne.dfs.core.windows.net/csvs/MachineLearning_reddit.csv", "abfss://test1#lmne.dfs.core.windows.net/csvs/bbc_news.csv"]
# Load all distinct CSV files
df1 = spark.read.option("header", false).csv(paths[0])
df2 = spark.read.option("header", false).csv(paths[1])
# Union DataFrames
combined_df = df1.unionByName(df2, allowMissingColumns=True)
Note: if the names of columns differ between the files, then for all columns from first file that are not present in second one, you will have null values. If the schema should be matching, then you can always rename the columns, before the unionByName step.

Assigning indexes across rows within python DataFrame

I'm currently trying to assign a unique indexer across rows, rather than alongside columns. The main problem is these values can never repeat, and must be preserved with every monthly report that I run.
I've thought about merging the columns and assigning an indexer to that, but my concern is that I won't be able to easily modify the dataframe and still preserve the same index values for each cell with this method.
I'm expecting my df to look something like this below:
Sample DataFrame
I haven't yet found a viable solution so haven't got any code to show yet. Any solutions would be much appreciated. Thank you.

spark reading missing columns in parquet

I have parquet files which I need to read from spark. Some files have few columns missing which are present in new files.
Since I do not know which files have column missing, I need to read all the files in spark. I have list of columns that I need to read. It may also be the case that all the files may have some column missing. I need to put a null in those columns which are missing.
When I try to do a
sqlContext.sql('query') it gives me error saying that columns are missing
If I define the schema and do a
sqlContext.read.parquet('s3://....').schema(parquet_schema)
It gives me the same error.
Help me here
You need to use parquet schema evolution strategy to address this situation.
As defined in the spark documentation
Users can start with a simple schema, and gradually add more columns to the schema as needed. In this way, users may end up with multiple Parquet files with different but mutually compatible schemas. The Parquet data source is now able to automatically detect this case and merge schemas of all these files.
All you need to do is
val mergedDF = spark.read.option("mergeSchema", "true").parquet("'s3://....'")
This will give you parquet data with complete schema.
Pain point
In case your schema is non compatible for example one parquet file has col1 DataType as String and another parquet file has col1 DataType as Long.
Then the merge schema will fail.

Partition column is moved to end of row when saving a file to Parquet

For a given DataFrame just before being save'd to parquet here is the schema: notice that the centroid0 is the first column and is StringType:
However when saving the file using:
df.write.partitionBy(dfHolder.metadata.partitionCols: _*).format("parquet").mode("overwrite").save(fpath)
and with the partitionCols as centroid0:
then there is a (to me) surprising result:
the centroid0 partition column has been moved to the end of the Row
the data type has been changed to Integer
I confirmed the output path via println :
path=/git/block/target/scala-2.11/test-classes/data/output/blocking/out//level1/clusters
And here is the schema upon reading back from the saved parquet:
Why are those two modifications to the input schema occurring - and how can they be avoided - while still maintaining the centroid0 as a partitioning column?
Update A preferred answer should mention why /when the partitions were added to the end (vs the beginning) of the columns list. We need an understanding of the deterministic ordering.
In addition - is there any way to cause spark to "change it's mind" on the inferred column types? I have had to change the partitions from 0, 1 etc to c0, c1 etc in order to get the inference to map to StringType. Maybe that were required .. but if there were some spark setting to change the behavior that would make for an excellent answer.
When you write.partitionBy(...) Spark saves the partition field(s) as folder(s)
This is can be beneficial for reading data later as (with some file types, parquet included) it can optimize to read data just from partitions that you use (i.e. if you'd read and filter for centroid0==1 spark wouldn't read the other partitions
The effect of this is that the partition fields (centroid0 in your case) are not written into the parquet file only as folder names (centroid0=1, centroid0=2, etc.)
The side effect of these are 1. the type of the partition is inferred at run time (since the schema is not saved in the parquet) and in your case it happened that you only had integer values so it was inferred to integer.
The other side effect is that the partition field is added at the end/beginning of the schema as it reads the schema from the parquet files as one chunk and then it adds to that the partition field(s) as another (again, it is no longer part of the schema that is stored in the parquet)
You can actually pretty easily make use of ordering of the columns of a case class that holds the schema of your partitioned data. You will need to read the data from the path, inside which the partitioning columns are stored underneath to make Spark infer the values of these columns. Then simply apply re-ordering by using the case class schema with a statement like:
val encoder: Encoder[RecordType] = Encoders.product[RecordType]
spark.read
.schema(encoder.schema)
.format("parquet")
.option("mergeSchema", "true")
.load(myPath)
// reorder columns, since reading from partitioned data, the partitioning columns are put to end
.select(encoder.schema.fieldNames.head, encoder.schema.fieldNames.tail: _*)
.as[RecordType]
The reason is in fact pretty simple. When you partition by a column, each partition can only contain one value of the said column. Therefore it is useless to actually write the same value everywhere in the file, and this is why Spark does not. When the file is read, Spark uses the information contained in the names of the files to reconstruct the partitioning column and it is put at the end of the schema. The type of the column is not stored, it is inferred when reading, hence the integer type in your case.
NB: There is no particular reason as to why the column is added at the end. It could have been at the beginning. I guess it is just an arbitrary choice of implementation.
To avoid losing the type and the order of the columns, you could duplicate the partitioning column like this df.withColumn("X", 'YOUR_COLUMN).write.partitionBy("X").parquet("...").
You will waste space though. Also, spark uses the partitioning to optimize filters for instance. Don't forget to use the X column for filters after reading the data and not your column or Spark won't be able to perform any optimizations.

Auto infer schema from parquet/ selectively convert string to float

I have a parquet file with 400+ columns, when I read it, the default datatypes attached to a lot of columns is String (may be due to the schema specified by someone else).
I was not able to find a parameter similar to
inferSchema=True' #for spark.read.parquet, present for spark.read.csv
I tried changing
mergeSchema=True #but it doesn't improve the results
To manually cast columns as float, I used
df_temp.select(*(col(c).cast("float").alias(c) for c in df_temp.columns))
this runs without error, but converts all the actual string column values to Null. I can't wrap this in a try, catch block as its not throwing any error.
Is there a way where i can check whether the columns contains only 'integer/ float' values and selectively cast those columns to float?
Parquet columns are typed, so there is no such thing as schema inference when loading Parquet files.
Is there a way where i can check whether the columns contains only 'integer/ float' values and selectively cast those columns to float?
You can use the same logic as Spark - define preferred type hierarchy and attempt to cast, until you get to the point, where you find the most selective type, that parses all values in the column.
How to force inferSchema for CSV to consider integers as dates (with "dateFormat" option)?
Spark data type guesser UDAF
There's no easy way currently,
there's a Github issue already existing which can be referred
https://github.com/databricks/spark-csv/issues/264
somthing like https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/csv/CSVInferSchema.scala
existing for scala this can be created for pyspark

Resources