Spark Dataframe validating column names for parquet writes - apache-spark

I'm processing events using Dataframes converted from a stream of JSON events which eventually gets written out as Parquet format.
However, some of the JSON events contains spaces in the keys which I want to log and filter/drop such events from the data frame before converting it to Parquet because ;{}()\n\t= are considered special characters in Parquet schema (CatalystSchemaConverter) as listed in [1] below and thus should not be allowed in the column names.
How can I do such validations in Dataframe on the column names and drop such an event altogether without erroring out the Spark Streaming job.
[1]
Spark's CatalystSchemaConverter
def checkFieldName(name: String): Unit = {
// ,;{}()\n\t= and space are special characters in Parquet schema
checkConversionRequirement(
!name.matches(".*[ ,;{}()\n\t=].*"),
s"""Attribute name "$name" contains invalid character(s) among " ,;{}()\\n\\t=".
|Please use alias to rename it.
""".stripMargin.split("\n").mkString(" ").trim
)
}

For everyone experiencing this in pyspark: this even happened to me after renaming the columns. One way I could get this to work after some iterations is this:
file = "/opt/myfile.parquet"
df = spark.read.parquet(file)
for c in df.columns:
df = df.withColumnRenamed(c, c.replace(" ", ""))
df = spark.read.schema(df.schema).parquet(file)

You can use a regex to replace all invalid characters with an underscore before you write into parquet. Additionally, strip accents from the column names too.
Here's a function normalize that do this for both Scala and Python :
Scala
/**
* Normalize column name by replacing invalid characters with underscore
* and strips accents
*
* #param columns dataframe column names list
* #return the list of normalized column names
*/
def normalize(columns: Seq[String]): Seq[String] = {
columns.map { c =>
org.apache.commons.lang3.StringUtils.stripAccents(c.replaceAll("[ ,;{}()\n\t=]+", "_"))
}
}
// using the function
val df2 = df.toDF(normalize(df.columns):_*)
Python
import unicodedata
import re
def normalize(column: str) -> str:
"""
Normalize column name by replacing invalid characters with underscore
strips accents and make lowercase
:param column: column name
:return: normalized column name
"""
n = re.sub(r"[ ,;{}()\n\t=]+", '_', column.lower())
return unicodedata.normalize('NFKD', n).encode('ASCII', 'ignore').decode()
# using the function
df = df.toDF(*map(normalize, df.columns))

This is my solution using Regex in order to rename all the dataframe's columns following the parquet convention:
df.columns.foldLeft(df){
case (currentDf, oldColumnName) => currentDf.withColumnRenamed(oldColumnName, oldColumnName.replaceAll("[ ,;{}()\n\t=]", ""))
}
I hope it helps,

I had the same problem with column names containing spaces.
The first part of the solution was to put the names in backquotes.
The second part of the solution was to replace the spaces with underscores.
Sorry but I have only the pyspark code ready:
from pyspark.sql import functions as F
df_tmp.select(*(F.col("`" + c+ "`").alias(c.replace(' ', '_')) for c in df_tmp.columns)

Using alias to change your field names without those special characters.

I have encounter this error "Error in SQL statement: AnalysisException: Found invalid character(s) among " ,;{}()\n\t=" in the column names of your schema. Please enable column mapping by setting table property 'delta.columnMapping.mode' to 'name'. For more details, refer to https://learn.microsoft.com/azure/databricks/delta/delta-column-mapping Or you can use alias to rename it."
The issue was because I used MAX(COLUM_NAME) when creating a table based on a parquet / Delta table, and the new name of the new table was "MAX(COLUM_NAME)" because forgot to use Aliases and parquet files doesn't support brackets '()'
Solved by using aliases (removing the brackets)

It was fixed in Spark 3.3.0 release at least for the parquet files (I tested), it might work with JSON as well.

Related

How to convert dataframe to a text file in spark?

I unloaded snowflake table and created a data frame.
this table has data of various datatype.
I tried to save it as a text file but got an error:
Text data source does not support Decimal(10,0).
So to resolve the error, I casted my select query and converted all columns to string datatype.
Then I got the below error:
Text data source supports only single column, and you have 5 columns.
my requirement is to create a text file as follows.
"column1value column2value column3value and so on"
You can use a CSV output with a space delimiter:
import pyspark.sql.functions as F
df.select([F.col(c).cast('string') for c in df.columns]).write.csv('output', sep=' ')
If you want only 1 output file, you can add .coalesce(1) before .write.
You need to have one column if you want to write using spark.write.text. You can use csv instead as suggested in #mck's answer or you can concatenate all columns into one before you write:
df.select(
concat_ws(" ", df.columns.map(c => col(c).cast("string")): _*).as("value")
).write
.text("output")

Row delimiter change while writing a tsv file in adls gen 1 from databrick

I want to save the tsv file to adls gen1. Using the below command to save the data but it writing a row delimiter as "\n"(LF) I want to writing a row delimiter "\r\n"
df.coalesce(1).write.mode("overwrite").format("csv").options(delimiter="\t",header="true",nullValue= None,lineSep ='\r\n').save(gen1temp)
I am having a 400+columns and 2M rows and file size in 6GB.
Please help with optimal solumn.
support for lineSep option for CSV files exists only in Spark 3.0, and doesn't exist in the earlier versions, like, 2.4, so it simply ignored.
Initially I thought about following workaround - append \r to the last column:
from pyspark.sql.functions import concat, col, lit
data = spark.range(1, 100).withColumn("abc", col("id")).withColumn("def", col("id"))
cols = map(lambda cn: col(cn), data.schema.fieldNames())
cols[-1] = concat(cols[-1].cast("string"), lit("\r"))
data.select(cols).write.csv("1.csv")
but unfortunately it doesn't work - it looks like that it's stripping ending whitespace when writing data into CSV...

pyspark dataframe column name

what is limitation for pyspark dataframe column names. I have issue with following code ..
%livy.pyspark
df_context_spark.agg({'spatialElementLabel.value': 'count'})
It gives ...
u'Cannot resolve column name "spatialElementLabel.value" among (lightFixtureID.value, spatialElementLabel.value);'
The column name is evidently typed correctly. I got the dataframe by transformation from pandas dataframe. It there any issue with dot in the column name string?
Dots are used for nested fields inside a structure type. So if you had a column that was called "address" of type StructType, and inside that you had street1, street2, etc you would access it the individual fields like this:
df.select("address.street1", "address.street2", ..)
Because of that, if you want to used a dot in your field name you need to quote the field whenever you refer to it. For example:
from pyspark.sql.types import *
schema = StructType([StructField("my.field", StringType())])
rdd = sc.parallelize([('hello',), ('world',)])
df = sqlContext.createDataFrame(rdd, schema)
# Using backticks to quote the field name
df.select("`my.field`").show()

How to handle min_itemsize exception in writing to pandas HDFStore

I am using pandas HDFStore to store dfs which I have created from data.
store = pd.HDFStore(storeName, ...)
for file in downloaded_files:
try:
with gzip.open(file) as f:
data = json.loads(f.read())
df = json_normalize(data)
store.append(storekey, df, format='table', append=True)
except TypeError:
pass
#File Error
I have received the error:
ValueError: Trying to store a string with len [82] in [values_block_2] column but
this column has a limit of [72]!
Consider using min_itemsize to preset the sizes on these columns
I found that it is possible to set min_itemsize for the column involved but this is not a viable solution as I do not know the max length I will encounter and all the columns which I will encounter the problem.
Is there a solution to automatically catch this exception and handle it each item it occur?
I think you can do it this way:
store.append(storekey, df, format='table', append=True, min_itemsize={'Long_string_column': 200})
basically it's very similar to the following create table SQL statement:
create table df(
id int,
str varchar(200)
);
where 200 is the maximal allowed length for the str column
The following links might be very helpful:
https://www.google.com/search?q=pandas+ValueError%3A+Trying+to+store+a+string+with+len+in+column+but+min_itemsize&pws=0&gl=us&gws_rd=cr
HDFStore.append(string, DataFrame) fails when string column contents are longer than those already there
Pandas pytable: how to specify min_itemsize of the elements of a MultiIndex

How to read multiple text files into a single RDD?

I want to read a bunch of text files from a hdfs location and perform mapping on it in an iteration using spark.
JavaRDD<String> records = ctx.textFile(args[1], 1); is capable of reading only one file at a time.
I want to read more than one file and process them as a single RDD. How?
You can specify whole directories, use wildcards and even CSV of directories and wildcards. E.g.:
sc.textFile("/my/dir1,/my/paths/part-00[0-5]*,/another/dir,/a/specific/file")
As Nick Chammas points out this is an exposure of Hadoop's FileInputFormat and therefore this also works with Hadoop (and Scalding).
Use union as follows:
val sc = new SparkContext(...)
val r1 = sc.textFile("xxx1")
val r2 = sc.textFile("xxx2")
...
val rdds = Seq(r1, r2, ...)
val bigRdd = sc.union(rdds)
Then the bigRdd is the RDD with all files.
You can use a single textFile call to read multiple files. Scala:
sc.textFile(','.join(files))
You can use this
First You can get a Buffer/List of S3 Paths :
import scala.collection.JavaConverters._
import java.util.ArrayList
import com.amazonaws.services.s3.AmazonS3Client
import com.amazonaws.services.s3.model.ObjectListing
import com.amazonaws.services.s3.model.S3ObjectSummary
import com.amazonaws.services.s3.model.ListObjectsRequest
def listFiles(s3_bucket:String, base_prefix : String) = {
var files = new ArrayList[String]
//S3 Client and List Object Request
var s3Client = new AmazonS3Client();
var objectListing: ObjectListing = null;
var listObjectsRequest = new ListObjectsRequest();
//Your S3 Bucket
listObjectsRequest.setBucketName(s3_bucket)
//Your Folder path or Prefix
listObjectsRequest.setPrefix(base_prefix)
//Adding s3:// to the paths and adding to a list
do {
objectListing = s3Client.listObjects(listObjectsRequest);
for (objectSummary <- objectListing.getObjectSummaries().asScala) {
files.add("s3://" + s3_bucket + "/" + objectSummary.getKey());
}
listObjectsRequest.setMarker(objectListing.getNextMarker());
} while (objectListing.isTruncated());
//Removing Base Directory Name
files.remove(0)
//Creating a Scala List for same
files.asScala
}
Now Pass this List object to the following piece of code, note : sc is an object of SQLContext
var df: DataFrame = null;
for (file <- files) {
val fileDf= sc.textFile(file)
if (df!= null) {
df= df.unionAll(fileDf)
} else {
df= fileDf
}
}
Now you got a final Unified RDD i.e. df
Optional, And You can also repartition it in a single BigRDD
val files = sc.textFile(filename, 1).repartition(1)
Repartitioning always works :D
In PySpark, I have found an additional useful way to parse files. Perhaps there is an equivalent in Scala, but I am not comfortable enough coming up with a working translation. It is, in effect, a textFile call with the addition of labels (in the below example the key = filename, value = 1 line from file).
"Labeled" textFile
input:
import glob
from pyspark import SparkContext
SparkContext.stop(sc)
sc = SparkContext("local","example") # if running locally
sqlContext = SQLContext(sc)
for filename in glob.glob(Data_File + "/*"):
Spark_Full += sc.textFile(filename).keyBy(lambda x: filename)
output: array with each entry containing a tuple using filename-as-key and with value = each line of file. (Technically, using this method you can also use a different key besides the actual filepath name- perhaps a hashing representation to save on memory). ie.
[('/home/folder_with_text_files/file1.txt', 'file1_contents_line1'),
('/home/folder_with_text_files/file1.txt', 'file1_contents_line2'),
('/home/folder_with_text_files/file1.txt', 'file1_contents_line3'),
('/home/folder_with_text_files/file2.txt', 'file2_contents_line1'),
...]
You can also recombine either as a list of lines:
Spark_Full.groupByKey().map(lambda x: (x[0], list(x[1]))).collect()
[('/home/folder_with_text_files/file1.txt', ['file1_contents_line1', 'file1_contents_line2','file1_contents_line3']),
('/home/folder_with_text_files/file2.txt', ['file2_contents_line1'])]
Or recombine entire files back to single strings (in this example the result is the same as what you get from wholeTextFiles, but with the string "file:" stripped from the filepathing.):
Spark_Full.groupByKey().map(lambda x: (x[0], ' '.join(list(x[1])))).collect()
you can use
JavaRDD<String , String> records = sc.wholeTextFiles("path of your directory")
here you will get the path of your file and content of that file. so you can perform any action of a whole file at a time that saves the overhead
All answers are correct with sc.textFile
I was just wondering why not wholeTextFiles For example, in this case...
val minPartitions = 2
val path = "/pathtohdfs"
sc.wholeTextFiles(path,minPartitions)
.flatMap{case (path, text)
...
one limitation is that, we have to load small files otherwise performance will be bad and may lead to OOM.
Note :
The wholefile should fit in to memory
Good for file formats that are NOT splittable by line... such as XML files
Further reference to visit
There is a straight forward clean solution available. Use the wholeTextFiles() method. This will take a directory and forms a key value pair. The returned RDD will be a pair RDD.
Find below the description from Spark docs:
SparkContext.wholeTextFiles lets you read a directory containing multiple small text files, and returns each of them as (filename, content) pairs. This is in contrast with textFile, which would return one record per line in each file
TRY THIS
Interface used to write a DataFrame to external storage systems (e.g. file systems, key-value stores, etc). Use DataFrame.write() to access this.
New in version 1.4.
csv(path, mode=None, compression=None, sep=None, quote=None, escape=None, header=None, nullValue=None, escapeQuotes=None, quoteAll=None, dateFormat=None, timestampFormat=None)
Saves the content of the DataFrame in CSV format at the specified path.
Parameters:
path – the path in any Hadoop supported file system
mode –
specifies the behavior of the save operation when data already exists.
append: Append contents of this DataFrame to existing data.
overwrite: Overwrite existing data.
ignore: Silently ignore this operation if data already exists.
error (default case): Throw an exception if data already exists.
compression – compression codec to use when saving to file. This can be one of the known case-insensitive shorten names (none, bzip2, gzip, lz4, snappy and deflate).
sep – sets the single character as a separator for each field and value. If None is set, it uses the default value, ,.
quote – sets the single character used for escaping quoted values where the separator can be part of the value. If None is set, it uses the default value, ". If you would like to turn off quotations, you need to set an empty string.
escape – sets the single character used for escaping quotes inside an already quoted value. If None is set, it uses the default value, \
escapeQuotes – A flag indicating whether values containing quotes should always be enclosed in quotes. If None is set, it uses the default value true, escaping all values containing a quote character.
quoteAll – A flag indicating whether all values should always be enclosed in quotes. If None is set, it uses the default value false, only escaping values containing a quote character.
header – writes the names of columns as the first line. If None is set, it uses the default value, false.
nullValue – sets the string representation of a null value. If None is set, it uses the default value, empty string.
dateFormat – sets the string that indicates a date format. Custom date formats follow the formats at java.text.SimpleDateFormat. This applies to date type. If None is set, it uses the default value value, yyyy-MM-dd.
timestampFormat – sets the string that indicates a timestamp format. Custom date formats follow the formats at java.text.SimpleDateFormat. This applies to timestamp type. If None is set, it uses the default value value, yyyy-MM-dd'T'HH:mm:ss.SSSZZ.
rdd = textFile('/data/{1.txt,2.txt}')

Resources