pyspark dataframe column name - string

what is limitation for pyspark dataframe column names. I have issue with following code ..
%livy.pyspark
df_context_spark.agg({'spatialElementLabel.value': 'count'})
It gives ...
u'Cannot resolve column name "spatialElementLabel.value" among (lightFixtureID.value, spatialElementLabel.value);'
The column name is evidently typed correctly. I got the dataframe by transformation from pandas dataframe. It there any issue with dot in the column name string?

Dots are used for nested fields inside a structure type. So if you had a column that was called "address" of type StructType, and inside that you had street1, street2, etc you would access it the individual fields like this:
df.select("address.street1", "address.street2", ..)
Because of that, if you want to used a dot in your field name you need to quote the field whenever you refer to it. For example:
from pyspark.sql.types import *
schema = StructType([StructField("my.field", StringType())])
rdd = sc.parallelize([('hello',), ('world',)])
df = sqlContext.createDataFrame(rdd, schema)
# Using backticks to quote the field name
df.select("`my.field`").show()

Related

Get the Dataframe from temp view name

I can register a Dataframe in the catalog using
df.createOrReplaceTempView("my_name")
I can check if the dataframe is registered
next(filter(lambda table:table.name=='my_name',spark.catalog.listTables()))
But how I can get the DataFrame associated with the name?
a_df = spark.catalog.getTempView()
assert id(a_df)==id(df)
spark.table() method will return the dataframe for the given table/view name
a_df = spark.table('my_name')
You may also use:
df=spark.sql("select * from my_name")
if you want to do some SQL before loading your df.

Change the datatype of a column in delta table

Is there a SQL command that I can easily use to change the datatype of a existing column in Delta table. I need to change the column datatype from BIGINT to STRING. Below is the SQL command I'm trying to use but no luck.
%sql ALTER TABLE [TABLE_NAME] ALTER COLUMN [COLUMN_NAME] STRING
Error I'm getting:
org.apache.spark.sql.AnalysisException
ALTER TABLE CHANGE COLUMN is not supported for changing column 'bam_user' with type
'IntegerType' to 'bam_user' with type 'StringType'
SQL doesn't support this, but it can be done in python:
from pyspark.sql.functions import col
# set dataset location and columns with new types
table_path = '/mnt/dataset_location...'
types_to_change = {
'column_1' : 'int',
'column_2' : 'string',
'column_3' : 'double'
}
# load to dataframe, change types
df = spark.read.format('delta').load(table_path)
for column in types_to_change:
df = df.withColumn(column,col(column).cast(types_to_change[column]))
# save df with new types overwriting the schema
df.write.format("delta").mode("overwrite").option("overwriteSchema",True).save("dbfs:" + table_path)
No Option to change the data type of column or dropping the column. You can read the data in datafame, modify the data type and with help of withColumn() and drop() and overwrite the table.
There is no real way to do this using SQL, unless you copy to a different table altogether. This option includes INSERT data to a new table, DROP TABLE and re-CREATE with the new structure and therefore risky.
The way to do this in python is as follows:
Let's say this is your table :
CREATE TABLE person (id INT, name STRING, age INT, class INT, address STRING);
INSERT INTO person VALUES
(100, 'John', 30, 1, 'Street 1'),
(200, 'Mary', NULL, 1, 'Street 2'),
(300, 'Mike', 80, 3, 'Street 3'),
(400, 'Dan', 50, 4, 'Street 4');
You can check the table structure using the following:
DESCRIBE TABLE person
IF you need to change the id to String:
This is the code:
%py
from pyspark.sql.functions import col
df = spark.read.table("person")
df1 = df.withColumn("id",col("id").cast("string"))
df1.write
.format ("parquet")
.mode("overwrite")
.option("overwriteSchema", "true")
.saveAsTable("person")
Couple of pointers: the format is parquet in this table. That's the default for Databricks. So you can omit the "format" line (note that Python is very sensitive regarding spaces).
Re databricks:
If the format is "delta" you must specify this.
Also, if the table is partitioned, it's important to mention that in the code:
For example:
df1.write
.format ("delta")
.mode("overwrite")
.partitionBy("col_to_partition1", "col_to_partition2")
.option("overwriteSchema", "true")
.save(table_location)
When table_location is where the delta table is saved.
(some of this answer is based on this)
Suppose you want to change data type of column "column_name" to "int" of table "delta_table_name"
spark.read.table("delta_table_name") .withColumn("Column_name",col("Column_name").cast("new_data_type")) .write.format("delta").mode("overwrite").option("overwriteSchema",true).saveAsTable("delta_table_name")
Read the table using spark.
Use withColumn method to transform the column you want.
Write the table back, mode overwrite and overwriteSchema True
Reference: https://docs.databricks.com/delta/update-schema.html#explicitly-update-schema-to-change-column-type-or-name
from pyspark.sql import functions as F
spark.read.table("<TABLE NAME>") \
.withColumn("<COLUMN NAME> ",F.col("<COLUMN NAME>").cast("<DATA TYPE>")) \
.write.format("delta").mode("overwrite").option("overwriteSchema",True).saveAsTable("<TABLE NAME>")

I would like to convert an int to string in dataframe

I would like to convert a column in dataframe to a string
it looks like this :
company department id family name start_date end_date
abc sales 38221925 Levy nali 16/05/2017 01/01/2018
I want to convert the id from int to string
I tried
data['id']=data['id'].to_string()
and
data['id']=data['id'].astype(str)
got dtype('O')
I expect to receive string
This is intended behaviour. This is how pandas stores strings.
From the docs
Pandas uses the object dtype for storing strings.
For a simple test, you can make a dummy dataframe and check it's dtype too.
import pandas as pd
df = pd.DataFrame(["abc", "ab"])
df[0].dtype
#Output:
dtype('O')
You can do that by using apply() function in this way:
data['id'] = data['id'].apply(lambda x: str(x))
This will convert all the values of id column to string.
You can ensure the type of the values like this:
type(data['id'][0]) (It is checking the first value of 'id' column)
This will give the output str.
And data['id'].dtype will give dtype('O') that is object.
You can also use data.info() to check all the information about that DataFrame.
str(12)
>>'12'
Can easily convert to a String

Spark Dataframe validating column names for parquet writes

I'm processing events using Dataframes converted from a stream of JSON events which eventually gets written out as Parquet format.
However, some of the JSON events contains spaces in the keys which I want to log and filter/drop such events from the data frame before converting it to Parquet because ;{}()\n\t= are considered special characters in Parquet schema (CatalystSchemaConverter) as listed in [1] below and thus should not be allowed in the column names.
How can I do such validations in Dataframe on the column names and drop such an event altogether without erroring out the Spark Streaming job.
[1]
Spark's CatalystSchemaConverter
def checkFieldName(name: String): Unit = {
// ,;{}()\n\t= and space are special characters in Parquet schema
checkConversionRequirement(
!name.matches(".*[ ,;{}()\n\t=].*"),
s"""Attribute name "$name" contains invalid character(s) among " ,;{}()\\n\\t=".
|Please use alias to rename it.
""".stripMargin.split("\n").mkString(" ").trim
)
}
For everyone experiencing this in pyspark: this even happened to me after renaming the columns. One way I could get this to work after some iterations is this:
file = "/opt/myfile.parquet"
df = spark.read.parquet(file)
for c in df.columns:
df = df.withColumnRenamed(c, c.replace(" ", ""))
df = spark.read.schema(df.schema).parquet(file)
You can use a regex to replace all invalid characters with an underscore before you write into parquet. Additionally, strip accents from the column names too.
Here's a function normalize that do this for both Scala and Python :
Scala
/**
* Normalize column name by replacing invalid characters with underscore
* and strips accents
*
* #param columns dataframe column names list
* #return the list of normalized column names
*/
def normalize(columns: Seq[String]): Seq[String] = {
columns.map { c =>
org.apache.commons.lang3.StringUtils.stripAccents(c.replaceAll("[ ,;{}()\n\t=]+", "_"))
}
}
// using the function
val df2 = df.toDF(normalize(df.columns):_*)
Python
import unicodedata
import re
def normalize(column: str) -> str:
"""
Normalize column name by replacing invalid characters with underscore
strips accents and make lowercase
:param column: column name
:return: normalized column name
"""
n = re.sub(r"[ ,;{}()\n\t=]+", '_', column.lower())
return unicodedata.normalize('NFKD', n).encode('ASCII', 'ignore').decode()
# using the function
df = df.toDF(*map(normalize, df.columns))
This is my solution using Regex in order to rename all the dataframe's columns following the parquet convention:
df.columns.foldLeft(df){
case (currentDf, oldColumnName) => currentDf.withColumnRenamed(oldColumnName, oldColumnName.replaceAll("[ ,;{}()\n\t=]", ""))
}
I hope it helps,
I had the same problem with column names containing spaces.
The first part of the solution was to put the names in backquotes.
The second part of the solution was to replace the spaces with underscores.
Sorry but I have only the pyspark code ready:
from pyspark.sql import functions as F
df_tmp.select(*(F.col("`" + c+ "`").alias(c.replace(' ', '_')) for c in df_tmp.columns)
Using alias to change your field names without those special characters.
I have encounter this error "Error in SQL statement: AnalysisException: Found invalid character(s) among " ,;{}()\n\t=" in the column names of your schema. Please enable column mapping by setting table property 'delta.columnMapping.mode' to 'name'. For more details, refer to https://learn.microsoft.com/azure/databricks/delta/delta-column-mapping Or you can use alias to rename it."
The issue was because I used MAX(COLUM_NAME) when creating a table based on a parquet / Delta table, and the new name of the new table was "MAX(COLUM_NAME)" because forgot to use Aliases and parquet files doesn't support brackets '()'
Solved by using aliases (removing the brackets)
It was fixed in Spark 3.3.0 release at least for the parquet files (I tested), it might work with JSON as well.

Pulling Name Out of Schema in Spark DataFrame

I am trying to make a function that will pull the name of the column out of a dataframe schema. So what I have is the initial function defined:
val df = sqlContext.parquetFile(inputVal.toString)
val dfSchema = df.schema
def schemaMatchP(schema: StructType) : Map[String,List[Int]] =
schema
// get the 1st word (column type) in upper cases
.map(columnDescr => columnDescr
If I do something like this:
.map(columnDescr => columnDescr.toString.split(',')(0).toUpperCase)
I will get STRUCTFIELD(HH_CUST_GRP_MBRP_ID,BINARYTYPE,TRUE)
How do you handle a StructField so I can grab the 1st element out of each column for the schema. So my Column names: HH_CUST_GRP_MBRP_ID, etc...
When in doubt look what the source does itself. DataFrame.toString has the answer :). StructField is a case class with a name property. So, just do:
schema.map(f => s"${f.name}")

Resources