Pyspark Obtain Substring from Filename and Store as New Column - apache-spark

I am processing CSV files from S3 using pyspark, however I wish to incorporate filename as a new column for which I am using the below code:
spark.udf.register("filenamefunc", lambda x: x.rsplit('/', 1)[-2])
df=spark.read.csv("s3a://exportcsv-battery/S5/243/101*",sep=',',header=True,inferSchema=True)
df=df.withColumn("filename", 'filenamefunc(input_file_name())')
But instead of filename, I want a substring of it, for example, if this is the input_file_name:-
s3a://exportcsv-battery/S5/243/101_002932_243_AAA_A_T01_AAA_AAA_0_0_0_0_2_10Hz.csv
I only want 243 to be extracted and stored in a new column for which I defined a UDF as:
spark.udf.register("filenamefunc", lambda x: x.rsplit('/', 1)[-2])
But it doesn't seem to work. Is there something I can do to fix it or a different approach? Thanks!

You can use split() function
import pyspark.sql.functions as f
[...]
df = df.withColumn('filename', f.split(f.input_file_name(), '/')[4])

Related

how to convert the type of an object from "pandas.core.groupby.generic.SeriesGroupBy" to "pandas.core.series.Series"?

I have a variable of type "pandas.core.groupby.generic.SeriesGroupBy" which I got from grouping various fields of a pandas dataframe. But, I would like to convert that variable into a pandas series which is working but with a lot of errors.
Here is the code which I have tried:
w = data.groupby(['dt', 'b'])['w']
w = pd.Series(w)
When I try to run this code, it's taking a lot of time to execute and also generating a lot of errors.
I am getting a pandas Series as follows:
But, I am expecting something similar to this:
Is there any other way to group the below column of a DataFrame and store it inside a pandas Series:
Pandas groupby objects are iterable. Using list comprehension you can extract the partitioned sub-series. Try:
list_of_series = [s for _, s in data.groupby(['dt', 'b'])['w']]
list_of_series is a list and should contain your desired pandas series.

Equivalent of pandas apply in pyspark?

I really want to be able to run complex functions over a whole column of a spark dataframe, as i would do in Pandas with the apply function.
For example, in Pandas I have an apply function that takes a messy domain like sub-subdomain.subdomain.facebook.co.nz/somequerystring and just outputs facebook.com.
How would I do that in Spark?
I have looked at UDF's but I am not clear how I would run it on a single column.
Let's say I have a simple function like below where I extract different bits of a date from the column of the pandas DF:
def format_date(row):
year = int(row['Contract_Renewal'][7:])
month = int(row['Contract_Renewal'][4:6])
day = int(row['Contract_Renewal'][:3])
date = datetime.date(year, month, day)
return date-now
In Pandas I would call it like:
df['days_until'] = df.apply(format_date, axis=1)
Can I achieve the same in Pyspark?
In this scenario, you may be able to use some combination of regexp_extract (http://spark.apache.org/docs/latest/api/python/pyspark.sql.html?highlight=substring#pyspark.sql.functions.regexp_extract), regexp_replace (http://spark.apache.org/docs/latest/api/python/pyspark.sql.html?highlight=substring#pyspark.sql.functions.regexp_replace), and split (http://spark.apache.org/docs/latest/api/python/pyspark.sql.html?highlight=substring#pyspark.sql.functions.split) to reformat the date the strings.
It's not as clean as defining your own function and using apply like Pandas, but it should be more performant than defining a Pandas/Spark UDF.
Good luck!
The latest version of PySpark provides a way to run apply() function by leveraging pandas. You can find the example at PySpark apply Function to Column
# Imports
import pyspark.pandas as ps
import numpy as np
technologies = ({
'Fee' :[20000,25000,30000,22000,np.NaN],
'Discount':[1000,2500,1500,1200,3000]
})
# Create a DataFrame
psdf = ps.DataFrame(technologies)
print(psdf)
def add(row):
return row[0]+row[1]
addDF = psdf.apply(add,axis=1)
print(addDF)

Python Pandas dataframe, how to integrate new columns into a new csv

guys, I need a bit help on Pandas and would appreciate greatly your inputs.
My original file looks like this:
I would like to convert it by mergering some pairs of columns (generating their averages) and returns a new file looking like this:
Also, if possible, I would also like to split the column 'RateDateTime' into two columns, one contains the date, the other contains only the time. How should I do it? I tried coding as belows but it doesn't work:
import pandas as pd
dateparse = lambda x: pd.datetime.strptime(x, '%Y/%m/%d %H:%M:%S')
df = pd.read_csv('data.csv', parse_dates=['RateDateTime'], index_col='RateDateTime',date_parser=dateparse)
a=pd.to_numeric(df['RateAsk_open'])
b=pd.to_numeric(df['RateAsk_high'])
c=pd.to_numeric(df['RateAsk_low'])
d=pd.to_numeric(df['RateAsk_close'])
e=pd.to_numeric(df['RateBid_open'])
f=pd.to_numeric(df['RateBid_high'])
g=pd.to_numeric(df['RateBid_low'])
h=pd.to_numeric(df['RateBid_close'])
df['Open'] = (a+e) /2
df['High'] = (b+f) /2
df['Low'] = (c+g) /2
df['Close'] = (d+h) /2
grouped = df.groupby('CurrencyPair')
Open=grouped['Open']
High=grouped['High']
Low=grouped['Low']
Close=grouped['Close']
w=pd.concat([Open, High,Low,Close], axis=1, keys=['Open', 'High','Low','Close'])
w.to_csv('w.csv')
Python returns:
TypeError: cannot concatenate object of type "<class 'pandas.core.groupby.groupby.SeriesGroupBy'>"; only pd.Series, pd.DataFrame, and pd.Panel (deprecated) objs are valid
Can someone help me please? Many thanks!!!
IIUYC, you don't need grouping here. You can simply update existing dataframe with new columns and specify, what columns you need to save to csv file in to_csv method. Here is example:
df['Open'] = df[['RateAsk_open', 'RateBid_open']].mean(axis=1)
df['RateDate'] = df['RateDateTime'].dt.date
df['RateTime'] = df['RateDateTime'].dt.time
df.to_csv('w.csv', columns=['CurrencyPair', 'Open', 'RateDate', 'RateTime'])

Create string from all row elements pandas

I have a csv file and I would like to create a string with all the elements of each row. Lets say that I have the following csv...
trump,clinton
google,microsoft,linkedin
linux,windows,osx
data science,operating systems
I would like to create a string like so; trump&clinton | google&microsoft&linkedin and so forth. I did import the file and create a df with pandas. The solution doesn't have to be with pandas, if can be done with import csv that is acceptable as well.
I need one string per row... each row will become its own string.
Try
df.apply('&'.join, axis=1)

Spark Dataframe validating column names for parquet writes

I'm processing events using Dataframes converted from a stream of JSON events which eventually gets written out as Parquet format.
However, some of the JSON events contains spaces in the keys which I want to log and filter/drop such events from the data frame before converting it to Parquet because ;{}()\n\t= are considered special characters in Parquet schema (CatalystSchemaConverter) as listed in [1] below and thus should not be allowed in the column names.
How can I do such validations in Dataframe on the column names and drop such an event altogether without erroring out the Spark Streaming job.
[1]
Spark's CatalystSchemaConverter
def checkFieldName(name: String): Unit = {
// ,;{}()\n\t= and space are special characters in Parquet schema
checkConversionRequirement(
!name.matches(".*[ ,;{}()\n\t=].*"),
s"""Attribute name "$name" contains invalid character(s) among " ,;{}()\\n\\t=".
|Please use alias to rename it.
""".stripMargin.split("\n").mkString(" ").trim
)
}
For everyone experiencing this in pyspark: this even happened to me after renaming the columns. One way I could get this to work after some iterations is this:
file = "/opt/myfile.parquet"
df = spark.read.parquet(file)
for c in df.columns:
df = df.withColumnRenamed(c, c.replace(" ", ""))
df = spark.read.schema(df.schema).parquet(file)
You can use a regex to replace all invalid characters with an underscore before you write into parquet. Additionally, strip accents from the column names too.
Here's a function normalize that do this for both Scala and Python :
Scala
/**
* Normalize column name by replacing invalid characters with underscore
* and strips accents
*
* #param columns dataframe column names list
* #return the list of normalized column names
*/
def normalize(columns: Seq[String]): Seq[String] = {
columns.map { c =>
org.apache.commons.lang3.StringUtils.stripAccents(c.replaceAll("[ ,;{}()\n\t=]+", "_"))
}
}
// using the function
val df2 = df.toDF(normalize(df.columns):_*)
Python
import unicodedata
import re
def normalize(column: str) -> str:
"""
Normalize column name by replacing invalid characters with underscore
strips accents and make lowercase
:param column: column name
:return: normalized column name
"""
n = re.sub(r"[ ,;{}()\n\t=]+", '_', column.lower())
return unicodedata.normalize('NFKD', n).encode('ASCII', 'ignore').decode()
# using the function
df = df.toDF(*map(normalize, df.columns))
This is my solution using Regex in order to rename all the dataframe's columns following the parquet convention:
df.columns.foldLeft(df){
case (currentDf, oldColumnName) => currentDf.withColumnRenamed(oldColumnName, oldColumnName.replaceAll("[ ,;{}()\n\t=]", ""))
}
I hope it helps,
I had the same problem with column names containing spaces.
The first part of the solution was to put the names in backquotes.
The second part of the solution was to replace the spaces with underscores.
Sorry but I have only the pyspark code ready:
from pyspark.sql import functions as F
df_tmp.select(*(F.col("`" + c+ "`").alias(c.replace(' ', '_')) for c in df_tmp.columns)
Using alias to change your field names without those special characters.
I have encounter this error "Error in SQL statement: AnalysisException: Found invalid character(s) among " ,;{}()\n\t=" in the column names of your schema. Please enable column mapping by setting table property 'delta.columnMapping.mode' to 'name'. For more details, refer to https://learn.microsoft.com/azure/databricks/delta/delta-column-mapping Or you can use alias to rename it."
The issue was because I used MAX(COLUM_NAME) when creating a table based on a parquet / Delta table, and the new name of the new table was "MAX(COLUM_NAME)" because forgot to use Aliases and parquet files doesn't support brackets '()'
Solved by using aliases (removing the brackets)
It was fixed in Spark 3.3.0 release at least for the parquet files (I tested), it might work with JSON as well.

Resources