Spark: return null from failed regexp_extract() - apache-spark

Suppose you try to extract a substring from a column of a dataframe. regexp_extract() returns a null if the field itself is null, but returns an empty string if field is not null but the expression is not found. How can you return a null value for the latter case?
df = spark.createDataFrame([(None),('foo'),('foo_bar')], StringType())
df.select(regexp_extract('value', r'_(.+)', 1).alias('extracted')).show()
# +---------+
# |extracted|
# +---------+
# | null|
# | |
# | bar|
# +---------+

I'm not sure if regexp_extract() could ever return None for a String type. One thing you could do is replace empty strings with None using a user defined function:
from pyspark.sql.functions import regexp_extract, udf
from pyspark.sql.types import StringType
df = spark.createDataFrame([(None),('foo'),('foo_bar')], StringType())
toNoneUDF = udf(lambda val: None if val == "" else val, StringType())
new_df = df.select(regexp_extract('value', r'_(.+)', 1).alias('extracted'))
new_df.withColumn("extracted", toNoneUDF(new_df.extracted)).show()

This should work:
df = spark.createDataFrame([(None),('foo'),('foo_bar')], StringType())
df = df.select(regexp_extract('value', r'_(.+)', 1).alias('extracted'))
df.withColumn(
'extracted',
when(col('extracted') != '', col('extracted'), lit(None))
).show()

In spark SQL, I've found a solution to count the number of regex occurrence, ignoring null values:
SELECT COUNT(CASE WHEN rlike(col, "_(.+)") THEN 1 END)
FROM VALUES (NULL), ("foo"), ("foo_bar"), ("") AS tab(col);
Result:
1
I hope this will help some of you.

Related

Spark UDF error AttributeError: 'NoneType' object has no attribute '_jvm'

I found similar question link , but no answer provided how to fix the issue.
I want to make a UDF, that would extract for me words from column. So, I want to create a column named new_column, by applying my UDF to old_column
from pyspark.sql.functions import col, regexp_extract
re_string = 'some|words|I|need|to|match'
def regex_extraction(x,re_string):
return regexp_extract(x,re_string,0)
extracting = udf(lambda row: regex_extraction(row,re_string))
df = df.withColumn("new_column", extracting(col('old_column')))
AttributeError: 'NoneType' object has no attribute '_jvm'
How to fix my function? I have many columns and want to loop through columns list and apply my UDF.
You don't need a UDF. UDF is required when you cannot do something using PySpark, so you need some python functions or libraries. In your case your can have a function which accepts a column and returns a column, but that's it, UDF is not needed.
from pyspark.sql.functions import regexp_extract
df = spark.createDataFrame([('some match',)], ['old_column'])
re_string = 'some|words|I|need|to|match'
def regex_extraction(x, re_string):
return regexp_extract(x, re_string, 0)
df = df.withColumn("new_column", regex_extraction('old_column', re_string))
df.show()
# +----------+----------+
# |old_column|new_column|
# +----------+----------+
# |some match| some|
# +----------+----------+
"Looping" through columns in a list can be implemented this way:
from pyspark.sql.functions import regexp_extract
cols = ['col1', 'col2']
df = spark.createDataFrame([('some match', 'match')], cols)
re_string = 'some|words|I|need|to|match'
def regex_extraction(x, re_string):
return regexp_extract(x, re_string, 0)
df = df.select(
'*',
*[regex_extraction(c, re_string).alias(f'new_{c}') for c in cols]
)
df.show()
# +----------+-----+--------+--------+
# | col1| col2|new_col1|new_col2|
# +----------+-----+--------+--------+
# |some match|match| some| match|
# +----------+-----+--------+--------+

Find specific word in input file and read the data from next row in PySpark

Input File:
32535
1243
1q332|2
EOH
CUST_ID|CUST_NAME|ORDER_NO|ORDER_ITEM
1|TAM|222|ORANGE
2|AAM|322|APPLE
output
CUST_ID|CUST_NAME|ORDER_NO|ORDER_ITEM
1|TAM|222|ORANGE
2|AAM|322|APPLE
Mentioned above the input and output. I want to read input file, if found 'EOH' word in input file and convert to Dataframe from next line. Before 'EOH' rows should be ignored. Output format is given above.
sometime few rows may be added before 'EOH'.Need to pickup based on 'EOH' word.
Please share Pyspark code.
I don't know if this is the best approach, but here is:
from pyspark.sql.window import Window
import pyspark.sql.functions as f
df = (spark
.read
.format('csv')
.option('delimiter', '|')
.schema('CUST_ID string, CUST_NAME string, ORDER_NO integer, ORDER_ITEM STRING')
.load(YOUR_PATH))
# Identifying which line is the header
df = (df
.withColumn('id', f.monotonically_increasing_id())
.withColumn('header', f.lag('CUST_ID', default=False).over(Window.orderBy('id')) == f.lit('EOH')))
# Collecting only header row to python context
header = df.where(f.col('header')).head()
# Removing all rows before header
df = (df
.where(f.col('id') > f.lit(header.id))
.drop('id', 'header'))
df.show()
Output:
+-------+---------+--------+----------+
|CUST_ID|CUST_NAME|ORDER_NO|ORDER_ITEM|
+-------+---------+--------+----------+
| 1| TAM| 222| ORANGE|
| 2| AAM| 322| APPLE|
+-------+---------+--------+----------+
If the schema is fixed as in the comment, you can pass them into from_csv
schema = """
CUST_ID INT,
CUST_NAME STRING,
ORDER_NO INT,
ORDER_ITEM STRING
"""
# if you know for sure all fields are not null then
(df
.withColumn('value', F.from_csv('value', schema, {'sep': '|'}))
.select('value.*')
.where(
F.col('CUST_ID').isNotNull() &
F.col('CUST_NAME').isNotNull() &
F.col('ORDER_NO').isNotNull() &
F.col('ORDER_ITEM').isNotNull()
)
.show(10, False)
)
# if you unsure about the nulls, you can filter them before processing (or there are many other options)
(df
.withColumn('tmp', F.size(F.split('value', '\|')))
.where((F.col('tmp') == 4) & (~F.col('value').startswith('CUST_ID')))
.withColumn('value', F.from_csv('value', schema, {'sep': '|'}))
.select('value.*')
.show(10, False)
)
# +-------+---------+--------+----------+
# |CUST_ID|CUST_NAME|ORDER_NO|ORDER_ITEM|
# +-------+---------+--------+----------+
# |1 |TAM |222 |ORANGE |
# |2 |AAM |322 |APPLE |
# +-------+---------+--------+----------+

Pyspark Change String Order

I have a dataframe below;
Leadtime
303400
333430
1234111
2356788
258
I completed all the strigns in the data to 7 digits.
filler = udf(lambda x: str(x).zfill(7))
df =df.withColumn('Leadtime',filler('Leadtime'))
Output is;
Leadtime
0303400
0333430
1234111
2356788
0000258
After that,
I want to write a method that will make the first index of the strings the last index as follows;
Leadtime
3034000
3334300
2341111
3567882
0002580
Could you please help me about this?
You can select a substring with substr and concatenate strings with concat:
#string change string
import pyspark.sql.functions as F
l = [('303400',)
,('333430',)
,('1234111',)
,('2356788',)
,('258',)]
df=spark.createDataFrame(l, ['Leadtime'])
filler = F.udf(lambda x: str(x).zfill(7))
df =df.withColumn('Leadtime',filler('Leadtime'))
df.withColumn('Leadtime', F.concat(df.Leadtime.substr(2, 6), df.Leadtime.substr(1, 1)) ).show()
Output:
+--------+
|Leadtime|
+--------+
| 3034000|
| 3334300|
| 2341111|
| 3567882|
| 0002580|
+--------+

Getting the table name from a Spark Dataframe

If I have a dataframe created as follows:
df = spark.table("tblName")
Is there anyway that I can get back tblName from df?
You can extract it from the plan:
df.logicalPlan().argString().replace("`","")
We can extract tablename from a dataframe by parsing unresolved logical plan.
Please follow the method below:
def getTableName(df: DataFrame): String = {
Seq(df.queryExecution.logical, df.queryExecution.optimizedPlan).flatMap{_.collect{
case LogicalRelation(_, _, catalogTable: Option[CatalogTable], _) =>
if (catalogTable.isDefined) {
Some(catalogTable.get.identifier.toString())
} else None
case hive: HiveTableRelation => Some(hive.tableMeta.identifier.toString())
}
}.flatten.head
}
scala> val df = spark.table("db.table")
scala> getTableName(df)
res: String = `db`.`table`
Following utility function may be helpful to determine the table name from given DataFrame.
def get_dataframe_tablename(df: pyspark.sql.DataFrame) -> typing.Optional[str]:
"""
If the dataframe was created from an underlying table (e.g. spark.table('dual') or
spark.sql("select * from dual"), this function will return the
fully qualified table name (e.g. `default`.`dual`) as output otherwise it will return None.
Test on: python 3.7, spark 3.0.1, but it should work with Spark >=2.x and python >=3.4 too
Examples:
>>> get_dataframe_tablename(spark.table('dual'))
`default`.`dual`
>>> get_dataframe_tablename(spark.sql("select * from dual"))
`default`.`dual`
It inspects the output of `df.explain()` to determine that the df was created from a table or not
:param df: input dataframe whose underlying table name will be return
:return: table name or None
"""
def _explain(_df: pyspark.sql.DataFrame) -> str:
# df.explain() does not take parameter to accept the out and dump the output on stdout
# by default
import contextlib
import io
with contextlib.redirect_stdout(io.StringIO()) as f:
_df.explain()
f.seek(0) # Rewind stream position
explanation = f.readlines()[1] # Ignore first output line(#Physical Plan...)
return explanation
pattern = re.compile("Scan hive (.+), HiveTableRelation (.+?), (.+)")
output = _explain(df)
match = pattern.search(output)
return match.group(2) if match else None
Below three line of code will give table and database name
import org.apache.spark.sql.execution.FileSourceScanExec
df=session.table("dealer")
df.queryExecution.sparkPlan.asInstanceOf[FileSourceScanExec].tableIdentifier
Any answer on this one yet? I found a way but it's probably not the prettiest. You can access the tablename by retrieving the physical execution plan and then doing some string splitting magic on it.
Let's say you have a table from database_name.tblName. The following should work:
execution_plan = df.__jdf.queryExecution().simpleString()
table_name = string.split('FileScan')[1].split('[')[0].split('.')[1]
The first line will return your execution plan in a string format. That will look similar to this:
== Physical Plan ==\n*(1) ColumnarToRow\n+- FileScan parquet database_name.tblName[column1#2880,column2ban#2881] Batched: true, DataFilters: [], Format: Parquet, Location: PreparedDeltaFileIndex[dbfs:/mnt/lake/database_name/table_name], PartitionFilters: [], PushedFilters: [], ReadSchema: struct<column1:string,column2:string...\n\n'
After that you can run some string splitting to access the relevant information. The first string split gets you all the elements of FileScan- you are interested in the second element, then before and after the [- here the first element is of interest. The second string split after . will return tblName
You can create table from df. But if table is a local temporary view or a global temporary view you should drop it (sqlContext.dropTempTable) before create a table with same name or use create or replace function (spark.createOrReplaceGlobalTempView or spark.createOrReplaceTempView). If table is temp table you can create table with same name without error
#Create data frame
>>> d = [('Alice', 1)]
>>> test_df = spark.createDataFrame(sc.parallelize(d), ['name','age'])
>>> test_df.show()
+-----+---+
| name|age|
+-----+---+
|Alice| 1|
+-----+---+
#create tables
>>> test_df.createTempView("tbl1")
>>> test_df.registerTempTable("tbl2")
>>> sqlContext.tables().show()
+--------+---------+-----------+
|database|tableName|isTemporary|
+--------+---------+-----------+
| | tbl1| true|
| | tbl2| true|
+--------+---------+-----------+
#create data frame from tbl1
>>> df = spark.table("tbl1")
>>> df.show()
+-----+---+
| name|age|
+-----+---+
|Alice| 1|
+-----+---+
#create tbl1 again with using df data frame. It will get error
>>> df.createTempView("tbl1")
raise AnalysisException(s.split(': ', 1)[1], stackTrace)
pyspark.sql.utils.AnalysisException: "Temporary view 'tbl1' already exists;"
#drop and create again
>>> sqlContext.dropTempTable('tbl1')
>>> df.createTempView("tbl1")
>>> spark.sql('select * from tbl1').show()
+-----+---+
| name|age|
+-----+---+
|Alice| 1|
+-----+---+
#create data frame from tbl2 and replace name value
>>> df = spark.table("tbl2")
>>> df = df.replace('Alice', 'Bob')
>>> df.show()
+----+---+
|name|age|
+----+---+
| Bob| 1|
+----+---+
#create tbl2 again with using df data frame
>>> df.registerTempTable("tbl2")
>>> spark.sql('select * from tbl2').show()
+----+---+
|name|age|
+----+---+
| Bob| 1|
+----+---+

How to find count of Null and Nan values for each column in a PySpark dataframe efficiently?

import numpy as np
data = [
(1, 1, None),
(1, 2, float(5)),
(1, 3, np.nan),
(1, 4, None),
(1, 5, float(10)),
(1, 6, float("nan")),
(1, 6, float("nan")),
]
df = spark.createDataFrame(data, ("session", "timestamp1", "id2"))
Expected output
dataframe with count of nan/null for each column
Note:
The previous questions I found in stack overflow only checks for null & not nan.
That's why I have created a new question.
I know I can use isnull() function in Spark to find number of Null values in Spark column but how to find Nan values in Spark dataframe?
You can use method shown here and replace isNull with isnan:
from pyspark.sql.functions import isnan, when, count, col
df.select([count(when(isnan(c), c)).alias(c) for c in df.columns]).show()
+-------+----------+---+
|session|timestamp1|id2|
+-------+----------+---+
| 0| 0| 3|
+-------+----------+---+
or
df.select([count(when(isnan(c) | col(c).isNull(), c)).alias(c) for c in df.columns]).show()
+-------+----------+---+
|session|timestamp1|id2|
+-------+----------+---+
| 0| 0| 5|
+-------+----------+---+
For null values in the dataframe of pyspark
Dict_Null = {col:df.filter(df[col].isNull()).count() for col in df.columns}
Dict_Null
# The output in dict where key is column name and value is null values in that column
{'#': 0,
'Name': 0,
'Type 1': 0,
'Type 2': 386,
'Total': 0,
'HP': 0,
'Attack': 0,
'Defense': 0,
'Sp_Atk': 0,
'Sp_Def': 0,
'Speed': 0,
'Generation': 0,
'Legendary': 0}
To make sure it does not fail for string, date and timestamp columns:
import pyspark.sql.functions as F
def count_missings(spark_df,sort=True):
"""
Counts number of nulls and nans in each column
"""
df = spark_df.select([F.count(F.when(F.isnan(c) | F.isnull(c), c)).alias(c) for (c,c_type) in spark_df.dtypes if c_type not in ('timestamp', 'string', 'date')]).toPandas()
if len(df) == 0:
print("There are no any missing values!")
return None
if sort:
return df.rename(index={0: 'count'}).T.sort_values("count",ascending=False)
return df
If you want to see the columns sorted based on the number of nans and nulls in descending:
count_missings(spark_df)
# | Col_A | 10 |
# | Col_C | 2 |
# | Col_B | 1 |
If you don't want ordering and see them as a single row:
count_missings(spark_df, False)
# | Col_A | Col_B | Col_C |
# | 10 | 1 | 2 |
An alternative to the already provided ways is to simply filter on the column like so
import pyspark.sql.functions as F
df = df.where(F.col('columnNameHere').isNull())
This has the added benefit that you don't have to add another column to do the filtering and it's quick on larger data sets.
Here is my one liner.
Here 'c' is the name of the column
from pyspark.sql.functions import isnan, when, count, col, isNull
df.select('c').withColumn('isNull_c',F.col('c').isNull()).where('isNull_c = True').count()
I prefer this solution:
df = spark.table(selected_table).filter(condition)
counter = df.count()
df = df.select([(counter - count(c)).alias(c) for c in df.columns])
Use the following code to identify the null values in every columns using pyspark.
def check_nulls(dataframe):
'''
Check null values and return the null values in pandas Dataframe
INPUT: Spark Dataframe
OUTPUT: Null values
'''
# Create pandas dataframe
nulls_check = pd.DataFrame(dataframe.select([count(when(isnull(c), c)).alias(c) for c in dataframe.columns]).collect(),
columns = dataframe.columns).transpose()
nulls_check.columns = ['Null Values']
return nulls_check
#Check null values
null_df = check_nulls(raw_df)
null_df
from pyspark.sql import DataFrame
import pyspark.sql.functions as fn
# compatiable with fn.isnan. Sourced from
# https://github.com/apache/spark/blob/13fd272cd3/python/pyspark/sql/functions.py#L4818-L4836
NUMERIC_DTYPES = (
'decimal',
'double',
'float',
'int',
'bigint',
'smallilnt',
'tinyint',
)
def count_nulls(df: DataFrame) -> DataFrame:
isnan_compat_cols = {c for (c, t) in df.dtypes if any(t.startswith(num_dtype) for num_dtype in NUMERIC_DTYPES)}
return df.select(
[fn.count(fn.when(fn.isnan(c) | fn.isnull(c), c)).alias(c) for c in isnan_compat_cols]
+ [fn.count(fn.when(fn.isnull(c), c)).alias(c) for c in set(df.columns) - isnan_compat_cols]
)
Builds off of gench and user8183279's answers, but checks via only isnull for columns where isnan is not possible, rather than just ignoring them.
The source code of pyspark.sql.functions seemed to have the only documentation I could really find enumerating these names — if others know of some public docs I'd be delighted.
if you are writing spark sql, then the following will also work to find null value and count subsequently.
spark.sql('select * from table where isNULL(column_value)')
Yet another alternative (improved upon Vamsi Krishna's solutions above):
def check_for_null_or_nan(df):
null_or_nan = lambda x: isnan(x) | isnull(x)
func = lambda x: df.filter(null_or_nan(x)).count()
print(*[f'{i} has {func(i)} nans/nulls' for i in df.columns if func(i)!=0],sep='\n')
check_for_null_or_nan(df)
id2 has 5 nans/nulls
Here is a readable solution because code is for people as much as computers ;-)
df.selectExpr('sum(int(isnull(<col_name>) or isnan(<col_name>))) as null_or_nan_count'))

Resources