Pyspark - Split a column and take n elements

Pyspark - Split a column and take n elements - apache-spark

I want to take a column and split a string using a character. As per usual, I understood that the method split would return a list, but when coding I found that the returning object had only the methods getItem or getField with the following descriptions from the API:
#since(1.3)
def getItem(self, key):
"""
An expression that gets an item at position ``ordinal`` out of a list,
or gets an item by key out of a dict.
#since(1.3)
def getField(self, name):
"""
An expression that gets a field by name in a StructField.
Obviously this doesnt meet my requirements, for example for the text within the column "A_B_C_D" I would like to split between "A_B_C_" and "D" in two different columns.
This is the code I'm using
from pyspark.sql.functions import regexp_extract, col, split
df_test=spark.sql("SELECT * FROM db_test.table_test")
#Applying the transformations to the data
split_col=split(df_test['Full_text'],'_')
df_split=df_test.withColumn('Last_Item',split_col.getItem(3))
Find an example:
from pyspark.sql import Row
from pyspark.sql.functions import regexp_extract, col, split
l = [("Item1_Item2_ItemN"),("FirstItem_SecondItem_LastItem"),("ThisShouldBeInTheFirstColumn_ThisShouldBeInTheLastColumn")]
rdd = sc.parallelize(l)
datax = rdd.map(lambda x: Row(fullString=x))
df = sqlContext.createDataFrame(datax)
split_col=split(df['fullString'],'_')
df=df.withColumn('LastItemOfSplit',split_col.getItem(2))
Result:
fullString LastItemOfSplit
Item1_Item2_ItemN ItemN
FirstItem_SecondItem_LastItem LastItem
ThisShouldBeInTheFirstColumn_ThisShouldBeInTheLastColumn null
My expected result would be having always the last item
fullString LastItemOfSplit
Item1_Item2_ItemN ItemN
FirstItem_SecondItem_LastItem LastItem
ThisShouldBeInTheFirstColumn_ThisShouldBeInTheLastColumn ThisShouldBeInTheLastColumn

You can use getItem(size - 1) to get the last item from the arrays:
Example:
df = spark.createDataFrame([[['A', 'B', 'C', 'D']], [['E', 'F']]], ['split'])
df.show()
+------------+
| split|
+------------+
|[A, B, C, D]|
| [E, F]|
+------------+
import pyspark.sql.functions as F
df.withColumn('lastItem', df.split.getItem(F.size(df.split) - 1)).show()
+------------+--------+
| split|lastItem|
+------------+--------+
|[A, B, C, D]| D|
| [E, F]| F|
+------------+--------+
For your case:
from pyspark.sql.functions import regexp_extract, col, split, size
df_test=spark.sql("SELECT * FROM db_test.table_test")
#Applying the transformations to the data
split_col=split(df_test['Full_text'],'_')
df_split=df_test.withColumn('Last_Item',split_col.getItem(size(split_col) - 1))

You can pass in a regular expression pattern to split.
The following would work for your example:
from pyspark.sql.functions split
split_col=split(df['fullString'], r"_(?=.+$)")
df = df.withColumn('LastItemOfSplit', split_col.getItem(1))
df.show(truncate=False)
#+--------------------------------------------------------+---------------------------+
#|fullString |LastItemOfSplit |
#+--------------------------------------------------------+---------------------------+
#|Item1_Item2_ItemN |Item2 |
#|FirstItem_SecondItem_LastItem |SecondItem |
#|ThisShouldBeInTheFirstColumn_ThisShouldBeInTheLastColumn|ThisShouldBeInTheLastColumn|
#+--------------------------------------------------------+---------------------------+
The pattern means the following:
_ the literal underscore
(?=.+$) positive look-ahead for anything (.) until the end of the string $
This will split the string on the last underscore. Then call .getItem(1) to get the item at index 1 in the resultant list.

Related

How to concat two ArrayType(StringType()) columns element-wise in Pyspark?

I have two ArrayType(StringType()) columns in a spark dataframe, and I want to concatenate the two columns element-wise:
input:
+-------------+-------------+
|col1 |col2 |
+-------------+-------------+
|['a','b'] |['c','d'] |
|['a','b','c']|['e','f','g']|
+-------------+-------------+
output:
+-------------+-------------+----------------+
|col1 |col2 |col3 |
+-------------+-------------+----------------+
|['a','b'] |['c','d'] |['ac', 'bd'] |
|['a','b','c']|['e','f','g']|['ae','bf','cg']|
+-------------+----------- -+----------------+
Thanks.

For Spark 2.4+, you can use zip_with function:
zip_with(left, right, func) - Merges the two given arrays,
element-wise, into a single array using function
df.withColumn("col3", expr("zip_with(col1, col2, (x, y) -> concat(x, y))")).show()
#+------+------+--------+
#| col1| col2| col3|
#+------+------+--------+
#|[a, b]|[c, d]|[ac, bd]|
#+------+------+--------+
Another way using transform function like this:
df.withColumn("col3", expr("transform(col1, (x, i) -> concat(x, col2[i]))"))
The transform function takes as parameters the first array column col1, iterates over its elements and applies a lambda function (x, i) -> concat(x, col2[i]) where x the actual element and i its index used to get the corresponding element from array col2.

Here is an alternative answer that can be used for the updated non-original question. Uses array and array_except to demonstrate the use of such methods. The accepted answer is more elegant.
from pyspark.sql.functions import *
from pyspark.sql.types import *
# Arbitrary max number of elements to apply array over, need not broadcast such a small amount of data afaik.
max_entries = 5
# Gen in this case numeric data, etc. 3 rows with 2 arrays of varying length,but per row constant length.
dfA = spark.createDataFrame([ ( list([x,x+1,4, x+100]), 4, list([x+100,x+200,999,x+500]) ) for x in range(3)], ['array1', 'value1', 'array2'] ).withColumn("s",size(col("array1")))
dfB = spark.createDataFrame([ ( list([x,x+1]), 4, list([x+100,x+200]) ) for x in range(5)], ['array1', 'value1', 'array2'] ).withColumn("s",size(col("array1")))
df = dfA.union(dfB)
# concat the array elements which are variable in size up to a max amount.
df2 = df.select(( [concat(col("array1")[i], col("array2")[i]) for i in range(max_entries)]))
df3 = df2.withColumn("res", array(df2.schema.names))
# Get results but strip out null entires from array.
df3.select(array_except(df3.res, array(lit(None)))).show(truncate=False)
Could not get the s value of column to be passed into range.
This returns:
+------------------------------+
|array_except(res, array(NULL))|
+------------------------------+
|[0100, 1200, 4999, 100500] |
|[1101, 2201, 4999, 101501] |
|[2102, 3202, 4999, 102502] |
|[0100, 1200] |
|[1101, 2201] |
|[2102, 3202] |
|[3103, 4203] |
|[4104, 5204] |
+------------------------------+

It wouldn't really scale, but you could get the 0th and 1st entries in each array and then say col3 is a[0] + b[0] and then a[1] + b[1].
Make all 4 entries separate values and then output them combined.

Here is a generic answer. Just look at res for the result. 2 equally sized arrays, thus n elements for both.
from pyspark.sql.functions import *
from pyspark.sql.types import *
# Gen in this case numeric data, etc. 3 rows with 2 arrays of varying length, but both the same length as in your example
df = spark.createDataFrame([ ( list([x,x+1,4, x+100]), 4, list([x+100,x+200,999,x+500]) ) for x in range(3)], ['array1', 'value1', 'array2'] )
num_array_elements = len(df.select("array1").first()[0])
# concat
df2 = df.select(([ concat(col("array1")[i], col("array2")[i]) for i in range(num_array_elements)]))
df2.withColumn("res", array(df2.schema.names)).show(truncate=False)
returns:

Pyspark Change String Order

I have a dataframe below;
Leadtime
303400
333430
1234111
2356788
258
I completed all the strigns in the data to 7 digits.
filler = udf(lambda x: str(x).zfill(7))
df =df.withColumn('Leadtime',filler('Leadtime'))
Output is;
Leadtime
0303400
0333430
1234111
2356788
0000258
After that,
I want to write a method that will make the first index of the strings the last index as follows;
Leadtime
3034000
3334300
2341111
3567882
0002580
Could you please help me about this?

You can select a substring with substr and concatenate strings with concat:
#string change string
import pyspark.sql.functions as F
l = [('303400',)
,('333430',)
,('1234111',)
,('2356788',)
,('258',)]
df=spark.createDataFrame(l, ['Leadtime'])
filler = F.udf(lambda x: str(x).zfill(7))
df =df.withColumn('Leadtime',filler('Leadtime'))
df.withColumn('Leadtime', F.concat(df.Leadtime.substr(2, 6), df.Leadtime.substr(1, 1)) ).show()
Output:
+--------+
|Leadtime|
+--------+
| 3034000|
| 3334300|
| 2341111|
| 3567882|
| 0002580|
+--------+

efficiently expand array of Row to separate columns

I have a spark dataframe and one of its fields is an array of Row structures. I need to expand it into their own columns. One of the problems is in the array, sometimes a field is missing.
The following is an example:
from pyspark.sql import SparkSession
from pyspark.sql.types import *
from pyspark.sql import Row
from pyspark.sql import functions as udf
spark = SparkSession.builder.getOrCreate()
# data
rows = [{'status':'active','member_since':1990,'info':[Row(tag='name',value='John'),Row(tag='age',value='50'),Row(tag='phone',value='1234567')]},
{'status':'inactive','member_since':2000,'info':[Row(tag='name',value='Tom'),Row(tag='phone',value='1234567')]},
{'status':'active','member_since':2015,'info':[Row(tag='name',value='Steve'),Row(tag='age',value='28')]}]
# create dataframe
df = spark.createDataFrame(rows)
# transform info to dict
to_dict = udf.UserDefinedFunction(lambda s:dict(s),MapType(StringType(),StringType()))
df = df.withColumn("info_dict",to_dict("info"))
# extract name, NA if not exists
extract_name = udf.UserDefinedFunction(lambda s:s.get("name","NA"))
df = df.withColumn("name",extract_name("info_dict"))
# extract age, NA if not exists
extract_age = udf.UserDefinedFunction(lambda s:s.get("age","NA"))
df = df.withColumn("age",extract_age("info_dict"))
# extract phone, NA if not exists
extract_phone = udf.UserDefinedFunction(lambda s:s.get("phone","NA"))
df = df.withColumn("phone",extract_phone("info_dict"))
df.show()
You can see for 'Tom', 'age' is missing; for 'Steve', 'phone' is missing. Like the above code snippet, my current solution is to first transform the array into dict and then parse each individual field into their column. The result is like this:
+--------------------+------------+--------+--------------------+-----+---+-------+
| info|member_since| status| info_dict| name|age| phone|
+--------------------+------------+--------+--------------------+-----+---+-------+
|[[name, John], [a...| 1990| active|[name -> John, ph...| John| 50|1234567|
|[[name, Tom], [ph...| 2000|inactive|[name -> Tom, pho...| Tom| NA|1234567|
|[[name, Steve], [...| 2015| active|[name -> Steve, a...|Steve| 28| NA|
+--------------------+------------+--------+--------------------+-----+---+-------+
I really just want the columns 'status','member_since','name', 'age' and 'phone'. This solution works but rather slow because of the UDF. Is there any faster alternatives? Thanks

I can think of 2 ways to do this using DataFrame functions. I believe the first one should be faster, but the code is much less elegant. The second is more compact, but probably slower.
Method 1: Create Map Dynamically
The heart of this method is to turn your Row into a MapType(). This can be achieved using pyspark.sql.functions.create_map() and some magic using functools.reduce() and operator.add().
from operator import add
import pyspark.sql.functions as f
f.create_map(
*reduce(
add,
[[f.col('info')['tag'].getItem(k), f.col('info')['value'].getItem(k)]
for k in range(3)]
)
)
The problem is that there isn't a way (AFAIK) to dynamically determine the length of the WrappedArray or iterate through it in an easy way. If a value is missing, this will cause an error because map keys can not be null. However since we know that the list can either contain 1, 2, 3 elements, we can just test for each of these cases.
df.withColumn(
'map',
f.when(f.size(f.col('info')) == 1,
f.create_map(
*reduce(
add,
[[f.col('info')['tag'].getItem(k), f.col('info')['value'].getItem(k)]
for k in range(1)]
)
)
).otherwise(
f.when(f.size(f.col('info')) == 2,
f.create_map(
*reduce(
add,
[[f.col('info')['tag'].getItem(k), f.col('info')['value'].getItem(k)]
for k in range(2)]
)
)
).otherwise(
f.when(f.size(f.col('info')) == 3,
f.create_map(
*reduce(
add,
[[f.col('info')['tag'].getItem(k), f.col('info')['value'].getItem(k)]
for k in range(3)]
)
)
)))
).select(
['member_since', 'status'] + [f.col("map").getItem(k).alias(k) for k in keys]
).show(truncate=False)
The last step turns the 'map' keys into columns using the method described in this answer.
This produces the following output:
+------------+--------+-----+----+-------+
|member_since|status |name |age |phone |
+------------+--------+-----+----+-------+
|1990 |active |John |50 |1234567|
|2000 |inactive|Tom |null|1234567|
|2015 |active |Steve|28 |null |
+------------+--------+-----+----+-------+
Method 2: Use explode, groupBy and pivot
First use pyspark.sql.functions.explode() on the column 'info', and then use the 'tag' and 'value' columns as arguments to create_map():
df.withColumn('id', f.monotonically_increasing_id())\
.withColumn('exploded', f.explode(f.col('info')))\
.withColumn(
'map',
f.create_map(*[f.col('exploded')['tag'], f.col('exploded')['value']]).alias('map')
)\
.select('id', 'member_since', 'status', 'map')\
.show(truncate=False)
#+------------+------------+--------+---------------------+
#|id |member_since|status |map |
#+------------+------------+--------+---------------------+
#|85899345920 |1990 |active |Map(name -> John) |
#|85899345920 |1990 |active |Map(age -> 50) |
#|85899345920 |1990 |active |Map(phone -> 1234567)|
#|180388626432|2000 |inactive|Map(name -> Tom) |
#|180388626432|2000 |inactive|Map(phone -> 1234567)|
#|266287972352|2015 |active |Map(name -> Steve) |
#|266287972352|2015 |active |Map(age -> 28) |
#+------------+------------+--------+---------------------+
I also added a column 'id' using pyspark.sql.functions.monotonically_increasing_id() to make sure we can keep track of which rows belong to the same record.
Now we can explode the map column, groupBy(), and pivot(). We can use pyspark.sql.functions.first() as the aggregate function for the groupBy() because we know there will only be one 'value' in each group.
df.withColumn('id', f.monotonically_increasing_id())\
.withColumn('exploded', f.explode(f.col('info')))\
.withColumn(
'map',
f.create_map(*[f.col('exploded')['tag'], f.col('exploded')['value']]).alias('map')
)\
.select('id', 'member_since', 'status', f.explode('map'))\
.groupBy('id', 'member_since', 'status').pivot('key').agg(f.first('value'))\
.select('member_since', 'status', 'age', 'name', 'phone')\
.show()
#+------------+--------+----+-----+-------+
#|member_since| status| age| name| phone|
#+------------+--------+----+-----+-------+
#| 1990| active| 50| John|1234567|
#| 2000|inactive|null| Tom|1234567|
#| 2015| active| 28|Steve| null|
#+------------+--------+----+-----+-------+

How to find count of Null and Nan values for each column in a PySpark dataframe efficiently?

import numpy as np
data = [
(1, 1, None),
(1, 2, float(5)),
(1, 3, np.nan),
(1, 4, None),
(1, 5, float(10)),
(1, 6, float("nan")),
(1, 6, float("nan")),
]
df = spark.createDataFrame(data, ("session", "timestamp1", "id2"))
Expected output
dataframe with count of nan/null for each column
Note:
The previous questions I found in stack overflow only checks for null & not nan.
That's why I have created a new question.
I know I can use isnull() function in Spark to find number of Null values in Spark column but how to find Nan values in Spark dataframe?

You can use method shown here and replace isNull with isnan:
from pyspark.sql.functions import isnan, when, count, col
df.select([count(when(isnan(c), c)).alias(c) for c in df.columns]).show()
+-------+----------+---+
|session|timestamp1|id2|
+-------+----------+---+
| 0| 0| 3|
+-------+----------+---+
or
df.select([count(when(isnan(c) | col(c).isNull(), c)).alias(c) for c in df.columns]).show()
+-------+----------+---+
|session|timestamp1|id2|
+-------+----------+---+
| 0| 0| 5|
+-------+----------+---+

For null values in the dataframe of pyspark
Dict_Null = {col:df.filter(df[col].isNull()).count() for col in df.columns}
Dict_Null
# The output in dict where key is column name and value is null values in that column
{'#': 0,
'Name': 0,
'Type 1': 0,
'Type 2': 386,
'Total': 0,
'HP': 0,
'Attack': 0,
'Defense': 0,
'Sp_Atk': 0,
'Sp_Def': 0,
'Speed': 0,
'Generation': 0,
'Legendary': 0}

To make sure it does not fail for string, date and timestamp columns:
import pyspark.sql.functions as F
def count_missings(spark_df,sort=True):
"""
Counts number of nulls and nans in each column
"""
df = spark_df.select([F.count(F.when(F.isnan(c) | F.isnull(c), c)).alias(c) for (c,c_type) in spark_df.dtypes if c_type not in ('timestamp', 'string', 'date')]).toPandas()
if len(df) == 0:
print("There are no any missing values!")
return None
if sort:
return df.rename(index={0: 'count'}).T.sort_values("count",ascending=False)
return df
If you want to see the columns sorted based on the number of nans and nulls in descending:
count_missings(spark_df)
# | Col_A | 10 |
# | Col_C | 2 |
# | Col_B | 1 |
If you don't want ordering and see them as a single row:
count_missings(spark_df, False)
# | Col_A | Col_B | Col_C |
# | 10 | 1 | 2 |

An alternative to the already provided ways is to simply filter on the column like so
import pyspark.sql.functions as F
df = df.where(F.col('columnNameHere').isNull())
This has the added benefit that you don't have to add another column to do the filtering and it's quick on larger data sets.

Here is my one liner.
Here 'c' is the name of the column
from pyspark.sql.functions import isnan, when, count, col, isNull
df.select('c').withColumn('isNull_c',F.col('c').isNull()).where('isNull_c = True').count()

I prefer this solution:
df = spark.table(selected_table).filter(condition)
counter = df.count()
df = df.select([(counter - count(c)).alias(c) for c in df.columns])

Use the following code to identify the null values in every columns using pyspark.
def check_nulls(dataframe):
'''
Check null values and return the null values in pandas Dataframe
INPUT: Spark Dataframe
OUTPUT: Null values
'''
# Create pandas dataframe
nulls_check = pd.DataFrame(dataframe.select([count(when(isnull(c), c)).alias(c) for c in dataframe.columns]).collect(),
columns = dataframe.columns).transpose()
nulls_check.columns = ['Null Values']
return nulls_check
#Check null values
null_df = check_nulls(raw_df)
null_df

from pyspark.sql import DataFrame
import pyspark.sql.functions as fn
# compatiable with fn.isnan. Sourced from
# https://github.com/apache/spark/blob/13fd272cd3/python/pyspark/sql/functions.py#L4818-L4836
NUMERIC_DTYPES = (
'decimal',
'double',
'float',
'int',
'bigint',
'smallilnt',
'tinyint',
)
def count_nulls(df: DataFrame) -> DataFrame:
isnan_compat_cols = {c for (c, t) in df.dtypes if any(t.startswith(num_dtype) for num_dtype in NUMERIC_DTYPES)}
return df.select(
[fn.count(fn.when(fn.isnan(c) | fn.isnull(c), c)).alias(c) for c in isnan_compat_cols]
+ [fn.count(fn.when(fn.isnull(c), c)).alias(c) for c in set(df.columns) - isnan_compat_cols]
)
Builds off of gench and user8183279's answers, but checks via only isnull for columns where isnan is not possible, rather than just ignoring them.
The source code of pyspark.sql.functions seemed to have the only documentation I could really find enumerating these names — if others know of some public docs I'd be delighted.

if you are writing spark sql, then the following will also work to find null value and count subsequently.
spark.sql('select * from table where isNULL(column_value)')

Yet another alternative (improved upon Vamsi Krishna's solutions above):
def check_for_null_or_nan(df):
null_or_nan = lambda x: isnan(x) | isnull(x)
func = lambda x: df.filter(null_or_nan(x)).count()
print(*[f'{i} has {func(i)} nans/nulls' for i in df.columns if func(i)!=0],sep='\n')
check_for_null_or_nan(df)
id2 has 5 nans/nulls

Here is a readable solution because code is for people as much as computers ;-)
df.selectExpr('sum(int(isnull(<col_name>) or isnan(<col_name>))) as null_or_nan_count'))

Remove duplicated fields in RDD from input data

my input csv data, some rows contains repeated fields or some missing fields, from this data i want to remove the duplicate fields from each row and then all rows should contain all the fields, with value as NULL is wherever it does not contain fields.

Try this:
def transform(line):
"""
>>> s = 'id:111|name:dave|age:33|city:london'
>>> transform(s)
('id:111', {'age': '33', 'name': 'dave', 'city': 'london'})
"""
bits = line.split("|")
key = bits[0]
pairs = [v.split(":") for v in bits[1:]]
return key, {kv[0].strip(): kv[1].strip() for kv in pairs if len(kv) == 2}
rdd = (sc
.textFile("/tmp/sample")
.map(transform))
Find keys:
from operator import attrgetter
keys = rdd.values().flatMap(lambda d: d.keys()).distinct().collect()
Create data frame:
df = rdd.toDF(["id", "map"])
And expand:
df.select(["id"] + [df["map"][k] for k in keys]).show()

So I assume that you have rdd already from the text file. I create one here:
rdd = spark.sparkContext.parallelize([(u'id:111', u'name:dave', u'dept:marketing', u'age:33', u'city:london'),
(u'id:123', u'name:jhon', u'dept:hr', u'city:newyork'),
(u'id:100', u'name:peter', u'dept:marketing', u'name:peter', u'age:30', u'city:london'),
(u'id:222', u'name:smith', u'dept:finance', u'city:boston'),
(u'id:234', u'name:peter', u'dept:service', u'name:peter', u'dept:service', u'age:32', u'city:richmond')])
I just make the function to map the rdd into key and value pair and also remove the duplicated one
from pyspark.sql import Row
from pyspark.sql.types import *
def split_to_dict(l):
l = list(set(l)) # drop duplicate here
kv_list = []
for e in l:
k, v = e.split(':')
kv_list.append({'key': k, 'value': v})
return kv_list
rdd_map = rdd.flatMap(lambda l: split_to_dict(l)).map(lambda x: Row(**x))
df = rdd_map.toDF()
Output example of first 5 rows
+----+---------+
| key| value|
+----+---------+
|city| london|
|dept|marketing|
|name| dave|
| age| 33|
| id| 111|
+----+---------+

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string

Pyspark - Split a column and take n elements - apache-spark

Related

How to concat two ArrayType(StringType()) columns element-wise in Pyspark?

Pyspark Change String Order

efficiently expand array of Row to separate columns

How to find count of Null and Nan values for each column in a PySpark dataframe efficiently?

Remove duplicated fields in RDD from input data

Categories

Resources