Separate string by white space in pyspark - apache-spark

I have column with search queries that are represented by strings. I want to separate every string to different work.
Let say I have this data frame:
import pyspark.sql.functions as F
spark = SparkSession.builder.getOrCreate()
df = spark.read.option("header", "true") \
.option("delimiter", "\t") \
.option("inferSchema", "true") \
.csv("/content/drive/MyDrive/my_data.txt")
data = df.groupBy("AnonID").agg(F.collect_list("Query").alias("Query"))
from pyspark.sql.functions import array_distinct
from pyspark.sql.functions import udf
data = data.withColumn("New_Data", array_distinct("Query"))
Z = data.drop(data.Query)
+------+------------------------+
|AnonID| New_Data |
+------+------------------------+
| 142|[Big House, Green frog] |
+------+------------------------+
And I want output like that:
+------+--------------------------+
|AnonID| New_Data |
+------+--------------------------+
| 142|[Big, House, Green, frog] |
+------+--------------------------+
I have tried to search In older posts but I was able to find only something that separates each word to different column and it's not what I want.

To separate the elements in an array and split each string into separate words, you can use the explode and split functions in Spark. The exploded elements can then be combined back into an array using the array function.
from pyspark.sql.functions import explode, split, array
data = data.withColumn("Words", explode(split(data["New_Data"], " ")))
data = data.groupBy("AnonID").agg(array(data["Words"]).alias("New_Data"))

You can do the collect_list first and then use the transform function to split the array elements and then flatten the elements and then finally apply array_distinct. Please check out the code and output below.
df=spark.createDataFrame([[142,"Big House"],[142,"Big Green Frog"]],["AnonID","Query"])
import pyspark.sql.functions as F
data = df.groupBy("AnonID").agg(F.collect_list("Query").alias("Query"))
data.withColumn("Query",F.array_distinct(flatten(transform(data["Query"], lambda x: split(x, " "))))).show(2,False)
+------+-------------------------+
|AnonID|Query |
+------+-------------------------+
|142 |[Big, House, Green, Frog]|
+------+-------------------------+

Related

Create new dataframe in pyspark with column names and its associated values in other column using spark/pyspark

i have a dataset like below
and i would like to create a dataframe using above dataset like below
First you need stack your dataframe, group by var_name and apply collect_list
import pyspark.sql.functions as f
expr_columns = ', '.join(map(lambda col: '"{col}", {col}'.format(col=col), df.columns))
expr = "stack(2, {columns}) as (var_name, values)".format(columns=expr_columns)
df_stack = df.selectExpr(expr)
df_final = df_stack.groupBy("var_name").agg(f.collect_list(f.col("values")))

Pyspark - Index from monotonically_increasing_id changes after list aggregation

I'm creating an index using the monotonically_increasing_id() function in Pyspark 3.1.1.
I'm aware of the specific characteristics of that function, but they don't explain my issue.
After creating the index I do a simple aggregation applying the collect_list() function on the created index.
If I compare the results the index changes in certain cases, that is specifically on the upper end of the long-range when the input data is not too small.
Full example code:
import random
import string
from pyspark.sql import SparkSession
from pyspark.sql import functions as f
from pyspark.sql.types import StructType, StructField, StringType
spark = SparkSession.builder\
.appName("test")\
.master("local")\
.config('spark.sql.shuffle.partitions', '8')\
.getOrCreate()
# Create random input data of around length 100000:
input_data = []
ii = 0
while ii <= 100000:
L = random.randint(1, 3)
B = ''.join(random.choices(string.ascii_uppercase, k=5))
for i in range(L):
C = random.randint(1,100)
input_data.append((B,))
ii += 1
# Create Spark DataFrame:
input_rdd = sc.parallelize(tuple(input_data))
schema = StructType([StructField("B", StringType())])
dg = spark.createDataFrame(input_rdd, schema=schema)
# Create id and aggregate:
dg = dg.sort("B").withColumn("ID0", f.monotonically_increasing_id())
dg2 = dg.groupBy("B").agg(f.collect_list("ID0"))
Output:
dg.sort('B', ascending=False).show(10, truncate=False)
dg2.sort('B', ascending=False).show(5, truncate=False)
This of course creates different data with every run, but if the length is large enough (problem appears slightly at 10000, but not at 1000), it should appear everytime. Here's an example result:
+-----+-----------+
|B |ID0 |
+-----+-----------+
|ZZZVB|60129554616|
|ZZZVB|60129554617|
|ZZZVB|60129554615|
|ZZZUH|60129554614|
|ZZZRW|60129554612|
|ZZZRW|60129554613|
|ZZZNH|60129554611|
|ZZZNH|60129554609|
|ZZZNH|60129554610|
|ZZZJH|60129554606|
+-----+-----------+
only showing top 10 rows
+-----+---------------------------------------+
|B |collect_list(ID0) |
+-----+---------------------------------------+
|ZZZVB|[60129554742, 60129554743, 60129554744]|
|ZZZUH|[60129554741] |
|ZZZRW|[60129554739, 60129554740] |
|ZZZNH|[60129554736, 60129554737, 60129554738]|
|ZZZJH|[60129554733, 60129554734, 60129554735]|
+-----+---------------------------------------+
only showing top 5 rows
The entry ZZZVB has the three IDs 60129554615, 60129554616, and 60129554617 before aggregation, but after aggregation the numbers have changed to 60129554742, 60129554743, 60129554744.
Why? I can't imagine this is supposed to happen. Isn't the result of monotonically_increasing_id() a simple long that keeps its value after having been created?
EDIT: As expected a workaround is to coalesce(1) the DataFrame before creating the id.
dg and df2 are two different dataframes, each with their own DAG. These DAGs are executed independently from each other when an action on one of the dataframes is called. So each time show() is called, the DAG of the respective dataframe is evaluated and during that evaluation, f.monotonically_increasing_id() is called.
To prevent f.monotonically_increasing_id() being called twice, you could add a cache after the withColumn transformation:
dg = dg.sort("B").withColumn("ID0", f.monotonically_increasing_id()).cache()
With the cache, the result of the first evaluation of f.monotonically_increasing_id() is cached and reused when evaluating the second dataframe.

PySpark merge 2 column values by index into new list [duplicate]

I have a Pandas dataframe. I have tried to join two columns containing string values into a list first and then using zip, I joined each element of the list with '_'. My data set is like below:
df['column_1']: 'abc, def, ghi'
df['column_2']: '1.0, 2.0, 3.0'
I wanted to join these two columns in a third column like below for each row of my dataframe.
df['column_3']: [abc_1.0, def_2.0, ghi_3.0]
I have successfully done so in python using the code below but the dataframe is quite large and it takes a very long time to run it for the whole dataframe. I want to do the same thing in PySpark for efficiency. I have read the data in spark dataframe successfully but I'm having a hard time determining how to replicate Pandas functions with PySpark equivalent functions. How can I get my desired result in PySpark?
df['column_3'] = df['column_2']
for index, row in df.iterrows():
while index < 3:
if isinstance(row['column_1'], str):
row['column_1'] = list(row['column_1'].split(','))
row['column_2'] = list(row['column_2'].split(','))
row['column_3'] = ['_'.join(map(str, i)) for i in zip(list(row['column_1']), list(row['column_2']))]
I have converted the two columns to arrays in PySpark by using the below code
from pyspark.sql.types import ArrayType, IntegerType, StringType
from pyspark.sql.functions import col, split
crash.withColumn("column_1",
split(col("column_1"), ",\s*").cast(ArrayType(StringType())).alias("column_1")
)
crash.withColumn("column_2",
split(col("column_2"), ",\s*").cast(ArrayType(StringType())).alias("column_2")
)
Now all I need is to zip each element of the arrays in the two columns using '_'. How can I use zip with this? Any help is appreciated.
A Spark SQL equivalent of Python's would be pyspark.sql.functions.arrays_zip:
pyspark.sql.functions.arrays_zip(*cols)
Collection function: Returns a merged array of structs in which the N-th struct contains all N-th values of input arrays.
So if you already have two arrays:
from pyspark.sql.functions import split
df = (spark
.createDataFrame([('abc, def, ghi', '1.0, 2.0, 3.0')])
.toDF("column_1", "column_2")
.withColumn("column_1", split("column_1", "\s*,\s*"))
.withColumn("column_2", split("column_2", "\s*,\s*")))
You can just apply it on the result
from pyspark.sql.functions import arrays_zip
df_zipped = df.withColumn(
"zipped", arrays_zip("column_1", "column_2")
)
df_zipped.select("zipped").show(truncate=False)
+------------------------------------+
|zipped |
+------------------------------------+
|[[abc, 1.0], [def, 2.0], [ghi, 3.0]]|
+------------------------------------+
Now to combine the results you can transform (How to use transform higher-order function?, TypeError: Column is not iterable - How to iterate over ArrayType()?):
df_zipped_concat = df_zipped.withColumn(
"zipped_concat",
expr("transform(zipped, x -> concat_ws('_', x.column_1, x.column_2))")
)
df_zipped_concat.select("zipped_concat").show(truncate=False)
+---------------------------+
|zipped_concat |
+---------------------------+
|[abc_1.0, def_2.0, ghi_3.0]|
+---------------------------+
Note:
Higher order functions transform and arrays_zip has been introduced in Apache Spark 2.4.
For Spark 2.4+, this can be done using only zip_with function to zip a concatenate on the same time:
df.withColumn("column_3", expr("zip_with(column_1, column_2, (x, y) -> concat(x, '_', y))"))
The higher-order function takes 2 arrays to merge, element-wise, using a lambda function (x, y) -> concat(x, '_', y).
You can also UDF to zip the split array columns,
df = spark.createDataFrame([('abc,def,ghi','1.0,2.0,3.0')], ['col1','col2'])
+-----------+-----------+
|col1 |col2 |
+-----------+-----------+
|abc,def,ghi|1.0,2.0,3.0|
+-----------+-----------+ ## Hope this is how your dataframe is
from pyspark.sql import functions as F
from pyspark.sql.types import *
def concat_udf(*args):
return ['_'.join(x) for x in zip(*args)]
udf1 = F.udf(concat_udf,ArrayType(StringType()))
df = df.withColumn('col3',udf1(F.split(df.col1,','),F.split(df.col2,',')))
df.show(1,False)
+-----------+-----------+---------------------------+
|col1 |col2 |col3 |
+-----------+-----------+---------------------------+
|abc,def,ghi|1.0,2.0,3.0|[abc_1.0, def_2.0, ghi_3.0]|
+-----------+-----------+---------------------------+
For Spark 3.1+, they now provide pyspark.sql.functions.zip_with() with Python lambda function, therefore it can be done like this:
import pyspark.sql.functions as F
df = df.withColumn("column_3", F.zip_with("column_1", "column_2", lambda x,y: F.concat_ws("_", x, y)))

How to split a spark dataframe column of ArrayType(StructType) to multiple columns in pyspark?

I am reading xml using databricks spark xml with below schema. the subelement X_PAT can occur more than one time, to handle
this I have used arraytype(structtype),ne xt transformation is to create multiple columns out of this single column.
<root_tag>
<id>fff9</id>
<X1000>
<X_PAT>
<X_PAT01>IC</X_PAT01>
<X_PAT02>EDISUPPORT</X_PAT02>
<X_PAT03>TE</X_PAT03>
</X_PAT>
<X_PAT>
<X_PAT01>IC1</X_PAT01>
<X_PAT02>EDISUPPORT1</X_PAT02>
<X_PAT03>TE1</X_PAT03>
</X_PAT>
</X1000>
</root_tag>
from pyspark.sql import SparkSession
from pyspark.sql.types import *
jar_path = "/Users/nsrinivas/com.databricks_spark-xml_2.10-0.4.1.jar"
spark = SparkSession.builder.appName("Spark - XML read").master("local[*]") \
.config("spark.jars", jar_path) \
.config("spark.executor.extraClassPath", jar_path) \
.config("spark.executor.extraLibrary", jar_path) \
.config("spark.driver.extraClassPath", jar_path) \
.getOrCreate()
xml_schema = StructType()
xml_schema.add("id", StringType(), True)
x1000 = StructType([
StructField("X_PAT",
ArrayType(StructType([
StructField("X_PAT01", StringType()),
StructField("X_PAT02", StringType()),
StructField("X_PAT03", StringType())]))),
])
xml_schema.add("X1000", x1000, True)
df = spark.read.format("xml").option("rowTag", "root_tag").option("valueTag", False) \
.load("root_tag.xml", schema=xml_schema)
df.select("id", "X1000.X_PAT").show(truncate=False)
I get the output as below:
+------------+--------------------------------------------+
|id |X_PAT |
+------------+--------------------------------------------+
|fff9 |[[IC1, SUPPORT1, TE1], [IC2, SUPPORT2, TE2]]|
+------------+--------------------------------------------+
but I want the X_PAT to be flatten and create multiple columns like below then I will rename the colums.
+-----+-------+-------+-------+-------+-------+-------+
|id |X_PAT01|X_PAT02|X_PAT03|X_PAT01|X_PAT02|X_PAT03|
+-----+-------+-------+-------+-------+-------+-------+
|fff9 |IC1 |SUPPORT1|TE1 |IC2 |SUPPORT2|TE2 |
+-----+-------+-------+-------+-------+-------+-------+
then i would rename the new columns as below
id|XPAT_1_01|XPAT_1_02|XPAT_1_03|XPAT_2_01|XPAT_2_02|XPAT_2_03|
I tried using X1000.X_PAT.* but it is throwing below error
pyspark.sql.utils.AnalysisException: 'Can only star expand struct data types. Attribute: ArrayBuffer(L_1000A, S_PER);'
Any ideas please?
Try this:
df = spark.createDataFrame([('1',[['IC1', 'SUPPORT1', 'TE1'],['IC2', 'SUPPORT2', 'TE2']]),('2',[['IC1', 'SUPPORT1', 'TE1'],['IC2','SUPPORT2', 'TE2']])],['id','X_PAT01'])
Define a function to parse the data
def create_column(df):
data = df.select('X_PAT01').collect()[0][0]
for each_list in range(len(data)):
for each_item in range(len(data[each_list])):
df = df.withColumn('X_PAT_'+str(each_list)+'_0'+str(each_item), F.lit(data[each_list][each_item]))
return df
calling
df = create_column(df)
output
This is a simple approach to horizontally explode array elements as per your requirement:
df2=(df1
.select('id',
*(col('X_PAT')
.getItem(i) #Fetch the nested array elements
.getItem(j) #Fetch the individual string elements from each nested array element
.alias(f'X_PAT_{i+1}_{str(j+1).zfill(2)}') #Format the column alias
for i in range(2) #outer loop
for j in range(3) #inner loop
)
)
)
Input vs Output:
Input(df1):
+----+--------------------------------------------+
|id |X_PAT |
+----+--------------------------------------------+
|fff9|[[IC1, SUPPORT1, TE1], [IC2, SUPPORT2, TE2]]|
+----+--------------------------------------------+
Output(df2):
+----+----------+----------+----------+----------+----------+----------+
| id|X_PAT_1_01|X_PAT_1_02|X_PAT_1_03|X_PAT_2_01|X_PAT_2_02|X_PAT_2_03|
+----+----------+----------+----------+----------+----------+----------+
|fff9| IC1| SUPPORT1| TE1| IC2| SUPPORT2| TE2|
+----+----------+----------+----------+----------+----------+----------+
Although this involves for loops, as the operations are directly performed on the dataframe (without collecting/converting to RDD), you should not encounter any issue.

Spark: Reading CSV files from list of paths in a DataFrame Row

I have a Spark DataFrame as follows:
# ---------------------------------
# - column 1 - ... - column 5 -
# ---------------------------------
# - ... - Array of paths
Columns 1 to 4 contain strings and the fifth column contains list of strings, that are actually paths to CSV files I wish to read as Spark Dataframes. I cannot find anyway to read them. Here's a simplified version with just a single column and the column with the list of paths:
from pyspark.sql import SparkSession,Row
spark = SparkSession \
.builder \
.appName('test') \
.getOrCreate()
simpleRDD = spark.sparkContext.parallelize(range(10))
simpleRDD = simpleRDD.map(lambda x: Row(**{'a':x,'paths':['{}_{}.csv'.format(y**2,y+1) for y in range(x+1)]}))
simpleDF = spark.createDataFrame(simpleRDD)
print(simpleDF.head(5))
This gives:
[Row(a=0, paths=['0_1.csv']),
Row(a=1, paths=['0_1.csv', '1_2.csv']),
Row(a=2, paths=['0_1.csv', '1_2.csv', '4_3.csv']),
Row(a=3, paths=['0_1.csv', '1_2.csv', '4_3.csv', '9_4.csv']),
Row(a=4, paths=['0_1.csv', '1_2.csv', '4_3.csv', '9_4.csv', '16_5.csv'])]
I would like then to do something like this:
simpleDF = simpleDF.withColumn('data',spark.read.csv(simpleDF.paths))
...but this of course, does not work.
from pyspark.sql import SparkSession,Row
from pyspark.sql.types import *
spark = SparkSession \
.builder \
.appName('test') \
.getOrCreate()
inp=[['a','b','c','d',['abc\t1.txt','abc\t2.txt','abc\t3.txt','abc\t4.txt','abc\t5.txt',]],
['f','g','h','i',['def\t1.txt','def\t2.txt','def\t3.txt','def\t4.txt','def\t5.txt',]],
['k','l','m','n',['ghi\t1.txt','ghi\t2.txt','ghi\t3.txt','ghi\t4.txt','ghi\t5.txt',]]
]
inp_data=spark.sparkContext.parallelize(inp)
##Defining the schema
schema = StructType([StructField('field1',StringType(),True),
StructField('field2',StringType(),True),
StructField('field3',StringType(),True),
StructField('field4',StringType(),True),
StructField('field5',ArrayType(StringType(),True))
])
## Create the Data frames
dataframe=spark.createDataFrame(inp_data,schema)
dataframe.createOrReplaceTempView("dataframe")
dataframe.select("field5").filter("field1='a'").show()
I'm not sure how you intend to store the DataFrame objects once you read them in from their path, but if it's a matter of accessing the values in your DataFrame column, you can use the .collect() method to return your DataFrame as a list of Row objects (just like an RDD).
Each Row object has a .asDict() method that converts it to a Python dictionary object. Once you're there, you can access the values by indexing the dictionary using its key.
Assuming you're content storing the returned DataFrames in a list, you could try the following:
# collect the DataFrame into a list of Rows
rows = simpleRDD.collect()
# collect all the values in your `paths` column
# (note that this will return a list of lists)
paths = map(lambda row: row.asDict().get('paths'), rows)
# flatten the list of lists
paths_flat = [path for path_list in paths for path in path_list]
# get the unique set of paths
paths_unique = list(set(paths_flat))
# instantiate an empty dictionary in which to collect DataFrames
dfs_dict = []
for path in paths_unique:
dfs_dict[path] = spark.read.csv(path)
Your dfs_dict will now contain all of your DataFrames. To get the DataFrame of a particular path, you can access it using the path as the dictionary key:
df_0_01 = dfs_dict['0_1.csv']

Resources