Spark Dataframe | Merge mulitple rows with missing values - apache-spark

I have a dataframe with a column that is a list of strings and another column that contains year.
There are a few rows with a missing values for the year column
Year
fields
2020
IFDSDEP.7
IFDSDEP.7,IFDSIMP.51,IFDSIMP.52,IFDSIMP.54,IFDSIMP.60
2020
IFDSIMP.7,IFDSIMP.14,IFDSIMP.51,IFDSIMP.52,IFDSIMP.54
I would like to merge rows with or without year value to a single row, is there a way to do it ?
In production, we can have multiple years and there could be a million rows.
My output should look like this:
Year
fields
2020
IFDSDEP.7,IFDSIMP.51,IFDSIMP.52,IFDSIMP.54,IFDSIMP.60,IFDSIMP.14
Thanks for the help.

If the fields column is a string you should probably first split this string into an array of strings, that way you will be able to combine into a unique list and then join them all back.
Regarding the nulls in the year column, you will have to fill the missing values. you will need to find a way to know which year to fill if there are multiple years.
Once you have done that, groupBy should do the trick.
# Your example DataFrame
df: DataFrame = spark.createDataFrame(data=[
[2020, "IFDSDEP.7"],
[None, "IFDSDEP.7,IFDSIMP.51,IFDSIMP.52,IFDSIMP.54,IFDSIMP.60"],
[2020, "IFDSIMP.7,IFDSIMP.14,IFDSIMP.51,IFDSIMP.52,IFDSIMP.54"],
], schema=StructType([
StructField("year", IntegerType()),
StructField("fields", StringType())
])).cache()
df.show(truncate=False)
+----+-----------------------------------------------------+
|year|fields |
+----+-----------------------------------------------------+
|2020|IFDSDEP.7 |
|null|IFDSDEP.7,IFDSIMP.51,IFDSIMP.52,IFDSIMP.54,IFDSIMP.60|
|2020|IFDSIMP.7,IFDSIMP.14,IFDSIMP.51,IFDSIMP.52,IFDSIMP.54|
+----+-----------------------------------------------------+
# Replace null with 2020 for `year` column
df = df.fillna({"year": 2020})
df.show(truncate=False)
+----+-----------------------------------------------------+
|year|fields |
+----+-----------------------------------------------------+
|2020|IFDSDEP.7 |
|2020|IFDSDEP.7,IFDSIMP.51,IFDSIMP.52,IFDSIMP.54,IFDSIMP.60|
|2020|IFDSIMP.7,IFDSIMP.14,IFDSIMP.51,IFDSIMP.52,IFDSIMP.54|
+----+-----------------------------------------------------+
# Transforming the fields column
df = df.withColumn("fields", F.split(F.col("fields"), ","))
df.show(truncate=False)
+----+-----------------------------------------------------------+
|year|fields |
+----+-----------------------------------------------------------+
|2020|[IFDSDEP.7] |
|2020|[IFDSDEP.7, IFDSIMP.51, IFDSIMP.52, IFDSIMP.54, IFDSIMP.60]|
|2020|[IFDSIMP.7, IFDSIMP.14, IFDSIMP.51, IFDSIMP.52, IFDSIMP.54]|
+----+-----------------------------------------------------------+
# Aggregate on year and collect all arrays of fields then combine them all and make them distinct
df_agg = df.groupby("year").agg(F.array_distinct(F.flatten(F.collect_list("fields"))))
df.show(truncate=False)
+----+----------------------------------------------------------------------------------+
|year|array_distinct(flatten(collect_list(fields))) |
+----+----------------------------------------------------------------------------------+
|2020|[IFDSDEP.7, IFDSIMP.51, IFDSIMP.52, IFDSIMP.54, IFDSIMP.60, IFDSIMP.7, IFDSIMP.14]|
+----+----------------------------------------------------------------------------------+
breaking down the last part of the code:
F.collect_list("fields") - collect all the fields arrays for the group by key (year)
by now you should have an array of arrays
[[IFDSDEP.7], [IFDSDEP.7, IFDSIMP.51, IFDSIMP.52, IFDSIMP.54, IFDSIMP.60], [IFDSIMP.7, IFDSIMP.14, IFDSIMP.51, IFDSIMP.52, IFDSIMP.54]]
F.flatten() - this function flattens out the sub arrays into one large array
[IFDSDEP.7, IFDSDEP.7, IFDSIMP.51, IFDSIMP.52, IFDSIMP.54, IFDSIMP.60, IFDSIMP.7, IFDSIMP.14, IFDSIMP.51, IFDSIMP.52, IFDSIMP.54]
F.array_distinct() - this function deduplicates the values in the array which results in what you expect
[IFDSDEP.7, IFDSIMP.51, IFDSIMP.52, IFDSIMP.54, IFDSIMP.60, IFDSIMP.7, IFDSIMP.14]

Related

How to compare only the column names of 2 data frames using pyspark?

I have 2 Data Frames with matched and unmatched column names, I want to compare the column names of the both the frames and print a table/dataframe with unmatched column names.
Please someone help me on this
I have no idea how can i achieve this
Below is the expectation
DF1:
DF2:
Output:
Output should actual vs unmatched column name
Update:
As per Expected output in Questions. The requirement is to compare both dataframes with similar schema but different names and make a dataframe of mismatched column names.
Thus, my best bet would be:
df3 = spark.createDataFrame([Row(idx,x) for idx,x in enumerate(df1.schema.names) if x not in df2.schema.names]).toDF("#","Uncommon Columns From DF1")\
.join(spark.createDataFrame([Row(idx,x) for idx, x in enumerate(df2.schema.names) if x not in df1.schema.names]).toDF("#","Uncommon Columns From DF2"),"#")
The catch here is, the schema should be similar as it compares column names based on "ordinals" i.e. their respective positions in the schema.
Input/Output
Change the join type to "full_outer" in case there are extra columns in either dataframe.
df3 = spark.createDataFrame([Row(idx,x) for idx,x in enumerate(df1.schema.names) if x not in df2.schema.names]).toDF("#","Uncommon Columns From DF1").join(spark.createDataFrame([Row(idx,x) for idx, x in enumerate(df2.schema.names) if x not in df1.schema.names]).toDF("#","Uncommon Columns From DF2"),"#", "full_outer")
Input/Output
You can easily do this by using sets operations
Data Preparation
s1 = StringIO("""
firstName,lastName,age,city,country
Alex,Smith,19,SF,USA
Rick,Mart,18,London,UK
""")
df1 = pd.read_csv(s1,delimiter=',')
sparkDF1 = sql.createDataFrame(df1)
s2 = StringIO("""
firstName,lastName,age
Alex,Smith,21
""")
df2 = pd.read_csv(s2,delimiter=',')
sparkDF2 = sql.createDataFrame(df2)
sparkDF1.show()
+---------+--------+---+------+-------+
|firstName|lastName|age| city|country|
+---------+--------+---+------+-------+
| Alex| Smith| 19| SF| USA|
| Rick| Mart| 18|London| UK|
+---------+--------+---+------+-------+
sparkDF2.show()
+---------+--------+---+
|firstName|lastName|age|
+---------+--------+---+
| Alex| Smith| 21|
+---------+--------+---+
Columns - Intersections & Difference
common = set(sparkDF1.columns) & set(sparkDF2.columns)
diff = set(sparkDF1.columns) - set(sparkDF2.columns)
print("Common - ",common)
## Common - {'lastName', 'age', 'firstName'}
print("Difference - ",diff)
## Difference - {'city', 'country'}
Additionally you can create tables/dataframes using the above variable values

Extract values from a complex column in PySpark

I have a PySpark dataframe which has a complex column, refer to below value:
ID value
1 [{"label":"animal","value":"cat"},{"label":null,"value":"George"}]
I want to add a new column in PySpark dataframe which basically convert it into a list of strings. If Label is null, string should contain "value" and if label is not null, string should be "label:value". So for above example dataframe, the output should look like below:
ID new_column
1 ["animal:cat", "George"]
You can use transform to transform each array element into a string, which is constructed using concat_ws:
df2 = df.selectExpr(
'id',
"transform(value, x -> concat_ws(':', x['label'], x['value'])) as new_column"
)
df2.show()
+---+--------------------+
| id| new_column|
+---+--------------------+
| 1|[animal:cat, George]|
+---+--------------------+

Searching for substring across multiple columns

I am trying to find a substring across all columns of my spark dataframe using PySpark. I currently know how to search for a substring through one column using filter and contains:
df.filter(df.col_name.contains('substring'))
How do I extend this statement, or utilize another, to search through multiple columns for substring matches?
You can generalize the statement the filter in one go:
from pyspark.sql.functions import col, count, when
# Converts all unmatched filters to NULL and drops them.
df = df.select([when(col(c).contains('substring'), col(c)).alias(c) for c in df.columns]).na.drop()
OR
You can simply loop over the columns and apply the same filter:
for col in df.columns:
df = df.filter(df[col].contains("substring"))
You can search through all columns and fill next dataframe and union results, like this:
columns = ["language", "else"]
data = [
("Java", "Python"),
("Python", "100000"),
("Scala", "3000"),
]
df = spark.createDataFrame(data).toDF(*columns)
df.cache()
df.show()
schema = df.schema
df2 = spark.createDataFrame(data=[], schema=schema)
for col in df.columns:
df2 = df2.unionByName(df.filter(df[col].like("%Python%")))
df2.show()
+--------+------+
|language| else|
+--------+------+
| Python|100000|
| Java|Python|
+--------+------+
Result will contain first 2 rows, because they have value 'Python' in some of the columns.

How to fill out nulls according to another dataframe pyspark

I currently started using pyspark. I have a two columns dataframe with one column containing some nulls, e.g.
df1
A B
1a3b 7
0d4s 12
6w2r null
6w2r null
1p4e null
and another dataframe has the correct mapping, i.e.
df2
A B
1a3b 7
0d4s 12
6w2r 0
1p4e 3
so I want to fill out the nulls in df1 using df2 s.t. the result is:
A B
1a3b 7
0d4s 12
6w2r 0
6w2r 0
1p4e 3
in pandas, I would first create a lookup dictionary from df2 then use apply on the df1 to populate the nulls. But I'm not really sure what functions to use in pyspark, most of replacing nulls I saw is based on simple conditions, for example, filling all the nulls to be a single constant value for certain column.
What I have tried is:
from pyspark.sql.functions import when, col
df1.withColumn('B', when(df.B.isNull(), df2.where(df2.B== df1.B).select('A')))
although I was getting AttributeError: 'DataFrame' object has no attribute '_get_object_id'. The logic is to first filter out the nulls then replace it with the column B's value from df2, but I think df.B.isNull() evaluates the whole column instead of single value, which is probably not the right way to do it, any suggestions?
left join on common column A and selecting appropriate columns should get you your desired output
df1.join(df2, df1.A == df2.A, 'left').select(df1.A, df2.B).show(truncate=False)
which should give you
+----+---+
|A |B |
+----+---+
|6w2r|0 |
|6w2r|0 |
|1a3b|7 |
|1p4e|3 |
|0d4s|12 |
+----+---+

Is there a way to generate rownumber without converting the dataframe into rdd in pyspark 1.3.1? [duplicate]

I have a very big pyspark.sql.dataframe.DataFrame named df.
I need some way of enumerating records- thus, being able to access record with certain index. (or select group of records with indexes range)
In pandas, I could make just
indexes=[2,3,6,7]
df[indexes]
Here I want something similar, (and without converting dataframe to pandas)
The closest I can get to is:
Enumerating all the objects in the original dataframe by:
indexes=np.arange(df.count())
df_indexed=df.withColumn('index', indexes)
Searching for values I need using where() function.
QUESTIONS:
Why it doesn't work and how to make it working? How to add a row to a dataframe?
Would it work later to make something like:
indexes=[2,3,6,7]
df1.where("index in indexes").collect()
Any faster and simpler way to deal with it?
It doesn't work because:
the second argument for withColumn should be a Column not a collection. np.array won't work here
when you pass "index in indexes" as a SQL expression to where indexes is out of scope and it is not resolved as a valid identifier
PySpark >= 1.4.0
You can add row numbers using respective window function and query using Column.isin method or properly formated query string:
from pyspark.sql.functions import col, rowNumber
from pyspark.sql.window import Window
w = Window.orderBy()
indexed = df.withColumn("index", rowNumber().over(w))
# Using DSL
indexed.where(col("index").isin(set(indexes)))
# Using SQL expression
indexed.where("index in ({0})".format(",".join(str(x) for x in indexes)))
It looks like window functions called without PARTITION BY clause move all data to the single partition so above may be not the best solution after all.
Any faster and simpler way to deal with it?
Not really. Spark DataFrames don't support random row access.
PairedRDD can be accessed using lookup method which is relatively fast if data is partitioned using HashPartitioner. There is also indexed-rdd project which supports efficient lookups.
Edit:
Independent of PySpark version you can try something like this:
from pyspark.sql import Row
from pyspark.sql.types import StructType, StructField, LongType
row = Row("char")
row_with_index = Row("char", "index")
df = sc.parallelize(row(chr(x)) for x in range(97, 112)).toDF()
df.show(5)
## +----+
## |char|
## +----+
## | a|
## | b|
## | c|
## | d|
## | e|
## +----+
## only showing top 5 rows
# This part is not tested but should work and save some work later
schema = StructType(
df.schema.fields[:] + [StructField("index", LongType(), False)])
indexed = (df.rdd # Extract rdd
.zipWithIndex() # Add index
.map(lambda ri: row_with_index(*list(ri[0]) + [ri[1]])) # Map to rows
.toDF(schema)) # It will work without schema but will be more expensive
# inSet in Spark < 1.3
indexed.where(col("index").isin(indexes))
If you want a number range that's guaranteed not to collide but does not require a .over(partitionBy()) then you can use monotonicallyIncreasingId().
from pyspark.sql.functions import monotonicallyIncreasingId
df.select(monotonicallyIncreasingId().alias("rowId"),"*")
Note though that the values are not particularly "neat". Each partition is given a value range and the output will not be contiguous. E.g. 0, 1, 2, 8589934592, 8589934593, 8589934594.
This was added to Spark on Apr 28, 2015 here: https://github.com/apache/spark/commit/d94cd1a733d5715792e6c4eac87f0d5c81aebbe2
from pyspark.sql.functions import monotonically_increasing_id
df.withColumn("Atr4", monotonically_increasing_id())
If you only need incremental values (like an ID) and if there is no
constraint that the numbers need to be consecutive, you could use
monotonically_increasing_id(). The only guarantee when using this
function is that the values will be increasing for each row, however,
the values themself can differ each execution.
You certainly can add an array for indexing, an array of your choice indeed:
In Scala, first we need to create an indexing Array:
val index_array=(1 to df.count.toInt).toArray
index_array: Array[Int] = Array(1, 2, 3, 4, 5, 6, 7, 8, 9, 10)
You can now append this column to your DF. First, For that, you need to open up our DF and get it as an array, then zip it with your index_array and then we convert the new array back into and RDD. The final step is to get it as a DF:
final_df = sc.parallelize((df.collect.map(
x=>(x(0),x(1))) zip index_array).map(
x=>(x._1._1.toString,x._1._2.toString,x._2))).
toDF("column_name")
The indexing would be more clear after that.
monotonicallyIncreasingId() - this will assign row numbers in incresing order but not in sequence.
sample output with 2 columns:
|---------------------|------------------|
| RowNo | Heading 2 |
|---------------------|------------------|
| 1 | xy |
|---------------------|------------------|
| 12 | xz |
|---------------------|------------------|
If you want assign row numbers use following trick.
Tested in spark-2.0.1 and greater versions.
df.createOrReplaceTempView("df")
dfRowId = spark.sql("select *, row_number() over (partition by 0) as rowNo from df")
sample output with 2 columns:
|---------------------|------------------|
| RowNo | Heading 2 |
|---------------------|------------------|
| 1 | xy |
|---------------------|------------------|
| 2 | xz |
|---------------------|------------------|
Hope this helps.
Selecting a single row n of a Pyspark DataFrame, try:
df.where(df.id == n).show()
Given a Pyspark DataFrame:
df = spark.createDataFrame([(1, 143.5, 5.6, 28, 'M', 100000),\
(2, 167.2, 5.4, 45, 'M', None),\
(3, None , 5.2, None, None, None),\
], ['id', 'weight', 'height', 'age', 'gender', 'income'])
Selecting the 3rd row, try:
df.where('id == 3').show()
Or:
df.where(df.id == 3).show()
Selecting multiple rows with rows' ids (the 2nd & the 3rd rows in this case), try:
id = {"2", "3"}
df.where(df.id.isin(id)).show()

Resources