Why does createDataFrame reorder the columns? - apache-spark

Suppose I am creating a data frame from a list without a schema:
data = [Row(c=0, b=1, a=2), Row(c=10, b=11, a=12)]
df = spark.createDataFrame(data)
df.show()
+---+---+---+
| a| b| c|
+---+---+---+
| 2| 1| 0|
| 12| 11| 10|
+---+---+---+
Why are the columns reordered in alphabet order ?
Can I preserve the original order of columns without adding a schema ?

Why are the columns reordered in alphabet order ?
Because Row created with **kwargs sorts the arguments by name.
This design choice is required to address the issues described in PEP 468. Please check SPARK-12467 for a discussion.
Can I preserve the original order of columns without adding a schema ?
Not with **kwargs. You can use plain tuples:
df = spark.createDataFrame([(0, 1, 2), (10, 11, 12)], ["c", "b", "a"])
or namedtuple:
from collections import namedtuple
CBA = namedtuple("CBA", ["c", "b", "a"])
spark.createDataFrame([CBA(0, 1, 2), CBA(10, 11, 12)])

Related

Spark dataframe slice

I've a table with (millions of) entries along the lines of the following example read into a Spark dataframe (sdf):
Id
C1
C2
xx1
c118
c219
xx1
c113
c218
xx1
c118
c214
acb
c121
c201
e3d
c181
c221
e3d
c132
c252
abq
c141
c290
...
...
...
vy1
c13023
C23021
I'd like to get a smaller subset of these Id's for further processing. I identify the unique set of Id's in the table using sdf_id = sdf.select("Id").dropDuplicates().
What is the efficient way from here to filter data (C1, C2) related to, let's say, 100 randomly selected Id's?
There are several ways to achieve what you want.
My sample data
df = spark.createDataFrame([
(1, 'a'),
(1, 'b'),
(1, 'c'),
(2, 'd'),
(2, 'e'),
(3, 'f'),
], ['id', 'col'])
The initial step is getting the sample IDs that you wanted
ids = df.select('id').distinct().sample(0.2) # 2 is 20%, you can adjust this
+---+
| id|
+---+
| 1|
+---+
Approach #1: using inner join
Since you have two dataframes, you can just perform a single inner join to get all records from df for each id in ids. Note that F.broadcast is to boost up the performance because ids suppose to be small enough. Feel free to take it away if you want to. Performance-wise, this approach is preferred.
df.join(F.broadcast(ids), on=['id'], how='inner').show()
+---+---+
| id|col|
+---+---+
| 1| a|
| 1| b|
| 1| c|
+---+---+
Approach #2: using isin
You can't simply get the list of IDs via ids.collect(), because that would return a list of Row, you have to loop through it to get the exact column that you want (id in this case).
df.where(F.col('id').isin([r['id'] for r in ids.collect()])).show()
+---+---+
| id|col|
+---+---+
| 1| a|
| 1| b|
| 1| c|
+---+---+
Since you already have the list of unique ids , you can further sample it to your desired fraction and filter based on that
There are other ways you can sample random ids , which can be found here
Sampling
### Assuming the DF is 1 mil records , 100 records would be 0.01%
sdf_id = sdf.select("Id").dropDuplicates().sample(0.01).collect()
Filter
sdf_filtered = sdf.filter(F.col('Id').isin(sdf_id))

Pyspark -- Filter ArrayType rows which contain null value

I am a beginner of PySpark. Suppose I have a Spark dataframe like this:
test_df = spark.createDataFrame(pd.DataFrame({"a":[[1,2,3], [None,2,3], [None, None, None]]}))
Now I hope to filter rows that the array DO NOT contain None value (in my case just keep the first row).
I have tried to use:
test_df.filter(array_contains(test_df.a, None))
But it does not work and throws an error:
AnalysisException: "cannot resolve 'array_contains(a, NULL)' due to
data type mismatch: Null typed values cannot be used as
arguments;;\n'Filter array_contains(a#166, null)\n+- LogicalRDD
[a#166], false\n
How should I filter in the correct way? Many thanks!
You need to use the forall function.
df = test_df.filter(F.expr('forall(a, x -> x is not null)'))
df.show(truncate=False)
You can use aggregate higher order function to count the number of nulls and filter rows with the count = 0. This will enable you to drop all rows with at least 1 None within the array.
data_ls = [
(1, ["A", "B"]),
(2, [None, "D"]),
(3, [None, None])
]
data_sdf = spark.sparkContext.parallelize(data_ls).toDF(['a', 'b'])
data_sdf.show()
+---+------+
| a| b|
+---+------+
| 1|[A, B]|
| 2| [, D]|
| 3| [,]|
+---+------+
# count the number of nulls within an array
data_sdf. \
withColumn('c', func.expr('aggregate(b, 0, (x, y) -> x + int(y is null))')). \
show()
+---+------+---+
| a| b| c|
+---+------+---+
| 1|[A, B]| 0|
| 2| [, D]| 1|
| 3| [,]| 2|
+---+------+---+
Once you have the column created you can apply the filter as filter(func.col('c')==0).
You can use exists function:
test_df.filter("!exists(a, x -> x is null)").show()
#+---------+
#| a|
#+---------+
#|[1, 2, 3]|
#+---------+

Merge two data frames and retrieve all the information from the right data frame

Hi Stack Overflow community. I am new to spark/pyspark and I have this question.
Say I have two data frames (df2 being the interest data set with a lot of records and df1 is a new update). I want to join the two data frames on multiple columns (if possible) and get the updated information from df1 when there is a key match otherwise keep the df2 information as it is.
here is my sample data set and my expected output (df30)
df1 = spark.createDataFrame([("a", 4, 'x'), ("b", 3, 'y'), ("c", 4, 'z'), ("d", 4, 'l')], ["C1", "C2", "C3"])
df2 = spark.createDataFrame([("a", 4, 5), ("f", 3, 4), ("b", 3, 6), ("c", 4, 7), ("d", 4, 8)], ["C1", "C2","C3"])
df1_s = df1.select([col(c).alias('s_' + c) for c in df1.columns])
You can use left join on your list of columns and using a list comprehension over the remaining columns select updates using the coalesce function:
from pyspark.sql import functions as F
join_columns = ["C1"]
result = df2.alias("df2").join(
df1.alias("df1"),
join_columns,
"left"
).select(
*join_columns,
*[
F.coalesce(f"df1.{c}", f"df2.{c}").alias(c)
for c in df1.columns if c not in join_columns
]
)
result.show()
#+---+---+---+
#| C1| C2| C3|
#+---+---+---+
#| f| 3| 4|
#| d| 4| l|
#| c| 4| z|
#| b| 3| y|
#| a| 4| x|
#+---+---+---+

How to overwrite entire existing column in Spark dataframe with new column?

I want to overwrite a spark column with a new column which is a binary flag.
I tried directly overwriting the column id2 but why is it not working like a inplace operation in Pandas?
How to do it without using withcolumn() to create new column and drop() to drop the old column?
I know that spark dataframe is immutable, is that the reason or there is a different way to overwrite without using withcolumn() & drop()?
df2 = spark.createDataFrame(
[(1, 1, float('nan')), (1, 2, float(5)), (1, 3, float('nan')), (1, 4, float('nan')), (1, 5, float(10)), (1, 6, float('nan')), (1, 6, float('nan'))],
('session', "timestamp1", "id2"))
df2.select(df2.id2 > 0).show()
+---------+
|(id2 > 0)|
+---------+
| true|
| true|
| true|
| true|
| true|
| true|
| true|
+---------+
# Attempting to overwriting df2.id2
df2.id2=df2.select(df2.id2 > 0).withColumnRenamed('(id2 > 0)','id2')
df2.show()
#Overwriting unsucessful
+-------+----------+----+
|session|timestamp1| id2|
+-------+----------+----+
| 1| 1| NaN|
| 1| 2| 5.0|
| 1| 3| NaN|
| 1| 4| NaN|
| 1| 5|10.0|
| 1| 6| NaN|
| 1| 6| NaN|
+-------+----------+----+
You can use
d1.withColumnRenamed("colName", "newColName")
d1.withColumn("newColName", $"colName")
The withColumnRenamed renames the existing column to new name.
The withColumn creates a new column with a given name. It creates a new column with same name if there exist already and drops the old one.
In your case changes are not applied to the original dataframe df2, it changes the name of column and return as a new dataframe which should be assigned to new variable for the further use.
d3 = df2.select((df2.id2 > 0).alias("id2"))
Above should work fine in your case.
Hope this helps!
As stated above it's not possible to overwrite DataFrame object, which is immutable collection, so all transformations return new DataFrame.
The fastest way to achieve your desired effect is to use withColumn:
df = df.withColumn("col", some expression)
where col is name of column which you want to "replace". After running this value of df variable will be replaced by new DataFrame with new value of column col. You might want to assign this to new variable.
In your case it can look:
df2 = df2.withColumn("id2", (df2.id2 > 0) & (df2.id2 != float('nan')))
I've added comparison to nan, because I'm assuming you don't want to treat nan as greater than 0.
If you're working with multiple columns of the same name in different joined tables you can use the table alias in the colName in withColumn.
Eg. df1.join(df2, df1.id = df2.other_id).withColumn('df1.my_col', F.greatest(df1.my_col, df2.my_col))
And if you only want to keep the columns from df1 you can also call .select('df1.*')
If you instead do df1.join(df2, df1.id = df2.other_id).withColumn('my_col', F.greatest(df1.my_col, df2.my_col))
I think it overwrites the last column which is called my_col. So it outputs:
id, my_col (df1.my_col original value), id, other_id, my_col (newly computed my_col)

Get IDs for duplicate rows (considering all other columns) in Apache Spark

I have a Spark sql dataframe, consisting of an ID column and n "data" columns, i.e.
id | dat1 | dat2 | ... | datn
The id columnn is uniquely determined, whereas, looking at dat1 ... datn there may be duplicates.
My goal is to find the ids of those duplicates.
My approach so far:
get the duplicate rows using groupBy:
dup_df = df.groupBy(df.columns[1:]).count().filter('count > 1')
join the dup_df with the entire df to get the duplicate rows including id:
df.join(dup_df, df.columns[1:])
I am quite certain that this is basically correct, it fails because the dat1 ... datn columns contain null values.
To do the join on null values, I found .e.g this SO post. But this would require to construct a huge "string join condition".
Thus my questions:
Is there a simple / more generic / more pythonic way to do joins on null values?
Or, even better, is there another (easier, more beautiful, ...) method to get the desired ids?
BTW: I am using Spark 2.1.0 and Python 3.5.3
If number ids per group is relatively small you can groupBy and collect_list. Required imports
from pyspark.sql.functions import collect_list, size
example data:
df = sc.parallelize([
(1, "a", "b", 3),
(2, None, "f", None),
(3, "g", "h", 4),
(4, None, "f", None),
(5, "a", "b", 3)
]).toDF(["id"])
query:
(df
.groupBy(df.columns[1:])
.agg(collect_list("id").alias("ids"))
.where(size("ids") > 1))
and the result:
+----+---+----+------+
| _2| _3| _4| ids|
+----+---+----+------+
|null| f|null|[2, 4]|
| a| b| 3|[1, 5]|
+----+---+----+------+
You can apply explode twice (or use an udf) to an output equivalent to the one returned from join.
You can also identify groups using minimal id per group. A few additional imports:
from pyspark.sql.window import Window
from pyspark.sql.functions import col, count, min
window definition:
w = Window.partitionBy(df.columns[1:])
query:
(df
.select(
"*",
count("*").over(w).alias("_cnt"),
min("id").over(w).alias("group"))
.where(col("_cnt") > 1))
and the result:
+---+----+---+----+----+-----+
| id| _2| _3| _4|_cnt|group|
+---+----+---+----+----+-----+
| 2|null| f|null| 2| 2|
| 4|null| f|null| 2| 2|
| 1| a| b| 3| 2| 1|
| 5| a| b| 3| 2| 1|
+---+----+---+----+----+-----+
You can further use group column for self join.

Resources