How do you remove an ambiguous column in pyspark? - apache-spark

There are many questions similar to this that are asking a different question with regard to avoid duplicate columns in a join; that is not what I am asking here.
Given that I already have a DataFrame with ambiguous columns, how do I remove a specific column?
For example, given:
df = spark.createDataFrame(
spark.sparkContext.parallelize([
[1, 0.0, "ext-0.0"],
[1, 1.0, "ext-1.0"],
[2, 1.0, "ext-2.0"],
[3, 2.0, "ext-3.0"],
[4, 3.0, "ext-4.0"],
]),
StructType([
StructField("id", IntegerType(), True),
StructField("shared", DoubleType(), True),
StructField("shared", StringType(), True),
])
)
I wish to retain only the numeric columns.
However, attempting to do something like df.select("id", "shared").show() results in:
raise AnalysisException(s.split(': ', 1)[1], stackTrace)
pyspark.sql.utils.AnalysisException: "Reference 'shared' is ambiguous, could be: shared, shared.;"
Many related solution to this problem are simply 'avoid ever getting into this situation', eg. by using ['joinkey'] instead of a.joinkey = b.joinkey on the join. I reiterate that this is not the situation here; this relates to a dataframe that has already been converted into this form.
The metadata from the DF disambiguates these columns:
$ df.dtypes
[('id', 'int'), ('shared', 'double'), ('shared', 'string')]
$ df.schema
StructType(List(StructField(id,IntegerType,true),StructField(shared,DoubleType,true),StructField(shared,StringType,true)))
So the data is retained internally... I just can't see how to use it.
How do I pick one column over the other?
I expected to be able to use, eg. col('shared#11') or similar... but there is nothing like that I can see?
Is this simply not possible in spark?
To answer this question, I would ask, please post either a) a working code snippet that solves the problem above, or b) link to something official from the spark developers that this simply isn't supported?

the easiest solution to this problem is to rename using df.toDF(...<new-col-names>...), but if you don't wanted to change the column name then group the duplicated columns by their type as struct<type1, type2> as below-
Please note that below solution is written in scala, but logically similar code can be implemented in python. Also this solution will work for all duplicate columns in the dataframe-
1. Load the test data
val df = Seq((1, 2.0, "shared")).toDF("id", "shared", "shared")
df.show(false)
df.printSchema()
/**
* +---+------+------+
* |id |shared|shared|
* +---+------+------+
* |1 |2.0 |shared|
* +---+------+------+
*
* root
* |-- id: integer (nullable = false)
* |-- shared: double (nullable = false)
* |-- shared: string (nullable = true)
*/
2. get all the duplicated column names
// 1. get all the duplicated column names
val findDupCols = (cols: Array[String]) => cols.map((_ , 1)).groupBy(_._1).filter(_._2.length > 1).keys.toSeq
val dupCols = findDupCols(df.columns)
println(dupCols.mkString(", "))
// shared
3. rename duplicate cols like shared => shared:string, shared:int, without touching the other column names
val renamedDF = df
// 2 rename duplicate cols like shared => shared:string, shared:int
.toDF(df.schema
.map{case StructField(name, dt, _, _) =>
if(dupCols.contains(name)) s"$name:${dt.simpleString}" else name}: _*)
3. create struct of all cols
// 3. create struct of all cols
val structCols = df.schema.map(f => f.name -> f ).groupBy(_._1)
.map{case(name, seq) =>
if (seq.length > 1)
struct(
seq.map { case (_, StructField(fName, dt, _, _)) =>
expr(s"`$fName:${dt.simpleString}` as ${dt.simpleString}")
}: _*
).as(name)
else col(name)
}.toSeq
val structDF = renamedDF.select(structCols: _*)
structDF.show(false)
structDF.printSchema()
/**
* +-------------+---+
* |shared |id |
* +-------------+---+
* |[2.0, shared]|1 |
* +-------------+---+
*
* root
* |-- shared: struct (nullable = false)
* | |-- double: double (nullable = false)
* | |-- string: string (nullable = true)
* |-- id: integer (nullable = false)
*/
4. get column by their type using <column_name>.<datatype>
// Use the dataframe without losing any columns
structDF.selectExpr("id", "shared.double as shared").show(false)
/**
* +---+------+
* |id |shared|
* +---+------+
* |1 |2.0 |
* +---+------+
*/
Hope this is useful to someone!

It seems this is possible by replacing the schema using .rdd.toDf() on the dataframe.
However, I'll still accept any answer that is less convoluted and annoying than the one below:
import random
import string
from pyspark.sql.types import DoubleType, LongType
def makeId():
return ''.join(random.choice(string.ascii_lowercase) for _ in range(6))
def makeUnique(column):
return "%s---%s" % (column.name, makeId())
def makeNormal(column):
return column.name.split("---")[0]
unique_schema = list(map(makeUnique, df.schema))
df_unique = df.rdd.toDF(schema=unique_schema)
df_unique.show()
numeric_cols = filter(lambda c: c.dataType.__class__ in [LongType, DoubleType], df_unique.schema)
numeric_col_names = list(map(lambda c: c.name, numeric_cols))
df_filtered = df_unique.select(*numeric_col_names)
df_filtered.show()
normal_schema = list(map(makeNormal, df_filtered.schema))
df_fixed = df_filtered.rdd.toDF(schema=normal_schema)
df_fixed.show()
Gives:
+-----------+---------------+---------------+
|id---chjruu|shared---aqboua|shared---ehjxor|
+-----------+---------------+---------------+
| 1| 0.0| ext-0.0|
| 1| 1.0| ext-1.0|
| 2| 1.0| ext-2.0|
| 3| 2.0| ext-3.0|
| 4| 3.0| ext-4.0|
+-----------+---------------+---------------+
+-----------+---------------+
|id---chjruu|shared---aqboua|
+-----------+---------------+
| 1| 0.0|
| 1| 1.0|
| 2| 1.0|
| 3| 2.0|
| 4| 3.0|
+-----------+---------------+
+---+------+
| id|shared|
+---+------+
| 1| 0.0|
| 1| 1.0|
| 2| 1.0|
| 3| 2.0|
| 4| 3.0|
+---+------+

Workaround: Simply rename the columns (in order) and then do whatever you wanted to do!
renamed_df = df.toDF("id", "shared_double", "shared_string")

Related

spark schema difference in partitions

I have to read data from a path which is partitioned by region.
US region has columns a,b,c,d,e
EUR region has only a,b,c,d
When I read data from the path and doing a printSchema, I am seeing only a,b,c,d 'e' is missing.
Is there any way to handle this situation? Like column e automatically gets populated with null for EUR data...?
You can use the mergeSchema option that should do exactly what you are looking for as long as columns with the same name have the same type.
Example:
spark.read.option("mergeSchema", "true").format("parquet").load(...)
Once you read the data from the path, you can check if data frame contains column 'e'. If it does not, then you could add this with default value which is None is this case.
from pyspark.sql import SparkSession
from pyspark.sql.functions import lit
spark = SparkSession.builder \
.appName('example') \
.getOrCreate()
df = spark.createDataFrame(data=data, schema = columns)
if 'e' not in df.columns:
df = df.withColumn('e',lit(None))
You can collect all the possible columns from both dataset then fill None if that column is not available in each dataset
df_ab = (spark
.sparkContext
.parallelize([
('a1', 'b1'),
('a2', 'b2'),
])
.toDF(['a', 'b'])
)
df_ab.show()
# +---+---+
# | a| b|
# +---+---+
# | a1| b1|
# | a2| b2|
# +---+---+
df_abcd = (spark
.sparkContext
.parallelize([
('a3', 'b3', 'c3', 'd3'),
('a4', 'b4', 'c4', 'd4'),
])
.toDF(['a', 'b', 'c', 'd'])
)
df_abcd.show()
# +---+---+---+---+
# | a| b| c| d|
# +---+---+---+---+
# | a3| b3| c3| d3|
# | a4| b4| c4| d4|
# +---+---+---+---+
unique_columns = list(set(df_ab.columns + df_abcd.columns))
# ['d', 'b', 'a', 'c']
for col in unique_columns:
if col not in df_ab.columns:
df_ab = df_ab.withColumn(col, F.lit(None))
if col not in df_abcd.columns:
df_abcd = df_abcd.withColumn(col, F.lit(None))
df_ab.printSchema()
# root
# |-- a: string (nullable = true)
# |-- b: string (nullable = true)
# |-- d: null (nullable = true)
# |-- c: null (nullable = true)
df_ab.show()
# +---+---+----+----+
# | a| b| d| c|
# +---+---+----+----+
# | a1| b1|null|null|
# | a2| b2|null|null|
# +---+---+----+----+
df_abcd.printSchema()
# root
# |-- a: string (nullable = true)
# |-- b: string (nullable = true)
# |-- c: string (nullable = true)
# |-- d: string (nullable = true)
df_abcd.show()
# +---+---+---+---+
# | a| b| c| d|
# +---+---+---+---+
# | a3| b3| c3| d3|
# | a4| b4| c4| d4|
# +---+---+---+---+
I used pyspark and SQLContext. Hope this implementation will help you to get an idea. Spark provides an environment to use SQL and it is very convenient to use SPARK SQL for these type of things.
from pyspark import SparkContext
from pyspark.sql import SparkSession
from pyspark.sql import Row
from pyspark.sql import functions
from pyspark.sql import SQLContext
import sys
import os
from pyspark.sql.types import StructType,StructField, StringType, IntegerType
class getData(object):
"""docstring for getData"""
def __init__(self):
def get_data(self, n):
spark = SparkSession.builder.appName('YourProjectName').getOrCreate()
data2 = [("region 1","region 2","region 3","region 4"),
("region 5","region 6","region 7","region 8")
]
schema = StructType([ \
StructField("a",StringType(),True), \
StructField("b",StringType(),True), \
StructField("c",StringType(),True), \
StructField("d", StringType(), True) \
])
data3 = [("EU region 1","EU region 2","EU region 3"),
("EU region 5","EU region 6","EU region 7")
]
schema3 = StructType([ \
StructField("a",StringType(),True), \
StructField("b",StringType(),True), \
StructField("c",StringType(),True) \
])
df = spark.createDataFrame(data=data2,schema=schema)
df.createOrReplaceTempView("USRegion")
sqlDF = self.sparkSession1.sql("SELECT * FROM USRegion")
sqlDF.show(n=600)
df1 = spark.createDataFrame(data=data3,schema=schema3)
df1.createOrReplaceTempView("EURegion")
sqlDF1 = self.sparkSession1.sql("SELECT * FROM EURegion")
sqlDF1.show(n=600)
sql_union_df = self.sparkSession1.sql("SELECT a, b, c, d FROM USRegion uNION ALL SELECT a,b, c, '' as d FROM EURegion ")
sql_union_df.show(n=600)
#call the class
conn = getData()
#call the method implemented inside the class
print(conn.get_data(10))

PySpark Compare Empty Map Literal

I want to drop rows in a PySpark DataFrame where a certain column contains an empty map. How do I do this? I can't seem to declare a typed empty MapType against which to compare my column. I have seen that in Scala, you can use typedLit, but there seems to be no such equivalent in PySpark. I have also tried using lit(...) and casting to a struct<string,int> but I have found no acceptable argument for lit() (tried using None which returns null and {} which is an error).
I'm sure this is trivial but I haven't seen any docs on this!
Here is a solution using pyspark size build-in function:
from pyspark.sql.functions import col, size
df = spark.createDataFrame(
[(1, {1:'A'} ),
(2, {2:'B'} ),
(3, {3:'C'} ),
(4, {}),
(5, None)]
).toDF("id", "map")
df.printSchema()
# root
# |-- id: long (nullable = true)
# |-- map: map (nullable = true)
# | |-- key: long
# | |-- value: string (valueContainsNull = true)
df.withColumn("is_empty", size(col("map")) <= 0).show()
# +---+--------+--------+
# | id| map|is_empty|
# +---+--------+--------+
# | 1|[1 -> A]| false|
# | 2|[2 -> B]| false|
# | 3|[3 -> C]| false|
# | 4| []| true|
# | 5| null| true|
# +---+--------+--------+
Note that the condition is size <= 0 since in the case of null the function returns -1 (if the spark.sql.legacy.sizeOfNull setting is true otherwise it will return null). Here you can find more details.
Generic solution: comparing Map column and literal Map
For a more generic solution we can use the build-in function size in combination with a UDF which append the string key + value of each item into a sorted list (thank you #jxc for pointing out the problem with the previous version). The hypothesis here will be that two maps are equal when:
they have the same size
the string representation of key + value is identical between the items of the maps
The literal map is created from an arbitrary python dictionary combining keys and values via map_from_arrays:
from pyspark.sql.functions import udf, lit, size, when, map_from_arrays, array
df = spark.createDataFrame([
[1, {}],
[2, {1:'A', 2:'B', 3:'C'}],
[3, {1:'A', 2:'B'}]
]).toDF("key", "map")
dict = { 1:'A' , 2:'B' }
map_keys_ = array([lit(k) for k in dict.keys()])
map_values_ = array([lit(v) for v in dict.values()])
tmp_map = map_from_arrays(map_keys_, map_values_)
to_strlist_udf = udf(lambda d: sorted([str(k) + str(d[k]) for k in d.keys()]))
def map_equals(m1, m2):
return when(
(size(m1) == size(m2)) &
(to_strlist_udf(m1) == to_strlist_udf(m2)), True
).otherwise(False)
df = df.withColumn("equals", map_equals(df["map"], tmp_map))
df.show(10, False)
# +---+------------------------+------+
# |key|map |equals|
# +---+------------------------+------+
# |1 |[] |false |
# |2 |[1 -> A, 2 -> B, 3 -> C]|false |
# |3 |[1 -> A, 2 -> B] |true |
# +---+------------------------+------+
Note: As you can see the pyspark == operator works pretty well for array comparison as well.

Using flatMap / reduce: dealing with rows containing a list of rows

I have a dataframe containing an array of rows on each row
I want to aggregate all the inner rows into one dataframe
Below is what I have / achieved:
This
df.select('*').take(1)
Gives me this:
[
Row(
body=[
Row(a=1, b=1),
Row(a=2, b=2)
]
)
]
So doing this:
df.rdd.flatMap(lambda x: x).collect()
I get this:
[[
Row(a=1, b=1)
Row(a=2, b=2)
]]
So I am forced to do this:
df.rdd.flatMap(lambda x: x).flatMap(lambda x: x)
So I can achieve the below:
[
Row(a=1, b=1)
Row(a=2, b=2)
]
Using the result above, I can finally convert it to a dataframe and save somewhere. Which is what I want. But calling flatMap twice doesnt look right.
I tried to the same by using Reduce, just like the following code:
flatRdd = df.rdd.flatMap(lambda x: x)
dfMerged = reduce(DataFrame.unionByName, [flatRdd])
The second argument of reduce must be iterable, so I was forced to add [flatRdd]. Sadly it gives me this:
[[
Row(a=1, b=1)
Row(a=2, b=2)
]]
There is certainlly a better way to achieve what I want.
IIUC, you can explode and then flatten the resulting Rows using the .* syntax.
Suppose you start with the following DataFrame:
df.show()
#+----------------+
#| body|
#+----------------+
#|[[1, 1], [2, 2]]|
#+----------------+
with the schema:
df.printSchema()
#root
# |-- body: array (nullable = true)
# | |-- element: struct (containsNull = true)
# | | |-- a: long (nullable = true)
# | | |-- b: long (nullable = true)
You can first explode the body column:
from pyspark.sql.functions import explode
df = df.select(explode("body").alias("exploded"))
df.show()
#+--------+
#|exploded|
#+--------+
#| [1, 1]|
#| [2, 2]|
#+--------+
Now flatten the exploded column:
df = df.select("exploded.*")
df.show()
#+---+---+
#| a| b|
#+---+---+
#| 1| 1|
#| 2| 2|
#+---+---+
Now if you were to call collect, you'd get the desired output:
print(df.collect())
#[Row(a=1, b=1), Row(a=2, b=2)]
See also:
Querying Spark SQL DataFrame with complex types
You don't need to run flatMap() on the Row object, just refer it directly with the key:
>>> df.rdd.flatMap(lambda x: x.body).collect()
[Row(a=1, b=1), Row(a=2, b=2)]

Spark Aggregating multiple columns (possible to array) from join output

I've below datasets
Table1
Table2
Now I would like to get below dataset. I've tried with left outer join Table1.id == Table2.departmentid but, I am not getting the desired output.
Later, I need to use this table to get several counts and convert the data into an xml . I will be doing this convertion using map.
Any help would be appreciated.
Only joining is not enough to get the desired output. Probably You are missing something and last element of each nested array might be departmentid. Assuming the last element of nested array is departmentid, I've generated the output by the following way:
import org.apache.spark.sql.{Row, SparkSession}
import org.apache.spark.sql.functions.collect_list
case class department(id: Integer, deptname: String)
case class employee(employeid:Integer, empname:String, departmentid:Integer)
val spark = SparkSession.builder().getOrCreate()
import spark.implicits._
val department_df = Seq(department(1, "physics")
,department(2, "computer") ).toDF()
val emplyoee_df = Seq(employee(1, "A", 1)
,employee(2, "B", 1)
,employee(3, "C", 2)
,employee(4, "D", 2)).toDF()
val result = department_df.join(emplyoee_df, department_df("id") === emplyoee_df("departmentid"), "left").
selectExpr("id", "deptname", "employeid", "empname").
rdd.map {
case Row(id:Integer, deptname:String, employeid:Integer, empname:String) => (id, deptname, Array(employeid.toString, empname, id.toString))
}.toDF("id", "deptname", "arrayemp").
groupBy("id", "deptname").
agg(collect_list("arrayemp").as("emplist")).
orderBy("id", "deptname")
The output looks like this:
result.show(false)
+---+--------+----------------------+
|id |deptname|emplist |
+---+--------+----------------------+
|1 |physics |[[2, B, 1], [1, A, 1]]|
|2 |computer|[[4, D, 2], [3, C, 2]]|
+---+--------+----------------------+
Explanation: If i break down the last dataframe transformation into multiple steps, it'll probably make clear how the output is generated.
left outer join between department_df and employee_df
val df1 = department_df.join(emplyoee_df, department_df("id") === emplyoee_df("departmentid"), "left").
selectExpr("id", "deptname", "employeid", "empname")
df1.show()
+---+--------+---------+-------+
| id|deptname|employeid|empname|
+---+--------+---------+-------+
| 1| physics| 2| B|
| 1| physics| 1| A|
| 2|computer| 4| D|
| 2|computer| 3| C|
+---+--------+---------+-------+
creating array using some column's values from the df1 dataframe
val df2 = df1.rdd.map {
case Row(id:Integer, deptname:String, employeid:Integer, empname:String) => (id, deptname, Array(employeid.toString, empname, id.toString))
}.toDF("id", "deptname", "arrayemp")
df2.show()
+---+--------+---------+
| id|deptname| arrayemp|
+---+--------+---------+
| 1| physics|[2, B, 1]|
| 1| physics|[1, A, 1]|
| 2|computer|[4, D, 2]|
| 2|computer|[3, C, 2]|
+---+--------+---------+
create new list aggregating multiple arrays using df2 dataframe
val result = df2.groupBy("id", "deptname").
agg(collect_list("arrayemp").as("emplist")).
orderBy("id", "deptname")
result.show(false)
+---+--------+----------------------+
|id |deptname|emplist |
+---+--------+----------------------+
|1 |physics |[[2, B, 1], [1, A, 1]]|
|2 |computer|[[4, D, 2], [3, C, 2]]|
+---+--------+----------------------+
import org.apache.spark.sql.functions._
import org.apache.spark.sql.types._
import org.apache.spark.sql.Row
val df = spark.sparkContext.parallelize(Seq(
(1,"Physics"),
(2,"Computer"),
(3,"Maths")
)).toDF("ID","Dept")
val schema = List(
StructField("EMPID", IntegerType, true),
StructField("EMPNAME", StringType, true),
StructField("DeptID", IntegerType, true)
)
val data = Seq(
Row(1,"A",1),
Row(2,"B",1),
Row(3,"C",2),
Row(4,"D",2) ,
Row(5,"E",null)
)
val df_emp = spark.createDataFrame(
spark.sparkContext.parallelize(data),
StructType(schema)
)
val newdf = df_emp.withColumn("CONC",array($"EMPID",$"EMPNAME",$"DeptID")).groupBy($"DeptID").agg(expr("collect_list(CONC) as emplist"))
df.join(newdf,df.col("ID") === df_emp.col("DeptID")).select($"ID",$"Dept",$"emplist").show()
---+--------+--------------------+
| ID| Dept| listcol|
+---+--------+--------------------+
| 1| Physics|[[1, A, 1], [2, B...|
| 2|Computer|[[3, C, 2], [4, D...|

Set schema in pyspark dataframe read.csv with null elements

I have a data set (example) that when imported with
df = spark.read.csv(filename, header=True, inferSchema=True)
df.show()
will assign the column with 'NA' as a stringType(), where I would like it to be IntegerType() (or ByteType()).
I then tried to set
schema = StructType([
StructField("col_01", IntegerType()),
StructField("col_02", DateType()),
StructField("col_03", IntegerType())
])
df = spark.read.csv(filename, header=True, schema=schema)
df.show()
The output shows the entire row with 'col_03' = null to be null.
However col_01 and col_02 return appropriate data if they are called with
df.select(['col_01','col_02']).show()
I can find a way around this by post casting the data type of col_3
df = spark.read.csv(filename, header=True, inferSchema=True)
df = df.withColumn('col_3',df['col_3'].cast(IntegerType()))
df.show()
, but I think it is not ideal and would be much better if I can assign the data type for each column directly with setting schema.
Would anyone be able to guide me what I do incorrectly? Or casting the data types after importing is the only solution? Any comment regarding performance of the two approaches (if we can make assigning schema to work) is also welcome.
Thank you,
You can set a new null value in spark's csv loader using nullValue:
for a csv file looking like this:
col_01,col_02,col_03
111,2007-11-18,3
112,2002-12-03,4
113,2007-02-14,5
114,2003-04-16,NA
115,2011-08-24,2
116,2003-05-03,3
117,2001-06-11,4
118,2004-05-06,NA
119,2012-03-25,5
120,2006-10-13,4
and forcing schema:
from pyspark.sql.types import StructType, IntegerType, DateType
schema = StructType([
StructField("col_01", IntegerType()),
StructField("col_02", DateType()),
StructField("col_03", IntegerType())
])
You'll get:
df = spark.read.csv(filename, header=True, nullValue='NA', schema=schema)
df.show()
df.printSchema()
+------+----------+------+
|col_01| col_02|col_03|
+------+----------+------+
| 111|2007-11-18| 3|
| 112|2002-12-03| 4|
| 113|2007-02-14| 5|
| 114|2003-04-16| null|
| 115|2011-08-24| 2|
| 116|2003-05-03| 3|
| 117|2001-06-11| 4|
| 118|2004-05-06| null|
| 119|2012-03-25| 5|
| 120|2006-10-13| 4|
+------+----------+------+
root
|-- col_01: integer (nullable = true)
|-- col_02: date (nullable = true)
|-- col_03: integer (nullable = true)
Try this once - (But this will read every column as string type. You can type caste as per your requirement)
import csv
from pyspark.sql.types import IntegerType
data = []
with open('filename', 'r' ) as doc:
reader = csv.DictReader(doc)
for line in reader:
data.append(line)
df = sc.parallelize(data).toDF()
df = df.withColumn("col_03", df["col_03"].cast(IntegerType()))

Resources