I have created a PySpark RDD (converted from XML to CSV) that does not have headers. I need to convert it to a DataFrame with headers to perform some SparkSQL queries on it. I cannot seem to find a simple way to add headers. Most examples start with a dataset that already has headers.
df = spark.read.csv('some.csv', header=True, schema=schema)
However, I need to append headers.
headers = ['a', 'b', 'c', 'd']
This seems to be a trivial problem, I am not sure why I cannot find a working solution. Thank you.
Like this ... you need to specify schema and .option("header", "false") if your csv does not contain a header row
spark.version
'2.3.2'
! cat sample.csv
1, 2.0,"hello"
3, 4.0, "there"
5, 6.0, "how are you?"
PATH = "sample.csv"
from pyspark.sql.functions import *
from pyspark.sql.types import *
schema = StructType([\
StructField("col1", IntegerType(), True),\
StructField("col2", FloatType(), True),\
StructField("col3", StringType(), True)])
csvFile = spark.read.format("csv")\
.option("header", "false")\
.schema(schema)\
.load(PATH)
csvFile.show()
+----+----+---------------+
|col1|col2| col3|
+----+----+---------------+
| 1| 2.0| hello|
| 3| 4.0| "there"|
| 5| 6.0| "how are you?"|
+----+----+---------------+
# if you have rdd and want to convert straight to df
rdd = sc.textFile(PATH)
# just showing rows
for i in rdd.collect(): print(i)
1, 2.0,"hello"
3, 4.0, "there"
5, 6.0, "how are you?"
# use Row to construct a schema from rdd
from pyspark.sql import Row
csvDF = rdd\
.map(lambda x: Row(col1 = int(x.split(",")[0]),\
col2 = float(x.split(",")[1]),\
col3 = str(x.split(",")[2]))).toDF()
csvDF.show()
+----+----+---------------+
|col1|col2| col3|
+----+----+---------------+
| 1| 2.0| "hello"|
| 3| 4.0| "there"|
| 5| 6.0| "how are you?"|
+----+----+---------------+
csvDF.printSchema()
root
|-- col1: long (nullable = true)
|-- col2: double (nullable = true)
|-- col3: string (nullable = true)
rdd.toDF(schema=['a', 'b', 'c', 'd']
Related
I have to read data from a path which is partitioned by region.
US region has columns a,b,c,d,e
EUR region has only a,b,c,d
When I read data from the path and doing a printSchema, I am seeing only a,b,c,d 'e' is missing.
Is there any way to handle this situation? Like column e automatically gets populated with null for EUR data...?
You can use the mergeSchema option that should do exactly what you are looking for as long as columns with the same name have the same type.
Example:
spark.read.option("mergeSchema", "true").format("parquet").load(...)
Once you read the data from the path, you can check if data frame contains column 'e'. If it does not, then you could add this with default value which is None is this case.
from pyspark.sql import SparkSession
from pyspark.sql.functions import lit
spark = SparkSession.builder \
.appName('example') \
.getOrCreate()
df = spark.createDataFrame(data=data, schema = columns)
if 'e' not in df.columns:
df = df.withColumn('e',lit(None))
You can collect all the possible columns from both dataset then fill None if that column is not available in each dataset
df_ab = (spark
.sparkContext
.parallelize([
('a1', 'b1'),
('a2', 'b2'),
])
.toDF(['a', 'b'])
)
df_ab.show()
# +---+---+
# | a| b|
# +---+---+
# | a1| b1|
# | a2| b2|
# +---+---+
df_abcd = (spark
.sparkContext
.parallelize([
('a3', 'b3', 'c3', 'd3'),
('a4', 'b4', 'c4', 'd4'),
])
.toDF(['a', 'b', 'c', 'd'])
)
df_abcd.show()
# +---+---+---+---+
# | a| b| c| d|
# +---+---+---+---+
# | a3| b3| c3| d3|
# | a4| b4| c4| d4|
# +---+---+---+---+
unique_columns = list(set(df_ab.columns + df_abcd.columns))
# ['d', 'b', 'a', 'c']
for col in unique_columns:
if col not in df_ab.columns:
df_ab = df_ab.withColumn(col, F.lit(None))
if col not in df_abcd.columns:
df_abcd = df_abcd.withColumn(col, F.lit(None))
df_ab.printSchema()
# root
# |-- a: string (nullable = true)
# |-- b: string (nullable = true)
# |-- d: null (nullable = true)
# |-- c: null (nullable = true)
df_ab.show()
# +---+---+----+----+
# | a| b| d| c|
# +---+---+----+----+
# | a1| b1|null|null|
# | a2| b2|null|null|
# +---+---+----+----+
df_abcd.printSchema()
# root
# |-- a: string (nullable = true)
# |-- b: string (nullable = true)
# |-- c: string (nullable = true)
# |-- d: string (nullable = true)
df_abcd.show()
# +---+---+---+---+
# | a| b| c| d|
# +---+---+---+---+
# | a3| b3| c3| d3|
# | a4| b4| c4| d4|
# +---+---+---+---+
I used pyspark and SQLContext. Hope this implementation will help you to get an idea. Spark provides an environment to use SQL and it is very convenient to use SPARK SQL for these type of things.
from pyspark import SparkContext
from pyspark.sql import SparkSession
from pyspark.sql import Row
from pyspark.sql import functions
from pyspark.sql import SQLContext
import sys
import os
from pyspark.sql.types import StructType,StructField, StringType, IntegerType
class getData(object):
"""docstring for getData"""
def __init__(self):
def get_data(self, n):
spark = SparkSession.builder.appName('YourProjectName').getOrCreate()
data2 = [("region 1","region 2","region 3","region 4"),
("region 5","region 6","region 7","region 8")
]
schema = StructType([ \
StructField("a",StringType(),True), \
StructField("b",StringType(),True), \
StructField("c",StringType(),True), \
StructField("d", StringType(), True) \
])
data3 = [("EU region 1","EU region 2","EU region 3"),
("EU region 5","EU region 6","EU region 7")
]
schema3 = StructType([ \
StructField("a",StringType(),True), \
StructField("b",StringType(),True), \
StructField("c",StringType(),True) \
])
df = spark.createDataFrame(data=data2,schema=schema)
df.createOrReplaceTempView("USRegion")
sqlDF = self.sparkSession1.sql("SELECT * FROM USRegion")
sqlDF.show(n=600)
df1 = spark.createDataFrame(data=data3,schema=schema3)
df1.createOrReplaceTempView("EURegion")
sqlDF1 = self.sparkSession1.sql("SELECT * FROM EURegion")
sqlDF1.show(n=600)
sql_union_df = self.sparkSession1.sql("SELECT a, b, c, d FROM USRegion uNION ALL SELECT a,b, c, '' as d FROM EURegion ")
sql_union_df.show(n=600)
#call the class
conn = getData()
#call the method implemented inside the class
print(conn.get_data(10))
There are many questions similar to this that are asking a different question with regard to avoid duplicate columns in a join; that is not what I am asking here.
Given that I already have a DataFrame with ambiguous columns, how do I remove a specific column?
For example, given:
df = spark.createDataFrame(
spark.sparkContext.parallelize([
[1, 0.0, "ext-0.0"],
[1, 1.0, "ext-1.0"],
[2, 1.0, "ext-2.0"],
[3, 2.0, "ext-3.0"],
[4, 3.0, "ext-4.0"],
]),
StructType([
StructField("id", IntegerType(), True),
StructField("shared", DoubleType(), True),
StructField("shared", StringType(), True),
])
)
I wish to retain only the numeric columns.
However, attempting to do something like df.select("id", "shared").show() results in:
raise AnalysisException(s.split(': ', 1)[1], stackTrace)
pyspark.sql.utils.AnalysisException: "Reference 'shared' is ambiguous, could be: shared, shared.;"
Many related solution to this problem are simply 'avoid ever getting into this situation', eg. by using ['joinkey'] instead of a.joinkey = b.joinkey on the join. I reiterate that this is not the situation here; this relates to a dataframe that has already been converted into this form.
The metadata from the DF disambiguates these columns:
$ df.dtypes
[('id', 'int'), ('shared', 'double'), ('shared', 'string')]
$ df.schema
StructType(List(StructField(id,IntegerType,true),StructField(shared,DoubleType,true),StructField(shared,StringType,true)))
So the data is retained internally... I just can't see how to use it.
How do I pick one column over the other?
I expected to be able to use, eg. col('shared#11') or similar... but there is nothing like that I can see?
Is this simply not possible in spark?
To answer this question, I would ask, please post either a) a working code snippet that solves the problem above, or b) link to something official from the spark developers that this simply isn't supported?
the easiest solution to this problem is to rename using df.toDF(...<new-col-names>...), but if you don't wanted to change the column name then group the duplicated columns by their type as struct<type1, type2> as below-
Please note that below solution is written in scala, but logically similar code can be implemented in python. Also this solution will work for all duplicate columns in the dataframe-
1. Load the test data
val df = Seq((1, 2.0, "shared")).toDF("id", "shared", "shared")
df.show(false)
df.printSchema()
/**
* +---+------+------+
* |id |shared|shared|
* +---+------+------+
* |1 |2.0 |shared|
* +---+------+------+
*
* root
* |-- id: integer (nullable = false)
* |-- shared: double (nullable = false)
* |-- shared: string (nullable = true)
*/
2. get all the duplicated column names
// 1. get all the duplicated column names
val findDupCols = (cols: Array[String]) => cols.map((_ , 1)).groupBy(_._1).filter(_._2.length > 1).keys.toSeq
val dupCols = findDupCols(df.columns)
println(dupCols.mkString(", "))
// shared
3. rename duplicate cols like shared => shared:string, shared:int, without touching the other column names
val renamedDF = df
// 2 rename duplicate cols like shared => shared:string, shared:int
.toDF(df.schema
.map{case StructField(name, dt, _, _) =>
if(dupCols.contains(name)) s"$name:${dt.simpleString}" else name}: _*)
3. create struct of all cols
// 3. create struct of all cols
val structCols = df.schema.map(f => f.name -> f ).groupBy(_._1)
.map{case(name, seq) =>
if (seq.length > 1)
struct(
seq.map { case (_, StructField(fName, dt, _, _)) =>
expr(s"`$fName:${dt.simpleString}` as ${dt.simpleString}")
}: _*
).as(name)
else col(name)
}.toSeq
val structDF = renamedDF.select(structCols: _*)
structDF.show(false)
structDF.printSchema()
/**
* +-------------+---+
* |shared |id |
* +-------------+---+
* |[2.0, shared]|1 |
* +-------------+---+
*
* root
* |-- shared: struct (nullable = false)
* | |-- double: double (nullable = false)
* | |-- string: string (nullable = true)
* |-- id: integer (nullable = false)
*/
4. get column by their type using <column_name>.<datatype>
// Use the dataframe without losing any columns
structDF.selectExpr("id", "shared.double as shared").show(false)
/**
* +---+------+
* |id |shared|
* +---+------+
* |1 |2.0 |
* +---+------+
*/
Hope this is useful to someone!
It seems this is possible by replacing the schema using .rdd.toDf() on the dataframe.
However, I'll still accept any answer that is less convoluted and annoying than the one below:
import random
import string
from pyspark.sql.types import DoubleType, LongType
def makeId():
return ''.join(random.choice(string.ascii_lowercase) for _ in range(6))
def makeUnique(column):
return "%s---%s" % (column.name, makeId())
def makeNormal(column):
return column.name.split("---")[0]
unique_schema = list(map(makeUnique, df.schema))
df_unique = df.rdd.toDF(schema=unique_schema)
df_unique.show()
numeric_cols = filter(lambda c: c.dataType.__class__ in [LongType, DoubleType], df_unique.schema)
numeric_col_names = list(map(lambda c: c.name, numeric_cols))
df_filtered = df_unique.select(*numeric_col_names)
df_filtered.show()
normal_schema = list(map(makeNormal, df_filtered.schema))
df_fixed = df_filtered.rdd.toDF(schema=normal_schema)
df_fixed.show()
Gives:
+-----------+---------------+---------------+
|id---chjruu|shared---aqboua|shared---ehjxor|
+-----------+---------------+---------------+
| 1| 0.0| ext-0.0|
| 1| 1.0| ext-1.0|
| 2| 1.0| ext-2.0|
| 3| 2.0| ext-3.0|
| 4| 3.0| ext-4.0|
+-----------+---------------+---------------+
+-----------+---------------+
|id---chjruu|shared---aqboua|
+-----------+---------------+
| 1| 0.0|
| 1| 1.0|
| 2| 1.0|
| 3| 2.0|
| 4| 3.0|
+-----------+---------------+
+---+------+
| id|shared|
+---+------+
| 1| 0.0|
| 1| 1.0|
| 2| 1.0|
| 3| 2.0|
| 4| 3.0|
+---+------+
Workaround: Simply rename the columns (in order) and then do whatever you wanted to do!
renamed_df = df.toDF("id", "shared_double", "shared_string")
I have a dataframe containing an array of rows on each row
I want to aggregate all the inner rows into one dataframe
Below is what I have / achieved:
This
df.select('*').take(1)
Gives me this:
[
Row(
body=[
Row(a=1, b=1),
Row(a=2, b=2)
]
)
]
So doing this:
df.rdd.flatMap(lambda x: x).collect()
I get this:
[[
Row(a=1, b=1)
Row(a=2, b=2)
]]
So I am forced to do this:
df.rdd.flatMap(lambda x: x).flatMap(lambda x: x)
So I can achieve the below:
[
Row(a=1, b=1)
Row(a=2, b=2)
]
Using the result above, I can finally convert it to a dataframe and save somewhere. Which is what I want. But calling flatMap twice doesnt look right.
I tried to the same by using Reduce, just like the following code:
flatRdd = df.rdd.flatMap(lambda x: x)
dfMerged = reduce(DataFrame.unionByName, [flatRdd])
The second argument of reduce must be iterable, so I was forced to add [flatRdd]. Sadly it gives me this:
[[
Row(a=1, b=1)
Row(a=2, b=2)
]]
There is certainlly a better way to achieve what I want.
IIUC, you can explode and then flatten the resulting Rows using the .* syntax.
Suppose you start with the following DataFrame:
df.show()
#+----------------+
#| body|
#+----------------+
#|[[1, 1], [2, 2]]|
#+----------------+
with the schema:
df.printSchema()
#root
# |-- body: array (nullable = true)
# | |-- element: struct (containsNull = true)
# | | |-- a: long (nullable = true)
# | | |-- b: long (nullable = true)
You can first explode the body column:
from pyspark.sql.functions import explode
df = df.select(explode("body").alias("exploded"))
df.show()
#+--------+
#|exploded|
#+--------+
#| [1, 1]|
#| [2, 2]|
#+--------+
Now flatten the exploded column:
df = df.select("exploded.*")
df.show()
#+---+---+
#| a| b|
#+---+---+
#| 1| 1|
#| 2| 2|
#+---+---+
Now if you were to call collect, you'd get the desired output:
print(df.collect())
#[Row(a=1, b=1), Row(a=2, b=2)]
See also:
Querying Spark SQL DataFrame with complex types
You don't need to run flatMap() on the Row object, just refer it directly with the key:
>>> df.rdd.flatMap(lambda x: x.body).collect()
[Row(a=1, b=1), Row(a=2, b=2)]
I have a data set (example) that when imported with
df = spark.read.csv(filename, header=True, inferSchema=True)
df.show()
will assign the column with 'NA' as a stringType(), where I would like it to be IntegerType() (or ByteType()).
I then tried to set
schema = StructType([
StructField("col_01", IntegerType()),
StructField("col_02", DateType()),
StructField("col_03", IntegerType())
])
df = spark.read.csv(filename, header=True, schema=schema)
df.show()
The output shows the entire row with 'col_03' = null to be null.
However col_01 and col_02 return appropriate data if they are called with
df.select(['col_01','col_02']).show()
I can find a way around this by post casting the data type of col_3
df = spark.read.csv(filename, header=True, inferSchema=True)
df = df.withColumn('col_3',df['col_3'].cast(IntegerType()))
df.show()
, but I think it is not ideal and would be much better if I can assign the data type for each column directly with setting schema.
Would anyone be able to guide me what I do incorrectly? Or casting the data types after importing is the only solution? Any comment regarding performance of the two approaches (if we can make assigning schema to work) is also welcome.
Thank you,
You can set a new null value in spark's csv loader using nullValue:
for a csv file looking like this:
col_01,col_02,col_03
111,2007-11-18,3
112,2002-12-03,4
113,2007-02-14,5
114,2003-04-16,NA
115,2011-08-24,2
116,2003-05-03,3
117,2001-06-11,4
118,2004-05-06,NA
119,2012-03-25,5
120,2006-10-13,4
and forcing schema:
from pyspark.sql.types import StructType, IntegerType, DateType
schema = StructType([
StructField("col_01", IntegerType()),
StructField("col_02", DateType()),
StructField("col_03", IntegerType())
])
You'll get:
df = spark.read.csv(filename, header=True, nullValue='NA', schema=schema)
df.show()
df.printSchema()
+------+----------+------+
|col_01| col_02|col_03|
+------+----------+------+
| 111|2007-11-18| 3|
| 112|2002-12-03| 4|
| 113|2007-02-14| 5|
| 114|2003-04-16| null|
| 115|2011-08-24| 2|
| 116|2003-05-03| 3|
| 117|2001-06-11| 4|
| 118|2004-05-06| null|
| 119|2012-03-25| 5|
| 120|2006-10-13| 4|
+------+----------+------+
root
|-- col_01: integer (nullable = true)
|-- col_02: date (nullable = true)
|-- col_03: integer (nullable = true)
Try this once - (But this will read every column as string type. You can type caste as per your requirement)
import csv
from pyspark.sql.types import IntegerType
data = []
with open('filename', 'r' ) as doc:
reader = csv.DictReader(doc)
for line in reader:
data.append(line)
df = sc.parallelize(data).toDF()
df = df.withColumn("col_03", df["col_03"].cast(IntegerType()))
I have a mixed type dataframe.
I am reading this dataframe from hive table using
spark.sql('select a,b,c from table') command.
Some columns are int , bigint , double and others are string. There are 32 columns in total.
Is there any way in pyspark to convert all columns in the data frame to string type ?
Just:
from pyspark.sql.functions import col
table = spark.sql("table")
table.select([col(c).cast("string") for c in table.columns])
Here's a one line solution in Scala :
df.select(df.columns.map(c => col(c).cast(StringType)) : _*)
Let's see an example here :
import org.apache.spark.sql._
import org.apache.spark.sql.types._
import org.apache.spark.sql.functions._
val data = Seq(
Row(1, "a"),
Row(5, "z")
)
val schema = StructType(
List(
StructField("num", IntegerType, true),
StructField("letter", StringType, true)
)
)
val df = spark.createDataFrame(
spark.sparkContext.parallelize(data),
schema
)
df.printSchema
//root
//|-- num: integer (nullable = true)
//|-- letter: string (nullable = true)
val newDf = df.select(df.columns.map(c => col(c).cast(StringType)) : _*)
newDf.printSchema
//root
//|-- num: string (nullable = true)
//|-- letter: string (nullable = true)
I hope it helps
for col in df_data.columns:
df_data = df_data.withColumn(col, df_data[col].cast(StringType()))
For Scala, spark version > 2.0
case class Row(id: Int, value: Double)
import spark.implicits._
import org.apache.spark.sql.functions._
val r1 = Seq(Row(1, 1.0), Row(2, 2.0), Row(3, 3.0)).toDF()
r1.show
+---+-----+
| id|value|
+---+-----+
| 1| 1.0|
| 2| 2.0|
| 3| 3.0|
+---+-----+
val castedDF = r1.columns.foldLeft(r1)((current, c) => current.withColumn(c, col(c).cast("String")))
castedDF.printSchema
root
|-- id: string (nullable = false)
|-- value: string (nullable = false)
you can cast single column as this
import pyspark.sql.functions as F
import pyspark.sql.types as T
df = df.withColumn("id", F.col("new_id").cast(T.StringType()))
and just for all column to cast