This question already has an answer here:
Adding a nullable column in Spark dataframe
(1 answer)
Closed 10 months ago.
As I create a new column with F.lit(1), while calling printSchema() I get
column_name: integer (nullable = false)
as lit function docs is quite scarce, do you think there is any simple mapping that can be done to turn it into nullable = true?
Okay, in this scenario (only some specific column mapping, nothing in bulk) it seems like
df.schema['column_name'].nullable = True
does the trick. Nevertheless df.printSchema() isn't updated, although df.schema is.
Related
May be a really silly question, but for:
val ds3 = ds.groupBy($"ip")
.avg("humidity")
it is not clear how for a dataset, not dataframe, how I can rename the column like using alias on-the-fly. I tried a few things but to no avail. No errors when trying, but no effect.
I would like "avg_humidity" as col name.
Extending the question, what if I issue:
val ds3 = ds.groupBy($"ip")
.avg()
How to handle that?
avg does not provide an alias func you might need an extra withColumnRenamed
val ds3 = ds.groupBy($"ip")
.avg("humidity")
.withColumnRenamed("avg(humidity)","avg_humidity")
instead you can use .agg(avg("humidity").as("avg_humidity"))
val ds3 = ds.groupBy($"ip").agg(avg("humidity").as("avg_humidity"))
groupBy(cols: Column*) returns a RelationalGroupedDataset.
The return type for avg(colNames: String*) on it is a DataFrame, so by using as(alias: String) you're simply assigning alias to a new DataFrame, not to a column(s).
SO discussion about renaming columns in a DataFrame is here.
This question already has answers here:
How do I split an RDD into two or more RDDs?
(4 answers)
Closed 4 years ago.
I want to split an RDD into multiple RDD based on a value in a row. The values in rows are pre-known and are fixed in nature.
for e.g.
source_rdd = sc.parallelize([('a',1),('a',2),('a',3),('b',4),('b',5),('b',6)])
should be split into two RDDs with one containing only a and another containing only b as keys
I have tried groupByKey method and able to do it successfully after doing a collect() operation on grouped RDD, which I cannot do in production due to memory constraints
a_rdd, b_rdd = source_rdd.keyBy(lambda row: row[0]).groupByKey().collect()
The current implementation is to apply multiple filter operation to get each RDD
a_rdd = source_rdd.filter(lambda row: row[0] == 'a')
b_rdd = source_rdd.filter(lambda row: row[0] == 'b')
Can this be optimized further, what will be the best way to do it in production with data which cannot fit in memory?
Usage: These RDD will be converted into different Dataframes (one for each key), each with different schema and stored in S3 as output.
Note: I would prefer pyspark implementation. I have read a lot of stack overflow answers and blogs, and could not find anyway which is working for me yet.
I have already seen question which is marked duplicate for, which I have already mentioned in my question. I have asked this question as the provided solution seems not the most optimised way and is 3 years old.
You can using toDF too. Aslo, a_rdd and b_rdd are not rdd in your code as they are collected!
df = source_rdd.keyBy(lambda row: row[0]).groupByKey()
a_rdd = df.filter(lambda row: row[0] == 'a')
b_rdd = df.filter(lambda row: row[0] == 'b')
This question already has answers here:
Create new Dataframe with empty/null field values
(2 answers)
Closed 4 years ago.
I am trying to add a new column using withcolumn whose value should be NULL but its not working.
val schema = StructType(
StructField("uid",StringType,true)::
StructField("sid",StringType,true)::
StructField("astid",StringType,true)::
StructField("timestamp",StringType,true)::
StructField("start",StringType,true)::
StructField("end",StringType,true)::
StructField("geo",StringType,true)::
StructField("stnid",StringType,true)::
StructField("end_type",LongType,true)::
StructField("like",LongType,true)::
StructField("dislike",LongType,true)::Nil
)
val Mobpath = spark.read.schema(schema).csv("/data/mob.txt")
Mobpath.printSchema()
Mobpath.createOrReplaceTempView("Mobpathsql")
val showall = spark.sql("select * from Mobpathsql")
showall.show()
val newcol = Mobpath.withColumn("new1",functions.lit("null"))
newcol.show()
using withcolumn it is not showing any error and also not showing any output.
what about this:
val newcol = showall.withColumn("new1",functions.lit("null"))
newcol.show()
I just test the above code and it worked, i don't know why it does not work with Mobpath
Until recently parquet did not support null values - a questionable premise. In fact a recent version did finally add that support:
https://github.com/apache/parquet-format/blob/master/LogicalTypes.md
However it will be a long time before spark supports that new parquet feature - if ever. Here is the associated (closed - will not fix) JIRA:
https://issues.apache.org/jira/browse/SPARK-10943
So what are folks doing with regards to null column values today when writing out dataframe's to parquet ? I can only think of very ugly horrible hacks like writing empty strings and .. well .. I have no idea what to do with numerical values to indicate null - short of putting some sentinel value in and having my code check for it (which is inconvenient and bug prone).
You misinterpreted SPARK-10943. Spark does support writing null values to numeric columns.
The problem is that null alone carries no type information at all
scala> spark.sql("SELECT null as comments").printSchema
root
|-- comments: null (nullable = true)
As per comment by Michael Armbrust all you have to do is cast:
scala> spark.sql("""SELECT CAST(null as DOUBLE) AS comments""").printSchema
root
|-- comments: double (nullable = true)
and the result can be safely written to Parquet.
I wrote a PySpark solution for this (df is a dataframe with columns of NullType):
# get dataframe schema
my_schema = list(df.schema)
null_cols = []
# iterate over schema list to filter for NullType columns
for st in my_schema:
if str(st.dataType) == 'NullType':
null_cols.append(st)
# cast null type columns to string (or whatever you'd like)
for ncol in null_cols:
mycolname = str(ncol.name)
df = df \
.withColumn(mycolname, df[mycolname].cast('string'))
This question already has an answer here:
Spark 2.0 implicit encoder, deal with missing column when type is Option[Seq[String]] (scala)
(1 answer)
Closed 5 years ago.
Lets assume I create a parquet file as follows :
case class A (i:Int,j:Double,s:String)
var l1 = List(A(1,2.0,"s1"),A(2,3.0,"S2"))
val ds = spark.createDataset(l1)
ds.write.parquet("/tmp/test.parquet")
Is it possible to read it into a Dataset of a type with a different schema, where the only difference is few additional fields?
Eg:
case class B (i:Int,j:Double,s:String,d:Double=1.0) // d is extra and has a default value
Is there a way that i can make this work? :
val ds2 = spark.read.parquet("/tmp/test.parquet").as[B]
In Spark, if the schema of the Dataset does not match the desired U type, you can use select along with alias or as to rearrange or rename as required. It means for the following code to work:
val ds2 = spark.read.parquet("/tmp/test.parquet").as[B]
Following modifications needs to be done:
val ds2 = spark.read.parquet("/tmp/test.parquet").withColumn("d", lit(1D)).as[B]
Or, if creating additional column is not possible, then following can be done:
val ds2 = spark.read.parquet("/tmp/test.parquet").map{
case row => B(row.getInt(0), row.getDouble(1), row.getString(2))
}