Check for empty row within spark dataframe? - apache-spark

Running over several csv files and i am trying to run and do some checks and for some reason for one file i am getting a NullPointerException and i am suspecting that there are some empty row.
So i am running the following and for some reason it gives me an OK output:
check_empty = lambda row : not any([False if k is None else True for k in row])
check_empty_udf = sf.udf(check_empty, BooleanType())
df.filter(check_empty_udf(sf.struct([col for col in df.columns]))).show()
I am missing something within the filter function or we can't extract empty rows from dataframes.

You could use df.dropna() to drop empty rows and then compare the counts.
Something like
df_clean = df.dropna()
num_empty_rows = df.count() - df_clean.count()

You could use an inbuilt option for dealing with such scenarios.
val df = spark.read
.format("csv")
.option("header", "true")
.option("mode", "DROPMALFORMED") // Drop empty/malformed rows
.load("hdfs:///path/file.csv")
Check this reference - https://docs.databricks.com/spark/latest/data-sources/read-csv.html#reading-files

Related

apply window function to multiple columns

I have a DF with over 20 columns. For each column I need to find the lead value and add it to the result.
I've been doing it using with column.
df
.withColumn("lead_col1", lead("col1").over(window))
.withColumn("lead_col2", lead("col2").over(window))
.withColumn("lead_col3", lead("col3").over(window))
and 17 more rows like that. Is there a way to do it using less code? I tried using this exampe, but it doesn't work.
Check below code, it is faster than foldLeft.
import org.apache.spark.sql.expressions._
val windowSpec = ...
val windowColumns = Seq(
("lead_col1", "col1"),
("lead_col2","col2"),
("lead_col3","col3")
).map(c => lead(col(c._2),1).over(windowSpec).as(c._1))
val windowColumns = df.columns ++ windowColumns
Applying windowColumns to DataFrame.
df.select(windowColumns:_*).show(false)
Like Sath suggested, foldleft works.
val columns = df.columns
columns.foldLeft(df){(tempDF, colName) =>
tempDF.withColumn("lag_" + colName, lag($"$colName", 1).over(window))}

How to process pyspark dataframe as group by column value

I have a huge dataframe of different item_id and its related data, I need to process each group with the item_id serparately in parallel, I tried the to repartition the dataframe by item_id using the below code, but it seems it's still being processed as a whole not chunks
data = sqlContext.read.csv(path='/user/data', header=True)
columns = data.columns
result = data.repartition('ITEM_ID') \
.rdd \
.mapPartitions(lambda iter: pd.DataFrame(list(iter), columns=columns))\
.mapPartitions(scan_item_best_model)\
.collect()
also is repartition is the correct approach or there is something am doing wrong ?
after looking around I found this which addresses a similar problem, finally I had to solve it like
data = sqlContext.read.csv(path='/user/data', header=True)
columns = data.columns
df = data.select("ITEM_ID", F.struct(columns).alias("df"))
df = df.groupBy('ITEM_ID').agg(F.collect_list('df').alias('data'))
df = df.rdd.map(lambda big_df: (big_df['ITEM_ID'], pd.DataFrame.from_records(big_df['data'], columns=columns))).map(
scan_item_best_model)

Spark SQL "select column AS ..." not finding column

I am trying to run an 'SQL' query on a Spark DataFrame. I have registered the name of the df as table and now I am trying to run a select on a column where I apply a udf and then pickup the rows that pass a certain condition.
The problem is that on my WHERE clause is referencing the modified column but it is not able to see the names declared with AS.
DataFrame df = sqlContext.read()
.format("com.databricks.spark.csv")
.option("header", "true")
.option("delimiter", delimiter)
.load(path);
df.registerTempTable("df");
String sqlDfQuery = "SELECT parseDateTime(start) as start1 FROM df WHERE start1 > 1";
if (sqlContext.sql(sqlDfQuery).take(1) != null) return true;
when I am running that I am getting back
org.apache.spark.sql.AnalysisException: cannot resolve 'start1' given input columns: [scores, start, ...
parseDateTime is a UDF defined like that
sqlContext.udf().register("parseDateTime", (String dt) -> new DateTime(dt).getMillis(), DataTypes.LongType);
Should I not be trying to do that?
This happens because it applies the filters before aliases.
You could do a nested select statement to solve this issue.
Something like the following:
String sqlDfQuery = "SELECT start1 FROM (
SELECT parseDateTime(start) AS start1 FROM df) TMP
WHERE start1 > 1 ";

Python Spark na.fill does not work

I'm working with spark 1.6 and Python.
I merged 2 dataframe:
df = df_1.join(df_2, df_1.id == df_2.id, 'left').drop(df_2.id)
I get new data frame with correct value and "Null" when the key don't match.
I would like to replace all "Null" values in my dataframe.
I used this function but it does not replace null value:
new_df = df.na.fill(0.0)
Does someone know why it does not work?
Many thanks for your answer.

get specific row from spark dataframe

Is there any alternative for df[100, c("column")] in scala spark data frames. I want to select specific row from a column of spark data frame.
for example 100th row in above R equivalent code
Firstly, you must understand that DataFrames are distributed, that means you can't access them in a typical procedural way, you must do an analysis first. Although, you are asking about Scala I suggest you to read the Pyspark Documentation, because it has more examples than any of the other documentations.
However, continuing with my explanation, I would use some methods of the RDD API cause all DataFrames have one RDD as attribute. Please, see my example bellow, and notice how I take the 2nd record.
df = sqlContext.createDataFrame([("a", 1), ("b", 2), ("c", 3)], ["letter", "name"])
myIndex = 1
values = (df.rdd.zipWithIndex()
.filter(lambda ((l, v), i): i == myIndex)
.map(lambda ((l,v), i): (l, v))
.collect())
print(values[0])
# (u'b', 2)
Hopefully, someone gives another solution with fewer steps.
This is how I achieved the same in Scala. I am not sure if it is more efficient than the valid answer, but it requires less coding
val parquetFileDF = sqlContext.read.parquet("myParquetFule.parquet")
val myRow7th = parquetFileDF.rdd.take(7).last
In PySpark, if your dataset is small (can fit into memory of driver), you can do
df.collect()[n]
where df is the DataFrame object, and n is the Row of interest. After getting said Row, you can do row.myColumn or row["myColumn"] to get the contents, as spelled out in the API docs.
The getrows() function below should get the specific rows you want.
For completeness, I have written down the full code in order to reproduce the output.
# Create SparkSession
from pyspark.sql import SparkSession
spark = SparkSession.builder.master('local').appName('scratch').getOrCreate()
# Create the dataframe
df = spark.createDataFrame([("a", 1), ("b", 2), ("c", 3)], ["letter", "name"])
# Function to get rows at `rownums`
def getrows(df, rownums=None):
return df.rdd.zipWithIndex().filter(lambda x: x[1] in rownums).map(lambda x: x[0])
# Get rows at positions 0 and 2.
getrows(df, rownums=[0, 2]).collect()
# Output:
#> [(Row(letter='a', name=1)), (Row(letter='c', name=3))]
This Works for me in PySpark
df.select("column").collect()[0][0]
There is a scala way (if you have a enough memory on working machine):
val arr = df.select("column").rdd.collect
println(arr(100))
If dataframe schema is unknown, and you know actual type of "column" field (for example double), than you can get arr as following:
val arr = df.select($"column".cast("Double")).as[Double].rdd.collect
you can simply do that by using below single line of code
val arr = df.select("column").collect()(99)
When you want to fetch max value of a date column from dataframe, just the value without object type or Row object information, you can refer to below code.
table = "mytable"
max_date = df.select(max('date_col')).first()[0]
2020-06-26
instead of Row(max(reference_week)=datetime.date(2020, 6, 26))
Following is a Java-Spark way to do it , 1) add a sequentially increment columns. 2) Select Row number using Id. 3) Drop the Column
import static org.apache.spark.sql.functions.*;
..
ds = ds.withColumn("rownum", functions.monotonically_increasing_id());
ds = ds.filter(col("rownum").equalTo(99));
ds = ds.drop("rownum");
N.B. monotonically_increasing_id starts from 0;

Resources