Check for empty row within spark dataframe?

Check for empty row within spark dataframe? - apache-spark

Running over several csv files and i am trying to run and do some checks and for some reason for one file i am getting a NullPointerException and i am suspecting that there are some empty row.
So i am running the following and for some reason it gives me an OK output:
check_empty = lambda row : not any([False if k is None else True for k in row])
check_empty_udf = sf.udf(check_empty, BooleanType())
df.filter(check_empty_udf(sf.struct([col for col in df.columns]))).show()
I am missing something within the filter function or we can't extract empty rows from dataframes.

You could use df.dropna() to drop empty rows and then compare the counts.
Something like
df_clean = df.dropna()
num_empty_rows = df.count() - df_clean.count()

You could use an inbuilt option for dealing with such scenarios.
val df = spark.read
.format("csv")
.option("header", "true")
.option("mode", "DROPMALFORMED") // Drop empty/malformed rows
.load("hdfs:///path/file.csv")
Check this reference - https://docs.databricks.com/spark/latest/data-sources/read-csv.html#reading-files

Related

apply window function to multiple columns

I have a DF with over 20 columns. For each column I need to find the lead value and add it to the result.
I've been doing it using with column.
df
.withColumn("lead_col1", lead("col1").over(window))
.withColumn("lead_col2", lead("col2").over(window))
.withColumn("lead_col3", lead("col3").over(window))
and 17 more rows like that. Is there a way to do it using less code? I tried using this exampe, but it doesn't work.

Check below code, it is faster than foldLeft.
import org.apache.spark.sql.expressions._
val windowSpec = ...
val windowColumns = Seq(
("lead_col1", "col1"),
("lead_col2","col2"),
("lead_col3","col3")
).map(c => lead(col(c._2),1).over(windowSpec).as(c._1))
val windowColumns = df.columns ++ windowColumns
Applying windowColumns to DataFrame.
df.select(windowColumns:_*).show(false)

Like Sath suggested, foldleft works.
val columns = df.columns
columns.foldLeft(df){(tempDF, colName) =>
tempDF.withColumn("lag_" + colName, lag($"$colName", 1).over(window))}

How to process pyspark dataframe as group by column value

I have a huge dataframe of different item_id and its related data, I need to process each group with the item_id serparately in parallel, I tried the to repartition the dataframe by item_id using the below code, but it seems it's still being processed as a whole not chunks
data = sqlContext.read.csv(path='/user/data', header=True)
columns = data.columns
result = data.repartition('ITEM_ID') \
.rdd \
.mapPartitions(lambda iter: pd.DataFrame(list(iter), columns=columns))\
.mapPartitions(scan_item_best_model)\
.collect()
also is repartition is the correct approach or there is something am doing wrong ?

after looking around I found this which addresses a similar problem, finally I had to solve it like
data = sqlContext.read.csv(path='/user/data', header=True)
columns = data.columns
df = data.select("ITEM_ID", F.struct(columns).alias("df"))
df = df.groupBy('ITEM_ID').agg(F.collect_list('df').alias('data'))
df = df.rdd.map(lambda big_df: (big_df['ITEM_ID'], pd.DataFrame.from_records(big_df['data'], columns=columns))).map(
scan_item_best_model)

Spark SQL "select column AS ..." not finding column

I am trying to run an 'SQL' query on a Spark DataFrame. I have registered the name of the df as table and now I am trying to run a select on a column where I apply a udf and then pickup the rows that pass a certain condition.
The problem is that on my WHERE clause is referencing the modified column but it is not able to see the names declared with AS.
DataFrame df = sqlContext.read()
.format("com.databricks.spark.csv")
.option("header", "true")
.option("delimiter", delimiter)
.load(path);
df.registerTempTable("df");
String sqlDfQuery = "SELECT parseDateTime(start) as start1 FROM df WHERE start1 > 1";
if (sqlContext.sql(sqlDfQuery).take(1) != null) return true;
when I am running that I am getting back
org.apache.spark.sql.AnalysisException: cannot resolve 'start1' given input columns: [scores, start, ...
parseDateTime is a UDF defined like that
sqlContext.udf().register("parseDateTime", (String dt) -> new DateTime(dt).getMillis(), DataTypes.LongType);
Should I not be trying to do that?

This happens because it applies the filters before aliases.
You could do a nested select statement to solve this issue.
Something like the following:
String sqlDfQuery = "SELECT start1 FROM (
SELECT parseDateTime(start) AS start1 FROM df) TMP
WHERE start1 > 1 ";

Python Spark na.fill does not work

I'm working with spark 1.6 and Python.
I merged 2 dataframe:
df = df_1.join(df_2, df_1.id == df_2.id, 'left').drop(df_2.id)
I get new data frame with correct value and "Null" when the key don't match.
I would like to replace all "Null" values in my dataframe.
I used this function but it does not replace null value:
new_df = df.na.fill(0.0)
Does someone know why it does not work?
Many thanks for your answer.

get specific row from spark dataframe

Is there any alternative for df[100, c("column")] in scala spark data frames. I want to select specific row from a column of spark data frame.
for example 100th row in above R equivalent code

Firstly, you must understand that DataFrames are distributed, that means you can't access them in a typical procedural way, you must do an analysis first. Although, you are asking about Scala I suggest you to read the Pyspark Documentation, because it has more examples than any of the other documentations.
However, continuing with my explanation, I would use some methods of the RDD API cause all DataFrames have one RDD as attribute. Please, see my example bellow, and notice how I take the 2nd record.
df = sqlContext.createDataFrame([("a", 1), ("b", 2), ("c", 3)], ["letter", "name"])
myIndex = 1
values = (df.rdd.zipWithIndex()
.filter(lambda ((l, v), i): i == myIndex)
.map(lambda ((l,v), i): (l, v))
.collect())
print(values[0])
# (u'b', 2)
Hopefully, someone gives another solution with fewer steps.

This is how I achieved the same in Scala. I am not sure if it is more efficient than the valid answer, but it requires less coding
val parquetFileDF = sqlContext.read.parquet("myParquetFule.parquet")
val myRow7th = parquetFileDF.rdd.take(7).last

In PySpark, if your dataset is small (can fit into memory of driver), you can do
df.collect()[n]
where df is the DataFrame object, and n is the Row of interest. After getting said Row, you can do row.myColumn or row["myColumn"] to get the contents, as spelled out in the API docs.

The getrows() function below should get the specific rows you want.
For completeness, I have written down the full code in order to reproduce the output.
# Create SparkSession
from pyspark.sql import SparkSession
spark = SparkSession.builder.master('local').appName('scratch').getOrCreate()
# Create the dataframe
df = spark.createDataFrame([("a", 1), ("b", 2), ("c", 3)], ["letter", "name"])
# Function to get rows at `rownums`
def getrows(df, rownums=None):
return df.rdd.zipWithIndex().filter(lambda x: x[1] in rownums).map(lambda x: x[0])
# Get rows at positions 0 and 2.
getrows(df, rownums=[0, 2]).collect()
# Output:
#> [(Row(letter='a', name=1)), (Row(letter='c', name=3))]

This Works for me in PySpark
df.select("column").collect()[0][0]

There is a scala way (if you have a enough memory on working machine):
val arr = df.select("column").rdd.collect
println(arr(100))
If dataframe schema is unknown, and you know actual type of "column" field (for example double), than you can get arr as following:
val arr = df.select($"column".cast("Double")).as[Double].rdd.collect

you can simply do that by using below single line of code
val arr = df.select("column").collect()(99)

When you want to fetch max value of a date column from dataframe, just the value without object type or Row object information, you can refer to below code.
table = "mytable"
max_date = df.select(max('date_col')).first()[0]
2020-06-26
instead of Row(max(reference_week)=datetime.date(2020, 6, 26))

Following is a Java-Spark way to do it , 1) add a sequentially increment columns. 2) Select Row number using Id. 3) Drop the Column
import static org.apache.spark.sql.functions.*;
..
ds = ds.withColumn("rownum", functions.monotonically_increasing_id());
ds = ds.filter(col("rownum").equalTo(99));
ds = ds.drop("rownum");
N.B. monotonically_increasing_id starts from 0;

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string

Check for empty row within spark dataframe? - apache-spark

You could use df.dropna() to drop empty rows and then compare the counts. Something like df_clean = df.dropna() num_empty_rows = df.count() - df_clean.count()

Related

apply window function to multiple columns

How to process pyspark dataframe as group by column value

Spark SQL "select column AS ..." not finding column

Python Spark na.fill does not work

get specific row from spark dataframe

Categories

Resources