Python Spark na.fill does not work - apache-spark

I'm working with spark 1.6 and Python.
I merged 2 dataframe:
df = df_1.join(df_2, df_1.id == df_2.id, 'left').drop(df_2.id)
I get new data frame with correct value and "Null" when the key don't match.
I would like to replace all "Null" values in my dataframe.
I used this function but it does not replace null value:
new_df = df.na.fill(0.0)
Does someone know why it does not work?
Many thanks for your answer.

Related

Pyspark convertion toPandas() problem, ValueError: ordinal must be >= 1

Hello every everyone !
I am reading data from DataLake (that holds a database tables) using PySpark and of applying some filters I put them in Spark DataFrame, but when I convert it to Pandas Data frame using toPandas(), I get this error: ErrorValue: ordinal must be >= 1 on jupyter.
all_columns = list(df.columns)
df = spark_df.select(all_columns)
new_df = df.toPandas()
ValueError: ordinal must be >= 1
Is there anyone has an idea how to fix this bug please !
Thank you in advance !
I tried sparkDataFrame.toPandas()
I expected to get a pandas DataFrame
Check out this StackOverflow question. Double-check if there are strange date values in your PySpark Dataframe before transforming to pandas. You can check out the MIN and MAX dates for pandas dataframes here.

Check for empty row within spark dataframe?

Running over several csv files and i am trying to run and do some checks and for some reason for one file i am getting a NullPointerException and i am suspecting that there are some empty row.
So i am running the following and for some reason it gives me an OK output:
check_empty = lambda row : not any([False if k is None else True for k in row])
check_empty_udf = sf.udf(check_empty, BooleanType())
df.filter(check_empty_udf(sf.struct([col for col in df.columns]))).show()
I am missing something within the filter function or we can't extract empty rows from dataframes.
You could use df.dropna() to drop empty rows and then compare the counts.
Something like
df_clean = df.dropna()
num_empty_rows = df.count() - df_clean.count()
You could use an inbuilt option for dealing with such scenarios.
val df = spark.read
.format("csv")
.option("header", "true")
.option("mode", "DROPMALFORMED") // Drop empty/malformed rows
.load("hdfs:///path/file.csv")
Check this reference - https://docs.databricks.com/spark/latest/data-sources/read-csv.html#reading-files

DataFrame object has no attribute 'col'

In Spark: The Definitive Guide it says:
If you need to refer to a specific DataFrame’s column, you can use the
col method on the specific DataFrame.
For example (in Python/Pyspark):
df.col("count")
However, when I run the latter code on a dataframe containing a column count I get the error 'DataFrame' object has no attribute 'col'. If I try column I get a similar error.
Is the book wrong, or how should I go about doing this?
I'm on Spark 2.3.1. The dataframe was created with the following:
df = spark.read.format("json").load("/Users/me/Documents/Books/Spark-The-Definitive-Guide/data/flight-data/json/2015-summary.json")
The book you're referring to describes Scala / Java API. In PySpark use []
df["count"]
The book combines the Scala and PySpark API's.
In Scala / Java API, df.col("column_name") or df.apply("column_name") return the Column.
Whereas in pyspark use the below to get the column from DF.
df.colName
df["colName"]
Applicable to Python Only
Given a DataFrame such as
>>> df
DataFrame[DEST_COUNTRY_NAME: string, ORIGIN_COUNTRY_NAME: string, count: bigint]
You can access any column with dot notation
>>> df.DEST_COUNTRY_NAME
Column<'DEST_COUNTRY_NAME'>
You can also use key based indexing to do the same
>>> df['DEST_COUNTRY_NAME']
Column<'DEST_COUNTRY_NAME'>
However, in case your column name and a method name on DataFrame clashes,
your column name will be shadowed when using dot notation.
>>> df['count']
Column<'count'>
>>> df.count
<bound method DataFrame.count of DataFrame[DEST_COUNTRY_NAME: string, ORIGIN_COUNTRY_NAME: string, count: bigint]>
from pyspark.sql.functions import col
... then continue
In PySpark col can be used in this way:
df.select(col("count")).show()

Spark assign a number for each word in collect

I have an collect data of dataFrame column in spark
temp = df.select('item_code').collect()
Result:
[Row(item_code=u'I0938'),
Row(item_code=u'I0009'),
Row(item_code=u'I0010'),
Row(item_code=u'I0010'),
Row(item_code=u'C0723'),
Row(item_code=u'I1097'),
Row(item_code=u'C0117'),
Row(item_code=u'I0009'),
Row(item_code=u'I0009'),
Row(item_code=u'I0009'),
Row(item_code=u'I0010'),
Row(item_code=u'I0009'),
Row(item_code=u'C0117'),
Row(item_code=u'I0009'),
Row(item_code=u'I0596')]
And now i would like assign a number for each word, if words is duplicate, it have the same number.
I using Spark, RDD , not Pandas
Please help me resolve this problem!
You could create a new dataframe which has distinct values.
val data = temp.distinct()
Now you can assigne a unique id using
import org.apache.spark.sql.functions._
val dataWithId = data.withColumn("uniqueID",monotonicallyIncreasingId)
Now you can join this new dataframe with the original dataframe and select the unique id.
val tempWithId = temp.join(dataWithId, "item_code").select("item_code", "uniqueID")
The code is assuming scala. But something similar should exist for pyspark as well. Just consider this as a pointer.

JavaRDD subtract result differs if data read from disk or it's in memory

I'm experiencing a strange behavior when I try to use JavaRDD subtract to compare 2 DataFrames.
This is what I'm doing:
I try to compare if 2 DataFrame (A,B) is equals by converting them to JavaRDD and than subtract A from B and B from A. If they are equals (contains the same data) than both result should be an empty JavaRDD.
I did not get empty result:
DataFrame A = someFunctionRespondWithDF(param);
DataFrame B = sqlContext.read().json("src/test/resources/expected/exp.json");
Assert.assertTrue(B.toJavaRDD().subtract(A.toJavaRDD()).isEmpty());
Assert.assertTrue(A.toJavaRDD().subtract(B.toJavaRDD()).isEmpty());
...assert fails
If I write the data to disk and read it back to another Dataframe, than it's fine.
A.write().json("target/result.json");
DataFrame AA = sqlContext.read().json("target/result.json");
Assert.assertTrue(B.toJavaRDD().subtract(AA.toJavaRDD()).isEmpty());
Assert.assertTrue(AA.toJavaRDD().subtract(B.toJavaRDD()).isEmpty());
...assert true
I also tried to enforce the evaluation by call the count(), cache() or persist() function on the DataFrame (based on this answer) but no success.
DataFrame AAA = A.cache();
Assert.assertTrue(B.toJavaRDD().subtract(AAA.toJavaRDD()).isEmpty();
Assert.assertTrue(AAA.toJavaRDD().subtract(B.toJavaRDD()).isEmpty();
Is there anybody experienced the same? What do I miss here?
Spark version: 1.6.1
Ok I can answer my own question:
The reason it fails on the assertion is that when I read the DataFrame from a json, the types differs. Let's say I had an Integer in my original DataFrame, after reading it back from a json (!without schema file) it will be a Long.
Solution -> use a format what describes the schema, like avro.

Resources