Dataframe in Pyspark - apache-spark

I was just dropping a column from dataframe. it was dropped. after calling show method, it seems like column is not dropped in dataframe.
Code:
df.drop('Salary').show()
+-----+
| Name|
+-----+
| Arun|
| Joe|
|Jerry|
+-----+
df.show()
+-----+------+
| Name|Salary|
+-----+------+
| Arun| 5000|
| Joe| 6300|
|Jerry| 9600|
+-----+------+
I am using spark 2.4.4 version. could you please tell why its not dropped? And I thought that its like a dropping column form table in oracle database.

The drop method returns a new DataFrame. The original df is not changed by this transformation, so calling df.show() a second time will return the original data with your Salary column.

You need to save the dataframe after dropping.
df2 = df.drop('Salary')
df2.show()

Related

pyspark: read partitioned parquet "my_file.parquet/col1=NOW" string value replaced by <current_time> on read()

With pyspark 3.1.1 on wsl Debian 10
When reading parquet file partitioned with a column containing the string NOW, the string is replaced by the current time at the moment of the read() funct is executed. I suppose that NOW string interpreted as now()
# step to reproduce
df = spark.createDataFrame(data=[("NOW",1), ("TEST", 2)], schema = ["col1", "id"])
df.write.partitionBy("col1").parquet("test/test.parquet")
>>> /home/test/test.parquet/col1=NOW
df_loaded = spark.read.option(
"basePath",
"test/test.parquet",
).parquet("test/test.parquet/col1=*")
df_loaded.show(truncate=False)
>>>
+---+--------------------------+
|id |col1 |
+---+--------------------------+
|2 |TEST |
|1 |2021-04-18 14:36:46.532273|
+---+--------------------------+
Is that a bug or a normal function of pyspark?
if the latter, is there a sparkContext option to avoid that behaviour?
I suspect that's an expected feature... but I'm not sure where it was documented. Anyway, if you want to keep the column as a string column, you can provide a schema while reading the parquet file:
df = spark.read.schema("id long, col1 string").parquet("test/test.parquet")
df.show()
+---+----+
| id|col1|
+---+----+
| 1| NOW|
| 2|TEST|
+---+----+

Create a column with values created from all other columns as a JSON in PySPARK

I have a dataframe as below:
+----------+----------+--------+
| FNAME| LNAME| AGE|
+----------+----------+--------+
| EARL| JONES| 35|
| MARK| WOOD| 20|
+----------+----------+--------+
I am trying to add a new column as value to this dataframe which should be like this:
+----------+----------+--------+------+------------------------------------+
| FNAME| LNAME| AGE| VALUE |
+----------+----------+--------+-------------------------------------------+
| EARL| JONES| 35|{"FNAME":"EARL","LNAME":"JONES","AGE":"35"}|
| MARK| WOOD| 20|{"FNAME":"MARK","WOOD":"JONES","AGE":"20"} |
+----------+----------+--------+-------------------------------------------+
I am not able to achieve this using withColumn or any json function.
Any headstart would be appreciated.
Spark: 2.3
Python: 3.7.x
Please consider using the SQL function to_jsonwhich you can find in org.apache.spark.sql.functions
Here's the solution :
df.withColumn("VALUE", to_json(struct($"FNAME", $"LNAME", $"AGE"))
And you can also avoid specifying the columns' names as follows :
df.withColumn("VALUE", to_json(struct(df.columns.map(col): _*)
PS: the code I provided is written in scala, but it's the same logic for Python, you just have to use a spark SQL function which is available in both programming languages.
I hope It helps,
scala solution:
val df2 = df.select(
to_json(
map_from_arrays(lit(df.columns), array('*))
).as("value")
)
pyton solution: (I don't know how to do it for n-cols like in scala because map_from_arrays not exists in pyspark)
import pyspark.sql.functions as f
df.select(f.to_json(
f.create_map(f.lit("FNAME"), df.FNAME, f.lit("LNAME"), df.LNAME, f.lit("AGE"), df.AGE)
).alias("value")
).show(truncate=False)
output:
+-------------------------------------------+
|value |
+-------------------------------------------+
|{"FNAME":"EARL","LNAME":"JONES","AGE":"35"}|
|{"FNAME":"MARK","LNAME":"WOOD","AGE":"20"} |
+-------------------------------------------+
Achieved using:
df.withColumn("VALUE", to_json(struct([df[x] for x in df.columns])))

How to join two pyspark data frames on Arraytype operation?

I have two DataFrames, A and B. Each have a column called 'names' and this column is ArrayType(StringType()).
Now I want to left join A and B on the condition that A['names'] and B['names'] have common elements.
Here is an example:
A:
+---------------+
| names|
+---------------+
|['Mike','Jack']|
| ['Peter']|
+---------------+
B:
+---------------+
| names|
+---------------+
|['John','Mike']|
| null|
+---------------+
after the left join, I should have:
+---------------+---------------+
| A_names| B_names|
+---------------+---------------+
|['Mike','Jack']|['John','Mike']|
| ['Peter']| null|
+---------------+---------------+
In your case you have to explode the values- explode will produce one row per value in your arrays and then you can join them and reduce the final result back to your desired format.
In the code example, I exploded the names and joined the DataFrames on the newly created column (B_names). Finally the result will be grouped by "names" to remove the produced duplicates.
For the group by aggregate function, you can use pyspark.sql.functions.first(), with the parameter ignorenulls set to True.
from pyspark.sql.functions import col, explode, first
test_list = [['Mike', 'Jack']], [['Peter']]
test_df = spark.createDataFrame(test_list, ["names"])
test_list2 = [["John","Mike"]],[["Kate"]]
test_df2 = spark.createDataFrame(test_list2, ["names"])
test_df2 = test_df2.select(
col("names").alias("B_names"),
explode("names").alias("single_names")
)
test_df.select(col("names").alias("A_names"), explode("names").alias("single_names"))\
.join(test_df2, on="single_names", how="left" )\
.groupBy("A_names").agg(first("B_names", ignorenulls=True).alias("B_names")).show()
Result:
+------------+------------+
| A_names| B_names|
+------------+------------+
|[Mike, Jack]|[John, Mike]|
| [Peter]| null|
+------------+------------+

Remove rows from dataframe based on condition in pyspark

I have one dataframe with two columns:
+--------+-----+
| col1| col2|
+--------+-----+
|22 | 12.2|
|1 | 2.1|
|5 | 52.1|
|2 | 62.9|
|77 | 33.3|
I would like to create a new dataframe which will take only rows where
"value of col1" > "value of col2"
Just as a note the col1 has long type and col2 has double type
the result should be like this:
+--------+----+
| col1|col2|
+--------+----+
|22 |12.2|
|77 |33.3|
I think the best way would be to simply use "filter".
df_filtered=df.filter(df.col1>df.col2)
df_filtered.show()
+--------+----+
| col1|col2|
+--------+----+
|22 |12.2|
|77 |33.3|
Another possible way could be using a where function of DF.
For example this:
val output = df.where("col1>col2")
will give you the expected result:
+----+----+
|col1|col2|
+----+----+
| 22|12.2|
| 77|33.3|
+----+----+
The best way to keep rows based on a condition is to use filter, as mentioned by others.
To answer the question as stated in the title, one option to remove rows based on a condition is to use left_anti join in Pyspark.
For example to delete all rows with col1>col2 use:
rows_to_delete = df.filter(df.col1>df.col2)
df_with_rows_deleted = df.join(rows_to_delete, on=[key_column], how='left_anti')
you can use sqlContext to simplify the challenge.
first register as temp table as example:
df.createOrReplaceTempView("tbl1")
then run the sql like
sqlContext.sql("select * from tbl1 where col1 > col2")

Subtracting DataFrames by a single ID column - duplicate columns behave differently

I am trying to compare two DataFrames with the same schema (in Spark 1.6.0, using Scala) to determine which rows in the newer table have been added (i.e. are not present in the older table).
I need to do this by ID (i.e. examining a single column, not the whole row, to see what is new). Some rows may have changed between the versions, in that they have the same id in both versions, but the other columns have changed - I do not want these in the output, so I cannot simply subtract the two versions.
Based on various suggestions, I am doing a left-outer join on the chosen ID column, then selecting rows with nulls in columns from the right side of the join (indicating that they were not present in the older version of the table):
def diffBy(field:String, newer:DataFrame, older:DataFrame): DataFrame = {
newer.join(older, newer(field) === older(field), "left_outer")
.select(older(field).isNull)
// TODO just select the leftmost columns, removing the nulls
}
However, this does not work. (row 3 exists only in the newer version, so should be output):
scala> newer.show
+---+-------+
| id| value|
+---+-------+
| 3| three|
| 2|two-new|
+---+-------+
scala> older.show
+---+-------+
| id| value|
+---+-------+
| 1| one|
| 2|two-old|
+---+-------+
scala> diffBy("id", newer, older).show
+---+-----+---+-----+
| id|value| id|value|
+---+-----+---+-----+
+---+-----+---+-----+
The join is working as expected:
scala> val joined = newer.join(older, newer("id") === older("id"), "left_outer")
scala> joined.show
+---+-------+----+-------+
| id| value| id| value|
+---+-------+----+-------+
| 2|two-new| 2|two-old|
| 3| three|null| null|
+---+-------+----+-------+
So the problem is in the selection of the column for filtering.
joined.where(older("id").isNull).show
+---+-----+---+-----+
| id|value| id|value|
+---+-----+---+-----+
+---+-----+---+-----+
Perhaps it is due to the duplicate id column names in the join? But if I use the value column (which is also duplicated) instead to detect nulls, it works as expected:
joined.where(older("value").isNull).show
+---+-----+----+-----+
| id|value| id|value|
+---+-----+----+-----+
| 3|three|null| null|
+---+-----+----+-----+
What is going on here - and why is the behaviour different for id and value?
You can solve the problem using a special spark join called "leftanti" . It is equivalent to minus (in Oracle PL SQL).
val joined = newer.join(older, newer("id") === older("id"), "leftanti")
This will only select columns from newer.
I have found a solution to my problem, though not an explanation for why it occurs.
It seems to be necessary to create an alias in order to refer unambiguously to the rightmost id column, and then use a textual WHERE clause so that I can substitute in the qualified column name from the variable field:
newer.join(older.as("o"), newer(field) === older(field), "left_outer")
.where(s"o.$field IS NULL")

Resources