Rename dataframe columns in spark python - python-3.x

I have a CSV with headings that I'd like to save as Parquet (actually a delta table)
The column headings have spaces in them, which parquet can't handle. How do I change spaces to underscores?
This is what I have so far, cobbled together from other SO posts:
from pyspark.sql.functions import *
df = spark.read.option("header", True).option("delimiter","\u0001").option("inferSchema",True).csv("/mnt/landing/MyFile.TXT")
names = df.schema.names
for name in names:
df2 = df.withColumnRenamed(name,regexp_replace(name, ' ', '_'))
When I run this, the final line gives me this error:
TypeError: Column is not iterable
I thought this would be a common requirement given that parquet can't handle spaces but it's quite difficult to find any examples.

You need to use reduce function to iteratively apply renaming to the dataframe, because in your code df2 will have only the last column renamed...
The code would look as following (instead of for loop):
df2 = reduce(lambda data, name: data.withColumnRenamed(name, name.replace('1', '2')),
names, df)

You are getting exception because - function regexp_replace returns of type Column but function withColumnRenamed is excepting of type String.
def regexp_replace(e: org.apache.spark.sql.Column,pattern: String,replacement: String): org.apache.spark.sql.Column
def withColumnRenamed(existingName: String,newName: String): org.apache.spark.sql.DataFrame

Use .toDF (or) .select and pass list of columns to create new dataframe.
df.show()
#+---+----+----+
#| id|id a|id b|
#+---+----+----+
#| 1| a| b|
#| 2| c| d|
#+---+----+----+
new_cols=list(map(lambda x: x.replace(" ", "_"), df.columns))
df.toDF(*new_cols).show()
df.select([col(s).alias(s.replace(' ','_')) for s in df.columns]).show()
#+---+----+----+
#| id|id_a|id_b|
#+---+----+----+
#| 1| a| b|
#| 2| c| d|
#+---+----+----+

Related

How best to shuffle columns in pyspark to calculate permutation feature importance

It seems that pyspark ML doesn't have a built-in permutation feature importance method. So I want code this up, and to do so I have to individually shuffle each column in the dataframe. I found this resource as a way to do this. However, it seems like it would be very computationally heavy for a large dataframe. Is there a better way?
For example, below is an example of how I could shuffle just the column a in simple pyspark dataframe df. I would then calculate model performance on df with a shuffled. Next I would do the same thing to shuffle b and then calculate model performance, and so on... Is there a better to do this?
import pandas as pd
from pyspark.sql.functions import row_number, lit
# Create Pandas DF
df = pd.DataFrame({
'a': [1,5,4,3,5,7],
'b': ['a','b','a','c','d','b'],
'c': [400, 200, 150, 300, 174, 225]
})
# Convert to PySpark
df = spark.createDataFrame(df)
# Create 'index' column to join
window = Window().orderBy(lit('A'))
df = df.withColumn('index', row_number().over(window))
# Shuffle just column 'a' in a new dataframe and add 'index'
df_a = df.select('a').withColumn('rand', rand(seed=83)).orderBy('rand')\
.drop('rand')\
.withColumnRenamed('a', 'a2')\
.withColumn('index', row_number().over(window))
# Replace 'a' in df with the shuffled 'a' from df_a
df = df.join(df_a, on=['index']).drop('a').withColumnRenamed('a2', 'a').show()
+-----+---+---+---+
|index| b| c| a|
+-----+---+---+---+
| 1| a|400| 5|
| 2| b|200| 1|
| 3| d|174| 5|
| 4| c|300| 3|
| 5| b|225| 4|
| 6| a|150| 7|
+-----+---+---+---+
Spark dataframes are unordered, so these types of transformations will always be expensive.
You might want to consider converting into pandas to do the shuffle part then convert back again to pyspark:
import numpy as np
pdf = df.toPandas()
pdf["a"] = np.random.permutation(pdf["a"].values)
pdf["b"] = np.random.permutation(pdf["b"].values)
df1 = spark.createDataFrame(pdf)
df1.show()
#+---+---+---+
#| a| b| c|
#+---+---+---+
#| 3| b|400|
#| 4| a|200|
#| 5| d|150|
#| 5| c|300|
#| 1| b|174|
#| 7| a|225|
#+---+---+---+

Finding the max value from a column and populating another column based on the max value

I have incremental load in csv files. I read the csv in a dataframe. The dataframe has one column containing some strings. I have to find the distinct strings from this column and assign an ID (integer) to each of the value starting from 0 after joining one other dataframe.
In the next run, I have to assign the ID after finding out the max value in ID column and incrementing it for different strings. Wherever there is a null in ID column, I have to increment it (+1) from the value of the previous run.
FIRST RUN
string
ID
zero
0
first
1
second
2
third
3
fourth
4
SECOND RUN
MAX(ID) = 4
string
ID
zero
0
first
1
second
2
third
3
fourth
4
fifth
5
sixth
6
seventh
7
eighth
8
I have tried this but couldn't make it working..
max = df.agg({"ID": "max"}).collect()[0][0]
df_incremented = df.withcolumn("ID", when(col("ID").isNull(),expr("max += 1")))
Let me know if there is an easy way to achieve this.
As you keep only distinct values, you can use row_number function over window :
from pyspark.sql import Window
from pyspark.sql import functions as F
df = spark.createDataFrame(
[("a",), ("a",), ("b",), ("c",), ("d",), ("e",), ("e",)],
("string",)
)
w = Window.orderBy("string")
df1 = df.distinct().withColumn("ID", F.row_number().over(w) - 1)
df1.show()
#+------+---+
#|string| ID|
#+------+---+
#| a| 0|
#| b| 1|
#| c| 2|
#| d| 3|
#| e| 4|
#+------+---+
Now let's add some rows into this dataframe and use row_number along with coalesce to assign ID only for row where it's null (no need to get the max):
df2 = df1.union(spark.sql("select * from values ('f', null), ('h', null), ('i', null)"))
df3 = df2.withColumn("ID", F.coalesce("ID", F.row_number(w) - 1))
df3.show()
#+------+---+
#|string| ID|
#+------+---+
#| a| 0|
#| b| 1|
#| c| 2|
#| d| 3|
#| e| 4|
#| f| 5|
#| h| 6|
#| i| 7|
#+------+---+
If you wanted to keep duplicated values too and assign them the same ID, then use dense_rank instead of row_number.

Managing multiple columns with duplicate names in pyspark dataframe using spark_sanitize_names

I have a dataframe with columns with duplicate names. The contents of these columns are different, but unfortunately the names are the same. I would like to change the names of the columns by adding say - a number series to the columns to make each column unique like this..
foo1 | foo2 | laa3 | boo4 ...
----------------------------------
| | |
Is there a way to do that? I found a tool for scala spark here, but none for pyspark.
https://rdrr.io/cran/sparklyr/src/R/utils.R#sym-spark_sanitize_names
We can use enumerate on df.columns then append index value to the column name.
finally create dataframe with new column names!
In Pyspark:
df.show()
#+---+---+---+---+
#| i| j| k| l|
#+---+---+---+---+
#| a| 1| v| p|
#+---+---+---+---+
new_cols=[elm + str(index+1) for index,elm in enumerate(df.columns)]
#['i1', 'j2', 'k3', 'l4']
#creating new dataframe with new column names
df1=df.toDF(*new_cols)
df1.show()
#+---+---+---+---+
#| i1| j2| k3| l4|
#+---+---+---+---+
#| a| 1| v| p|
#+---+---+---+---+
In Scala:
val new_cols=df.columns.zipWithIndex.collect{case(a,i) => a+(i+1)}
val df1=df.toDF(new_cols:_*)
df1.show()
//+---+---+---+---+
//| i1| j2| k3| l4|
//+---+---+---+---+
//| a| 1| v| p|
//+---+---+---+---+

pyspark how do we check if a column value is contained in a list [duplicate]

This question already has answers here:
Filtering a Pyspark DataFrame with SQL-like IN clause
(6 answers)
Closed 4 years ago.
I'm trying to figure out if there is a function that would check if a column of a spark DataFrame contains any of the values in a list:
# define a dataframe
rdd = sc.parallelize([(0,100), (0,1), (0,2), (1,2), (1,10), (1,20), (3,18), (3,18), (3,18)])
df = sqlContext.createDataFrame(rdd, ["id", "score"])
# define a list of scores
l = [1]
# filter out records by scores by list l
records = df.filter(~df.score.contains(l))
# expected: (0,100), (0,1), (1,10), (3,18)
I get an issue running this code :
java.lang.RuntimeException: Unsupported literal type class java.util.ArrayList [1]
Is there a way to do this or do we have to loop through the list to pass contains?
I see some ways to do this without using a udf.
You could use a list comprehension with pyspark.sql.functions.regexp_extract, exploiting the fact that an empty string is returned if there is no match.
Try to extract all of the values in the list l and concatenate the results. If the resulting concatenated string is an empty string, that means none of the values matched.
For example:
from pyspark.sql.functions import concat, regexp_extract
records = df.where(concat(*[regexp_extract("score", str(val), 0) for val in l]) != "")
records.show()
#+---+-----+
#| id|score|
#+---+-----+
#| 0| 100|
#| 0| 1|
#| 1| 10|
#| 3| 18|
#| 3| 18|
#| 3| 18|
#+---+-----+
If you take a look at the execution plan, you'll see that it's smart enough cast the score column to string implicitly:
records.explain()
#== Physical Plan ==
#*Filter NOT (concat(regexp_extract(cast(score#11L as string), 1, 0)) = )
#+- Scan ExistingRDD[id#10L,score#11L]
Another way is to use pyspark.sql.Column.like (or similarly with rlike):
from functools import reduce
from pyspark.sql.functions import col
records = df.where(
reduce(
lambda a, b: a|b,
map(
lambda val: col("score").like(val.join(["%", "%"])),
map(str, l)
)
)
)
Which produces the same output as above and has the following execution plan:
#== Physical Plan ==
#*Filter Contains(cast(score#11L as string), 1)
#+- Scan ExistingRDD[id#10L,score#11L]
If you wanted only distinct records, you can do:
records.distinct().show()
#+---+-----+
#| id|score|
#+---+-----+
#| 0| 1|
#| 0| 100|
#| 3| 18|
#| 1| 10|
#+---+-----+
If I understand you correctly, you want to have a list with elements in your case its only 1. Where you want to check if this element appears in the score. In this case its easier to work with strings and not with numbers directly.
You can do this with a custom map function and apply this via a udf (directly application resulted in some strange behavior and worked only sometimes).
Find the Code below:
rdd = sc.parallelize([(0,100), (0,1), (0,2), (1,2), (1,10), (1,20), (3,18), (3,18), (3,18)])
df = sqlContext.createDataFrame(rdd, ["id", "score"])
l = [1]
def filter_list(score, l):
found = True
for e in l:
if str(e) not in str(score): #The filter that checks if an Element e
found = False #does not appear in the score
if found:
return True #boolean value if the all elements were found
else:
return False
def udf_filter(l):
return udf(lambda score: filter_list(score, l)) #make a udf function out of the filter list
df.withColumn("filtered", udf_filter(l)(col("score"))).filter(col("filtered")==True).drop("filtered").show()
#apply the function and store results in "filtered" column afterwards
#only select the successful filtered rows and drop the column
Output:
+---+-----+
| id|score|
+---+-----+
| 0| 100|
| 0| 1|
| 1| 10|
| 3| 18|
| 3| 18|
| 3| 18|
+---+-----+

Split Spark dataframe string column into multiple columns

I've seen various people suggesting that Dataframe.explode is a useful way to do this, but it results in more rows than the original dataframe, which isn't what I want at all. I simply want to do the Dataframe equivalent of the very simple:
rdd.map(lambda row: row + [row.my_str_col.split('-')])
which takes something looking like:
col1 | my_str_col
-----+-----------
18 | 856-yygrm
201 | 777-psgdg
and converts it to this:
col1 | my_str_col | _col3 | _col4
-----+------------+-------+------
18 | 856-yygrm | 856 | yygrm
201 | 777-psgdg | 777 | psgdg
I am aware of pyspark.sql.functions.split(), but it results in a nested array column instead of two top-level columns like I want.
Ideally, I want these new columns to be named as well.
pyspark.sql.functions.split() is the right approach here - you simply need to flatten the nested ArrayType column into multiple top-level columns. In this case, where each array only contains 2 items, it's very easy. You simply use Column.getItem() to retrieve each part of the array as a column itself:
split_col = pyspark.sql.functions.split(df['my_str_col'], '-')
df = df.withColumn('NAME1', split_col.getItem(0))
df = df.withColumn('NAME2', split_col.getItem(1))
The result will be:
col1 | my_str_col | NAME1 | NAME2
-----+------------+-------+------
18 | 856-yygrm | 856 | yygrm
201 | 777-psgdg | 777 | psgdg
I am not sure how I would solve this in a general case where the nested arrays were not the same size from Row to Row.
Here's a solution to the general case that doesn't involve needing to know the length of the array ahead of time, using collect, or using udfs. Unfortunately this only works for spark version 2.1 and above, because it requires the posexplode function.
Suppose you had the following DataFrame:
df = spark.createDataFrame(
[
[1, 'A, B, C, D'],
[2, 'E, F, G'],
[3, 'H, I'],
[4, 'J']
]
, ["num", "letters"]
)
df.show()
#+---+----------+
#|num| letters|
#+---+----------+
#| 1|A, B, C, D|
#| 2| E, F, G|
#| 3| H, I|
#| 4| J|
#+---+----------+
Split the letters column and then use posexplode to explode the resultant array along with the position in the array. Next use pyspark.sql.functions.expr to grab the element at index pos in this array.
import pyspark.sql.functions as f
df.select(
"num",
f.split("letters", ", ").alias("letters"),
f.posexplode(f.split("letters", ", ")).alias("pos", "val")
)\
.show()
#+---+------------+---+---+
#|num| letters|pos|val|
#+---+------------+---+---+
#| 1|[A, B, C, D]| 0| A|
#| 1|[A, B, C, D]| 1| B|
#| 1|[A, B, C, D]| 2| C|
#| 1|[A, B, C, D]| 3| D|
#| 2| [E, F, G]| 0| E|
#| 2| [E, F, G]| 1| F|
#| 2| [E, F, G]| 2| G|
#| 3| [H, I]| 0| H|
#| 3| [H, I]| 1| I|
#| 4| [J]| 0| J|
#+---+------------+---+---+
Now we create two new columns from this result. First one is the name of our new column, which will be a concatenation of letter and the index in the array. The second column will be the value at the corresponding index in the array. We get the latter by exploiting the functionality of pyspark.sql.functions.expr which allows us use column values as parameters.
df.select(
"num",
f.split("letters", ", ").alias("letters"),
f.posexplode(f.split("letters", ", ")).alias("pos", "val")
)\
.drop("val")\
.select(
"num",
f.concat(f.lit("letter"),f.col("pos").cast("string")).alias("name"),
f.expr("letters[pos]").alias("val")
)\
.show()
#+---+-------+---+
#|num| name|val|
#+---+-------+---+
#| 1|letter0| A|
#| 1|letter1| B|
#| 1|letter2| C|
#| 1|letter3| D|
#| 2|letter0| E|
#| 2|letter1| F|
#| 2|letter2| G|
#| 3|letter0| H|
#| 3|letter1| I|
#| 4|letter0| J|
#+---+-------+---+
Now we can just groupBy the num and pivot the DataFrame. Putting that all together, we get:
df.select(
"num",
f.split("letters", ", ").alias("letters"),
f.posexplode(f.split("letters", ", ")).alias("pos", "val")
)\
.drop("val")\
.select(
"num",
f.concat(f.lit("letter"),f.col("pos").cast("string")).alias("name"),
f.expr("letters[pos]").alias("val")
)\
.groupBy("num").pivot("name").agg(f.first("val"))\
.show()
#+---+-------+-------+-------+-------+
#|num|letter0|letter1|letter2|letter3|
#+---+-------+-------+-------+-------+
#| 1| A| B| C| D|
#| 3| H| I| null| null|
#| 2| E| F| G| null|
#| 4| J| null| null| null|
#+---+-------+-------+-------+-------+
Here's another approach, in case you want split a string with a delimiter.
import pyspark.sql.functions as f
df = spark.createDataFrame([("1:a:2001",),("2:b:2002",),("3:c:2003",)],["value"])
df.show()
+--------+
| value|
+--------+
|1:a:2001|
|2:b:2002|
|3:c:2003|
+--------+
df_split = df.select(f.split(df.value,":")).rdd.flatMap(
lambda x: x).toDF(schema=["col1","col2","col3"])
df_split.show()
+----+----+----+
|col1|col2|col3|
+----+----+----+
| 1| a|2001|
| 2| b|2002|
| 3| c|2003|
+----+----+----+
I don't think this transition back and forth to RDDs is going to slow you down...
Also don't worry about last schema specification: it's optional, you can avoid it generalizing the solution to data with unknown column size.
I understand your pain. Using split() can work, but can also lead to breaks.
Let's take your df and make a slight change to it:
df = spark.createDataFrame([('1:"a:3":2001',),('2:"b":2002',),('3:"c":2003',)],["value"])
df.show()
+------------+
| value|
+------------+
|1:"a:3":2001|
| 2:"b":2002|
| 3:"c":2003|
+------------+
If you try to apply split() to this as outlined above:
df_split = df.select(split(df.value,":")).rdd.flatMap(
lambda x: x).toDF(schema=["col1","col2","col3"]).show()
you will get
IllegalStateException: Input row doesn't have expected number of values required by the schema. 4 fields are required while 3 values are provided.
So, is there a more elegant way of addressing this? I was so happy to have it pointed out to me. pyspark.sql.functions.from_csv() is your friend.
Taking my above example df:
from pyspark.sql.functions import from_csv
# Define a column schema to apply with from_csv()
col_schema = ["col1 INTEGER","col2 STRING","col3 INTEGER"]
schema_str = ",".join(col_schema)
# define the separator because it isn't a ','
options = {'sep': ":"}
# create a df from the value column using schema and options
df_csv = df.select(from_csv(df.value, schema_str, options).alias("value_parsed"))
df_csv.show()
+--------------+
| value_parsed|
+--------------+
|[1, a:3, 2001]|
| [2, b, 2002]|
| [3, c, 2003]|
+--------------+
Then we can easily flatten the df to put the values in columns:
df2 = df_csv.select("value_parsed.*").toDF("col1","col2","col3")
df2.show()
+----+----+----+
|col1|col2|col3|
+----+----+----+
| 1| a:3|2001|
| 2| b|2002|
| 3| c|2003|
+----+----+----+
No breaks. Data correctly parsed. Life is good. Have a beer.
Instead of Column.getItem(i) we can use Column[i].
Also, enumerate is useful in big dataframes.
from pyspark.sql import functions as F
Keep parent column:
for i, c in enumerate(['new_1', 'new_2']):
df = df.withColumn(c, F.split('my_str_col', '-')[i])
or
new_cols = ['new_1', 'new_2']
df = df.select('*', *[F.split('my_str_col', '-')[i].alias(c) for i, c in enumerate(new_cols)])
Replace parent column:
for i, c in enumerate(['new_1', 'new_2']):
df = df.withColumn(c, F.split('my_str_col', '-')[i])
df = df.drop('my_str_col')
or
new_cols = ['new_1', 'new_2']
df = df.select(
*[c for c in df.columns if c != 'my_str_col'],
*[F.split('my_str_col', '-')[i].alias(c) for i, c in enumerate(new_cols)]
)

Resources