Dynamic columns in where clause while reading parquet files in pyspark - apache-spark

I have parquet files and want to read them based on dynamic columns, so take an example, I have 2 data frames and want to select data from df1 based on df2.
so I am using the below code but want to make it dynamic in terms of joining columns, today i have 2 columns, tomorrow i can have 4.
a = dict[keys]
col1 = a[0]
col2 = a[1]
v = df1.join(df2,[df1[col1] == df2[col1],
df1.[col2] == df2.[col2],
how='inner')
So how can i make this columns dynamic so the join condition need not be be hard coded and will add remove columns from join condition.

I would just generation an join condition object first based on your dict and then use it in the join.
from functools import reduce
join_condition = reduce(
lambda a, b: a && b,
[ df1[col] == df2[col]
for col
in dict[keys]
]
)
v = df1.join(
df2,
join_condition,
)

Related

How do filter date time column according to time ranges specified in two column?

In one excel File that I have filtered-eda data, I want to filter such data according to my second excel file by using two columns; StartTime and EndTime; as a time range
(time column types in two excel datetime64[ns])
you can see my two excel files at picture
enter image description here
My code is
df1 = pd.read_excel(filename_1)`
df2 = pd.read_excel(filename_2, usecols= "A,C")
df3 = df1[df1['BinaryLabels'] == 1]
df2 = df2[(df3["StartTime"] <= df2.Time) & (df2.Time <= df3["EndTime"])]
print(df2)
and get error as :ValueError: Can only compare identically-labeled Series objects
How can I solve it?
Thanks for advance..
If geenral DataFrames with different number of rows you have to merge with cross join before filtering:
df1 = pd.read_excel(filename_1)
df2 = pd.read_excel(filename_2, usecols= "A,C")
df3 = df1[df1['BinaryLabels'] == 1]
df = df2.assign(a=1).merge(df3.assign(a=1), on='a', how='outer')
df = df[(df["StartTime"] <= df.Time) & (df.Time <= df["EndTime"])]

Searching for substring across multiple columns

I am trying to find a substring across all columns of my spark dataframe using PySpark. I currently know how to search for a substring through one column using filter and contains:
df.filter(df.col_name.contains('substring'))
How do I extend this statement, or utilize another, to search through multiple columns for substring matches?
You can generalize the statement the filter in one go:
from pyspark.sql.functions import col, count, when
# Converts all unmatched filters to NULL and drops them.
df = df.select([when(col(c).contains('substring'), col(c)).alias(c) for c in df.columns]).na.drop()
OR
You can simply loop over the columns and apply the same filter:
for col in df.columns:
df = df.filter(df[col].contains("substring"))
You can search through all columns and fill next dataframe and union results, like this:
columns = ["language", "else"]
data = [
("Java", "Python"),
("Python", "100000"),
("Scala", "3000"),
]
df = spark.createDataFrame(data).toDF(*columns)
df.cache()
df.show()
schema = df.schema
df2 = spark.createDataFrame(data=[], schema=schema)
for col in df.columns:
df2 = df2.unionByName(df.filter(df[col].like("%Python%")))
df2.show()
+--------+------+
|language| else|
+--------+------+
| Python|100000|
| Java|Python|
+--------+------+
Result will contain first 2 rows, because they have value 'Python' in some of the columns.

pyspark RDD - Left outer join on specific key

I have two table A and B with hundred of columns. I am trying to apply left outer join on two table but they both have different keys.
I created a new column with same key in B as A. Then was able to apply left outer join. However, how do I join both tables if I am unable to make the column names consistent?
This is what I have tried:
a = spark.table('a').rdd
a = spark.table('a')
b = b.withColumn("acct_id",col("id"))
b = b.rdd
a.leftOuterJoin(b).collect()
If you have dataframe then why you are creating rdd for that, is there any specific need?
Try below command on dataframes -
a.join(b, a.column_name==b.column_name, 'left').show()
Here are few commands you can use to investigate your dataframe
##Get column names of dataframe
a.columns
##Get column names with their datatype of dataframe
a.dtypes
##What is the type of object (eg. dataframe, rdd etc.)
type(a)
DataFrames are faster than rdd, and you already have dataframes, so I sugest:
a = spark.table('a')
b = spark.table('b').withColumn("acct_id",col("id"))
result = pd.merge(a, b, left_on='id', right_on='acct_id', how='left').rdd

How to drop single valued columns effiecently from dataframe

How to drop all columns which have single value from a dataframe effiecntly ?
I found two ways:
This method ignores null and only considers other values, I need to consider nulls in my case
# apply countDistinct on each column
col_counts = partsDF.agg(*(countDistinct(col(c)).alias(c) for c in partsDF.columns)).collect()[0].asDict()
this method takes too long time
col_counts = partsDF.agg(*( partsDF.select(c).distinct().count() for c in partsDF.columns)).collect()[0].asDict()
#select the cols with count=1 in an array
cols_to_drop = [col for col in partsDF.columns if col_counts[col] == 1 ]

How to merge edits from one dataframe into another dataframe in Spark?

I have a dataframe df1 with 150 columns and many rows. I also have a dataframe df2 with the same schema but very few rows containing edits that should be applied to df1 (there's a key column id to identify which row to update). df2 has only columns with updates populated. The other of the columns are null. What I want to do is to update the rows in df1 with correspoding rows from dataframe df2 in the following way:
if a column in df2 is null, it should not cause any changes in df1
if a column in df2 contains a tilde "~", it should result in nullifying that column in df1
otherwise the value in column in df1 should get replaced with the value from df2
How can I best do it? Can it be done in a generic way without listing all the columns but rather iterating over them? Can it be done using dataframe API or do I need to switch to RDDs?
(Of course by updating dataframe df1 I mean creating a new, updated dataframe.)
Example
Let's say the schema is: id:Int, name:String, age: Int.
df1 is:
1,"Greg",18
2,"Kate",25
3,"Chris",30
df2 is:
1,"Gregory",null
2,~,26
The updated dataframe should look like this:
1,"Gregory",18
2,null,26
3,"Chris",30
you can also use case or coalesce using full outer join to merge the two dataframes. see a link below for an explanation.
Spark incremental loading overwrite old record
I figured out how to do it with an intermediate conversion to RDD. First, create a map idsToEdits where keys are row ids and values are maps of column numbers to values (only the non-null ones).
val idsToEdits=df2.rdd.map{row=>
(row(0),
row.getValuesMap[AnyVal](row.schema.fieldNames.filterNot(colName=>row.isNullAt(row.fieldIndex(colName))))
.map{case (k,v)=> (row.fieldIndex(k),if(v=="~") null else v)} )
}.collectAsMap()
Broadast that map and define an editRow function updating a row.
val idsToEditsBr=sc.broadcast(idsToEdits)
import org.apache.spark.sql.Row
val editRow:Row=>Row={ row =>
idsToEditsBr
.value
.get(row(0))
.map{edits => Row.fromSeq(edits.foldLeft(row.toSeq){case (rowSeq,
(idx,newValue))=>rowSeq.updated(idx,newValue)})}
.getOrElse(row)
}
Finally, use that function on RDD derived from df1 and convert back to a dataframe.
val updatedDF=spark.createDataFrame(df1.rdd.map(editRow),df1.schema)
It sounds like your question is how to perform this without explcitly naming all the columns so I will assume you have some "doLogic" udf function or dataframe functions to perform your logic after joining.
import org.apache.spark.sql.types.StringType
val cols = df1.schema.filterNot(x => x.name == "id").map({ x =>
if (x.dataType == StringType) {
doLogicUdf(col(x), col(x + "2")))
} else {
when(col(x + "2").isNotNull, col(x + "2")).otherwise(col(x))
}
}) :+ col("id")
val df2 = df2.select(df2.columns.map( x=> col(x).alias(x+"2")) : _*))
df1.join(df2, col("id") ===col("id2") , "inner").select(cols : _*)

Resources