Searching for substring across multiple columns - apache-spark

I am trying to find a substring across all columns of my spark dataframe using PySpark. I currently know how to search for a substring through one column using filter and contains:
df.filter(df.col_name.contains('substring'))
How do I extend this statement, or utilize another, to search through multiple columns for substring matches?

You can generalize the statement the filter in one go:
from pyspark.sql.functions import col, count, when
# Converts all unmatched filters to NULL and drops them.
df = df.select([when(col(c).contains('substring'), col(c)).alias(c) for c in df.columns]).na.drop()
OR
You can simply loop over the columns and apply the same filter:
for col in df.columns:
df = df.filter(df[col].contains("substring"))

You can search through all columns and fill next dataframe and union results, like this:
columns = ["language", "else"]
data = [
("Java", "Python"),
("Python", "100000"),
("Scala", "3000"),
]
df = spark.createDataFrame(data).toDF(*columns)
df.cache()
df.show()
schema = df.schema
df2 = spark.createDataFrame(data=[], schema=schema)
for col in df.columns:
df2 = df2.unionByName(df.filter(df[col].like("%Python%")))
df2.show()
+--------+------+
|language| else|
+--------+------+
| Python|100000|
| Java|Python|
+--------+------+
Result will contain first 2 rows, because they have value 'Python' in some of the columns.

Related

How to compare only the column names of 2 data frames using pyspark?

I have 2 Data Frames with matched and unmatched column names, I want to compare the column names of the both the frames and print a table/dataframe with unmatched column names.
Please someone help me on this
I have no idea how can i achieve this
Below is the expectation
DF1:
DF2:
Output:
Output should actual vs unmatched column name
Update:
As per Expected output in Questions. The requirement is to compare both dataframes with similar schema but different names and make a dataframe of mismatched column names.
Thus, my best bet would be:
df3 = spark.createDataFrame([Row(idx,x) for idx,x in enumerate(df1.schema.names) if x not in df2.schema.names]).toDF("#","Uncommon Columns From DF1")\
.join(spark.createDataFrame([Row(idx,x) for idx, x in enumerate(df2.schema.names) if x not in df1.schema.names]).toDF("#","Uncommon Columns From DF2"),"#")
The catch here is, the schema should be similar as it compares column names based on "ordinals" i.e. their respective positions in the schema.
Input/Output
Change the join type to "full_outer" in case there are extra columns in either dataframe.
df3 = spark.createDataFrame([Row(idx,x) for idx,x in enumerate(df1.schema.names) if x not in df2.schema.names]).toDF("#","Uncommon Columns From DF1").join(spark.createDataFrame([Row(idx,x) for idx, x in enumerate(df2.schema.names) if x not in df1.schema.names]).toDF("#","Uncommon Columns From DF2"),"#", "full_outer")
Input/Output
You can easily do this by using sets operations
Data Preparation
s1 = StringIO("""
firstName,lastName,age,city,country
Alex,Smith,19,SF,USA
Rick,Mart,18,London,UK
""")
df1 = pd.read_csv(s1,delimiter=',')
sparkDF1 = sql.createDataFrame(df1)
s2 = StringIO("""
firstName,lastName,age
Alex,Smith,21
""")
df2 = pd.read_csv(s2,delimiter=',')
sparkDF2 = sql.createDataFrame(df2)
sparkDF1.show()
+---------+--------+---+------+-------+
|firstName|lastName|age| city|country|
+---------+--------+---+------+-------+
| Alex| Smith| 19| SF| USA|
| Rick| Mart| 18|London| UK|
+---------+--------+---+------+-------+
sparkDF2.show()
+---------+--------+---+
|firstName|lastName|age|
+---------+--------+---+
| Alex| Smith| 21|
+---------+--------+---+
Columns - Intersections & Difference
common = set(sparkDF1.columns) & set(sparkDF2.columns)
diff = set(sparkDF1.columns) - set(sparkDF2.columns)
print("Common - ",common)
## Common - {'lastName', 'age', 'firstName'}
print("Difference - ",diff)
## Difference - {'city', 'country'}
Additionally you can create tables/dataframes using the above variable values

How do I subset a pandas dataframe based on a list of column names

I have a client data df with 200+ columns, say A,B,C,D...X,Y,Z. There's a column in this df which has CAMPAIGN_ID in it. I have another data mapping_csv that has CAMPAIGN_ID and set of columns I need from df. I need to split df into one csv file for each campaign, that will have rows from that campaign and only those columns that are as per mapping_csv.
I am getting type error as below.
TypeError: unhashable type: 'list'
This is what I tried.
for campaign in df['CAMPAIGN_ID'].unique():
df2 = df[df['CAMPAIGN_ID']==campaign]
# remove blank columns
df2.dropna(how='all', axis=1, inplace=True)
for column in df2.columns:
if df2[column].unique()[0]=="0000-00-00" and df2[column].unique().shape[0]==1:
df2 = df2.drop(column, axis=1)
for column in df2.columns:
if df2[column].unique()[0]=='0' and df2[column].unique().shape[0]==1:
df2 = df2.drop(column, axis=1)
# select required columns
df2 = df2[mapping_csv.loc[mapping_csv['CAMPAIGN_ID']==campaign, 'Variable_List'].str.replace(" ","").str.split(",")]
file_shape = df2.shape[0]
filename = "cart_"+str(dt.date.today().strftime('%Y%m%d'))+"_"+campaign+"_rowcnt_"+str(file_shape)
df2.to_csv(filename+".csv",index=False)
Any help will be appreciated.
This is how data looks like -
This is how mapping looks like -
This addresses your core problem.
df = pd.DataFrame(dict(id=['foo','foo','bar','bar',],a=[1,2,3,4,], b=[5,6,7,8], c=[1,2,3,4]))
mapper = dict(foo=['a','b'], bar=['b','c'])
for each_id in df.id.unique():
df_id = df.query(f'id.str.contains("{each_id}")').loc[:,mapper[each_id]]
print(df_id)

PySpark- How to filter row from this dataframe

I am trying to read the first row from a file and then filter that from the dataframe.
I am using take(1) to read the first row. I then want to filter this from the dataframe (it could appear multiple times within the dataset).
from pyspark.context import SparkContext
from pyspark.sql.session import SparkSession
sc = SparkContext(appName = "solution01")
spark = SparkSession(sc)
df1 = spark.read.csv("/Users/abc/test.csv")
header = df1.take(1)
print(header)
final_df = df1.filter(lambda x: x != header)
final_df.show()
However I get the following error TypeError: condition should be string or Column.
I was trying to follow the answer from Nicky here How to skip more then one lines of header in RDD in Spark
The data looks like (but will have multiple columns that i need to do the same for):
customer_id
1
2
3
customer_id
4
customer_id
5
I want the result as:
1
2
3
4
5
take on dataframe results list(Row) we need to get the value use [0][0] and
In filter clause use column_name and filter the rows which are not equal to header
header = df1.take(1)[0][0]
#filter out rows that are not equal to header
final_df = df1.filter(col("<col_name>") != header)
final_df.show()

Pandas data frame merge select columns

I want to get data from only df2 (all columns) by comparing 'no' filed in both df1 and df2.
My 3 line code is below, for this i'm getting all columns from df1 and df2 not able to trim fields from df1. How to achieve ?
I've 2 pandas dataframes like below :
df1:
no,name,salary
1,abc,100
2,def,105
3,abc,110
4,def,115
5,abc,120
df2:
no,name,salary,dept,addr
1,abc,100,IT1,ADDR1
2,abc,101,IT2,ADDR2
3,abc,102,IT3,ADDR3
4,abc,103,IT4,ADDR4
5,abc,104,IT5,ADDR5
6,abc,105,IT6,ADDR6
7,abc,106,IT7,ADDR7
8,abc,107,IT8,ADDR8
df1 = pd.read_csv("D:\\data\\data1.csv")
df2 = pd.read_csv("D:\\data\\data2.csv")
resDF = pd.merge(df1, df2, on='no' , how='inner')
I think you need filter only no column, then on and how parameters are not necessary:
resDF = pd.merge(df1[['no']], df2)
Or use boolean indexing with filtering by isin:
resDF = df2[df2['no'].isin(df1['no'])]

Apache Spark -- Parsing the data and convert columns into rows

I need to convert the columns into rows .Please help me in the below requirement in spark scala code.input file is | delimiter and one of the column having comma delimiter value.based on the comma delimiter i need to convert them into rows
my input records:
c11|c12|a,b|c14
c21|c22|a,c,d|c24
expected output :
a,c11,c12,c14
b,c11,c12,c14
a,c21,c22,c24
c,c21,c22,c24
d,c21,c22,c24
Thanks,
Siva
First read the dataframe as csv with | as a separator:
This provides a dataframe with the base columns you need except the third which would be a string. Lets say you renamed this column to be called _c2 (the default name for the third column). Now you can split the string to get an array
We also remove the previous column as we don't need it anymore.
Lastly we use explode to turn the array to rows and drop the unused column
from pyspark.sql.functions import split
from pyspark.sql.functions import explode
df1 = spark.read.csv("pathToFile", sep="|")
df2 = df1.withColumn("splitted", split(df1["_c2"],",")).drop("_c2")
df3 = df2.withColumn("exploded", explode(df2["splitted"])).drop("splitted")
or in scala (free form)
import org.apache.spark.sql.functions.split
import org.apache.spark.sql.functions.explode
val df1 = spark.read.csv("pathToFile", sep="|")
val df2 = df1.withColumn("splitted", split(df1("_c2"),",")).drop("_c2")
val df3 = df2.withColumn("exploded", explode(df2("splitted"))).drop("splitted")

Resources