Transform data into rdd and analyze - apache-spark

I am new in spark and have below data in csv format, which I want to convert in proper format.
Csv file with no header
Student_name=abc, student_grades=A, Student_gender=female
Student_name=Xyz, student_grades=B, Student_gender=male
Now I want to put it in rdd with creation of header
Student_Name student_grades student_gender
abc A female
Xyz B male
Also I want to get list of students with grades as A, B and C

What you could do is infer the schema from the first line of the file, and then transform the dataframe accordingly that is:
Remove the column name from the row values.
Rename the columns
Here is how you could do it. First, let's read your data from a file and display it.
// the options are here to get rid of potential spaces around the ",".
val df = spark.read
.option("ignoreTrailingWhiteSpace", true)
.option("ignoreLeadingWhiteSpace", true)
.csv("path/your_file.csv")
df.show(false)
+----------------+----------------+---------------------+
|_c0 |_c1 |_c2 |
+----------------+----------------+---------------------+
|Student_name=abc|student_grades=A|Student_gender=female|
|Student_name=Xyz|student_grades=B|Student_gender=male |
+----------------+----------------+---------------------+
Then, we extract a mapping between the default names and the new names using the first row of the dataframe.
val row0 = df.head
val cols = df
.columns
.map(c => c -> row0.getAs[String](c).split("=").head )
Finally we get rid of the name of the columns with a split on "=" and rename the columns using our mapping:
val new_df = df
.select(cols.map{ case (old_name, new_name) =>
split(col(old_name), "=")(1) as new_name
} : _*)
new_df.show(false)
+------------+--------------+--------------+
|Student_name|student_grades|Student_gender|
+------------+--------------+--------------+
|abc |A |female |
|Xyz |B |male |
+------------+--------------+--------------+

Related

How to compare only the column names of 2 data frames using pyspark?

I have 2 Data Frames with matched and unmatched column names, I want to compare the column names of the both the frames and print a table/dataframe with unmatched column names.
Please someone help me on this
I have no idea how can i achieve this
Below is the expectation
DF1:
DF2:
Output:
Output should actual vs unmatched column name
Update:
As per Expected output in Questions. The requirement is to compare both dataframes with similar schema but different names and make a dataframe of mismatched column names.
Thus, my best bet would be:
df3 = spark.createDataFrame([Row(idx,x) for idx,x in enumerate(df1.schema.names) if x not in df2.schema.names]).toDF("#","Uncommon Columns From DF1")\
.join(spark.createDataFrame([Row(idx,x) for idx, x in enumerate(df2.schema.names) if x not in df1.schema.names]).toDF("#","Uncommon Columns From DF2"),"#")
The catch here is, the schema should be similar as it compares column names based on "ordinals" i.e. their respective positions in the schema.
Input/Output
Change the join type to "full_outer" in case there are extra columns in either dataframe.
df3 = spark.createDataFrame([Row(idx,x) for idx,x in enumerate(df1.schema.names) if x not in df2.schema.names]).toDF("#","Uncommon Columns From DF1").join(spark.createDataFrame([Row(idx,x) for idx, x in enumerate(df2.schema.names) if x not in df1.schema.names]).toDF("#","Uncommon Columns From DF2"),"#", "full_outer")
Input/Output
You can easily do this by using sets operations
Data Preparation
s1 = StringIO("""
firstName,lastName,age,city,country
Alex,Smith,19,SF,USA
Rick,Mart,18,London,UK
""")
df1 = pd.read_csv(s1,delimiter=',')
sparkDF1 = sql.createDataFrame(df1)
s2 = StringIO("""
firstName,lastName,age
Alex,Smith,21
""")
df2 = pd.read_csv(s2,delimiter=',')
sparkDF2 = sql.createDataFrame(df2)
sparkDF1.show()
+---------+--------+---+------+-------+
|firstName|lastName|age| city|country|
+---------+--------+---+------+-------+
| Alex| Smith| 19| SF| USA|
| Rick| Mart| 18|London| UK|
+---------+--------+---+------+-------+
sparkDF2.show()
+---------+--------+---+
|firstName|lastName|age|
+---------+--------+---+
| Alex| Smith| 21|
+---------+--------+---+
Columns - Intersections & Difference
common = set(sparkDF1.columns) & set(sparkDF2.columns)
diff = set(sparkDF1.columns) - set(sparkDF2.columns)
print("Common - ",common)
## Common - {'lastName', 'age', 'firstName'}
print("Difference - ",diff)
## Difference - {'city', 'country'}
Additionally you can create tables/dataframes using the above variable values

Split pyspark dataframe column and limit the splits

I have the below spark dataframe.
Column_1
Physics=99;name=Xxxx;age=15
Physics=97;chemistry=85;name=yyyy;age=14
Physics=97;chemistry=85;maths=65;name=zzzz;age=14
I have to split the above dataframe column into multiple columns like below.
column_1 name age
Physics=99 Xxxx 15
Physics=97;chemistry=85 yyyy 14
Physics=97;chemistry=85;maths=65 zzzz 14
I tried split using delimiter ; and limit. But it splits the subjects too into different columns. Name and age is clubbed together into single column. I require all subjects in one column and only name and age in separate columns.
Is it possible to achieve this in Pyspark.
You can use replace trick to split the columns.
df = spark.createDataFrame([('Physics=99;name=Xxxx;age=15'),('Physics=97;chemistry=85;name=yyyy;age=14'),('Physics=97;chemistry=85;maths=65;name=zzzz;age=14')], 'string').toDF('c1')
df.withColumn('c1', f.regexp_replace('c1', ';name', ',name')) \
.withColumn('c1', f.regexp_replace('c1', ';age', ',age')) \
.withColumn('c1', f.split('c1', ',')) \
.select(
f.col('c1')[0].alias('stat'),
f.col('c1')[1].alias('name'),
f.col('c1')[2].alias('age')) \
.show(truncate=False)
+--------------------------------+---------+------+
|stat |name |age |
+--------------------------------+---------+------+
|Physics=99 |name=Xxxx|age=15|
|Physics=97;chemistry=85 |name=yyyy|age=14|
|Physics=97;chemistry=85;maths=65|name=zzzz|age=14|
+--------------------------------+---------+------+
You can do like this to extract the names with a regex:
import pyspark.sql.functions as F
df = spark.createDataFrame([("Physics=99;name=Xxxx;age=15",), ("Physics=97;chemistry=85;name=yyyy;age=14",),("Physics=97;chemistry=85;maths=65;name=zzzz;age=14",)], ["Column1"])
new_df = df.withColumn("name", F.regexp_extract('Column1', r'name=(\w+)', 1).alias('name'))
new_df.show()
Output:
+--------------------+----+
| Column1|name|
+--------------------+----+
|Physics=99;name=X...|Xxxx|
|Physics=97;chemis...|yyyy|
|Physics=97;chemis...|zzzz|
+--------------------+----+

Extract values from a complex column in PySpark

I have a PySpark dataframe which has a complex column, refer to below value:
ID value
1 [{"label":"animal","value":"cat"},{"label":null,"value":"George"}]
I want to add a new column in PySpark dataframe which basically convert it into a list of strings. If Label is null, string should contain "value" and if label is not null, string should be "label:value". So for above example dataframe, the output should look like below:
ID new_column
1 ["animal:cat", "George"]
You can use transform to transform each array element into a string, which is constructed using concat_ws:
df2 = df.selectExpr(
'id',
"transform(value, x -> concat_ws(':', x['label'], x['value'])) as new_column"
)
df2.show()
+---+--------------------+
| id| new_column|
+---+--------------------+
| 1|[animal:cat, George]|
+---+--------------------+

Split one column into multiple columns in Spark DataFrame using comma separator

I want to create a multiple columns from one column from Dataframe using comma separator in Java Spark.
I have one value with a comma in one column in DataFrame and want to split into multiple columns by using a comma separator. I have the following code:
Dataset<Row> dfreq1 = spark.read().format("json").option("inferSchema", "true")
.load("new.json");
dfreq1.show(5, 300);
dfreq1.createOrReplaceTempView("tempdata");
Dataset<Row> dfreq2 = dfreq1.sqlContext().sql("select split(names, '|') from tempdata");
dfreq2.show(5, 300);
Input
+----------------------------+
| name|
+-----------------------------+
|ABC1,XYZ1,GDH1,KLN1,JUL1,HAI1|
|ABC2,XYZ2,GDH2,KLN2,JUL2,HAI2|
+-----------------------------+
Output
+-----------------------------+
| Cl1| Cl2| Cl3| Cl3|Cl4 | Cl4|
+-----------------------------+
|ABC1|XYZ1|GDH1|KLN1|JUL1|HAI1|
|ABC2|XYZ2|GDH2|KLN2|JUL2|HAI2|
+-----------------------------+
you can try this
scala> var dfd =Seq(("ABC1,XYZ1,GDH1,KLN1,JUL1,HAI1"),("ABC2,XYZ2,GDH2,KLN2,JUL2,HAI2")).toDF("name")
scala> dfd.withColumn("temp", split(col("name"), ",")).select((0 until 6).map(i => col("temp").getItem(i).as(s"col$i")): _* ).show
+----+----+----+----+----+----+
|col0|col1|col2|col3|col4|col5|
+----+----+----+----+----+----+
|ABC1|XYZ1|GDH1|KLN1|JUL1|HAI1|
|ABC2|XYZ2|GDH2|KLN2|JUL2|HAI2|
+----+----+----+----+----+----+
hope this helps you
List<String> schemaList = Arrays.asList("name","gender","sale_amount","event","age","shop_time");
Column column = functions.col("value");
Column linesSplit = functions.split(column,"##");
for(int i=0;i<schemaList.size();i++){
lines = lines.withColumn(schemaList.get(i),linesSplit.getItem(i));
}
You read the csv from this column into a Dataset
Dataset<Row> df= spark.read
.option("header",false)
.option("inferSchema",true)
.option("delimiter", ",")
.csv(originalDF.map(x=>x.getString(0)))

Searching for substring across multiple columns

I am trying to find a substring across all columns of my spark dataframe using PySpark. I currently know how to search for a substring through one column using filter and contains:
df.filter(df.col_name.contains('substring'))
How do I extend this statement, or utilize another, to search through multiple columns for substring matches?
You can generalize the statement the filter in one go:
from pyspark.sql.functions import col, count, when
# Converts all unmatched filters to NULL and drops them.
df = df.select([when(col(c).contains('substring'), col(c)).alias(c) for c in df.columns]).na.drop()
OR
You can simply loop over the columns and apply the same filter:
for col in df.columns:
df = df.filter(df[col].contains("substring"))
You can search through all columns and fill next dataframe and union results, like this:
columns = ["language", "else"]
data = [
("Java", "Python"),
("Python", "100000"),
("Scala", "3000"),
]
df = spark.createDataFrame(data).toDF(*columns)
df.cache()
df.show()
schema = df.schema
df2 = spark.createDataFrame(data=[], schema=schema)
for col in df.columns:
df2 = df2.unionByName(df.filter(df[col].like("%Python%")))
df2.show()
+--------+------+
|language| else|
+--------+------+
| Python|100000|
| Java|Python|
+--------+------+
Result will contain first 2 rows, because they have value 'Python' in some of the columns.

Resources