Split pyspark dataframe column and limit the splits - apache-spark

I have the below spark dataframe.
Column_1
Physics=99;name=Xxxx;age=15
Physics=97;chemistry=85;name=yyyy;age=14
Physics=97;chemistry=85;maths=65;name=zzzz;age=14
I have to split the above dataframe column into multiple columns like below.
column_1 name age
Physics=99 Xxxx 15
Physics=97;chemistry=85 yyyy 14
Physics=97;chemistry=85;maths=65 zzzz 14
I tried split using delimiter ; and limit. But it splits the subjects too into different columns. Name and age is clubbed together into single column. I require all subjects in one column and only name and age in separate columns.
Is it possible to achieve this in Pyspark.

You can use replace trick to split the columns.
df = spark.createDataFrame([('Physics=99;name=Xxxx;age=15'),('Physics=97;chemistry=85;name=yyyy;age=14'),('Physics=97;chemistry=85;maths=65;name=zzzz;age=14')], 'string').toDF('c1')
df.withColumn('c1', f.regexp_replace('c1', ';name', ',name')) \
.withColumn('c1', f.regexp_replace('c1', ';age', ',age')) \
.withColumn('c1', f.split('c1', ',')) \
.select(
f.col('c1')[0].alias('stat'),
f.col('c1')[1].alias('name'),
f.col('c1')[2].alias('age')) \
.show(truncate=False)
+--------------------------------+---------+------+
|stat |name |age |
+--------------------------------+---------+------+
|Physics=99 |name=Xxxx|age=15|
|Physics=97;chemistry=85 |name=yyyy|age=14|
|Physics=97;chemistry=85;maths=65|name=zzzz|age=14|
+--------------------------------+---------+------+

You can do like this to extract the names with a regex:
import pyspark.sql.functions as F
df = spark.createDataFrame([("Physics=99;name=Xxxx;age=15",), ("Physics=97;chemistry=85;name=yyyy;age=14",),("Physics=97;chemistry=85;maths=65;name=zzzz;age=14",)], ["Column1"])
new_df = df.withColumn("name", F.regexp_extract('Column1', r'name=(\w+)', 1).alias('name'))
new_df.show()
Output:
+--------------------+----+
| Column1|name|
+--------------------+----+
|Physics=99;name=X...|Xxxx|
|Physics=97;chemis...|yyyy|
|Physics=97;chemis...|zzzz|
+--------------------+----+

Related

Concatenate two columns of spark dataframe with null values

I have two columns in my spark dataframe
First_name Last_name
Shiva Kumar
Karthik kumar
Shiva Null
Null Shiva
My requirement is to add a new column to dataframe by concatenating the above 2 columns with a comma and handle null values too.
I have tried using concat and coalesce but I can't get the output with comma delimiter only when both columns are available
Expected output
Full_name
Shiva,kumar
Karthik,kumar
Shiva
Shiva
concat_ws concats and handles null values for you.
df.withColumn('Full_Name', F.concat_ws(',', F.col('First_name'), F.col('Last_name'))
You can use lit:
import pyspark.sql.functions as F
f = df.withColumn('Full_Name', F.concat(F.col('First_name'), F.lit(','), F.col('Last_name'))).select('Full_Name')
# fix null values
f = f.withColumn('Full_Name', F.regexp_replace(F.col('Full_Name'), '(,Null)|(Null,)', ''))
f.show()
+-------------+
| Full_Name|
+-------------+
| Shiva,Kumar|
|Karthik,kumar|
| Shiva|
| Shiva|
+-------------+

Split one column into multiple columns in Spark DataFrame using comma separator

I want to create a multiple columns from one column from Dataframe using comma separator in Java Spark.
I have one value with a comma in one column in DataFrame and want to split into multiple columns by using a comma separator. I have the following code:
Dataset<Row> dfreq1 = spark.read().format("json").option("inferSchema", "true")
.load("new.json");
dfreq1.show(5, 300);
dfreq1.createOrReplaceTempView("tempdata");
Dataset<Row> dfreq2 = dfreq1.sqlContext().sql("select split(names, '|') from tempdata");
dfreq2.show(5, 300);
Input
+----------------------------+
| name|
+-----------------------------+
|ABC1,XYZ1,GDH1,KLN1,JUL1,HAI1|
|ABC2,XYZ2,GDH2,KLN2,JUL2,HAI2|
+-----------------------------+
Output
+-----------------------------+
| Cl1| Cl2| Cl3| Cl3|Cl4 | Cl4|
+-----------------------------+
|ABC1|XYZ1|GDH1|KLN1|JUL1|HAI1|
|ABC2|XYZ2|GDH2|KLN2|JUL2|HAI2|
+-----------------------------+
you can try this
scala> var dfd =Seq(("ABC1,XYZ1,GDH1,KLN1,JUL1,HAI1"),("ABC2,XYZ2,GDH2,KLN2,JUL2,HAI2")).toDF("name")
scala> dfd.withColumn("temp", split(col("name"), ",")).select((0 until 6).map(i => col("temp").getItem(i).as(s"col$i")): _* ).show
+----+----+----+----+----+----+
|col0|col1|col2|col3|col4|col5|
+----+----+----+----+----+----+
|ABC1|XYZ1|GDH1|KLN1|JUL1|HAI1|
|ABC2|XYZ2|GDH2|KLN2|JUL2|HAI2|
+----+----+----+----+----+----+
hope this helps you
List<String> schemaList = Arrays.asList("name","gender","sale_amount","event","age","shop_time");
Column column = functions.col("value");
Column linesSplit = functions.split(column,"##");
for(int i=0;i<schemaList.size();i++){
lines = lines.withColumn(schemaList.get(i),linesSplit.getItem(i));
}
You read the csv from this column into a Dataset
Dataset<Row> df= spark.read
.option("header",false)
.option("inferSchema",true)
.option("delimiter", ",")
.csv(originalDF.map(x=>x.getString(0)))

Searching for substring across multiple columns

I am trying to find a substring across all columns of my spark dataframe using PySpark. I currently know how to search for a substring through one column using filter and contains:
df.filter(df.col_name.contains('substring'))
How do I extend this statement, or utilize another, to search through multiple columns for substring matches?
You can generalize the statement the filter in one go:
from pyspark.sql.functions import col, count, when
# Converts all unmatched filters to NULL and drops them.
df = df.select([when(col(c).contains('substring'), col(c)).alias(c) for c in df.columns]).na.drop()
OR
You can simply loop over the columns and apply the same filter:
for col in df.columns:
df = df.filter(df[col].contains("substring"))
You can search through all columns and fill next dataframe and union results, like this:
columns = ["language", "else"]
data = [
("Java", "Python"),
("Python", "100000"),
("Scala", "3000"),
]
df = spark.createDataFrame(data).toDF(*columns)
df.cache()
df.show()
schema = df.schema
df2 = spark.createDataFrame(data=[], schema=schema)
for col in df.columns:
df2 = df2.unionByName(df.filter(df[col].like("%Python%")))
df2.show()
+--------+------+
|language| else|
+--------+------+
| Python|100000|
| Java|Python|
+--------+------+
Result will contain first 2 rows, because they have value 'Python' in some of the columns.

Grouping by name and then adding up the number of another column [duplicate]

I am using pyspark to read a parquet file like below:
my_df = sqlContext.read.parquet('hdfs://myPath/myDB.db/myTable/**')
Then when I do my_df.take(5), it will show [Row(...)], instead of a table format like when we use the pandas data frame.
Is it possible to display the data frame in a table format like pandas data frame? Thanks!
The show method does what you're looking for.
For example, given the following dataframe of 3 rows, I can print just the first two rows like this:
df = sqlContext.createDataFrame([("foo", 1), ("bar", 2), ("baz", 3)], ('k', 'v'))
df.show(n=2)
which yields:
+---+---+
| k| v|
+---+---+
|foo| 1|
|bar| 2|
+---+---+
only showing top 2 rows
As mentioned by #Brent in the comment of #maxymoo's answer, you can try
df.limit(10).toPandas()
to get a prettier table in Jupyter. But this can take some time to run if you are not caching the spark dataframe. Also, .limit() will not keep the order of original spark dataframe.
Let's say we have the following Spark DataFrame:
df = sqlContext.createDataFrame(
[
(1, "Mark", "Brown"),
(2, "Tom", "Anderson"),
(3, "Joshua", "Peterson")
],
('id', 'firstName', 'lastName')
)
There are typically three different ways you can use to print the content of the dataframe:
Print Spark DataFrame
The most common way is to use show() function:
>>> df.show()
+---+---------+--------+
| id|firstName|lastName|
+---+---------+--------+
| 1| Mark| Brown|
| 2| Tom|Anderson|
| 3| Joshua|Peterson|
+---+---------+--------+
Print Spark DataFrame vertically
Say that you have a fairly large number of columns and your dataframe doesn't fit in the screen. You can print the rows vertically - For example, the following command will print the top two rows, vertically, without any truncation.
>>> df.show(n=2, truncate=False, vertical=True)
-RECORD 0-------------
id | 1
firstName | Mark
lastName | Brown
-RECORD 1-------------
id | 2
firstName | Tom
lastName | Anderson
only showing top 2 rows
Convert to Pandas and print Pandas DataFrame
Alternatively, you can convert your Spark DataFrame into a Pandas DataFrame using .toPandas() and finally print() it.
>>> df_pd = df.toPandas()
>>> print(df_pd)
id firstName lastName
0 1 Mark Brown
1 2 Tom Anderson
2 3 Joshua Peterson
Note that this is not recommended when you have to deal with fairly large dataframes, as Pandas needs to load all the data into memory. If this is the case, the following configuration will help when converting a large spark dataframe to a pandas one:
spark.conf.set("spark.sql.execution.arrow.pyspark.enabled", "true")
For more details you can refer to my blog post Speeding up the conversion between PySpark and Pandas DataFrames
Yes: call the toPandas method on your dataframe and you'll get an actual pandas dataframe !
By default show() function prints 20 records of DataFrame. You can define number of rows you want to print by providing argument to show() function. You never know, what will be the total number of rows DataFrame will have. So, we can pass df.count() as argument to show function, which will print all records of DataFrame.
df.show() --> prints 20 records by default
df.show(30) --> prints 30 records according to argument
df.show(df.count()) --> get total row count and pass it as argument to show
If you are using Jupyter, this is what worked for me:
[1]
df= spark.read.parquet("s3://df/*")
[2]
dsp = users
[3]
%%display
dsp
This shows well-formated HTML table, you can also draw some simple charts on it straight away. For more documentation of %%display, type %%help.
Maybe something like this is a tad more elegant:
df.display()
# OR
df.select('column1').display()

Transform data into rdd and analyze

I am new in spark and have below data in csv format, which I want to convert in proper format.
Csv file with no header
Student_name=abc, student_grades=A, Student_gender=female
Student_name=Xyz, student_grades=B, Student_gender=male
Now I want to put it in rdd with creation of header
Student_Name student_grades student_gender
abc A female
Xyz B male
Also I want to get list of students with grades as A, B and C
What you could do is infer the schema from the first line of the file, and then transform the dataframe accordingly that is:
Remove the column name from the row values.
Rename the columns
Here is how you could do it. First, let's read your data from a file and display it.
// the options are here to get rid of potential spaces around the ",".
val df = spark.read
.option("ignoreTrailingWhiteSpace", true)
.option("ignoreLeadingWhiteSpace", true)
.csv("path/your_file.csv")
df.show(false)
+----------------+----------------+---------------------+
|_c0 |_c1 |_c2 |
+----------------+----------------+---------------------+
|Student_name=abc|student_grades=A|Student_gender=female|
|Student_name=Xyz|student_grades=B|Student_gender=male |
+----------------+----------------+---------------------+
Then, we extract a mapping between the default names and the new names using the first row of the dataframe.
val row0 = df.head
val cols = df
.columns
.map(c => c -> row0.getAs[String](c).split("=").head )
Finally we get rid of the name of the columns with a split on "=" and rename the columns using our mapping:
val new_df = df
.select(cols.map{ case (old_name, new_name) =>
split(col(old_name), "=")(1) as new_name
} : _*)
new_df.show(false)
+------------+--------------+--------------+
|Student_name|student_grades|Student_gender|
+------------+--------------+--------------+
|abc |A |female |
|Xyz |B |male |
+------------+--------------+--------------+

Resources