Spark: Read dat file with a special case - apache-spark

I have a .dat file delimited by \u0001. It should look like so
+---+---+---+
|A |B |C |
+---+---+---+
|1 |2,3|4 |
+---+---+---+
|5 |6 |7 |
+---+---+---+
But the file I get has a lot of spaces between fields of a few rows
A\u0001B\u0001C
1\u0001"2,3" \u00014
5\u00016\u00017
In the second row above, there are 79 spaces between two columns. Now when I read the file in Spark
val df = spark.read.format("csv").option("header", "true").option("delimiter", "\u0001").load("path")
df.show(false)
+---+-----------------------------------------------------------------------------------+----+
|A |B |C |
+---+-----------------------------------------------------------------------------------+----+
|1 |2,3 4|null|
+---+-----------------------------------------------------------------------------------+----+
|5 |6 |7 |
+---+-----------------------------------------------------------------------------------+----+
Is there a way to fix this without changing the input file?

Try adding .option("ignoreTrailingWhiteSpace", true)
From the documentation:
ignoreTrailingWhiteSpace (default false): a flag indicating whether or
not trailing whitespaces from values being read should be skipped.
EDIT: you need to turn off quotation to make it work with your example:
val df2 = spark.read.format("csv")
.option("header", true)
.option("quote", "")
.option("delimiter", "\u0001")
.option("ignoreTrailingWhiteSpace", true)
.load("data2.txt")
Result:
df2.show()
+---+-----+---+
| A| B| C|
+---+-----+---+
| 1|"2,3"| 4|
| 5| 6| 7|
+---+-----+---+
To remove the quotes you can try (note this will remove quotes inside your string):
import org.apache.spark.sql.functions.regexp_replace
df2.withColumn("B", regexp_replace(df2("B"), "\"", "")).show()
+---+---+---+
| A| B| C|
+---+---+---+
| 1|2,3| 4|
| 5| 6| 7|
+---+---+---+

Do this, by setting: 'ignoreLeadingWhiteSpace' and 'ignoreTrailingWhiteSpace' to true, you will remove any white spaces:
val df = spark.read.format("csv")
.option("header", true)
.option("delimiter", "\u0001")
.option("ignoreLeadingWhiteSpace", true)
.option("ignoreTrailingWhiteSpace", true)
.load("path")
df.show(10, false)

You need to remove the quote character property and set trailing property as
val df = spark.read.format("csv")
.option("header", true)
.option("delimiter", "\u0001")
.option("quote", '')
.option("ignoreLeadingWhiteSpace", true)
.option("ignoreTrailingWhiteSpace", true)
.load("path")

Related

How to merge duplicate columns in pyspark?

I have a pyspark dataframe in which some of the columns have same name. I want to merge all the columns having same name in one column.
For example, Input dataframe:
How can I do this in pyspark? Any help would be highly appreciated.
Check below scala code. It might help you.
scala> :paste
// Entering paste mode (ctrl-D to finish)
import org.apache.spark.sql.{DataFrame, SparkSession}
import org.apache.spark.sql.functions._
import org.apache.spark.sql.types._
import scala.annotation.tailrec
import scala.util.Try
implicit class DFHelpers(df: DataFrame) {
def mergeColumns() = {
val dupColumns = df.columns
val newColumns = dupColumns.zipWithIndex.map(c => s"${c._1}${c._2}")
val columns = newColumns
.map(c => (c(0),c))
.groupBy(_._1)
.map(c => (c._1,c._2.map(_._2)))
.map(c => s"""coalesce(${c._2.mkString(",")}) as ${c._1}""")
.toSeq
df.toDF(newColumns:_*).selectExpr(columns:_*)
}
}
// Exiting paste mode, now interpreting.
scala> df.show(false)
+----+----+----+----+----+----+
|a |b |a |c |a |b |
+----+----+----+----+----+----+
|4 |null|null|8 |null|21 |
|null|8 |7 |6 |null|null|
|96 |null|null|null|null|78 |
+----+----+----+----+----+----+
scala> df.printSchema
root
|-- a: string (nullable = true)
|-- b: string (nullable = true)
|-- a: string (nullable = true)
|-- c: string (nullable = true)
|-- a: string (nullable = true)
|-- b: string (nullable = true)
scala> df.mergeColumns.show(false)
+---+---+----+
|b |a |c |
+---+---+----+
|21 |4 |8 |
|8 |7 |6 |
|78 |96 |null|
+---+---+----+
Edited to answer OP request to coalesce from list,
Here's a reproducible example
import pyspark.sql.functions as F
df = spark.createDataFrame([
("z","a", None, None),
("b",None,"c", None),
("c","b", None, None),
("d",None, None, "z"),
], ["a","c", "c","c"])
df.show()
#fix duplicated column names
old_col=df.schema.names
running_list=[]
new_col=[]
i=0
for column in old_col:
if(column in running_list):
new_col.append(column+"_"+str(i))
i=i+1
else:
new_col.append(column)
running_list.append(column)
print(new_col)
df1 = df.toDF(*new_col)
#coalesce columns to get one column from a list
a=['c','c_0','c_1']
to_drop=['c_0','c_1']
b=[]
[b.append(df1[col]) for col in a]
#coalesce columns to get one column
df_merged=df1.withColumn('c',F.coalesce(*b)).drop(*to_drop)
df_merged.show()
Output:
+---+----+----+----+
| a| c| c| c|
+---+----+----+----+
| z| a|null|null|
| b|null| c|null|
| c| b|null|null|
| d|null|null| z|
+---+----+----+----+
['a', 'c', 'c_0', 'c_1']
+---+---+
| a| c|
+---+---+
| z| a|
| b| c|
| c| b|
| d| z|
+---+---+

return alphanumeric values from column in pyspark dataframe

I have a pyspark dataframe df. it has 2 columns like the example input shown below. I would like to create a new output dataframe, with a new column 'col3' that only has the alphanumeric values from the strings in col2.
I've tried using spark sql with
regexp_extract('('+col1+')','[^[A-Za-z0-9] ]', 0)
but it only returns null.
can anyone suggest how to do this?
input
df.show()
+----+----+
|col1|col2|
+----+----+
|1 |ab& |
+----+----+
|2 |efg |
+----+----+
output
+----+----+
|col1|col3|
+----+----+
|1 |ab |
+----+----+
|2 |efg |
+----+----+
Use regexp_replace() function in spark.
Example:
df.show()
#+----+----+
#|col1|col2|
#+----+----+
#| 1| ab&|
#| 2| efg|
#+----+----+
from pyspark.sql.functions import *
df.withColumn("col3",regexp_replace("col2",'[^A-Za-z0-9]','')).show()
#+----+----+----+
#|col1|col2|col3|
#+----+----+----+
#| 1| ab&| ab|
#| 2| efg| efg|
#+----+----+----+

How to compute the numerical difference between columns of different dataframes?

Given two spark dataframes A and B with the same number of columns and rows, I want to compute the numerical difference between the two dataframes and store it into another dataframe (or another data structure optionally).
For instance let us have the following datasets
DataFrame A:
+----+---+
| A | B |
+----+---+
| 1| 0|
| 1| 0|
+----+---+
DataFrame B:
----+---+
| A | B |
+----+---+
| 1| 0 |
| 0| 0 |
+----+---+
How to obtain B-A, i.e
+----+---+
| c1 | c2|
+----+---+
| 0| 0 |
| -1| 0 |
+----+---+
In practice the real dataframes have a consequent number of rows and 50+ columns for which the difference need to be computed. What is the Spark/Scala way of doing it?
I was able to solve this by using the approach below. This code can work with any number of columns. You just have to change the input DFs accordingly.
import org.apache.spark.sql.Row
val df0 = Seq((1, 5), (1, 4)).toDF("a", "b")
val df1 = Seq((1, 0), (3, 2)).toDF("a", "b")
val columns = df0.columns
val rdd = df0.rdd.zip(df1.rdd).map {
x =>
val arr = columns.map(column =>
x._2.getAs[Int](column) - x._1.getAs[Int](column))
Row(arr: _*)
}
spark.createDataFrame(rdd, df0.schema).show(false)
Output generated:
df0=>
+---+---+
|a |b |
+---+---+
|1 |5 |
|1 |4 |
+---+---+
df1=>
+---+---+
|a |b |
+---+---+
|1 |0 |
|3 |2 |
+---+---+
Output=>
+---+---+
|a |b |
+---+---+
|0 |-5 |
|2 |-2 |
+---+---+
If your df A is the same as df B you can try below approach. I don't know if this will work correct for large datasets, it will be better to have id for joining already instead of creating it using monotonically_increasing_id().
import spark.implicits._
import org.apache.spark.sql.functions._
val df0 = Seq((1, 0), (1, 0)).toDF("a", "b")
val df1 = Seq((1, 0), (0, 0)).toDF("a", "b")
// new cols names
val colNamesA = df0.columns.map("A_" + _)
val colNamesB = df0.columns.map("B_" + _)
// rename cols and add id
val dfA = df0.toDF(colNamesA: _*)
.withColumn("id", monotonically_increasing_id())
val dfB = df1.toDF(colNamesB: _*)
.withColumn("id", monotonically_increasing_id())
dfA.show()
dfB.show()
// get columns without id
val dfACols = dfA.columns.dropRight(1).map(dfA(_))
val dfBCols = dfB.columns.dropRight(1).map(dfB(_))
// diff between cols
val calcCols = (dfACols zip dfBCols).map(s=>s._2-s._1)
// join dfs
val joined = dfA.join(dfB, "id")
joined.show()
calcCols.foreach(_.explain(true))
joined.select(calcCols:_*).show()
+---+---+---+
|A_a|A_b| id|
+---+---+---+
| 1| 0| 0|
| 1| 0| 1|
+---+---+---+
+---+---+---+
|B_a|B_b| id|
+---+---+---+
| 1| 0| 0|
| 0| 0| 1|
+---+---+---+
+---+---+---+---+---+
| id|A_a|A_b|B_a|B_b|
+---+---+---+---+---+
| 0| 1| 0| 1| 0|
| 1| 1| 0| 0| 0|
+---+---+---+---+---+
(B_a#26 - A_a#18)
(B_b#27 - A_b#19)
+-----------+-----------+
|(B_a - A_a)|(B_b - A_b)|
+-----------+-----------+
| 0| 0|
| -1| 0|
+-----------+-----------+

Spark dataframe self-joins are producing empty dataframe as a result

Below is my data in csv which I read into dataframe.
id,pid,pname,ppid
1, 1, 5, -1
2, 1, 7, -1
3, 2, 9, 1
4, 2, 11, 1
5, 3, 5, 1
6, 4, 7, 2
7, 1, 9, 3
I am reading that data into a dataframe data_df. I am tryng to do a self-join on different columns. But the results dataframes are empty. Have tried multiple options.
Below is my code. Only the last joined4 is producing the result.
val joined = data_df.as("first").join(data_df.as("second")).where( col("first.ppid") === col("second.pid"))
joined.show(50, truncate = false)
val joined2 = data_df.as("first").join(data_df.as("second"), col("first.ppid") === col("second.pid"), "inner")
joined2.show(50, truncate = false)
val df1 = data_df.as("df1")
val df2 = data_df.as("df2")
val joined3 = df1.join(df2, $"df1.ppid" === $"df2.id")
joined3.show(50, truncate = false)
val joined4 = data_df.as("df1").join(data_df.as("df2"), Seq("id"))
joined4.show(50, truncate = false)
Below are the output of joined, joined2, joined3, joined4 respectively :
+---+---+-----+----+---+---+-----+----+
|id |pid|pname|ppid|id |pid|pname|ppid|
+---+---+-----+----+---+---+-----+----+
+---+---+-----+----+---+---+-----+----+
+---+---+-----+----+---+---+-----+----+
|id |pid|pname|ppid|id |pid|pname|ppid|
+---+---+-----+----+---+---+-----+----+
+---+---+-----+----+---+---+-----+----+
+---+---+-----+----+---+---+-----+----+
|id |pid|pname|ppid|id |pid|pname|ppid|
+---+---+-----+----+---+---+-----+----+
+---+---+-----+----+---+---+-----+----+
+---+---+-----+----+---+-----+----+
|id |pid|pname|ppid|pid|pname|ppid|
+---+---+-----+----+---+-----+----+
| 1 | 1| 5| -1| 1| 5| -1|
| 2 | 1| 7| -1| 1| 7| -1|
| 3 | 2| 9| 1| 2| 9| 1|
| 4 | 2| 11| 1| 2| 11| 1|
| 5 | 3| 5| 1| 3| 5| 1|
| 6 | 4| 7| 2| 4| 7| 2|
| 7 | 1| 9| 3| 1| 9| 3|
+---+---+-----+----+---+-----+----+
Sorry, later on figured out that the spaces in the csv were causing the issue. If I create a correctly structured csv of the initial data, the problem disappears.
Correct csv format as follows.
id,pid,pname,ppid
1,1,5,-1
2,1,7,-1
3,2,9,1
4,2,1,1
5,3,5,1
6,4,7,2
7,1,9,3
Ideally, I can also use the option to ignore leading whitespaces as shown in the following answer :
val data_df = spark.read
.schema(dataSchema)
.option("mode", "FAILFAST")
.option("header", "true")
.option("ignoreLeadingWhiteSpace", "true")
.csv(dataSourceName)
pySpark (v2.4) DataFrameReader adds leading whitespace to column names

How to convert RDD[List[Int]] to DataFrame?

I hava a RDD[List[Int]] ,I don not know the count of list[Int],I want to convert i Rdd[List[Int]] to DataFrame,How should I do?
this is my input:
val l1=Array(1,2,3,4)
val l2=Array(1,2,3,4)
val Lz=Seq(l1,l2)
val rdd1=sc.parallelize(Lz,2)
this is my expect result:
+---+---+---+---+
| _1| _2| _3| _4|
+---+---+---+---+
| 1| 2| 3| 4|
| 1| 2| 3| 4|
+---+---+---+---+
There might be some other and better functional way to do this, but this works too:
def getSchema(myArray : Array[Int]): StructType = {
var schemaArray = scala.collection.mutable.ArrayBuffer[StructField]()
for((el,idx) <- myArray.view.zipWithIndex){
schemaArray += StructField("col"+idx , IntegerType, true)
}
StructType(schemaArray)
}
val l1=Array(1,2,3,4)
val l2=Array(1,2,3,4)
val Lz=Seq(l1,l2)
val rdd1=sc.parallelize(Lz,2).map(Row.fromSeq(_))
val schema = getSchema(l1) //Since both arrays will be of same type and size
val df = sqlContext.createDataFrame(rdd1, schema)
df.show()
+----+----+----+----+
|col0|col1|col2|col3|
+----+----+----+----+
| 1| 2| 3| 4|
| 1| 2| 3| 4|
+----+----+----+----+
You can do the following :
val l1=Array(1,2,3,4)
val l2=Array(1,2,3,4)
val Lz=Seq(l1,l2)
val df = sc.parallelize(Lz,2).map{
case Array(val1, val2, val3, val4) => (val1, val2, val3, val4)
}.toDF
df.show
// +---+---+---+---+
// | _1| _2| _3| _4|
// +---+---+---+---+
// | 1| 2| 3| 4|
// | 1| 2| 3| 4|
// +---+---+---+---+
If you have lots of columns, you would need to proceed differently but you need to know the schema of your data otherwise you'll not be able to perform the following :
val sch = df.schema // I just took the schema from the old df but you can add one programmatically
val df2 = spark.createDataFrame(sc.parallelize(Lz,2).map{ Row.fromSeq(_) }, sch)
df2.show
// +---+---+---+---+
// | _1| _2| _3| _4|
// +---+---+---+---+
// | 1| 2| 3| 4|
// | 1| 2| 3| 4|
// +---+---+---+---+
Unless you provide a schema, you won't be able to do much except having an array column :
val df3 = sc.parallelize(Lz,2).toDF
// df3: org.apache.spark.sql.DataFrame = [value: array<int>]
df3.show
// +------------+
// | value|
// +------------+
// |[1, 2, 3, 4]|
// |[1, 2, 3, 4]|
// +------------+
df3.printSchema
//root
// |-- value: array (nullable = true)
// | |-- element: integer (containsNull = false)

Resources