Spark doesn't read columns with null values in first row - apache-spark

Below is the content in my csv file :
A1,B1,C1
A2,B2,C2,D1
A3,B3,C3,D2,E1
A4,B4,C4,D3
A5,B5,C5,,E2
So, there are 5 columns but only 3 values in the first row.
I read it using the following command :
val csvDF : DataFrame = spark.read
.option("header", "false")
.option("delimiter", ",")
.option("inferSchema", "false")
.csv("file.csv")
And following is what i get using csvDF.show()
+---+---+---+
|_c0|_c1|_c2|
+---+---+---+
| A1| B1| C1|
| A2| B2| C2|
| A3| B3| C3|
| A4| B4| C4|
| A5| B5| C5|
+---+---+---+
How can i read all the data in all the columns?

Basically your csv-file isn't properly formatted in the sense that it doesn't have a equal number of columns in each row, which is required if you want to read it with spark.read.csv. However, you can instead read it with spark.read.textFile and then parse each row.
As I understand it, you do not know the number of columns beforehand, so you want your code to handle an arbitrary number of columns. To do this you need to establish the maximum number of columns in your data set, so you need two passes over your data set.
For this particular problem, I would actually go with RDDs instead of DataFrames or Datasets, like this:
val data = spark.read.textFile("file.csv").rdd
val rdd = data.map(s => (s, s.split(",").length)).cache
val maxColumns = rdd.map(_._2).max()
val x = rdd
.map(row => {
val rowData = row._1.split(",")
val extraColumns = Array.ofDim[String](maxColumns - rowData.length)
Row((rowData ++ extraColumns).toList:_*)
})
Hope that helps :)

You can read it as a dataset with only one column (for example by using another delimiter) :
var df = spark.read.format("csv").option("delimiter",";").load("test.csv")
df.show()
+--------------+
| _c0|
+--------------+
| A1,B1,C1|
| A2,B2,C2,D1|
|A3,B3,C3,D2,E1|
| A4,B4,C4,D3|
| A5,B5,C5,,E2|
+--------------+
Then you can use this answer to manually split your column in five, this will add null values when the element does not exist :
var csvDF = df.withColumn("_tmp",split($"_c0",",")).select(
$"_tmp".getItem(0).as("col1"),
$"_tmp".getItem(1).as("col2"),
$"_tmp".getItem(2).as("col3"),
$"_tmp".getItem(3).as("col4"),
$"_tmp".getItem(4).as("col5")
)
csvDF.show()
+----+----+----+----+----+
|col1|col2|col3|col4|col5|
+----+----+----+----+----+
| A1| B1| C1|null|null|
| A2| B2| C2| D1|null|
| A3| B3| C3| D2| E1|
| A4| B4| C4| D3|null|
| A5| B5| C5| | E2|
+----+----+----+----+----+

If the column dataTypes and number of columns are known then you can define schema and apply the schema while reading the csv file as dataframe. Below I have defined all five columns as stringType
val schema = StructType(Seq(
StructField("col1", StringType, true),
StructField("col2", StringType, true),
StructField("col3", StringType, true),
StructField("col4", StringType, true),
StructField("col5", StringType, true)))
val csvDF : DataFrame = sqlContext.read
.option("header", "false")
.option("delimiter", ",")
.option("inferSchema", "false")
.schema(schema)
.csv("file.csv")
You should be getting dataframe as
+----+----+----+----+----+
|col1|col2|col3|col4|col5|
+----+----+----+----+----+
|A1 |B1 |C1 |null|null|
|A2 |B2 |C2 |D1 |null|
|A3 |B3 |C3 |D2 |E1 |
|A4 |B4 |C4 |D3 |null|
|A5 |B5 |C5 |null|E2 |
+----+----+----+----+----+

Related

Spark: Read dat file with a special case

I have a .dat file delimited by \u0001. It should look like so
+---+---+---+
|A |B |C |
+---+---+---+
|1 |2,3|4 |
+---+---+---+
|5 |6 |7 |
+---+---+---+
But the file I get has a lot of spaces between fields of a few rows
A\u0001B\u0001C
1\u0001"2,3" \u00014
5\u00016\u00017
In the second row above, there are 79 spaces between two columns. Now when I read the file in Spark
val df = spark.read.format("csv").option("header", "true").option("delimiter", "\u0001").load("path")
df.show(false)
+---+-----------------------------------------------------------------------------------+----+
|A |B |C |
+---+-----------------------------------------------------------------------------------+----+
|1 |2,3 4|null|
+---+-----------------------------------------------------------------------------------+----+
|5 |6 |7 |
+---+-----------------------------------------------------------------------------------+----+
Is there a way to fix this without changing the input file?
Try adding .option("ignoreTrailingWhiteSpace", true)
From the documentation:
ignoreTrailingWhiteSpace (default false): a flag indicating whether or
not trailing whitespaces from values being read should be skipped.
EDIT: you need to turn off quotation to make it work with your example:
val df2 = spark.read.format("csv")
.option("header", true)
.option("quote", "")
.option("delimiter", "\u0001")
.option("ignoreTrailingWhiteSpace", true)
.load("data2.txt")
Result:
df2.show()
+---+-----+---+
| A| B| C|
+---+-----+---+
| 1|"2,3"| 4|
| 5| 6| 7|
+---+-----+---+
To remove the quotes you can try (note this will remove quotes inside your string):
import org.apache.spark.sql.functions.regexp_replace
df2.withColumn("B", regexp_replace(df2("B"), "\"", "")).show()
+---+---+---+
| A| B| C|
+---+---+---+
| 1|2,3| 4|
| 5| 6| 7|
+---+---+---+
Do this, by setting: 'ignoreLeadingWhiteSpace' and 'ignoreTrailingWhiteSpace' to true, you will remove any white spaces:
val df = spark.read.format("csv")
.option("header", true)
.option("delimiter", "\u0001")
.option("ignoreLeadingWhiteSpace", true)
.option("ignoreTrailingWhiteSpace", true)
.load("path")
df.show(10, false)
You need to remove the quote character property and set trailing property as
val df = spark.read.format("csv")
.option("header", true)
.option("delimiter", "\u0001")
.option("quote", '')
.option("ignoreLeadingWhiteSpace", true)
.option("ignoreTrailingWhiteSpace", true)
.load("path")

How to extract value of json when doing pyspark query

This is how the table look like
which I extract using the following command:
query="""
select
distinct
userid,
region,
json_data
from mytable
where
operation = 'myvalue'
"""
table=spark.sql(query)
Now, I wish to extract only value of msg_id in column json_data (which is a string column), with the following expected output:
How should I change the query in the above code to extract the json_data
Note:
The json format is not fix (i.e., may contains other fields), but the value I want to extract is always with msg_id.
I want to achieve during retrieval for efficiency reason, though I can retrieve the json_data and format it afterwards.
from pyspark.sql.functions import from_json, col
from pyspark.sql.types import StructType,StructField,StringType
spark = SparkSession.builder.getOrCreate()
schema = StructType([
StructField("a", StringType(), True),
StructField("b", StringType(), True),
StructField("json", StringType(), True)
])
data = [("a","b",'{"msg_id":"123","msg":"test"}'),("c","d",'{"msg_id":"456","column1":"test"}')]
df = spark.createDataFrame(data,schema)
json_schema = spark.read.json(df.rdd.map(lambda row: row.json)).schema
df2 = df.withColumn('parsed', from_json(col('json'), json_schema))
df2.createOrReplaceTempView("test")
spark.sql("select a,b,parsed.msg_id from test").show()```
OUTPUT >>>
+---+---+------+
| a| b|msg_id|
+---+---+------+
| a| b| 123|
| c| d| 456|
+---+---+------+
Instead of reading file to get schema you can specify schema using StructType and StructField syntax, or <> syntax or use schema_of_json as shown below:
df.show() #sampledataframe
#+------+------+-----------------------------------------+
#|userid|region|json_data |
#+------+------+-----------------------------------------+
#|1 |US |{"msg_id":123} |
#|2 |US |{"msg_id":123} |
#|3 |US |{"msg_id":123} |
#|4 |US |{"msg_id":123,"is_ads":true,"location":2}|
#|5 |US |{"msg_id":456} |
#+------+------+-----------------------------------------+
from pyspark.sql import functions as F
from pyspark.sql.types import *
schema = StructType([StructField("msg_id", LongType(), True),
StructField("is_ads", BooleanType(), True),
StructField("location", LongType(), True)])
#OR
schema= 'struct<is_ads:boolean,location:bigint,msg_id:bigint>'
#OR
schema= df.select(F.schema_of_json("""{"msg_id":123,"is_ads":true,"location":2}""")).collect()[0][0]
df.withColumn("json_data", F.from_json("json_data",schema))\
.select("userid","region","json_data.msg_id").show()
#+------+------+------+
#|userid|region|msg_id|
#+------+------+------+
#| 1| US| 123|
#| 2| US| 123|
#| 3| US| 123|
#| 4| US| 123|
#| 5| US| 456|
#+------+------+------+

How to perform one hot encoding in Spark on a string column that has comma separated values?

I have a dataframe that looks like this
val df = Seq(
(1,"a,b,c"),
(2,"b,c")
).toDF("id","page_path")
df.createOrReplaceTempView("df")
df.show()
+---+---------+
| id|page_path|
+---+---------+
| 1| a,b,c|
| 2| b,c|
+---+---------+
I want to perform one hot encoding on this page_path column such that the output would look like -
Can I do this using one-hot encoding in Spark?
Column "page_path" can be split, and then values exploded, and pivoted:
df
.withColumn("splitted", split($"page_path",","))
.withColumn("exploded", explode($"splitted"))
.groupBy("id")
.pivot("exploded")
.count()
// replace nulls with 0
.na.fill(0)
Output:
+---+---+---+---+
|id |a |b |c |
+---+---+---+---+
|1 |1 |1 |1 |
|2 |0 |1 |1 |
+---+---+---+---+
Since in the question you mentioned df.createOrReplaceTempView("df") thought of giving sql version of the same thing which pasha done.
In Databricks documenation they have mentioned many use cases with Pivot...
Below is the sql version for sql lovers.
In this approach, contrary to dataframe operations approach pivot uses implicit grouping there is no need for seperate group by clause in the sql.
val df: DataFrame = Seq((1, "a,b,c"),(2, "b,c")).toDF("id", "page_path")
df.createOrReplaceTempView("df")
spark.sql(
"""
|Select * from
|( select id, explode(split( page_path ,',')) as exploded from df )
|pivot(count(exploded) for exploded in ('is_a','is_b','is_c')
|)
""".stripMargin).na.fill(0).show
Result :
+---+----+----+----+
| id|is_a|is_b|is_c|
+---+----+----+----+
| 1| 0| 0| 0|
| 2| 0| 0| 0|
+---+----+----+----+
import org.apache.spark.SparkConf
import org.apache.spark.sql.SparkSession
import org.apache.spark.sql.functions._
object Solution {
def main(args: Array[String]): Unit = {
val sparkConf = new SparkConf().setMaster("local[4]").setAppName("SparkClusterApp")
val sparkSession = SparkSession.builder.config(sparkConf).getOrCreate
import sparkSession.implicits._
val df = Seq((1, "a,b,c"),(2, "b,c")).toDF("id", "page_path")
df.createOrReplaceTempView("df")
df.withColumn("_tmp", split($"page_path", "\\,")).select( $"id",
when(array_contains($"_tmp","a"),"1").otherwise("0").as("is_a"),
when(array_contains($"_tmp","b"),"1").otherwise("0").as("is_b"),
when(array_contains($"_tmp","c"),"1").otherwise("0").as("is_c")).show()
}
}

Efficent Dataframe lookup in Apache Spark

I want to efficiently look up many IDs. What I have is a dataframe that looks like this dataframe df_source but with a couple of million records distributed to 10 Workers:
+-------+----------------+
| URI| Links_lists|
+-------+----------------+
| URI_1|[URI_8,URI_9,...|
| URI_2|[URI_6,URI_7,...|
| URI_3|[URI_4,URI_1,...|
| URI_4|[URI_1,URI_5,...|
| URI_5|[URI_3,URI_2,...|
+-------+----------------+
My first step would be to make an RDD out of df_source:
rdd_source = df_source.rdd
out of rdd_source I want to create an RDD that only contains the URIs with IDs. I do this like that:
rdd_index = rdd_source.map(lambda x: x[0]).zipWithUniqueId()
now I also .flatMap() the rdd_source in to an RDD that contains all relations. Until now only contained within the Links_list column.
rdd_relations = rdd_source.flatMap(lamda x: x)
now I transform both rdd_index and rdd_relations back into dataframes because I want to do joins and I think (I might be wrong on this) joins on dataframes are faster.
schema_index = StructType([
StructField("URI", StringType(), True),
StructField("ID", IntegerType(), True))
df_index = sqlContext.createDataFrame(rdd_index, schema=schema_index)
and
schema_relation = StructType([
StructField("URI", StringType(), True),
StructField("LINK", StringType(), True))
df_relations = sqlContext.createDataFrame(rdd_relations, schema=schema_relation )
The resulting dataframes should look like these two :
df_index:
+-------+-------+
| URI| ID|
+-------+-------+
| URI_1| 1|
| URI_2| 2|
| URI_3| 3|
| URI_4| 4|
| URI_5| 5|
+-------+-------+
df_relations:
+-------+-------+
| URI| LINK|
+-------+-------+
| URI_1| URI_5|
| URI_1| URI_8|
| URI_1| URI_9|
| URI_2| URI_3|
| URI_2| URI_4|
+-------+-------+
now to replace the long string URIs in the df_relations I will do joins on the df_index, the first join:
df_relations =\
df_relations.join(df_index, df_relations.URI == df_index.URI,'inner')\
.select(col(ID).alias(URI_ID),col('LINK'))
This should yield me a dataframe looking like this:
df_relations:
+-------+-------+
| URI_ID| LINK|
+-------+-------+
| 1| URI_5|
| 1| URI_8|
| 1| URI_9|
| 2| URI_3|
| 2| URI_4|
+-------+-------+
And the second join:
df_relations =\
df_relations.join(df_index, df_relations.LINK == df_index.URI,'inner')\
.select(col(URI_ID),col('ID').alias(LINK_ID))
this should result in the final dataframe the one I need. Looking like this
df_relations:
+-------+-------+
| URI_ID|LINK_ID|
+-------+-------+
| 1| 5|
| 1| 8|
| 1| 9|
| 2| 3|
| 2| 4|
+-------+-------+
where all URIs are replaced with IDs from df_index.
Is this the efficent way to look up the IDs for all URIs on both columns in the relation table, or is there a more effective way doing this?
I'm using Apache Spark 2.1.0 with Python 3.5
You do not need to use RDD for the operations you described. Using RDD can be very costly. Second you do not need to do two joins, you can do just one:
import pyspark.sql.functions as f
# add a unique id for each URI
withID = df_source.withColumn("URI_ID", f.monotonically_increasing_id())
# create a single line from each element in the array
exploded = withID.select("URI_ID", f.explode("Links_lists").alias("LINK")
linkID = withID.withColumnRenamed("URI_ID", "LINK_ID").drop("Links_lists")
joined= exploded.join(linkID, on=exploded.LINK==linkID.URI).drop("URI").drop("LINK")
Lastly,if linkID (which is basically df_source with a column replaced) is relatively small (i.e. can be fully contained in a single worker) you can broadcast it. add the following before the join:
linkID = f.broadcast(linkID)

Removing duplicates from rows based on specific columns in an RDD/Spark DataFrame

Let's say I have a rather large dataset in the following form:
data = sc.parallelize([('Foo',41,'US',3),
('Foo',39,'UK',1),
('Bar',57,'CA',2),
('Bar',72,'CA',2),
('Baz',22,'US',6),
('Baz',36,'US',6)])
What I would like to do is remove duplicate rows based on the values of the first,third and fourth columns only.
Removing entirely duplicate rows is straightforward:
data = data.distinct()
and either row 5 or row 6 will be removed
But how do I only remove duplicate rows based on columns 1, 3 and 4 only? i.e. remove either one one of these:
('Baz',22,'US',6)
('Baz',36,'US',6)
In Python, this could be done by specifying columns with .drop_duplicates(). How can I achieve the same in Spark/Pyspark?
Pyspark does include a dropDuplicates() method, which was introduced in 1.4. https://spark.apache.org/docs/3.1.2/api/python/reference/api/pyspark.sql.DataFrame.dropDuplicates.html
>>> from pyspark.sql import Row
>>> df = sc.parallelize([ \
... Row(name='Alice', age=5, height=80), \
... Row(name='Alice', age=5, height=80), \
... Row(name='Alice', age=10, height=80)]).toDF()
>>> df.dropDuplicates().show()
+---+------+-----+
|age|height| name|
+---+------+-----+
| 5| 80|Alice|
| 10| 80|Alice|
+---+------+-----+
>>> df.dropDuplicates(['name', 'height']).show()
+---+------+-----+
|age|height| name|
+---+------+-----+
| 5| 80|Alice|
+---+------+-----+
From your question, it is unclear as-to which columns you want to use to determine duplicates. The general idea behind the solution is to create a key based on the values of the columns that identify duplicates. Then, you can use the reduceByKey or reduce operations to eliminate duplicates.
Here is some code to get you started:
def get_key(x):
return "{0}{1}{2}".format(x[0],x[2],x[3])
m = data.map(lambda x: (get_key(x),x))
Now, you have a key-value RDD that is keyed by columns 1,3 and 4.
The next step would be either a reduceByKey or groupByKey and filter.
This would eliminate duplicates.
r = m.reduceByKey(lambda x,y: (x))
I know you already accepted the other answer, but if you want to do this as a
DataFrame, just use groupBy and agg. Assuming you had a DF already created (with columns named "col1", "col2", etc) you could do:
myDF.groupBy($"col1", $"col3", $"col4").agg($"col1", max($"col2"), $"col3", $"col4")
Note that in this case, I chose the Max of col2, but you could do avg, min, etc.
Agree with David. To add on, it may not be the case that we want to groupBy all columns other than the column(s) in aggregate function i.e, if we want to remove duplicates purely based on a subset of columns and retain all columns in the original dataframe. So the better way to do this could be using dropDuplicates Dataframe api available in Spark 1.4.0
For reference, see: https://spark.apache.org/docs/1.4.0/api/scala/index.html#org.apache.spark.sql.DataFrame
I used inbuilt function dropDuplicates(). Scala code given below
val data = sc.parallelize(List(("Foo",41,"US",3),
("Foo",39,"UK",1),
("Bar",57,"CA",2),
("Bar",72,"CA",2),
("Baz",22,"US",6),
("Baz",36,"US",6))).toDF("x","y","z","count")
data.dropDuplicates(Array("x","count")).show()
Output :
+---+---+---+-----+
| x| y| z|count|
+---+---+---+-----+
|Baz| 22| US| 6|
|Foo| 39| UK| 1|
|Foo| 41| US| 3|
|Bar| 57| CA| 2|
+---+---+---+-----+
The below programme will help you drop duplicates on whole , or if you want to drop duplicates based on certain columns , you can even do that:
import org.apache.spark.sql.SparkSession
object DropDuplicates {
def main(args: Array[String]) {
val spark =
SparkSession.builder()
.appName("DataFrame-DropDuplicates")
.master("local[4]")
.getOrCreate()
import spark.implicits._
// create an RDD of tuples with some data
val custs = Seq(
(1, "Widget Co", 120000.00, 0.00, "AZ"),
(2, "Acme Widgets", 410500.00, 500.00, "CA"),
(3, "Widgetry", 410500.00, 200.00, "CA"),
(4, "Widgets R Us", 410500.00, 0.0, "CA"),
(3, "Widgetry", 410500.00, 200.00, "CA"),
(5, "Ye Olde Widgete", 500.00, 0.0, "MA"),
(6, "Widget Co", 12000.00, 10.00, "AZ")
)
val customerRows = spark.sparkContext.parallelize(custs, 4)
// convert RDD of tuples to DataFrame by supplying column names
val customerDF = customerRows.toDF("id", "name", "sales", "discount", "state")
println("*** Here's the whole DataFrame with duplicates")
customerDF.printSchema()
customerDF.show()
// drop fully identical rows
val withoutDuplicates = customerDF.dropDuplicates()
println("*** Now without duplicates")
withoutDuplicates.show()
val withoutPartials = customerDF.dropDuplicates(Seq("name", "state"))
println("*** Now without partial duplicates too")
withoutPartials.show()
}
}
This is my Df contain 4 is repeated twice so here will remove repeated values.
scala> df.show
+-----+
|value|
+-----+
| 1|
| 4|
| 3|
| 5|
| 4|
| 18|
+-----+
scala> val newdf=df.dropDuplicates
scala> newdf.show
+-----+
|value|
+-----+
| 1|
| 3|
| 5|
| 4|
| 18|
+-----+

Resources