i have one or more csv files that i have to merge in pyspark:
file 1:
c1,c2,c3
1,3,4
file 2:
c4,c5,c6
4,5,6
file 3
c1,c2
7,8
i need to merge the files so that the outcome will be:
c1,c2,c3,c4,c5,c6
1,2,3,null,null,null
null,null,null,4,5,6
7,8,null,null,null,null
i have tried:
loading all the files from a folder using the load method:
spark.read.format("csv").option("header","true")
merging the files using merge.
both used just one of the files schema
c1,c2,c3
1,3,4
4,5,6
7,8
Read all the files - f1,f2,f3 and merge the column names. Then for each file, find the complement columns and generate new columns with lit(null). Finally, union all the dfs, by selecting the column names in order. Here is the scala solution.
val f1 = spark.read.format("csv").option("inferSchema","true").option("header","true").load("in/f1.csv")
val f2 = spark.read.format("csv").option("inferSchema","true").option("header","true").load("in/f2.csv")
val f3 = spark.read.format("csv").option("inferSchema","true").option("header","true").load("in/f3.csv")
val fall = f1.columns.union(f2.columns).union(f3.columns).distinct
val f1c = fall.diff(f1.columns)
val f1a = f1c.foldLeft(f1)( (acc,r) => acc.withColumn(r,lit(null)) )
val f2c = fall.diff(f2.columns)
val f2a = f2c.foldLeft(f2)( (acc,r) => acc.withColumn(r,lit(null)) )
val f3c = fall.diff(f3.columns)
val f3a = f3c.foldLeft(f3)( (acc,r) => acc.withColumn(r,lit(null)) )
val result = f1a.select(fall.head,fall.tail:_*).union(f2a.select(fall.head,fall.tail:_*)).union(f3a.select(fall.head,fall.tail:_*))
result.printSchema
result.show(false)
Results:
root
|-- c1: integer (nullable = true)
|-- c2: integer (nullable = true)
|-- c3: integer (nullable = true)
|-- c4: integer (nullable = true)
|-- c5: integer (nullable = true)
|-- c6: integer (nullable = true)
+----+----+----+----+----+----+
|c1 |c2 |c3 |c4 |c5 |c6 |
+----+----+----+----+----+----+
|1 |3 |4 |null|null|null|
|null|null|null|4 |5 |6 |
|7 |8 |null|null|null|null|
+----+----+----+----+----+----+
Related
I have an input PySpark df:
+---------+-------+--------+----------+----------+
|timestamp|user_id|results |event_name|product_id|
+---------+-------+--------+----------+----------+
|1000 |user_1 |result 1|Click |1 |
|1001 |user_1 |result 1|View |1 |
|1002 |user_1 |result 2|Click |3 |
|1003 |user_1 |result 2|View |4 |
|1004 |user_1 |result 2|View |5 |
+---------+-------+--------+----------+----------+
root
|-- timestamp: timestamp (nullable = true)
|-- user_id: string (nullable = true)
|-- results: string (nullable = true)
|-- event_name: string (nullable = true)
|-- product_id: string (nullable = true)
I'd like to convert this to following making sure that I keep unique combinations of user_id and results, and aggregate product_ids based on given event_name like this:
+-------+--------+---------------+---------------+
|user_id|results |product_clicked|products_viewed|
+-------+--------+---------------+---------------+
|user_1 |result 1|[1] |[1] |
|user_1 |result 2|[4,5] |[3] |
+-------+--------+---------------+---------------+
root
|-- user_id: string (nullable = true)
|-- results: string (nullable = true)
|-- product_clicked: array (nullable = true)
| |-- element: string (containsNull = true)
|-- products_viewed: array (nullable = true)
| |-- element: string (containsNull = true)
I have looked into pivot, it's close but I do not need the aggregation part of it, instead I need array creation on columns which are created based on event_name column. Cannot figure our how to do it.
NOTE: The order in product_clicked and product_viewed columns above is important and is based on timestamp column of input dataframe.
You can use collect_list during the pivot aggregation:
import pyspark.sql.functions as F
df2 = (df.groupBy('user_id', 'results')
.pivot('event_name')
.agg(F.collect_list('product_id'))
.selectExpr("user_id", "results", "Click as product_clicked", "View as product_viewed")
)
df2.show()
+-------+-------+---------------+--------------+
|user_id|results|product_clicked|product_viewed|
+-------+-------+---------------+--------------+
| user_1|result2| [3]| [4, 5]|
| user_1|result1| [1]| [1]|
+-------+-------+---------------+--------------+
To ensure ordering, you can collect a list of structs containing the timestamp, sort the list, and transform the list to only keep the product_id:
df2 = (df.groupBy('user_id', 'results')
.pivot('event_name')
.agg(F.sort_array(F.collect_list(F.struct('timestamp', 'product_id'))))
.selectExpr("user_id", "results", "transform(Click, x -> x.product_id) as product_clicked", "transform(View, x -> x.product_id) as product_viewed")
)
df2.show()
+-------+-------+---------------+--------------+
|user_id|results|product_clicked|product_viewed|
+-------+-------+---------------+--------------+
| user_1|result2| [3]| [4, 5]|
| user_1|result1| [1]| [1]|
+-------+-------+---------------+--------------+
Data schema,
root
|-- id: string (nullable = true)
|-- col1: string (nullable = true)
|-- col2: string (nullable = true)
|id|col1 |col2 |
|1 |["x","y","z"]|[123,"null","null"]|
From above data i want to filter where x exits in col1 and respective value for x from col2.
(values of col1 and col2 ordered.If x index 2 in col1 and value index at col2 also 2)
Result:(Need col1 and col2 type array type)
|id |col1 |col2 |
|1 |["x"]|[123]|
If x not present in col1 then need result like
|id| col1 |col2 |
|1 |["null"] |["null"]|
i tried,
val df1 = df.withColumn("result",when($"col1".contains("x"),"X").otherwise("null"))
The trick is to transform your data from dumb string columns into a more useable data structure. Once col1 and col2 are rebuilt as arrays (or as a map, as your desired output suggests they should be), you can use Spark's built-in functions rather than a messy UDF as suggested by #baitmbarek.
To start, use trim and split to convert col1 and col2 to arrays:
scala> val df = Seq(
| ("1", """["x","y","z"]""","""[123,"null","null"]"""),
| ("2", """["a","y","z"]""","""[123,"null","null"]""")
| ).toDF("id","col1","col2")
df: org.apache.spark.sql.DataFrame = [id: string, col1: string ... 1 more field]
scala> val df_array = df.withColumn("col1", split(trim($"col1", "[\"]"), "\"?,\"?"))
.withColumn("col2", split(trim($"col2", "[\"]"), "\"?,\"?"))
df_array: org.apache.spark.sql.DataFrame = [id: string, col1: array<string> ... 1 more field]
scala> df_array.show(false)
+---+---------+-----------------+
|id |col1 |col2 |
+---+---------+-----------------+
|1 |[x, y, z]|[123, null, null]|
|2 |[a, y, z]|[123, null, null]|
+---+---------+-----------------+
scala> df_array.printSchema
root
|-- id: string (nullable = true)
|-- col1: array (nullable = true)
| |-- element: string (containsNull = true)
|-- col2: array (nullable = true)
| |-- element: string (containsNull = true)
From here, you should be able to achieve what you want using array_position to find the index of 'x' (if any) in col1 and retrieve the matching data from col2. However, converting the two arrays into a map first should make it clearer to understand what your code is doing:
scala> val df_map = df_array.select(
$"id",
map_from_entries(arrays_zip($"col1", $"col2")).as("col_map")
)
df_map: org.apache.spark.sql.DataFrame = [id: string, col_map: map<string,string>]
scala> df_map.show(false)
+---+--------------------------------+
|id |col_map |
+---+--------------------------------+
|1 |[x -> 123, y -> null, z -> null]|
|2 |[a -> 123, y -> null, z -> null]|
+---+--------------------------------+
scala> val df_final = df_map.select(
$"id",
when(isnull(element_at($"col_map", "x")),
array(lit("null")))
.otherwise(
array(lit("x")))
.as("col1"),
when(isnull(element_at($"col_map", "x")),
array(lit("null")))
.otherwise(
array(element_at($"col_map", "x")))
.as("col2")
)
df_final: org.apache.spark.sql.DataFrame = [id: string, col1: array<string> ... 1 more field]
scala> df_final.show
+---+------+------+
| id| col1| col2|
+---+------+------+
| 1| [x]| [123]|
| 2|[null]|[null]|
+---+------+------+
scala> df_final.printSchema
root
|-- id: string (nullable = true)
|-- col1: array (nullable = false)
| |-- element: string (containsNull = false)
|-- col2: array (nullable = false)
| |-- element: string (containsNull = true)
Dataframe schema:
root
|-- ID: decimal(15,0) (nullable = true)
|-- COL1: array (nullable = true)
| |-- element: string (containsNull = true)
|-- COL2: array (nullable = true)
| |-- element: string (containsNull = true)
|-- COL3: array (nullable = true)
| |-- element: string (containsNull = true)
Sample data
+--------------------+--------------------+--------------------+
| COL1 | COL2 | COL3 |
+--------------------+--------------------+--------------------+
|[A, B, C, A] |[101, 102, 103, 104]|[P, Q, R, S] |
+--------------------+--------------------+--------------------+
I want to apply nested conditions on array elements.
For example,
Find COL3 elements where COL1 elements are A and COL2 elements are even.
Expected Output : [S]
I looked at various functions. For e.g. - array_position but it returns only the first occurrence.
Is there any straightforward way or I have to explode arrays?
Assuming your condition applies to array elements with the same index, it is possible to filter arrays with lambda functions in SQL since Spark 2.4.0, but this is still not exposed via the other language APIs and you need to use expr(). You simply zip the three arrays and then filter the resulting array of structs:
scala> df.show()
+---+------------+--------------------+------------+
| ID| COL1| COL2| COL3|
+---+------------+--------------------+------------+
| 1|[A, B, C, A]|[101, 102, 103, 104]|[P, Q, R, S]|
+---+------------+--------------------+------------+
scala> df.select($"ID", expr(s"""
| filter(
| arrays_zip(COL1, COL2, COL3),
| e -> e.COL1 == "A" AND CAST(e.COL2 AS integer) % 2 == 0
| ).COL3 AS result
| """)).show()
+---+------+
| ID|result|
+---+------+
| 1| [S]|
+---+------+
Since this uses expr() to supply an SQL expression as a column, it also works with PySpark:
>>> from pyspark.sql.functions import expr
>>> df.select(df.ID, expr("""
... filter(
... arrays_zip(COL1, COL2, COL3),
... e -> e.COL1 == "A" AND CAST(e.COL2 AS integer) % 2 == 0
... ).COL3 AS result
... """)).show()
+---+------+
| ID|result|
+---+------+
| 1| [S]|
+---+------+
I have a Dataset with the following schema
|-- Name: string (nullable = true)
|-- Values: long (nullable = true)
|-- Count: integer (nullable = true)
Input Dataset
+------------+-----------------------+--------------+
|Name |Values |Count |
+------------+-----------------------+--------------+
|A |1000 |1 |
|B |1150 |0 |
|C |500 |3 |
+------------+-----------------------+--------------+
My result dataset needs to be of the format
+------------+-----------------------+--------------+
|Sum(count>0)| sum(all) | Percentage |
+------------+-----------------------+--------------+
| 1500 | 2650 | 56.60 |
+------------+-----------------------+--------------+
I am currently able to get the sum(count>0) and sum(all) in individual datasets by running
val non_zero = df.filter(col(COUNT).>(0)).select(sum(VALUES).as(NON_ZERO_SUM))
val total = df.select(sum(col(VALUES)).as(TOTAL_SUM))
I'm at a loss on what to do to merge the two independent datasets into a single dataset, with which I would calculate the percentage.
Also could this same problem be solved in a better way?
Thanks,
I'd use single aggregation:
import org.apache.spark.sql.functions._
df.select(
sum(when($"count" > 0, $"values')).alias("NON_ZERO_SUM"),
sum($"values").alias("TOTAL_SUM")
)
It's CDH with Spark 1.6.
I am trying to import this Hypothetical CSV into a apache Spark DataFrame:
$ hadoop fs -cat test.csv
a,b,c,2016-09-09,a,2016-11-11 09:09:09.0,a
a,b,c,2016-09-10,a,2016-11-11 09:09:10.0,a
I use databricks-csv jar.
val textData = sqlContext.read
.format("com.databricks.spark.csv")
.option("header", "false")
.option("delimiter", ",")
.option("dateFormat", "yyyy-MM-dd HH:mm:ss")
.option("inferSchema", "true")
.option("nullValue", "null")
.load("test.csv")
I use inferSchema to make the schema for the resulting DataFrame. printSchema() function gives me the following output for the code above:
scala> textData.printSchema()
root
|-- C0: string (nullable = true)
|-- C1: string (nullable = true)
|-- C2: string (nullable = true)
|-- C3: string (nullable = true)
|-- C4: string (nullable = true)
|-- C5: timestamp (nullable = true)
|-- C6: string (nullable = true)
scala> textData.show()
+---+---+---+----------+---+--------------------+---+
| C0| C1| C2| C3| C4| C5| C6|
+---+---+---+----------+---+--------------------+---+
| a| b| c|2016-09-09| a|2016-11-11 09:09:...| a|
| a| b| c|2016-09-10| a|2016-11-11 09:09:...| a|
+---+---+---+----------+---+--------------------+---+
The C3 column has String type. I want C3 to have date type. To get it to date type I tried the following code.
val textData = sqlContext.read.format("com.databricks.spark.csv")
.option("header", "false")
.option("delimiter", ",")
.option("dateFormat", "yyyy-MM-dd")
.option("inferSchema", "true")
.option("nullValue", "null")
.load("test.csv")
scala> textData.printSchema
root
|-- C0: string (nullable = true)
|-- C1: string (nullable = true)
|-- C2: string (nullable = true)
|-- C3: timestamp (nullable = true)
|-- C4: string (nullable = true)
|-- C5: timestamp (nullable = true)
|-- C6: string (nullable = true)
scala> textData.show()
+---+---+---+--------------------+---+--------------------+---+
| C0| C1| C2| C3| C4| C5| C6|
+---+---+---+--------------------+---+--------------------+---+
| a| b| c|2016-09-09 00:00:...| a|2016-11-11 00:00:...| a|
| a| b| c|2016-09-10 00:00:...| a|2016-11-11 00:00:...| a|
+---+---+---+--------------------+---+--------------------+---+
The only difference between this code and the first block is the dateFormat option line (I use "yyyy-MM-dd" instead of "yyyy-MM-dd HH:mm:ss").Now I get both C3 and C5 as timestamps(C3 is still not date). But for C5, the HH::mm:ss part is ignored and shows up as zeroes in the data.
Ideally I want C3 to be of type date, C5 to be of type timestamp and its HH:mm:ss part to be not ignored. My solution right now looks like this. I make the csv by pulling data in parallel from my DB. I make sure that I pull all dates as timestamps (Not ideal). So, the test csv looks like this now:
$ hadoop fs -cat new-test.csv
a,b,c,2016-09-09 00:00:00,a,2016-11-11 09:09:09.0,a
a,b,c,2016-09-10 00:00:00,a,2016-11-11 09:09:10.0,a
This is my final working code:
val textData = sqlContext.read.format("com.databricks.spark.csv")
.option("header", "false")
.option("delimiter", ",")
.option("dateFormat", "yyyy-MM-dd HH:mm:ss")
.schema(finalSchema)
.option("nullValue", "null")
.load("new-test.csv")
Here, I use the complete timestamp format ("yyyy-MM-dd HH:mm:ss") in dateFormat. I manually create the finalSchema instance where c3 is date and C5 is Timestamp type(Spark sql types). I apply these schema use the schema() function. The output looks like follows:
scala> finalSchema
res4: org.apache.spark.sql.types.StructType = StructType(StructField(C0,StringType,true), StructField(C1,StringType,true), StructField(C2,StringType,true), StructField(C3,DateType,true), StructField(C4,StringType,true), StructField(C5,TimestampType,true), StructField(C6,StringType,true))
scala> textData.printSchema()
root
|-- C0: string (nullable = true)
|-- C1: string (nullable = true)
|-- C2: string (nullable = true)
|-- C3: date (nullable = true)
|-- C4: string (nullable = true)
|-- C5: timestamp (nullable = true)
|-- C6: string (nullable = true)
scala> textData.show()
+---+---+---+----------+---+--------------------+---+
| C0| C1| C2| C3| C4| C5| C6|
+---+---+---+----------+---+--------------------+---+
| a| b| c|2016-09-09| a|2016-11-11 09:09:...| a|
| a| b| c|2016-09-10| a|2016-11-11 09:09:...| a|
+---+---+---+----------+---+--------------------+---+
Is there an easier or out of the box way to parse out a csv file (that has both date and timestamp type into a spark dataframe?
Relevant Links:
http://spark.apache.org/docs/latest/sql-programming-guide.html#manually-specifying-options
https://github.com/databricks/spark-csv
With a infer option for non-trivial cases it will probably not return the expected result. As you can see in InferSchema.scala:
if (field == null || field.isEmpty || field == nullValue) {
typeSoFar
} else {
typeSoFar match {
case NullType => tryParseInteger(field)
case IntegerType => tryParseInteger(field)
case LongType => tryParseLong(field)
case DoubleType => tryParseDouble(field)
case TimestampType => tryParseTimestamp(field)
case BooleanType => tryParseBoolean(field)
case StringType => StringType
case other: DataType =>
throw new UnsupportedOperationException(s"Unexpected data type $other")
It will only try to match each column with a timestamp type, not a date type, so the "out of the box solution" for this case is not possible. But with my experience the "easier" solution, is directly define the schema with the needed type, it will avoid the infer option set a type that only matches for the RDD evaluated not the entire data. Your final schema is an efficient solution.
It's not really elegant but you can convert from timestamp to date like this (check last line):
val textData = sqlContext.read.format("com.databricks.spark.csv")
.option("header", "false")
.option("delimiter", ",")
.option("dateFormat", "yyyy-MM-dd")
.option("inferSchema", "true")
.option("nullValue", "null")
.load("test.csv")
.withColumn("C4", expr("""to_date(C4)"""))