I have been struggling to understand this weird Spark stream behaviour.
I want to write 2 files of CSV into a delta table using a Spark Streaming.
I made this example only to understand how Streams work, I dont want to use other solutions I just need to understand why is this not working.
So, I have to CSV files in /test/input:
A.csv
+---+---+
| id| x|
+---+---+
| 1| A|
| 2| B|
| 3| C|
+---+---+
B.csv
+---+---+
| id| x|
+---+---+
| 4| D|
| 5| E|
+---+---+
I read the directory (so the union of the two dataframes above) as a stream:
schema = StructType([StructField("id",IntegerType(),True), StructField("x",StringType(),True)])
df = spark.readStream.format("csv").schema(schema).option("ignoreChanges", "true").option("delimiter", ";").option("header", True).load("/test/input")
I then wanted to write this stream using the following code:
def processDf(df, epoch_id):
Ids=[x.id for x in df.select("id").distinct().collect()]
for i in Ids:
temp_df=df.filter((df.id==i))
temp_df.write.format("delta").option("mergeSchema", "true").partitionBy("id").option("replaceWhere", "id=="+str(i)).mode("append").save("/test/res")
df.writeStream.format("delta").foreachBatch(processDf).queryName("x").option("checkpointLocation", "/test/check").trigger(once=True).start()
No errors are shown. The code executes successfully.
When I go to check my files in /test/res I find all data:
But when I check delta data, I notice that only the first line is present:
df= (spark.read.format("delta").option("sep", ";").option("header", "true").load("/test/res")).cache()
+---+---+
| id| x|
+---+---+
| 1| A|
+---+---+
Why isnt it inserting all lines ? Is it the replaceWhere option ?
replaceWhere is supposed to delete only the partitions that are already in the table and got updated in source data.
What am I doing wrong please.
EDIT:
Same behaviour is noticed even if I read only one CSV in input. Code still writes only one line in output instead of all lines.
This was actually a syntaxical error, I changed the loop block with the following and it worked:
for i in ids:
i=str(i)
tmp = df.filter(df.id == i)
tmp.write.format("delta").option("mergeSchema", "true").partitionBy(PartitionKey).option("replaceWhere", "id == '$i'".format(i=i)).save("/res/")
If I partition a data set, will it be in the correct order when I read it back? For example, consider the following pyspark code:
# read a csv
df = sql_context.read.csv(input_filename)
# add a hash column
hash_udf = udf(lambda customer_id: hash(customer_id) % 4, IntegerType())
df = df.withColumn('hash', hash_udf(df['customer_id']))
# write out to parquet
df.write.parquet(output_path, partitionBy=['hash'])
# read back the file
df2 = sql_context.read.parquet(output_path)
I am partitioning on a customer_id bucket. When I read back the whole data set, are the partitions guaranteed to be merged back together in the original insertion order?
Right now, I'm not so sure, so I'm adding a sequence column:
df = df.withColumn('seq', monotonically_increasing_id())
However, I don't know if this is redundant.
No, it's not guaranteed. Try it with even a tiny data set:
df = spark.createDataFrame([(1,'a'),(2,'b'),(3,'c'),(4,'d')],['customer_id', 'name'])
# add a hash column
hash_udf = udf(lambda customer_id: hash(customer_id) % 4, IntegerType())
df = df.withColumn('hash', hash_udf(df['customer_id']))
# write out to parquet
df.write.parquet("test", partitionBy=['hash'], mode="overwrite")
# read back the file
df2 = spark.read.parquet("test")
df.show()
+-----------+----+----+
|customer_id|name|hash|
+-----------+----+----+
| 1| a| 1|
| 2| b| 2|
| 3| c| 3|
| 4| d| 0|
+-----------+----+----+
df2.show()
+-----------+----+----+
|customer_id|name|hash|
+-----------+----+----+
| 2| b| 2|
| 1| a| 1|
| 4| d| 0|
| 3| c| 3|
+-----------+----+----+
On input I have DF similar to:
+-----+-----+
|data1|data2|
+-----+-----+
| 1.0| 0.33|
| 1.0| 0|
| 2.0| 0.33|
| 1.0| 0|
| 1.0| 0.33|
| 2.0| 0.33|
+-----+-----+
after performing pivot
pivot = df.groupBy('data1').pivot('data2').count()
structure looks like this:
+-----+----+----+
|data1| 0|0.33|
+-----+----+----+
| 1.0| 2| 2|
| 2.0|null| 2|
+-----+----+----+
Attempting to do anything with column 0.33 results in
AnalysisException: Can't extract value from 0#1535L;
How to handle this case?
The problem is that your column name contains a dot. As you can see here:
The Spark SQL doesn’t support field names that contains dots
Solution 1
Rename columns with new names (new names have to be without dots):
There are many ways to do this, see this SO question, here I have put an example from that question:
>>> oldColumns = pivot.schema.names
>>> newColumns = ["data1","col1","col2"]
>>> newPivot = reduce(lambda data, idx: data.withColumnRenamed(oldColumns[idx], newColumns[idx]), xrange(len(oldColumns)), pivot)
>>> newPivot.show()
+-----+----+----+
|data1|col1|col2|
+-----+----+----+
| 1.0| 2| 2|
| 2.0|null| 2|
+-----+----+----+
Solution 2
Use backquote ( ` ) to select the column that have dots in its name (here an example):
>>> newPivot = pivot.groupBy().sum("`0.33`")
>>> newPivot.show()
+---------+
|sum(0.33)|
+---------+
| 4|
+---------+
I've seen various people suggesting that Dataframe.explode is a useful way to do this, but it results in more rows than the original dataframe, which isn't what I want at all. I simply want to do the Dataframe equivalent of the very simple:
rdd.map(lambda row: row + [row.my_str_col.split('-')])
which takes something looking like:
col1 | my_str_col
-----+-----------
18 | 856-yygrm
201 | 777-psgdg
and converts it to this:
col1 | my_str_col | _col3 | _col4
-----+------------+-------+------
18 | 856-yygrm | 856 | yygrm
201 | 777-psgdg | 777 | psgdg
I am aware of pyspark.sql.functions.split(), but it results in a nested array column instead of two top-level columns like I want.
Ideally, I want these new columns to be named as well.
pyspark.sql.functions.split() is the right approach here - you simply need to flatten the nested ArrayType column into multiple top-level columns. In this case, where each array only contains 2 items, it's very easy. You simply use Column.getItem() to retrieve each part of the array as a column itself:
split_col = pyspark.sql.functions.split(df['my_str_col'], '-')
df = df.withColumn('NAME1', split_col.getItem(0))
df = df.withColumn('NAME2', split_col.getItem(1))
The result will be:
col1 | my_str_col | NAME1 | NAME2
-----+------------+-------+------
18 | 856-yygrm | 856 | yygrm
201 | 777-psgdg | 777 | psgdg
I am not sure how I would solve this in a general case where the nested arrays were not the same size from Row to Row.
Here's a solution to the general case that doesn't involve needing to know the length of the array ahead of time, using collect, or using udfs. Unfortunately this only works for spark version 2.1 and above, because it requires the posexplode function.
Suppose you had the following DataFrame:
df = spark.createDataFrame(
[
[1, 'A, B, C, D'],
[2, 'E, F, G'],
[3, 'H, I'],
[4, 'J']
]
, ["num", "letters"]
)
df.show()
#+---+----------+
#|num| letters|
#+---+----------+
#| 1|A, B, C, D|
#| 2| E, F, G|
#| 3| H, I|
#| 4| J|
#+---+----------+
Split the letters column and then use posexplode to explode the resultant array along with the position in the array. Next use pyspark.sql.functions.expr to grab the element at index pos in this array.
import pyspark.sql.functions as f
df.select(
"num",
f.split("letters", ", ").alias("letters"),
f.posexplode(f.split("letters", ", ")).alias("pos", "val")
)\
.show()
#+---+------------+---+---+
#|num| letters|pos|val|
#+---+------------+---+---+
#| 1|[A, B, C, D]| 0| A|
#| 1|[A, B, C, D]| 1| B|
#| 1|[A, B, C, D]| 2| C|
#| 1|[A, B, C, D]| 3| D|
#| 2| [E, F, G]| 0| E|
#| 2| [E, F, G]| 1| F|
#| 2| [E, F, G]| 2| G|
#| 3| [H, I]| 0| H|
#| 3| [H, I]| 1| I|
#| 4| [J]| 0| J|
#+---+------------+---+---+
Now we create two new columns from this result. First one is the name of our new column, which will be a concatenation of letter and the index in the array. The second column will be the value at the corresponding index in the array. We get the latter by exploiting the functionality of pyspark.sql.functions.expr which allows us use column values as parameters.
df.select(
"num",
f.split("letters", ", ").alias("letters"),
f.posexplode(f.split("letters", ", ")).alias("pos", "val")
)\
.drop("val")\
.select(
"num",
f.concat(f.lit("letter"),f.col("pos").cast("string")).alias("name"),
f.expr("letters[pos]").alias("val")
)\
.show()
#+---+-------+---+
#|num| name|val|
#+---+-------+---+
#| 1|letter0| A|
#| 1|letter1| B|
#| 1|letter2| C|
#| 1|letter3| D|
#| 2|letter0| E|
#| 2|letter1| F|
#| 2|letter2| G|
#| 3|letter0| H|
#| 3|letter1| I|
#| 4|letter0| J|
#+---+-------+---+
Now we can just groupBy the num and pivot the DataFrame. Putting that all together, we get:
df.select(
"num",
f.split("letters", ", ").alias("letters"),
f.posexplode(f.split("letters", ", ")).alias("pos", "val")
)\
.drop("val")\
.select(
"num",
f.concat(f.lit("letter"),f.col("pos").cast("string")).alias("name"),
f.expr("letters[pos]").alias("val")
)\
.groupBy("num").pivot("name").agg(f.first("val"))\
.show()
#+---+-------+-------+-------+-------+
#|num|letter0|letter1|letter2|letter3|
#+---+-------+-------+-------+-------+
#| 1| A| B| C| D|
#| 3| H| I| null| null|
#| 2| E| F| G| null|
#| 4| J| null| null| null|
#+---+-------+-------+-------+-------+
Here's another approach, in case you want split a string with a delimiter.
import pyspark.sql.functions as f
df = spark.createDataFrame([("1:a:2001",),("2:b:2002",),("3:c:2003",)],["value"])
df.show()
+--------+
| value|
+--------+
|1:a:2001|
|2:b:2002|
|3:c:2003|
+--------+
df_split = df.select(f.split(df.value,":")).rdd.flatMap(
lambda x: x).toDF(schema=["col1","col2","col3"])
df_split.show()
+----+----+----+
|col1|col2|col3|
+----+----+----+
| 1| a|2001|
| 2| b|2002|
| 3| c|2003|
+----+----+----+
I don't think this transition back and forth to RDDs is going to slow you down...
Also don't worry about last schema specification: it's optional, you can avoid it generalizing the solution to data with unknown column size.
I understand your pain. Using split() can work, but can also lead to breaks.
Let's take your df and make a slight change to it:
df = spark.createDataFrame([('1:"a:3":2001',),('2:"b":2002',),('3:"c":2003',)],["value"])
df.show()
+------------+
| value|
+------------+
|1:"a:3":2001|
| 2:"b":2002|
| 3:"c":2003|
+------------+
If you try to apply split() to this as outlined above:
df_split = df.select(split(df.value,":")).rdd.flatMap(
lambda x: x).toDF(schema=["col1","col2","col3"]).show()
you will get
IllegalStateException: Input row doesn't have expected number of values required by the schema. 4 fields are required while 3 values are provided.
So, is there a more elegant way of addressing this? I was so happy to have it pointed out to me. pyspark.sql.functions.from_csv() is your friend.
Taking my above example df:
from pyspark.sql.functions import from_csv
# Define a column schema to apply with from_csv()
col_schema = ["col1 INTEGER","col2 STRING","col3 INTEGER"]
schema_str = ",".join(col_schema)
# define the separator because it isn't a ','
options = {'sep': ":"}
# create a df from the value column using schema and options
df_csv = df.select(from_csv(df.value, schema_str, options).alias("value_parsed"))
df_csv.show()
+--------------+
| value_parsed|
+--------------+
|[1, a:3, 2001]|
| [2, b, 2002]|
| [3, c, 2003]|
+--------------+
Then we can easily flatten the df to put the values in columns:
df2 = df_csv.select("value_parsed.*").toDF("col1","col2","col3")
df2.show()
+----+----+----+
|col1|col2|col3|
+----+----+----+
| 1| a:3|2001|
| 2| b|2002|
| 3| c|2003|
+----+----+----+
No breaks. Data correctly parsed. Life is good. Have a beer.
Instead of Column.getItem(i) we can use Column[i].
Also, enumerate is useful in big dataframes.
from pyspark.sql import functions as F
Keep parent column:
for i, c in enumerate(['new_1', 'new_2']):
df = df.withColumn(c, F.split('my_str_col', '-')[i])
or
new_cols = ['new_1', 'new_2']
df = df.select('*', *[F.split('my_str_col', '-')[i].alias(c) for i, c in enumerate(new_cols)])
Replace parent column:
for i, c in enumerate(['new_1', 'new_2']):
df = df.withColumn(c, F.split('my_str_col', '-')[i])
df = df.drop('my_str_col')
or
new_cols = ['new_1', 'new_2']
df = df.select(
*[c for c in df.columns if c != 'my_str_col'],
*[F.split('my_str_col', '-')[i].alias(c) for i, c in enumerate(new_cols)]
)
I have a dataframe and I want to randomize rows in the dataframe. I tried sampling the data by giving a fraction of 1, which didn't work (interestingly this works in Pandas).
It works in Pandas because taking sample in local systems is typically solved by shuffling data. Spark from the other hand avoids shuffling by performing linear scans over the data. It means that sampling in Spark only randomizes members of the sample not an order.
You can order DataFrame by a column of random numbers:
from pyspark.sql.functions import rand
df = sc.parallelize(range(20)).map(lambda x: (x, )).toDF(["x"])
df.orderBy(rand()).show(3)
## +---+
## | x|
## +---+
## | 2|
## | 7|
## | 14|
## +---+
## only showing top 3 rows
but it is:
expensive - because it requires full shuffle and it something you typically want to avoid.
suspicious - because order of values in a DataFrame is not something you can really depend on in non-trivial cases and since DataFrame doesn't support indexing it is relatively useless without collecting.
This code works for me without any RDD operations:
import pyspark.sql.functions as F
df = df.select("*").orderBy(F.rand())
Here is a more elaborated example:
import pyspark.sql.functions as F
# Example: create a Dataframe for the example
pandas_df = pd.DataFrame(([1,2],[3,1],[4,2],[7,2],[32,7],[123,3]),columns=["id","col1"])
df = sqlContext.createDataFrame(pandas_df)
df = df.select("*").orderBy(F.rand())
df.show()
+---+----+
| id|col1|
+---+----+
| 1| 2|
| 3| 1|
| 4| 2|
| 7| 2|
| 32| 7|
|123| 3|
+---+----+
df.select("*").orderBy(F.rand()).show()
+---+----+
| id|col1|
+---+----+
| 7| 2|
|123| 3|
| 3| 1|
| 4| 2|
| 32| 7|
| 1| 2|
+---+----+