Join dataframes with list field - apache-spark

I'm using structured streaming with pyspark, and I'm reading from a kafka topic a key, which is an integer, and a value, which is a coma separated list of integers
I'm trying to make a join of this dataframe with another dataframe that I get from MongoDB. I could also mke a filter based on the values of the kafka dataframe, that are present in the column "id" of the MongoDB dataframe (though I don't know if the concept is also correct)
kafka dataframe:
key
value
1
2,9,7
MongoDB dataframe
name
id
camp_1
1
camp_2
9
camp_3
5
camp_4
7
camp_5
2
So, the result should be
name
id
camp_5
2
camp_2
9
camp_4
7
I'm thinking in a join since I've been unable to iterate over the values of the list in the "value" field of the kafka dataframe

You can try to use explode and then join
from pyspark.sql.functions import explode
kafkaData = [("1", [2, 7, 9])]
kafkaDf = spark.createDataFrame(kafkaData, schema=["key", "Value"])
mongoData = [("Camp_1", 1), ("Camp_2", 9), ("Camp_3", 5), ("Camp_4", 7), ("Camp_5", 2)]
mongoDf = spark.createDataFrame(mongoData, schema=["name", "id"])
kafkaDfExploded = kafkaDf.select(explode("Value").alias("id"))
kafkaDfExploded.join(mongoDf, "id", "left").select("name", "id").show()
output:
+------+---+
| name| id|
+------+---+
|Camp_4| 7|
|Camp_2| 9|
|Camp_5| 2|
+------+---+
You may add order by if it is relevant for you, you may also broadcast one of df if you know its going to be small.
If your list 2, 7, 9 is not an array but string you can first split it, then rest will be similar

Related

Extract values from a complex column in PySpark

I have a PySpark dataframe which has a complex column, refer to below value:
ID value
1 [{"label":"animal","value":"cat"},{"label":null,"value":"George"}]
I want to add a new column in PySpark dataframe which basically convert it into a list of strings. If Label is null, string should contain "value" and if label is not null, string should be "label:value". So for above example dataframe, the output should look like below:
ID new_column
1 ["animal:cat", "George"]
You can use transform to transform each array element into a string, which is constructed using concat_ws:
df2 = df.selectExpr(
'id',
"transform(value, x -> concat_ws(':', x['label'], x['value'])) as new_column"
)
df2.show()
+---+--------------------+
| id| new_column|
+---+--------------------+
| 1|[animal:cat, George]|
+---+--------------------+

Spark Scala sort PIVOT column

The following:
val pivotDF = df.groupBy("Product").pivot("Country").sum("Amount")
pivotDF.show()
I cannot recall seeing the ability to sort the pivoted column. What is the assumption of sorting? Ascending always. Cannot find it. Non-deterministic?
Tips welcome.
According to scala docs:
There are two versions of pivot function: one that requires the caller to specify the list of distinct values to pivot on, and one that does not. The latter is more concise but less efficient, because Spark needs to first compute the list of distinct values internally.
Taking a look how the latter one works
// This is to prevent unintended OOM errors when the number of distinct values is large
val maxValues = df.sparkSession.sessionState.conf.dataFramePivotMaxValues
// Get the distinct values of the column and sort them so its consistent
val values = df.select(pivotColumn)
.distinct()
.limit(maxValues + 1)
.sort(pivotColumn) // ensure that the output columns are in a consistent logical order
.collect()
.map(_.get(0))
.toSeq
and values is passed to the former version. So when using the version that auto-detects the values, the columns are always sorted using the natural ordering of values. If you need another sorting, it is easy enough to replicate the auto-detection mechanism and then call the version with explicit values:
val df = Seq(("Foo", "UK", 1), ("Bar", "UK", 1), ("Foo", "FR", 1), ("Bar", "FR", 1))
.toDF("Product", "Country", "Amount")
df.groupBy("Product")
.pivot("Country", Seq("UK", "FR")) // natural ordering would be "FR", "UK"
.sum("Amount")
.show()
Output:
+-------+---+---+
|Product| UK| FR|
+-------+---+---+
| Bar| 1| 1|
| Foo| 1| 1|
+-------+---+---+

Grouping by name and then adding up the number of another column [duplicate]

I am using pyspark to read a parquet file like below:
my_df = sqlContext.read.parquet('hdfs://myPath/myDB.db/myTable/**')
Then when I do my_df.take(5), it will show [Row(...)], instead of a table format like when we use the pandas data frame.
Is it possible to display the data frame in a table format like pandas data frame? Thanks!
The show method does what you're looking for.
For example, given the following dataframe of 3 rows, I can print just the first two rows like this:
df = sqlContext.createDataFrame([("foo", 1), ("bar", 2), ("baz", 3)], ('k', 'v'))
df.show(n=2)
which yields:
+---+---+
| k| v|
+---+---+
|foo| 1|
|bar| 2|
+---+---+
only showing top 2 rows
As mentioned by #Brent in the comment of #maxymoo's answer, you can try
df.limit(10).toPandas()
to get a prettier table in Jupyter. But this can take some time to run if you are not caching the spark dataframe. Also, .limit() will not keep the order of original spark dataframe.
Let's say we have the following Spark DataFrame:
df = sqlContext.createDataFrame(
[
(1, "Mark", "Brown"),
(2, "Tom", "Anderson"),
(3, "Joshua", "Peterson")
],
('id', 'firstName', 'lastName')
)
There are typically three different ways you can use to print the content of the dataframe:
Print Spark DataFrame
The most common way is to use show() function:
>>> df.show()
+---+---------+--------+
| id|firstName|lastName|
+---+---------+--------+
| 1| Mark| Brown|
| 2| Tom|Anderson|
| 3| Joshua|Peterson|
+---+---------+--------+
Print Spark DataFrame vertically
Say that you have a fairly large number of columns and your dataframe doesn't fit in the screen. You can print the rows vertically - For example, the following command will print the top two rows, vertically, without any truncation.
>>> df.show(n=2, truncate=False, vertical=True)
-RECORD 0-------------
id | 1
firstName | Mark
lastName | Brown
-RECORD 1-------------
id | 2
firstName | Tom
lastName | Anderson
only showing top 2 rows
Convert to Pandas and print Pandas DataFrame
Alternatively, you can convert your Spark DataFrame into a Pandas DataFrame using .toPandas() and finally print() it.
>>> df_pd = df.toPandas()
>>> print(df_pd)
id firstName lastName
0 1 Mark Brown
1 2 Tom Anderson
2 3 Joshua Peterson
Note that this is not recommended when you have to deal with fairly large dataframes, as Pandas needs to load all the data into memory. If this is the case, the following configuration will help when converting a large spark dataframe to a pandas one:
spark.conf.set("spark.sql.execution.arrow.pyspark.enabled", "true")
For more details you can refer to my blog post Speeding up the conversion between PySpark and Pandas DataFrames
Yes: call the toPandas method on your dataframe and you'll get an actual pandas dataframe !
By default show() function prints 20 records of DataFrame. You can define number of rows you want to print by providing argument to show() function. You never know, what will be the total number of rows DataFrame will have. So, we can pass df.count() as argument to show function, which will print all records of DataFrame.
df.show() --> prints 20 records by default
df.show(30) --> prints 30 records according to argument
df.show(df.count()) --> get total row count and pass it as argument to show
If you are using Jupyter, this is what worked for me:
[1]
df= spark.read.parquet("s3://df/*")
[2]
dsp = users
[3]
%%display
dsp
This shows well-formated HTML table, you can also draw some simple charts on it straight away. For more documentation of %%display, type %%help.
Maybe something like this is a tad more elegant:
df.display()
# OR
df.select('column1').display()

How to fill out nulls according to another dataframe pyspark

I currently started using pyspark. I have a two columns dataframe with one column containing some nulls, e.g.
df1
A B
1a3b 7
0d4s 12
6w2r null
6w2r null
1p4e null
and another dataframe has the correct mapping, i.e.
df2
A B
1a3b 7
0d4s 12
6w2r 0
1p4e 3
so I want to fill out the nulls in df1 using df2 s.t. the result is:
A B
1a3b 7
0d4s 12
6w2r 0
6w2r 0
1p4e 3
in pandas, I would first create a lookup dictionary from df2 then use apply on the df1 to populate the nulls. But I'm not really sure what functions to use in pyspark, most of replacing nulls I saw is based on simple conditions, for example, filling all the nulls to be a single constant value for certain column.
What I have tried is:
from pyspark.sql.functions import when, col
df1.withColumn('B', when(df.B.isNull(), df2.where(df2.B== df1.B).select('A')))
although I was getting AttributeError: 'DataFrame' object has no attribute '_get_object_id'. The logic is to first filter out the nulls then replace it with the column B's value from df2, but I think df.B.isNull() evaluates the whole column instead of single value, which is probably not the right way to do it, any suggestions?
left join on common column A and selecting appropriate columns should get you your desired output
df1.join(df2, df1.A == df2.A, 'left').select(df1.A, df2.B).show(truncate=False)
which should give you
+----+---+
|A |B |
+----+---+
|6w2r|0 |
|6w2r|0 |
|1a3b|7 |
|1p4e|3 |
|0d4s|12 |
+----+---+

Is there a way to generate rownumber without converting the dataframe into rdd in pyspark 1.3.1? [duplicate]

I have a very big pyspark.sql.dataframe.DataFrame named df.
I need some way of enumerating records- thus, being able to access record with certain index. (or select group of records with indexes range)
In pandas, I could make just
indexes=[2,3,6,7]
df[indexes]
Here I want something similar, (and without converting dataframe to pandas)
The closest I can get to is:
Enumerating all the objects in the original dataframe by:
indexes=np.arange(df.count())
df_indexed=df.withColumn('index', indexes)
Searching for values I need using where() function.
QUESTIONS:
Why it doesn't work and how to make it working? How to add a row to a dataframe?
Would it work later to make something like:
indexes=[2,3,6,7]
df1.where("index in indexes").collect()
Any faster and simpler way to deal with it?
It doesn't work because:
the second argument for withColumn should be a Column not a collection. np.array won't work here
when you pass "index in indexes" as a SQL expression to where indexes is out of scope and it is not resolved as a valid identifier
PySpark >= 1.4.0
You can add row numbers using respective window function and query using Column.isin method or properly formated query string:
from pyspark.sql.functions import col, rowNumber
from pyspark.sql.window import Window
w = Window.orderBy()
indexed = df.withColumn("index", rowNumber().over(w))
# Using DSL
indexed.where(col("index").isin(set(indexes)))
# Using SQL expression
indexed.where("index in ({0})".format(",".join(str(x) for x in indexes)))
It looks like window functions called without PARTITION BY clause move all data to the single partition so above may be not the best solution after all.
Any faster and simpler way to deal with it?
Not really. Spark DataFrames don't support random row access.
PairedRDD can be accessed using lookup method which is relatively fast if data is partitioned using HashPartitioner. There is also indexed-rdd project which supports efficient lookups.
Edit:
Independent of PySpark version you can try something like this:
from pyspark.sql import Row
from pyspark.sql.types import StructType, StructField, LongType
row = Row("char")
row_with_index = Row("char", "index")
df = sc.parallelize(row(chr(x)) for x in range(97, 112)).toDF()
df.show(5)
## +----+
## |char|
## +----+
## | a|
## | b|
## | c|
## | d|
## | e|
## +----+
## only showing top 5 rows
# This part is not tested but should work and save some work later
schema = StructType(
df.schema.fields[:] + [StructField("index", LongType(), False)])
indexed = (df.rdd # Extract rdd
.zipWithIndex() # Add index
.map(lambda ri: row_with_index(*list(ri[0]) + [ri[1]])) # Map to rows
.toDF(schema)) # It will work without schema but will be more expensive
# inSet in Spark < 1.3
indexed.where(col("index").isin(indexes))
If you want a number range that's guaranteed not to collide but does not require a .over(partitionBy()) then you can use monotonicallyIncreasingId().
from pyspark.sql.functions import monotonicallyIncreasingId
df.select(monotonicallyIncreasingId().alias("rowId"),"*")
Note though that the values are not particularly "neat". Each partition is given a value range and the output will not be contiguous. E.g. 0, 1, 2, 8589934592, 8589934593, 8589934594.
This was added to Spark on Apr 28, 2015 here: https://github.com/apache/spark/commit/d94cd1a733d5715792e6c4eac87f0d5c81aebbe2
from pyspark.sql.functions import monotonically_increasing_id
df.withColumn("Atr4", monotonically_increasing_id())
If you only need incremental values (like an ID) and if there is no
constraint that the numbers need to be consecutive, you could use
monotonically_increasing_id(). The only guarantee when using this
function is that the values will be increasing for each row, however,
the values themself can differ each execution.
You certainly can add an array for indexing, an array of your choice indeed:
In Scala, first we need to create an indexing Array:
val index_array=(1 to df.count.toInt).toArray
index_array: Array[Int] = Array(1, 2, 3, 4, 5, 6, 7, 8, 9, 10)
You can now append this column to your DF. First, For that, you need to open up our DF and get it as an array, then zip it with your index_array and then we convert the new array back into and RDD. The final step is to get it as a DF:
final_df = sc.parallelize((df.collect.map(
x=>(x(0),x(1))) zip index_array).map(
x=>(x._1._1.toString,x._1._2.toString,x._2))).
toDF("column_name")
The indexing would be more clear after that.
monotonicallyIncreasingId() - this will assign row numbers in incresing order but not in sequence.
sample output with 2 columns:
|---------------------|------------------|
| RowNo | Heading 2 |
|---------------------|------------------|
| 1 | xy |
|---------------------|------------------|
| 12 | xz |
|---------------------|------------------|
If you want assign row numbers use following trick.
Tested in spark-2.0.1 and greater versions.
df.createOrReplaceTempView("df")
dfRowId = spark.sql("select *, row_number() over (partition by 0) as rowNo from df")
sample output with 2 columns:
|---------------------|------------------|
| RowNo | Heading 2 |
|---------------------|------------------|
| 1 | xy |
|---------------------|------------------|
| 2 | xz |
|---------------------|------------------|
Hope this helps.
Selecting a single row n of a Pyspark DataFrame, try:
df.where(df.id == n).show()
Given a Pyspark DataFrame:
df = spark.createDataFrame([(1, 143.5, 5.6, 28, 'M', 100000),\
(2, 167.2, 5.4, 45, 'M', None),\
(3, None , 5.2, None, None, None),\
], ['id', 'weight', 'height', 'age', 'gender', 'income'])
Selecting the 3rd row, try:
df.where('id == 3').show()
Or:
df.where(df.id == 3).show()
Selecting multiple rows with rows' ids (the 2nd & the 3rd rows in this case), try:
id = {"2", "3"}
df.where(df.id.isin(id)).show()

Resources