Combine associated items in Spark - apache-spark

In Spark, I have a large list (millions) of elements that contain items associated with each other. Examples:
1: ("A", "C", "D") # Each of the items in this array is associated with any other element in the array, so A and C are associated, A, and D are associated, and C and D are associated.
2: ("F", "H", "I", "P")
3: ("H", "I", "D")
4: ("X", "Y", "Z")
I want to perform an operation to combine the associations where there are associations that go across the lists. In the example above, we can see that all the items of the first three lines are associated with each other (line 1 and line 2 should be combined because according line 3 D and I are associated). Therefore, the output should be:
("A", "C", "D", "F", "H", "I", "P")
("X", "Y", "Z")
What type of transformations in Spark can I use to perform this operation? I looked like various ways of grouping the data, but haven't found an obvious way to combine list elements if they share common elements.
Thank you!

As a couple of users have already stated, this can be seen as a graph problem, where you want to find the connected components in a graph.
As you are using spark, I think is a nice opportunity to show how to use graphx in python.
To run this example you will need pyspark and graphframes python packages.
from pyspark.sql import SparkSession
from graphframes import GraphFrame
from pyspark.sql import functions as f
spark = (
SparkSession.builder.appName("test")
.config("spark.jars.packages", "graphframes:graphframes:0.8.2-spark3.2-s_2.12")
.getOrCreate()
)
# graphframe requires defining a checkpoint dir.
spark.sparkContext.setCheckpointDir("/tmp/checkpoint")
# lets create a sample dataframe
df = spark.createDataFrame(
[
(1, ["A", "C", "D"]),
(2, ["F", "H", "I", "P"]),
(3, ["H", "I", "D"]),
(4, ["X", "Y", "Z"]),
],
["id", "values"],
)
# We can use the explode function to explode the lists in new rows having a list of (id, node)
df = df.withColumn("node", f.explode("values"))
df.createOrReplaceTempView("temp_table")
# Then we can join the table with itself to generate an edge table with source and destination nodes.
edge_table = spark.sql(
"""
SELECT
distinct a.node as src, b.node as dst
FROM
temp_table a join temp_table b
ON a.id=b.id AND a.node != b.node
"""
)
# Now we define our graph by using an edge table (a table with the node ids)
# and our edge table
# then we use the connectedComponents method to find the components
cc_df = GraphFrame(
df.selectExpr("node as id").drop_duplicates(), edge_table
).connectedComponents()
# The cc_df dataframe will have two columns, the node id and the connected component.
# To get the desired result we can group by the component and create a list
cc_df.groupBy("component").agg(f.collect_list("id")).show(truncate=False)
The output you will get looks like this:
You can install the dependencies by using:
pip install -q pyspark==3.2 graphframes

There probably isn't enough information in the question to fully solve this but I would suggest creating an adjacency matrix/list using GraphX to represent it as a graph. From there hopefully you can solve the rest of your problem.
https://en.wikipedia.org/wiki/Adjacency_matrix
https://spark.apache.org/docs/latest/graphx-programming-guide.html

If you are using a PySpark Kernel, this solution should work
iset = set([frozenset(s) for s in tuple_list]) # Convert to a set of sets
result = []
while(iset): # While there are sets left to process:
nset = set(iset.pop()) # Pop a new set
check = len(iset) # Does iset contain more sets
while check: # Until no more sets to check:
check = False
for s in iset.copy(): # For each other set:
if nset.intersection(s): # if they intersect:
check = True # Must recheck previous sets
iset.remove(s) # Remove it from remaining sets
nset.update(s) # Add it to the current set
result.append(tuple(nset)) # Convert back to a list of tuples

Related

Selecting specific rows with multi-level indexes

I am learning Python in DataCamp where I am in a course titled Data Manipulation with pandas
In an exercise I want to select some rows from a data frame but I get an error and I really don't understand the message of the error.
Here is the code
# Load packages
import pandas as pd
# Import data
sales = pd.read_pickle("https://assets.datacamp.com/production/repositories/5386/datasets/b0e855c644024f850a7df3fe8b7bf0e23f7bd2fc/walmart_sales.pkl.bz2", 'bz2')
print(sales)
# Create multi-level indexes
sales_ind = sales.set_index(["store", "type"])
print(sales_ind)
# Define the rows to select
rows_to_keep = [(1, "A"), (2, "B")]
print(rows_to_keep)
# Use .iloc to select the rows
print(sales_ind.loc[rows_to_keep])
The problem is when I run print(sales_ind.loc[rows_to_keep]) where I get this error message KeyError: 'Passing list-likes to .loc or [] with any missing labels is no longer supported, see https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#deprecate-loc-reindex-listlike' but I don't understand how to fix the code.
I am using Python 3.7.6 and pandas 1.0.0
I figured out your problem and it is not a problem with df.loc, but instead with the data you provided. There is no value in your multiindex that matches the tuple (2,'B').
You can see this if you try:
sales_ind.loc[(2,'B'),:]
which returns:
KeyError: (2, 'B')
while (1,'A') works just fine.
By the way, when selecting on a multiindex, I always recommend using pd.IndexSlice

Spark joins performance issue

I'm trying to merge historical and incremental data. As part of the incremental data, I'm getting deletes. Below is the case.
historical data - 100 records ( 20 columns, id is the key column)
incremental data - 10 records ( 20 columns, id is the key column)
Out of the 10 records in incremental data, only 5 will match with historical data.
Now I want 100 records in the final dataframe of which 95 records belong to historical data and 5 records belong to incremental data(wherever id column is match).
Update timestamp field is available in both historical and incremental data.
Below is the approach I tried.
DF1 - Historical Data
DF2 - Incremental Delete Dataset
DF3 = DF1 LEFTANTIJOIN DF2
DF4 = DF2 INNERJOIN DF1
DF5 = DF3 UNION DF4
However, I observed It has lot of performance issue as I'm running this join on billions of records. Any better way to do this?
you can use the cogroup operator combined with a user defined function to construct the different variations of the join.
Suppose we have these two RDDs as an example :
visits = sc.parallelize([("h", "1.2.3.4"), ("a", "3.4.5.6"), ("h","1.3.3.1")] )
pageNames = sc.parallelize([("h", "Home"), ("a", "About"), ("o", "Other")])
cg = visits.cogroup(pageNames).map(lambda x :(x[0], ( list(x[1][0]), list(x[1][1]))))
You can implement an inner join as such :
innerjoin = cg.flatMap(lambda x: J(x))
Where J is defined as such :
def J(x):
j=[]
k=x[0]
if x[1][0]!=[] and x[1][1]!=[]:
for l in x[1][0]:
for r in x[1][1]:
j.append((k,(l,r)))
return j
For a right outer join for example you just need to change the J function to an roJ function defined as such :
def roJ(x):
j=[]
k=x[0]
if x[1][0]!=[] and x[1][1]!=[]:
for l in x[1][0]:
for r in x[1][1]:
j.append((k,(l,r)))
elif x[1][1]!=[] :
for r in x[1][1]:
j.append((k, (None, r)))
return j
And call it like so :
rightouterjoin = cg.flatMap(lambda x: roJ(x))
And so on for other types of join you'd wish to implement
Performance issues are not just related to the size of your data. It depends on many other parameters like, the keys you used for partition, your partitioned file sizes and the cluster configuration you are running your job on. I would recommend you to go through the official documentation on Tuning your spark jobs and make necessary changes.
https://spark.apache.org/docs/latest/tuning.html
Below is the approach I did.
historical_data.as("a").join(
incr_data.as("b"),
$"a.id" === $"b.id", "full")
.select(historical_data.columns.map(f => expr(s"""case when a.id=b.id then b.${f} else a.${f} end as $f""")): _*)

How do I give a text key to a dataframe stored as a value in a dictionary?

So I have 3 dataframes - df1,df2.df3. I'm trying to loop through each dataframe so that I can run some preprocessing - set date time, extract hour to a separate column etc. However, I'm running into some issues:
If I store the df in a dict as in df_dict = {'df1' : df1, 'df2' : df2, 'df3' : df3} and then loop through it as in
for k, v in df_dict.items():
if k == 'df1':
v['Col1']....
else:
v['Coln']....
I get a NameError: name 'df1' is not defined
What am I doing wrong? I initially thought I was not reading in the df1..3 data in but that seems to operate ok (as in it doesn't fail and its clearly reading it in given the time lag (they are big files)). The code preceding it (for load) is:
DF_DATA = { 'df1': 'df1.csv','df2': 'df2.csv', 'df3': 'df3.csv' }
for k,v in DF_DATA.items():
print(k, v) #this works to print out both key and value
k = pd.read_csv(v) #this does not
I am thinking this maybe the cause but not sure.I'm expecting the load loop to create the 3 dataframes and put them into memory. Then for the loop on the top of the page, I want to reference the string key in my if block condition so that each df can get a slightly different preprocessing treatment.
Thanks very much in advance for your assist.
You didn't create df_dict correctly. Try this:
DF_DATA = { 'df1': 'df1.csv','df2': 'df2.csv', 'df3': 'df3.csv' }
df_dict= {k:pd.read_csv(v) for k,v in DF_DATA.items()}

How to remove rows in DataFrame on column based on another DataFrame?

I'm trying to use SQLContext.subtract() in Spark 1.6.1 to remove rows from a dataframe based on a column from another dataframe. Let's use an example:
from pyspark.sql import Row
df1 = sqlContext.createDataFrame([
Row(name='Alice', age=2),
Row(name='Bob', age=1),
]).alias('df1')
df2 = sqlContext.createDataFrame([
Row(name='Bob'),
])
df1_with_df2 = df1.join(df2, 'name').select('df1.*')
df1_without_df2 = df1.subtract(df1_with_df2)
Since I want all rows from df1 which don't include name='Bob' I expect Row(age=2, name='Alice'). But I also retrieve Bob:
print(df1_without_df2.collect())
# [Row(age='1', name='Bob'), Row(age='2', name='Alice')]
After various experiments to get down to this MCVE, I found out that the issue is with the age key. If I omit it:
df1_noage = sqlContext.createDataFrame([
Row(name='Alice'),
Row(name='Bob'),
]).alias('df1_noage')
df1_noage_with_df2 = df1_noage.join(df2, 'name').select('df1_noage.*')
df1_noage_without_df2 = df1_noage.subtract(df1_noage_with_df2)
print(df1_noage_without_df2.collect())
# [Row(name='Alice')]
Then I only get Alice as expected. The weirdest observation I made is that it's possible to add keys, as long as they're after (in the lexicographical order sense) the key I use in the join:
df1_zage = sqlContext.createDataFrame([
Row(zage=2, name='Alice'),
Row(zage=1, name='Bob'),
]).alias('df1_zage')
df1_zage_with_df2 = df1_zage.join(df2, 'name').select('df1_zage.*')
df1_zage_without_df2 = df1_zage.subtract(df1_zage_with_df2)
print(df1_zage_without_df2.collect())
# [Row(name='Alice', zage=2)]
I correctly get Alice (with her zage)! In my real examples, I'm interested in all columns, not only the ones that are after name.
Well there are some bugs here (the first issue looks like related to to the same problem as SPARK-6231) and JIRA looks like a good idea, but SUBTRACT / EXCEPT is no the right choice for partial matches.
Instead, as of Spark 2.0, you can use anti-join:
df1.join(df1_with_df2, ["name"], "leftanti").show()
In 1.6 you can do pretty much the same thing with standard outer join:
import pyspark.sql.functions as F
ref = df1_with_df2.select("name").alias("ref")
(df1
.join(ref, ref.name == df1.name, "leftouter")
.filter(F.isnull("ref.name"))
.drop(F.col("ref.name")))

Spark dataframes convert nested JSON to seperate columns

I've a stream of JSONs with following structure that gets converted to dataframe
{
"a": 3936,
"b": 123,
"c": "34",
"attributes": {
"d": "146",
"e": "12",
"f": "23"
}
}
The dataframe show functions results in following output
sqlContext.read.json(jsonRDD).show
+----+-----------+---+---+
| a| attributes| b| c|
+----+-----------+---+---+
|3936|[146,12,23]|123| 34|
+----+-----------+---+---+
How can I split attributes column (nested JSON structure) into attributes.d, attributes.e and attributes.f as seperate columns into a new dataframe, so I can have columns as a, b, c, attributes.d, attributes.e and attributes.f in the new dataframe?
If you want columns named from a to f:
df.select("a", "b", "c", "attributes.d", "attributes.e", "attributes.f")
If you want columns named with attributes. prefix:
df.select($"a", $"b", $"c", $"attributes.d" as "attributes.d", $"attributes.e" as "attributes.e", $"attributes.f" as "attributes.f")
If names of your columns are supplied from an external source (e.g. configuration):
val colNames: Seq("a", "b", "c", "attributes.d", "attributes.e", "attributes.f")
df.select(colNames.head, colNames.tail: _*).toDF(colNames:_*)
Using the attributes.d notation, you can create new columns and you will have them in your DataFrame. Look at the withColumn() method in Java.
Use Python
Extract the DataFrame by using the pandas Lib of python.
Change the data type from 'str' to 'dict'.
Get the values of each features.
Save the results to a new file.
import pandas as pd
data = pd.read_csv("data.csv") # load the csv file from your disk
json_data = data['Desc'] # get the DataFrame of Desc
data = data.drop('Desc', 1) # delete Desc column
Total, Defective = [], [] # setout list
for i in json_data:
i = eval(i) # change the data type from 'str' to 'dict'
Total.append(i['Total']) # append 'Total' feature
Defective.append(i['Defective']) # append 'Defective' feature
# finally,complete the DataFrame
data['Total'] = Total
data['Defective'] = Defective
data.to_csv("result.csv") # save to the result.csv and check it

Resources