Using sc.parallelize inside map() or any other solution? - apache-spark

I have following issue: i need to find all combinations of values in the column B per each id from the column A and return the results as DataFrame
In example below of the input DataFrame
A B
0 5 10
1 1 20
2 1 15
3 3 50
4 5 14
5 1 30
6 1 15
7 3 33
I need to get the following output DataFrame (it is for GraphX\GraphFrame)
src dist A
0 10 14 5
1 50 33 3
2 20 15 1
3 30 15 1
4 20 30 1
The one solution that I thought till now it is:
df_result = df.drop_duplicates().\
map(lambda (A,B):(A,[B])).\
reduceByKey(lambda p, q: p + q).\
map(lambda (A,B_values_array):(A,[k for k in itertools.combinations(B_values_array,2)]))
print df_result.take(3)
output: [(1, [(20,15),(30,20),(30,15)]),(5,[(10,14)]),(3,[(50,33)])]
And here I'm stuck :( how to return it to the data frame that I need? One idea was to use parallelize:
import spark_sc
edges = df_result.map(lambda (A,B_pairs): spark_sc.sc.parallelize([(k[0],k[1],A) for k in B_pairs]))
For spark_sc I have other file with name spark_sc.py
def init():
global sc
global sqlContext
sc = SparkContext(conf=conf,
appName="blablabla",
pyFiles=['my_file_with_code.py'])
sqlContext = SQLContext(sc)
but my code it failed:
AttributeError: 'module' object has no attribute 'sc'
if I use the spark_sc.sc() not into map() it works.
Any idea what I miss in the last step? is it possible at all to use parallelize()? or I need completely different solution?
Thanks!

You definitely need another solution which could be as simple as:
from pyspark.sql.functions import greatest, least, col
df.alias("x").join(df.alias("y"), ["A"]).select(
least("x.B", "y.B").alias("src"), greatest("x.B", "y.B").alias("dst"), "A"
).where(col("src") != col("dst")).distinct()
where:
df.alias("x").join(df.alias("y"), ["A"])
joins table with itself by A,
least("x.B", "y.B").alias("src")
and
greatest("x.B", "y.B")
choose value with a lower id as the source, and higher id as a destination. Finally:
where(col("src") != col("dst"))
drops self loops.
In general it is not possible to use SparkContext from an action or a transformation (not that it would make any sense to do this in your case).

Related

How to assign key value pair to two matrices from text file in Pyspark RDD using Python

I have a text file that looks like:
1 2 3 4
4 5 6 7
3 4 5 6
2 3 4
3 4 5
4 5 7
6 7 9
I want to create two matrices ab, bc (e.g. 34 and 43 here) for further matrix operation and assign key value pairs using Pyspark rdd with Python.
I tried:
import pyspark
from pyspark import SparkContext, SparkConf
sc = SparkContext.getOrCreate()
data = sc.textfile('file.txt')
data2 = data.filter(lambda x: x.strip()).map(lambda x: x.split(' '))
I don't understand the next step to map the key value to two matrices because if i apply a function, it iterates over a row.

Finding intervals in pandas dataframe based on values in another dataframe

I have two data frames. One dataframe (A) looks like:
Name. gender start_coordinate end_coordinate ID
Peter M 30 150 1
Hugo M 4500 6000 2
Jennie F 300 700 3
The other dataframe (B) looks like
ID_sim. position string
1 89 aa
4 568 bb
5 938437 cc
I want to accomplish two tasks here:
I want to get a list of indices for rows (from dataframe B) for which position column falls in the interval (specified by start_coordinate and end_coordinate column) in dataframe A.
The result for this task will be:
lst = [0,1]. ### because row 0 of B falls in interval of row 1 in A and row 1 of B falls in interval of row 3 of A.
The indices that I get from task 1, I want to keep it from dataframe B to create a new dataframe. Thus, the new dataframe will look like:
position string
89 aa
568 bb
I used .between() to accomplish this task. The code is as follows:
lst=dfB[dfB['position'].between(dfA.loc[0,'start_coordinate'],dfA.loc[len(dfA)-1,'end_coordinate'])].index.tolist()
result=dfB[dfB.index.isin(lst)]
result.shape
However, when I run this piece of code I get the following error:
KeyError: 0
What could possibly be raising this error? And how can I solve this?
We can try numpy broadcasting here
s, e = dfA[['start_coordinate', 'end_coordinate']].to_numpy().T
p = dfB['position'].to_numpy()[:, None]
dfB[((p >= s) & (p <= e)).any(1)]
ID_sim. position string
0 1 89 aa
1 4 568 bb
You could use Pandas IntervalIndex to get the positions, and afterwards, use a boolean to pull the relevant rows from B :
Create IntervalIndex:
intervals = pd.IntervalIndex.from_tuples([*zip(A['start_coordinate'],
A['end_coordinate'])
],
closed='both')
Get indexers for B.position, create a boolean array with the values and filter B:
# get_indexer returns -1 if an index is not found.
B.loc[intervals.get_indexer(B.position) >= 0]
Out[140]:
ID_sim. position string
0 1 89 aa
1 4 568 bb
This should work. Less elegant but easier to comprehend.
import pandas as pd
data = [['Name.','gender', 'start_coordinate','end_coordinate','ID'],
['Peter','M',30,150,1],
['Hugo','M',4500,6000,2],
['Jennie','F',300,700,3]]
data2 = [['ID_sim.','position','string'],
['1',89,'aa'],
['4',568,'bb'],
['5',938437,'cc']]
df1 = pd.DataFrame(data[1:], columns=data[0])
df2 = pd.DataFrame(data2[1:], columns=data2[0])
merged = pd.merge(df1, df2, left_index=True, right_index=True)
print (merged[(merged['position'] > merged['start_coordinate']) & (merged['position'] < merged['end_coordinate'])])

pandas get rows from one dataframe which are existed in other dataframe

I have two dataframes. The dataframes as follows:
df1 is
numbers
user_id
0 9154701244
1 9100913773
2 8639988041
3 8092118985
4 8143131334
5 9440609551
6 8309707235
7 8555033317
8 7095451372
9 8919206985
10 8688960416
11 9676230089
12 7036733390
13 9100914771
it's shape is (14,1)
df2 is
user_id numbers names type duration date_time
0 9032095748 919182206378 ramesh incoming 23 233445445
1 9032095748 918919206983 suresh incoming 45 233445445
2 9032095748 919030785187 rahul incoming 45 233445445
3 9032095748 916281206641 jay incoming 67 233445445
4 jakfnka998nknk 9874654411 query incoming 25 8571228412
5 jakfnka998nknk 9874654112 form incoming 42 678565487
6 jakfnka998nknk 9848022238 json incoming 10 89547212765
7 ukajhj9417fka 9984741215 keert incoming 32 8548412664
8 ukajhj9417fka 9979501984 arun incoming 21 7541344646
9 ukajhj9417fka 95463241 paru incoming 42 945151215451
10 ukajknva939o 7864621215 hari outgoing 34 49829840920
and it's shape is (10308,6)
Here in df1, the column name numbers are having the multiple unique numbers. These numbers are available in df2 and those are repeated depends on the duration. I want to get those data who all are existed in df2 based on the numbers which are available in df1.
Here is the code I've tried to get this but I'm not able to figure it out how it can be solved using pandas.
df = pd.concat([df1, df2]) # concat dataframes
df = df.reset_index(drop=True) # reset the index
df_gpby = df.groupby(list(df.columns)) #group by
idx = [x[0] for x in df_gpby.groups.values() if len(x) == 1] #reindex
df = df.reindex(idx)
It gives me only unique numbers column which are there in df2. But I need to get all the data including other columns from the second dataframe.
It would be great that anyone can help me on this. Thanks in advance.
Here is a sample dataframe, I created keeping the gist same.
df1=pd.DataFrame({"numbers":[123,1234,12345,5421]})
df2=pd.DataFrame({"numbers":[123,1234,12345,123,123,45643],"B":[1,2,3,4,5,6],"C":[2,3,4,5,6,7]})
final_df=df2[df2.numbers.isin(df1.numbers)]
Output DataFrame The result is all unique numbers that are present in df1 and present in df2 will be returned
numbers B C
0 123 1 2
1 1234 2 3
2 12345 3 4
3 123 4 5
4 123 5 6

How can I delete useless strings by index from a Pandas DataFrame defining a function?

I have a DataFrame, namely 'traj', as follow:
x y z
0 5 3 4
1 4 2 8
2 1 1 7
3 Some string here
4 This is spam
5 5 7 8
6 9 9 7
... #continues repeatedly a lot with the same strings here in index 3 and 4
79 4 3 3
80 Some string here
I'm defining a function in order to delete useless strings positioned in certain index from the DataFrame. Here is what I'm trying:
def spam(names,df): #names is a list composed, for instance, by "Some" and "This" in 'traj'
return df.drop(index = ([traj[(traj.iloc[:,0] == n)].index for n in names]))
But when I call it it returns the error:
traj_clean = spam(my_list_of_names, traj)
...
KeyError: '[(3,4,...80)] not found in axis'
If I try alone:
traj.drop(index = ([traj[(traj.iloc[:,0] == 'Some')].index for n in names]))
it works.
I solved it in a different way:
df = traj[~traj[:].isin(names)].dropna()
Where names is a list of the terms you wish to delete.
df will contain only rows without these terms

Find a row matching multiple column criteria

I have a dataframe with 2M rows which is in the below format:
ID Number
1 30
1 40
1 60
2 10
2 30
3 60
I need to select the IDs have the number 30 and 40 present (in this case, output should be 1).
I know we can create a new DF having only numbers 30 & 40 and then groupby to see which IDs have more than count 1. But is there a way we can to do both in the groupby statement ?
My code:
a=df[(df['Number']==30) | (df['Number']==40) ]
b=a.groupby('ID')['Number'].nunique().to_frame(name='tt').reset_index()
b[b['tt'] > 1]
Use groupby filter and issubset
s = {30, 40}
df.groupby('ID').filter(lambda x: s.issubset(set(x.Number)))
Out[158]:
ID Number
0 1 30
1 1 40
2 1 60
I find the fact that the describe() method of Groupby objects returns a dataframe to be extremely helpful.
Output temp1 = a.groupby("ID").describe() and temp2 = a.groupby("ID").describe()["Number"] to a Jupyter notebook to see what they look like, then the following code (which follows on from yours) should make sense.
summary = a.groupby("ID").describe()["Number"]
summary.loc[summary["count"] > 1].index
I would create a df for each condition and then inner join them:
df1 = df[df.Number == 30][['Number']]
df2 = df[df.Number == 40][['Number']]
df3 = df1.join(df2,how='inner',on='Number')

Resources