How to assign key value pair to two matrices from text file in Pyspark RDD using Python - python-3.x

I have a text file that looks like:
1 2 3 4
4 5 6 7
3 4 5 6
2 3 4
3 4 5
4 5 7
6 7 9
I want to create two matrices ab, bc (e.g. 34 and 43 here) for further matrix operation and assign key value pairs using Pyspark rdd with Python.
I tried:
import pyspark
from pyspark import SparkContext, SparkConf
sc = SparkContext.getOrCreate()
data = sc.textfile('file.txt')
data2 = data.filter(lambda x: x.strip()).map(lambda x: x.split(' '))
I don't understand the next step to map the key value to two matrices because if i apply a function, it iterates over a row.

Related

Compare two dataframes and export unmatched data using pandas or other packages?

I have two dataframes and one is a subset of another one (picture below). I am not sure whether pandas can compare two dataframes and filter the data which is not in the subset and export it as a dataframe. Or is there any package doing this kind of task?
The subset dataframe was generated from RandomUnderSampler but the RandomUnderSampler did not have function which exports the unselected data. Any comments are welcome.
Use drop_duplicates with keep=False parameter:
Example:
>>> df1
A B
0 0 1
1 2 3
2 4 5
3 6 7
4 8 9
>>> df2
A B
0 0 1
1 2 3
2 6 7
>>> pd.concat([df1, df2]).drop_duplicates(keep=False)
A B
2 4 5
4 8 9

How can I delete useless strings by index from a Pandas DataFrame defining a function?

I have a DataFrame, namely 'traj', as follow:
x y z
0 5 3 4
1 4 2 8
2 1 1 7
3 Some string here
4 This is spam
5 5 7 8
6 9 9 7
... #continues repeatedly a lot with the same strings here in index 3 and 4
79 4 3 3
80 Some string here
I'm defining a function in order to delete useless strings positioned in certain index from the DataFrame. Here is what I'm trying:
def spam(names,df): #names is a list composed, for instance, by "Some" and "This" in 'traj'
return df.drop(index = ([traj[(traj.iloc[:,0] == n)].index for n in names]))
But when I call it it returns the error:
traj_clean = spam(my_list_of_names, traj)
...
KeyError: '[(3,4,...80)] not found in axis'
If I try alone:
traj.drop(index = ([traj[(traj.iloc[:,0] == 'Some')].index for n in names]))
it works.
I solved it in a different way:
df = traj[~traj[:].isin(names)].dropna()
Where names is a list of the terms you wish to delete.
df will contain only rows without these terms

Replace missing dataframe with values from a reference dataframe in Python

This is regarding a project using pandas in Python 3.7
I have a reference Dataframe df1
code name
0 1 A
2 2 B
3 3 C
4 4 D
And I have another bigger data frame df2 with missing values
code name
0 3 C
1 2
2 1 A
3 4
4 3
5 1 B
6 4
7 2
8 3 C
9 2
As you see here df2 has missing values.
How can I fill these values from the reference dataframe df1 using
I used the following:
'''
df2 = df2.merge(df1,on='code',how='left')
'''

How to specify a random seed while using Python's numpy random choice?

I have a list of four strings. Then in a Pandas dataframe I want to create a variable randomly selecting a value from this list and assign into each row. I am using numpy's random choice, but reading their documentation, there is no seed option. How can I specify the random seed to the random assignment so every time the random assignment will be the same?
service_code_options = ['899.59O', '12.42R', '13.59P', '204.68L']
df['SERVICE_CODE'] = [np.random.choice(service_code_options ) for i in df.index]
You need define it before by numpy.random.seed, also list comprehension is not necessary, because is possible use numpy.random.choice with parameter size:
np.random.seed(123)
df = pd.DataFrame({'a':range(10)})
service_code_options = ['899.59O', '12.42R', '13.59P', '204.68L']
df['SERVICE_CODE'] = np.random.choice(service_code_options, size=len(df))
print (df)
a SERVICE_CODE
0 0 13.59P
1 1 12.42R
2 2 13.59P
3 3 13.59P
4 4 899.59O
5 5 13.59P
6 6 13.59P
7 7 12.42R
8 8 204.68L
9 9 13.59P
Documentation numpy.random.seed
np.random.seed(this_is_my_seed)
That could be an integer or a list of integers
np.random.seed(300)
Or
np.random.seed([3, 1415])
Example
np.random.seed([3, 1415])
service_code_options = ['899.59O', '12.42R', '13.59P', '204.68L']
np.random.choice(service_code_options, 3)
array(['899.59O', '204.68L', '13.59P'], dtype='<U7')
Notice that I passed a 3 to the choice function to specify the size of the array.
numpy.random.choice
According to the notes of numpy.random.seed in numpy v1.2.4:
Best practice is to use a dedicated Generator instance rather than the random variate generation methods exposed directly in the random module.
Such a Generator is constructed using np.random.default_rng.
Thus, instead of np.random.seed, the current best practice is to use a np.random.default_rng with a seed to construct a Generator, which can be further used for reproducible results.
Combining jezrael's answer and the current best practice, we have:
import pandas as pd
import numpy as np
rng = np.random.default_rng(seed=121)
df = pd.DataFrame({'a':range(10)})
service_code_options = ['899.59O', '12.42R', '13.59P', '204.68L']
df['SERVICE_CODE'] = rng.choice(service_code_options, size=len(df))
print(df)
a SERVICE_CODE
0 0 12.42R
1 1 13.59P
2 2 12.42R
3 3 12.42R
4 4 899.59O
5 5 204.68L
6 6 204.68L
7 7 13.59P
8 8 12.42R
9 9 13.59P

Using sc.parallelize inside map() or any other solution?

I have following issue: i need to find all combinations of values in the column B per each id from the column A and return the results as DataFrame
In example below of the input DataFrame
A B
0 5 10
1 1 20
2 1 15
3 3 50
4 5 14
5 1 30
6 1 15
7 3 33
I need to get the following output DataFrame (it is for GraphX\GraphFrame)
src dist A
0 10 14 5
1 50 33 3
2 20 15 1
3 30 15 1
4 20 30 1
The one solution that I thought till now it is:
df_result = df.drop_duplicates().\
map(lambda (A,B):(A,[B])).\
reduceByKey(lambda p, q: p + q).\
map(lambda (A,B_values_array):(A,[k for k in itertools.combinations(B_values_array,2)]))
print df_result.take(3)
output: [(1, [(20,15),(30,20),(30,15)]),(5,[(10,14)]),(3,[(50,33)])]
And here I'm stuck :( how to return it to the data frame that I need? One idea was to use parallelize:
import spark_sc
edges = df_result.map(lambda (A,B_pairs): spark_sc.sc.parallelize([(k[0],k[1],A) for k in B_pairs]))
For spark_sc I have other file with name spark_sc.py
def init():
global sc
global sqlContext
sc = SparkContext(conf=conf,
appName="blablabla",
pyFiles=['my_file_with_code.py'])
sqlContext = SQLContext(sc)
but my code it failed:
AttributeError: 'module' object has no attribute 'sc'
if I use the spark_sc.sc() not into map() it works.
Any idea what I miss in the last step? is it possible at all to use parallelize()? or I need completely different solution?
Thanks!
You definitely need another solution which could be as simple as:
from pyspark.sql.functions import greatest, least, col
df.alias("x").join(df.alias("y"), ["A"]).select(
least("x.B", "y.B").alias("src"), greatest("x.B", "y.B").alias("dst"), "A"
).where(col("src") != col("dst")).distinct()
where:
df.alias("x").join(df.alias("y"), ["A"])
joins table with itself by A,
least("x.B", "y.B").alias("src")
and
greatest("x.B", "y.B")
choose value with a lower id as the source, and higher id as a destination. Finally:
where(col("src") != col("dst"))
drops self loops.
In general it is not possible to use SparkContext from an action or a transformation (not that it would make any sense to do this in your case).

Resources