Alternate or better approach to aggregateByKey in pyspark RDD - apache-spark

I have a weather data csv file in which each entry has station ID and the minimum or max value recorded for that day. The second element is key word to know what the value represents. Sample input is as below.
stationID feature value
ITE00100554 TMAX -75
ITE00100554 TMIN -148
GM000010962 PRCP 0
EZE00100082 TMAX -86
EZE00100082 TMIN -135
ITE00100554 TMAX -60
ITE00100554 TMIN -125
GM000010962 PRCP 0
EZE00100082 TMAX -44
EZE00100082 TMIN -130
ITE00100554 TMAX -23
I have filtered out entries with TMIN or TMAX. Each entry is recorded for a given data. I have stripped Date while building my RDD as it's not of interest. My goal is to find the Min and Max value of each station amongst all of its records i.e.,
ITE00100554, 'TMIN', <global_min_value recorded by that station>
ITE00100554, 'TMAX', <global_max_value>
EZE00100082, 'TMIN', <global_min_value>
EZE00100082, 'TMAX', <global_max_value>
I was able to accomplish this using aggregateByKey, but according to this link https://backtobazics.com/big-data/spark/apache-spark-aggregatebykey-example/ I dont have to use aggregateByKey since the input and output values format is the same. So I would like to know if there are an alternate or better ways to code this without defining so many functions.
stationtemps = entries.filter(lambda x: x[1] in ['TMIN', 'TMAX']).map(lambda x: (x[0], (x[1], x[2]))) # (stationID, (tempkey, value))
max_temp = stationtemps.values().values().max()
min_temp = stationtemps.values().values().min()
def max_seqOp(accumulator, element):
return (accumulator if accumulator[1] > element[1] else element)
def max_combOp(accu1, accu2):
return (accu1 if accu1[1] > accu2[1] else accu2)
def min_seqOp(accumulator, element):
return (accumulator if accumulator[1] < element[1] else element)
def min_combOp(accu1, accu2):
return (accu1 if accu1[1] < accu2[1] else accu2)
station_max_temps = stationtemps.aggregateByKey(('', min_temp), max_seqOp, max_combOp).sortByKey()
station_min_temps = stationtemps.aggregateByKey(('', max_temp), min_seqOp, min_combOp).sortByKey()
min_max_temps = station_max_temps.zip(station_min_temps).collect()
with open('1800_min_max.csv', 'w') as fd:
writer = csv.writer(fd)
writer.writerows(map(lambda x: list(list(x)), min_max_temps))
I am learning pyspark and havent mastered all different transforming functions.

Here simulated input and if the min and max is filled in correctly, then why the need for the indicator TMIN, TMAX? Indeed no need for an accumulator.
rdd = sc.parallelize([ ('s1','tmin',-3), ('s1','tmax', 5), ('s2','tmin',0), ('s2','tmax', 7), ('s0','tmax',14), ('s0','tmin', 3) ])
rddcollect = rdd.collect()
#print(rddcollect)
rdd2 = rdd.map(lambda x: (x[0], x[2]))
#rdd2collect = rdd2.collect()
#print(rdd2collect)
rdd3 = rdd2.groupByKey().sortByKey()
rdd4 = rdd3.map(lambda k_v: ( k_v[0], (sorted(k_v[1]))) )
rdd4.collect()
returns:
Out[27]: [('s0', [3, 14]), ('s1', [-3, 5]), ('s2', [0, 7])]
ALTERNATE ANSWER
after clarification
assuming that min and max values make sense
with my own data
there are other solutions BTW
Here goes:
include = ['tmin','tmax']
rdd0 = sc.parallelize([ ('s1','tmin',-3), ('s1','tmax', 5), ('s2','tmin',0), ('s2','tmin',-12), ('s2','tmax', 7), ('s2','tmax', 17), ('s2','tother', 17), ('s0','tmax',14), ('s0','tmin', 3) ])
rdd1 = rdd0.filter(lambda x: any(e in x for e in include) )
rdd2 = rdd1.map(lambda x: ( (x[0],x[1]), x[2]))
rdd3 = rdd2.groupByKey().sortByKey()
rdd4Min = rdd3.filter(lambda k_v: k_v[0][1] == 'tmin').map(lambda k_v: ( k_v[0][0], min( k_v[1] ) ))
rdd4Max = rdd3.filter(lambda k_v: k_v[0][1] == 'tmax').map(lambda k_v: ( k_v[0][0], max( k_v[1] ) ))
rdd5=rdd4Min.union(rdd4Max)
rdd6 = rdd5.groupByKey().sortByKey()
res = rdd6.map(lambda k_v: ( k_v[0], (sorted(k_v[1]))))
rescollect = res.collect()
print(rescollect)
returns:
[('s0', [3, 14]), ('s1', [-3, 5]), ('s2', [-12, 17])]

Following the same logic as #thebluephantom, this was my final code while reading from csv
def get_temp_points(item):
if item[0][1] == 'TMIN':
return (item[0], min(item[1]))
else:
return (item[0], max(item[1]))
data = lines.filter(lambda x: any(ele for ele in x if ele in ['TMIN', 'TMAX']))
temps = data.map(lambda x: ((x[0], x[2]), float(x[3]))
temp_list = temps.groupByKey().mapValues(list)
##((stationID, 'TMIN'/'TMAX'), listofvalues)
min_max_temps = temp_list.map(get_temp_points).collect()

Related

How does LabelEncoder() encode values?

I want to know how does LabelEncoder() function.
This is a part of my code
for att in all_features_test:
if (str(test_home_data[att].dtypes) == 'object'):
test_home_data[att].fillna( 'Nothing', inplace = True)
train_home_data[att].fillna( 'Nothing', inplace = True)
train_home_data[att] = LabelEncoder().fit_transform(train_home_data[att])
test_home_data[att] = LabelEncoder().fit_transform(test_home_data[att])
else:
test_home_data[att].fillna( 0, inplace = True)
train_home_data[att].fillna( 0, inplace = True)
Both train and test data set has an attribute 'Condition' which can hold values - Bad, Average and Good
Lets say LabelEncoder() would encode Bad as 0, Average as 2, and Good as 1 in train_home_data. Now would that be same for test_home data?
If not, then what should I do?
You should not label after the split, but before.
The unique labels (= classes) are ordered according to alphabet, see uniques = sorted(set(values)) in this source code snipped from sklearn.preprocessing.LabelEncoder which links to the [source] on the upper right of the page.
python method:
def _encode_python(values, uniques=None, encode=False):
# only used in _encode below, see docstring there for details
if uniques is None:
uniques = sorted(set(values))
uniques = np.array(uniques, dtype=values.dtype)
if encode:
table = {val: i for i, val in enumerate(uniques)}
try:
encoded = np.array([table[v] for v in values])
except KeyError as e:
raise ValueError("y contains previously unseen labels: %s"
% str(e))
return uniques, encoded
else:
return uniques
Same for numpy arrays as classes, see return np.unique(values), because unique() sorts by default:
numpy method:
def _encode_numpy(values, uniques=None, encode=False, check_unknown=True):
# only used in _encode below, see docstring there for details
if uniques is None:
if encode:
uniques, encoded = np.unique(values, return_inverse=True)
return uniques, encoded
else:
# unique sorts
return np.unique(values)
if encode:
if check_unknown:
diff = _encode_check_unknown(values, uniques)
if diff:
raise ValueError("y contains previously unseen labels: %s"
% str(diff))
encoded = np.searchsorted(uniques, values)
return uniques, encoded
else:
return uniques
You can never be sure that the test set and training set have the exactly same classes. The training or testing set might simply lack a class of the three label column 'Condition'.
If you desparately want to encode after the train/test split, you need to check that the number of classes is the same in both sets before the encoding.
Quoting the script:
Uses pure python method for object dtype, and numpy method for all
other dtypes.
python method (object type):
assert sorted(set(train_home_data[att])) == sorted(set(test_home_data[att]))
numpy method (all other types):
assert np.unique(train_home_data[att]) == np.unique(test_home_data[att])
I got the answer for this I guess.
Code
data1 = [('A', 1), ('B', 2),('C', 3) ,('D', 4)]
data2 = [('D', 1), ('A', 2),('A', 3) ,('B', 4)]
df1 = pd.DataFrame(data1, columns = ['col1', 'col2'])
df2 = pd.DataFrame(data2, columns = ['col1', 'col2'])
print(df1['col1'])
print(df2['col1'])
df1['col1'] = LabelEncoder().fit_transform(df1['col1'])
df2['col1'] = LabelEncoder().fit_transform(df2['col1'])
print(df1['col1'])
print(df2['col1'])
Output
0 A
1 B
2 C
3 D
Name: col1, dtype: object # df1
0 D
1 A
2 A
3 B
Name: col1, dtype: object # df2
0 0
1 1
2 2
3 3
Name: col1, dtype: int64 #df1 encoded
0 2
1 0
2 0
3 1
Name: col1, dtype: int64 #df2 encoded
B of df1 is encoded to 1.
and,
B of df2 is encoded to 1 as well
So if I encode training and testing data sets, then the encoded values in training set would reflect in testing data set (only if both are label encoded)
I would suggest fitting the label encoder on one dataset and transforming both:
data1 = [('A', 1), ('B', 2),('C', 3) ,('D', 4)]
data2 = [('D', 1), ('A', 2),('A', 3) ,('B', 4)]
df1 = pd.DataFrame(data1, columns = ['col1', 'col2'])
df2 = pd.DataFrame(data2, columns = ['col1', 'col2'])
# here comes the new code:
le = LabelEncoder()
df1['col1'] = le.fit_transform(df1['col1'])
df2['col1'] = le.transform(df2['col1'])

How to iterate over dfs and append data with combine names

i have this problem to solve, this is a continuation of a previus question How to iterate over pandas df with a def function variable function and the given answer worked perfectly, but now i have to append all the data in a 2 columns dataframe (Adduct_name and mass).
This is from the previous question:
My goal: i have to calculate the "adducts" for a given "Compound", both represents numbes, but for eah "Compound" there are 46 different "Adducts".
Each adduct is calculated as follow:
Adduct 1 = [Exact_mass*M/Charge + Adduct_mass]
where exact_mass = number, M and Charge = number (1, 2, 3, etc) according to each type of adduct, Adduct_mass = number (positive or negative) according to each adduct.
My data: 2 data frames. One with the Adducts names, M, Charge, Adduct_mass. The other one correspond to the Compound_name and Exact_mass of the Compounds i want to iterate over (i just put a small data set)
Adducts: df_al
import pandas as pd
data = [["M+3H", 3, 1, 1.007276], ["M+3Na", 3, 1, 22.989], ["M+H", 1, 1,
1.007276], ["2M+H", 1, 2, 1.007276], ["M-3H", 3, 1, -1.007276]]
df_al = pd.DataFrame(data, columns=["Ion_name", "Charge", "M", "Adduct_mass"])
Compounds: df
import pandas as pd
data1 = [[1, "C3H64O7", 596.465179], [2, "C30H42O7", 514.293038], [4,
"C44H56O8", 712.397498], [4, "C24H32O6S", 448.191949], [5, "C20H28O3",
316.203834]]
df = pd.DataFrame(data1, columns=["CdId", "Formula", "exact_mass"])
The solution to this problem was:
df_name = df_al["Ion_name"]
df_mass = df_al["Adduct_mass"]
df_div = df_al["Charge"]
df_M = df_al["M"]
#Defining general function
def Adduct(x,i):
return x*df_M[i]/df_div[i] + df_mass[i]
#Applying general function in a range from 0 to 5.
for i in range(5):
df[df_name.loc[i]] = df['exact_mass'].map(lambda x: Adduct(x,i))
Output
Name exact_mass M+3H M+3Na M+H 2M+H M-3H
0 a 596.465179 199.829002 221.810726 597.472455 1193.937634 197.814450
1 b 514.293038 172.438289 194.420013 515.300314 1029.593352 170.423737
2 c 712.397498 238.473109 260.454833 713.404774 1425.802272 236.458557
3 d 448.191949 150.404592 172.386316 449.199225 897.391174 148.390040
4 e 316.203834 106.408554 128.390278 317.211110 633.414944 104.39400
Now that is the rigth calculations but i need now a file where:
-only exists 2 columns (Name and mass)
-All the different adducts are appended one after another
desired out put
Name Mass
a_M+3H 199.82902
a_M+3Na 221.810726
a_M+H 597.472455
a_2M+H 1193.937634
a_M-3H 197.814450
b_M+3H 514.293038
.
.
.
c_M+3H
and so on.
Also i need to combine the name of the respective compound with the ion form (M+3H, M+H, etc).
At this point i have no code for that.
I would apprecitate any advice and a better approach since the begining.
This part is an update of the question above:
Is posible to obtain and ouput like this one:
Name Mass RT
a_M+3H 199.82902 1
a_M+3Na 221.810726 1
a_M+H 597.472455 1
a_2M+H 1193.937634 1
a_M-3H 197.814450 1
b_M+3H 514.293038 3
.
.
.
c_M+3H 2
The RT is the same value for all forms of a compound, in this example is RT for a =1, b = 3, c =2, etc.
Is posible to incorporate (Keep this column) from the data set df (which i update here below)?. As you can see that df has more columns like "Formula" and "RT" which desapear after calculations.
import pandas as pd
data1 = [[a, "C3H64O7", 596.465179, 1], [b, "C30H42O7", 514.293038, 3], [c,
"C44H56O8", 712.397498, 2], [d, "C24H32O6S", 448.191949, 4], [e, "C20H28O3",
316.203834, 1.5]]
df = pd.DataFrame(data1, columns=["Name", "Formula", "exact_mass", "RT"])
Part three! (sorry and thank you)
this is a trial i did on a small data set (df) using the code below, with the same df_al of above.
df=
Code
#Defining variables for calculation
df_name = df_al["Ion_name"]
df_mass = df_al["Adduct_mass"]
df_div = df_al["Charge"]
df_M = df_al["M"]
df_ID= df["Name"]
#Defining the RT dictionary
RT = dict(zip(df["Name"], df["RT"]))
#Removing RT column
df=df.drop(columns=["RT"])
#Defining general function
def Adduct(x,i):
return x*df_M[i]/df_div[i] + df_mass[i]
#Applying general function in a range from 0 to 46.
for i in range(47):
df[df_name.loc[i]] = df['exact_mass'].map(lambda x: Adduct(x,i))
df
output
#Melting
df = pd.melt(df, id_vars=['Name'], var_name = "Adduct", value_name= "Exact_mass", value_vars=[x for x in df.columns if 'Name' not in x and 'exact' not in x])
df['name'] = df.apply(lambda x:x[0] + "_" + x[1], axis=1)
df['RT'] = df.Name.apply(lambda x: RT[x[0]] if x[0] in RT else np.nan)
del df['Name']
del df['Adduct']
df['RT'] = df.name.apply(lambda x: RT[x[0]] if x[0] in RT else np.nan)
df
output
Why NaN?
Here is how I will go about it, pandas.melt comes to rescue:
import pandas as pd
import numpy as np
from io import StringIO
s = StringIO('''
Name exact_mass M+3H M+3Na M+H 2M+H M-3H
0 a 596.465179 199.829002 221.810726 597.472455 1193.937634 197.814450
1 b 514.293038 172.438289 194.420013 515.300314 1029.593352 170.423737
2 c 712.397498 238.473109 260.454833 713.404774 1425.802272 236.458557
3 d 448.191949 150.404592 172.386316 449.199225 897.391174 148.390040
4 e 316.203834 106.408554 128.390278 317.211110 633.414944 104.39400
''')
df = pd.read_csv(s, sep="\s+")
df = pd.melt(df, id_vars=['Name'], value_vars=[x for x in df.columns if 'Name' not in x and 'exact' not in x])
df['name'] = df.apply(lambda x:x[0] + "_" + x[1], axis=1)
del df['Name']
del df['variable']
RT = {'a':1, 'b':2, 'c':3, 'd':5, 'e':1.5}
df['RT'] = df.name.apply(lambda x: RT[x[0]] if x[0] in RT else np.nan)
df
Here is the output:

New column based on a row with conditions in Pandas

I'm trying to do an operation with Dataframes but i'm not sure how I can solve the problem using the built-in Pandas Operations (Actualy my code is based on a for so I'm trying to build a more elegant solution).
Given the following Dataframes, defined with the columns described below
original_df = [o1, o2, o3, o4]
weights_df = [w1, w2, w3, w4]
conditions_df = [c1, c2, c3, c4]
I need to built a new column on original_df based on the division of o1/w1 but depending on the value of c1, with takes the values ["+" or "-" I need to do the -o1/w1 operation.
As long as I did was:
orignal_df['newcolumn'] = original_df / weights_df
Where of course I divided the two terms but without applying the condition, I'm trying to do with map and apply functions but I'm not sure how I can add the third column into the function.
original_df = [100, 200, 300, 400]
weights_df = [10, 20, 30, 40]
conditions_df = [1, 2, 3, 4]
df = pd.DataFrame({'x':original_df, 'y':weights_df, 'z':conditions_df})
def div(x, y, z):
if z > 2:
return float(x/y)
else:
return float(-1*x/y)
df['new_feature'] = df.apply(lambda p: div(p['x'], p['y'], p['z']), axis=1)
This is one way of solving. If your conditions_df contains '+'/'-' then you can change the condition in def div(x, y, z) accordingly.
You can use numpy.where for mask by condition:
#data from lisa answer
#df = pd.DataFrame({'x':original_df, 'y':weights_df, 'z':conditions_df})
df['new_feature'] = df['x'] / df['y'] * np.where(df['z'] > 2, 1, -1)
print (df)
x y z new_feature
0 100 10 1 -10.0
1 200 20 2 -10.0
2 300 30 3 10.0
3 400 40 4 10.0
Timings:
#4k rows
df = pd.concat([df]*1000).reset_index(drop=True)
#lisa answer
In [95]: %timeit df['new_feature1'] = df.apply(lambda p: div(p['x'], p['y'], p['z']), axis=1)
10 loops, best of 3: 123 ms per loop
In [96]: %timeit df['new_feature2'] = df['x'] / df['y'] * np.where(df['z'] > 2, 1, -1)
1000 loops, best of 3: 595 µs per loop

Spark Accumulator confusion

I'm writing a Spark job that takes in data from multiple sources, filters bad input rows, and outputs a slightly modified version of the input. The job has two additional requirements:
I must keep track of the number of bad inputs rows per source to notify those upstream providers.
I must support an output limit per source.
The job seemed straightforward and I approached the problem using accumulators to keep track of the number of filtered rows per source. However, when I implemented the final .limit(N), my accumulator behavior changed. Here's some striped down sample code that triggers the behavior on a single source:
from pyspark.sql import Row, SparkSession
from pyspark.sql.types import *
from random import randint
def filter_and_transform_parts(rows, filter_int, accum):
for r in rows:
if r[0] == filter_int:
accum.add(1)
continue
yield r[0], r[1] + 1, r[2] + 1
def main():
spark= SparkSession \
.builder \
.appName("Test") \
.getOrCreate()
sc = spark.sparkContext
accum = sc.accumulator(0)
# 20 inputs w/ tuple having 4 as first element
inputs = [(4, randint(1, 10), randint(1, 10)) if x % 5 == 0 else (randint(6, 10), randint(6, 10), randint(6, 10)) for x in xrange(100)]
rdd = sc.parallelize(inputs)
# filter out tuples where 4 is first element
rdd = rdd.mapPartitions(lambda r: filter_and_transform_parts(r, 4, accum))
# if not limit, accumulator value is 20
# if limit and limit_count <= 63, accumulator value is 0
# if limit and limit_count >= 64, accumulator value is 20
limit = True
limit_count = 63
if limit:
rdd = rdd.map(lambda r: Row(r[0], r[1], r[2]))
df_schema = StructType([StructField("val1", IntegerType(), False),
StructField("val2", IntegerType(), False),
StructField("val3", IntegerType(), False)])
df = spark.createDataFrame(rdd, schema=df_schema)
df = df.limit(limit_count)
df.write.mode("overwrite").csv('foo/')
else:
rdd.saveAsTextFile('foo/')
print "Accum value: {}".format(accum.value)
if __name__ == "__main__":
main()
The problem is that my accumulator sometimes reports the number of filtered rows and sometimes doesn't, depending on the limit specified and number of inputs for a source. However, in all situations the filtered rows don't make it into the output meaning the filter occurred and the accumulator should have a value.
If you can shed some light on this that'd be very helpful, thanks!
Update:
Adding a rdd.persist() call after mapPartitions made the accumulator behavior consistent.
Actually, it doesnt't matter what the limit_count's value is.
The reason why sometime Accum value is 0 is because you performe accumulator in transformations(e.g.: rdd.map,rdd.mapPartitions).
Spark only guaranty that accumulator works as well inside actions(e.g.: rdd.foreach)
Lets make a little bit of change on your code:
from pyspark.sql import *
from random import randint
def filter_and_transform_parts(rows, filter_int, accum):
for r in rows:
if r[0] == filter_int:
accum.add(1)
def main():
spark = SparkSession.builder.appName("Test").getOrCreate()
sc = spark.sparkContext
print(sc.applicationId)
accum = sc.accumulator(0)
inputs = [(4, x * 10, x * 100) if x % 5 == 0 else (randint(6, 10), x * 10, x * 100) for x in xrange(100)]
rdd = sc.parallelize(inputs)
rdd.foreachPartition(lambda r: filter_and_transform_parts(r, 4, accum))
limit = True
limit_count = 10 or 'whatever'
if limit:
rdd = rdd.map(lambda r: Row(val1=r[0], val2=r[1], val3=r[2]))
df = spark.createDataFrame(rdd)
df = df.limit(limit_count)
df.write.mode("overwrite").csv('file:///tmp/output')
else:
rdd.saveAsTextFile('file:///tmp/output')
print "Accum value: {}".format(accum.value)
if __name__ == "__main__":
main()
Accum value is equle to 20 all the time
For more information:
http://spark.apache.org/docs/2.0.2/programming-guide.html#accumulators

Reduce ResultIterable objects after groupByKey in PySpark

I'm working on temperature forecasting data using PySpark.
The raw temperature data in the following format:
station;date;time,temperature;quality
102170;2012-11-01;06:00:00;6.8;G
102185;2012-11-02;06:00:00;5.8;G
102170;2013-11-01;18:00:00;2.8;G
102185;2013-11-01;18:00:00;7.8;G
The target result is getting the min/max temperature for each year, mentioned in which station, like the following:
year;station;max_temp
2013;102185;7.8
2012;102170;6.8
My current code as the following:
sc = SparkContext(appName="maxMin")
lines = sc.textFile('data/temperature-readings.csv')
lines = lines.map(lambda a: a.split(";"))
lines = lines.filter(lambda x: int(x[1][0:4]) >= 1950 and int(x[1][0:4]) <= 2014)
temperatures = lines.map(lambda x: (x[1][0:4], (x[0], float(x[3]))))
so far, the result as following:
temperatures.take(4)
(2012, (102170,6.8))
(2012, (102185,5.8))
(2013, (102170,2.8))
(2013, (102185,7.8))
After grouping by key, the becomes as the following:
temperatures = temperatures.groupByKey()
temperatures.take(2)
[(u'2012', <pyspark.resultiterable.ResultIterable object at 0x2a0be50>),
(u'2013', <pyspark.resultiterable.ResultIterable object at 0x2a0bc50>)]
So, how I can reduce these resultiterable objects to get only the element with min or max temperature.
Just don't. Use reduce by key:
lines.map(lambda x: (x[1][0:4], (x[0], float(x[3])))).map(lambda x: (x, x)) \
.reduceByKey(lambda x, y: (
min(x[0], y[0], key=lambda x: x[1]),
max(x[1], y[1], , key=lambda x: x[1])))

Resources