Ranking of single datapoint against reference dataset - python-3.x

I have the following hypothetical dataframe:
data = {'score1':[60, 30, 80, 120],
'score2':[20, 21, 19, 18],
'score3':[12, 43, 71, 90]}
# Create the pandas DataFrame
df = pd.DataFrame(data)
# calculating the ranks
df['score1_rank'] = df['score1'].rank(pct = True)
df['score2_rank'] = df['score2'].rank(pct = True)
df['score3_rank'] = df['score3'].rank(pct = True)
I then have individual datapoints I would like to to rank against the references, for example:
data_to_test = {'score1':[12],
'score2':[4],
'score3':[6]}
How could I compare these new values against this reference?
Thank you for any help!

Related

openpyxl : Update multiple columns & rows from dictionary

I have a nested Dictionary
aDictionary = {'Asset': {'Name': 'Max', 'Age': 28, 'Job': 'Nil'}, 'Parameter': {'Marks': 60, 'Height': 177, 'Weight': 76}}
I want to update the values in an excel as follows
|Asset |Name |Max|
|Asset |Age |28 |
|Asset |Job |Nil|
|Parameter|Marks |60 |
|Parameter|Height|177|
|Parameter|Weight|76 |
I tried something like this, but result is not what I was expecting. Am pretty new to openpyxl. I can't seem to wrap my head around it.
from openpyxl import *
workbook=load_workbook('Empty.xlsx')
worksheet= workbook['Sheet1']
for m in range(1,7):
for i in aDictionary:
worksheet["A"+str(m)].value=i
for j, k in aDictionary[i].items():
worksheet["B"+str(m)].value=j
worksheet["C"+str(m)].value=k
workbook.save('Empty.xlsx')
One way to do this is to convert the Dictionary to a DataFrame and stack it the way you indicated, rearrange the columns and then load it into Excel. I've used pandas to_excel as it is a single line of code. But, you can use load_workbook() as well...
Stacking part was borrowed from here
Code
aDictionary = {'Asset': {'Name': 'Max', 'Age': 28, 'Job': 'Nil'}, 'Parameter': {'Marks': 60, 'Height': 177, 'Weight': 76}}
df = pd.DataFrame(aDictionary) # Convert to dataframe
df = df.stack().reset_index() # Stack
# Rearrange columns to the way you want it
cols = df.columns.tolist()
cols = list(df.columns.values)
cols[0], cols[1] = cols[1], cols[0]
df = df[cols]
#Write to Excel
df.to_excel('Empty.xlsx', sheet_name='Sheet1', index=False, header=None)
Output in Excel

Is there a numpy way of looping/getting sub arrays of an array to extract info?

First of all, thank you for the time you took to answer me.
To give a little example, I have a huge dataset (n instances, 3 features) like that:
data = np.array([[7.0, 2.5, 3.1], [4.3, 8.8, 6.2], [1.1, 5.5, 9.9]])
It's labeled in another array:
label = np.array([0, 1, 0])
Questions:
I know that I can solve my problem by looping python like (for loop) but I'm concerned about a numpy way (without for-loop) to be less time consumption (do it as fast as possible).
If there aren't a way without for-loop, what would be the best one (M1, M2, any other wizardry method?)?.
My solution:
clusters = []
for lab in range(label.max()+1):
# M1: creating new object
c = data[label == lab]
clusters.append([c.min(axis=0), c.max(axis=0)])
# M2: comparing multiple times (called views?)
# clusters.append([data[label == lab].min(axis=0), data[label == lab].max(axis=0)])
print(clusters)
# [[array([1.1, 2.5, 3.1]), array([7. , 5.5, 9.9])], [array([4.3, 8.8, 6.2]), array([4.3, 8.8, 6.2])]]
You could start from and easier variant of this problem:
Given arr and its label, could you find a minimum and maximum values of arr items in each group of labels?
For instance:
arr = np.array([55, 7, 49, 65, 46, 75, 4, 54, 43, 54])
label = np.array([1, 3, 2, 0, 0, 2, 1, 1, 1, 2])
Then you would expect that minimum and maximum values of arr in each label group were:
min_values = np.array([46, 4, 49, 7])
max_values = np.array([65, 55, 75, 7])
Here is a numpy approach to this kind of problem:
def groupby_minmax(arr, label, return_groups=False):
arg_idx = np.argsort(label)
arr_sort = arr[arg_idx]
label_sort = label[arg_idx]
div_points = np.r_[0, np.flatnonzero(np.diff(label_sort)) + 1]
min_values = np.minimum.reduceat(arr_sort, div_points)
max_values = np.maximum.reduceat(arr_sort, div_points)
if return_groups:
return min_values, max_values, label_sort[div_points]
else:
return min_values, max_values
Now there's not much to change in order to adapt it to your use case:
def groupby_minmax_OP(arr, label, return_groups=False):
arg_idx = np.argsort(label)
arr_sort = arr[arg_idx]
label_sort = label[arg_idx]
div_points = np.r_[0, np.flatnonzero(np.diff(label_sort)) + 1]
min_values = np.minimum.reduceat(arr_sort, div_points, axis=0)
max_values = np.maximum.reduceat(arr_sort, div_points, axis=0)
if return_groups:
return min_values, max_values, label_sort[div_points]
else:
return np.array([min_values, max_values]).swapaxes(0, 1)
groupby_minmax(data, label)
Output:
array([[[1.1, 2.5, 3.1],
[7. , 5.5, 9.9]],
[[4.3, 8.8, 6.2],
[4.3, 8.8, 6.2]]])
it has already been answered, you can go to this link for your answer python numpy access list of arrays without for loop

Spark Core How to fetch max n rows of an RDD function without using Rdd.max()

I have an RDD having below elements:
('09', [25, 66, 67])
('17', [66, 67, 39])
('04', [25])
('08', [120, 122])
('28', [25, 67])
('30', [122])
I need to fetch the elements having a max number of elements in the list which is 3 in the above RDD O/p should be filtered into another RDD and not use the max function and spark dataframes:
('09', [25, 66, 67])
('17', [66, 67, 39])
max_len = uniqueRDD.max(lambda x: len(x[1]))
maxRDD = uniqueRDD.filter(lambda x : (len(x[1]) == len(max_len[1])))
I am able to do with the above lines of code but spark streaming won't support this as max_len is a tuple and not RDD
Can someone suggest? Thanks in advance
Does this work for you? I tried filtering on the streaming rdds as well. Seems to work.
from pyspark.sql import Row
from pyspark.sql.types import *
from pyspark import SparkContext, SQLContext
from pyspark.sql.functions import *
from pyspark.streaming import StreamingContext
sc = SparkContext('local')
sqlContext = SQLContext(sc)
ssc = StreamingContext(sc,1)
data1 = [
('09', [25, 66, 67]),
('17', [66, 67, 39]),
('04', [25]),
('08', [120, 122]),
('28', [25, 67]),
('30', [122])
]
df1Columns = ["id", "list"]
df1 = sqlContext.createDataFrame(data=data1, schema = df1Columns)
df1.show(20, truncate=False)
uniqueRDD = df1.rdd
max_len = uniqueRDD.map(lambda x: len(x[1])).max(lambda x: x)
maxRDD = uniqueRDD.filter(lambda x : (len(x[1]) == max_len))
print("printing out maxlength = ", max_len)
dStream = ssc.queueStream([uniqueRDD])
resultStream = dStream.filter(lambda x : (len(x[1]) == max_len))
print("Printing the filtered streaming result")
def printResultStream(rdd):
mylist = rdd.collect()
for ele in mylist:
print(ele)
resultStream.foreachRDD(printResultStream)
ssc.start()
ssc.awaitTermination()
ssc.stop()
Here's output :
+---+------------+
|id |list |
+---+------------+
|09 |[25, 66, 67]|
|17 |[66, 67, 39]|
|04 |[25] |
|08 |[120, 122] |
|28 |[25, 67] |
|30 |[122] |
+---+------------+
printing out maxlength = 3
Printing the filtered streaming result
Row(id='09', list=[25, 66, 67])
Row(id='17', list=[66, 67, 39])
You can try something like this:
dStream = ssc.queueStream([uniqueRDD, uniqueRDD, uniqueRDD])
def maxOverRDD(input_rdd):
if not input_rdd.isEmpty():
reduced_rdd = input_rdd.reduce(lambda acc, value : value if (len(acc[1]) < len(value[1])) else acc)
internal_result = input_rdd.filter(lambda x: len(x[1]) == len(reduced_rdd[1]))
return internal_result
result = dStream.transform(maxOverRDD)
print("Printing the finalStream")
result.foreachRDD(printResultStream)
Output would be like (Output is repeated because the same RDD is provided 3 times in the stream):
Printing the finalStream
Row(id='09', list=[25, 66, 67])
Row(id='17', list=[66, 67, 39])
Row(id='09', list=[25, 66, 67])
Row(id='17', list=[66, 67, 39])
Row(id='09', list=[25, 66, 67])
Row(id='17', list=[66, 67, 39])

How to extract specific pages based on a formula?

I'm trying to extract pages from a PDF that is 1000 pages long but I only need pages in the pattern of [9,10,17,18,25,26,33,34,...etc]. These numbers can be represented in the formula: pg = 1/2 (7 - 3 (-1)^n + 8*n).
I tried to define the formula and plug into tabula.read_pdf but I'm not sure how to define the 'n' variable where 'n' ranges from 0 up to 25. Right now I defined it as a list which I think is the problem...
n = list(range(25+1))
pg = 1/2 (7 - 3 (-1)^n + 8*n)
df = tabula.read_pdf(path, pages = 'pg',index_col=0, multiple_tables=False)
When trying to execute, I get a TypeError: 'int' object is not callable on line pg = 1/2 (7 - 3 (-1)^n + 8*n). How would I define the variables so that tabula extracts pages that fit the condition of the formula?
Formula is x = 1/2(8n - 3(-1)^n + 7)
Step1:
pg = [] #Empty list to store the pages numbers calculated by formula
for i in range(1, 25+1): # For 1000 pages pdf use 1000 instead of 25
k = int(1/2*((8*n[i])-3*((-1)**n[i])+7))
pg.append(k)
print(pg, end = '') # This will give you list of page numbers
#[9, 10, 17, 18, 25, 26, 33, 34, 41, 42, 49, 50, 57, 58, 65, 66, 73, 74, 81, 82, 89, 90, 97, 98, 105]
Step 2:
# Now run the loop through each of the pages with the table
df=pd.DataFrame([])
df_combine=pd.DataFrame([])
for pageiter in range(pg):
df = tabula.read_pdf(path, pages=pageiter+1 ,index_col=0, multiple_tables=False, guess=False) #modify it as per your requirement
df_combine=pd.concat([df,df_combine]) #you can choose between merge or concat as per your need
OR
df_data = []
for pageiter in range(pg):
df = tabula.read_pdf(path, pages=pageiter+1 ,index_col=0, multiple_tables=False, guess=False) #modify it as per your requirement
df_data.append(df)
df_combine= pd.concat(df_data, axis=1)
Reference link to create formula
https://www.wolframalpha.com/widgets/view.jsp?id=a3af2e675c3bfae0f2ecce820c2bef43

How to add a new column to a Spark RDD?

I have a RDD with MANY columns (e.g., hundreds), how do I add one more column at the end of this RDD?
For example, if my RDD is like below:
123, 523, 534, ..., 893
536, 98, 1623, ..., 98472
537, 89, 83640, ..., 9265
7297, 98364, 9, ..., 735
......
29, 94, 956, ..., 758
how can I add a column to it, whose value is the sum of the second and the third columns?
Thank you very much.
You do not have to use Tuple* objects at all for adding a new column to an RDD.
It can be done by mapping each row, taking its original contents plus the elements you want to append, for example:
val rdd = ...
val withAppendedColumnsRdd = rdd.map(row => {
val originalColumns = row.toSeq.toList
val secondColValue = originalColumns(1).asInstanceOf[Int]
val thirdColValue = originalColumns(2).asInstanceOf[Int]
val newColumnValue = secondColValue + thirdColValue
Row.fromSeq(originalColumns :+ newColumnValue)
// Row.fromSeq(originalColumns ++ List(newColumnValue1, newColumnValue2, ...)) // or add several new columns
})
you have RDD of tuple 4, apply map and convert it to tuple5
val rddTuple4RDD = ...........
val rddTuple5RDD = rddTuple4RDD.map(r=> Tuple5(rddTuple4._1, rddTuple4._2, rddTuple4._3, rddTuple4._4, rddTuple4._2 + rddTuple4._3))

Resources