How to access integer named attribute from pyspark row?

How to access integer named attribute from pyspark row? - apache-spark

For the row object defined as
from pyspark.sql import Row
rr=Row('a',3)
r = rr(3,4)
print(r)
output:
Row(a=3, 3=4)
I can access the value of a as either r.a or r['a']
But neither can be used to access the value of the key 3. How can I access the value of a Row attribute if it is named by an integer?

You can extract it by using its position within the row. So in your case, that would be the value at index 1:
>>> r[1]
4
Or, if you don't know the position on beforehand, and just want to index using the column name 3 you could do something like this:
# Grabbing index of column with 3 as name
>>> wantedIndex = r.__fields__.index(3)
>>> r[wantedIndex]
4

Related

pandas first_valid_index() as integer key

I have a pandas dataframe with an index as a date string like so:
'2015-07-15'
and another column along side it with a value associated with the dates.
When I use to find out when the column first time equals 5:
df[df['Column'] == 5].first_valid_index()
it gives me back
'2020-12-19'
instead I want to know the exact integer index number of this occurence rather than the Date index itself so I can use the .iloc method to specify this index.
How would I do so? thank you

You need to reset_index before so that you can get your integer index.
df.reset_index(inplace=True)
df[df['Column'] == 5].first_valid_index()
Alternate way without resetting index would be using get_loc. Assuming your data atleast contain 1 value
df.index.get_loc(df.index[df['Column'] == 5][0])
Combination of both would look like,
df.index.get_loc(df[df['Column'] == 5].first_valid_index())

Pandas multi-column match and then backward shift to find value

I would really appreciate your help with this. I'm a Pandas python noob and have been thrown into the deepend with this problem.
I searched through around 100 different queries on this website and cannot find something befitting. The closest I got was applying boolean masks.
Please click on the text below to find my dataframe.
I would like to run a query on the dataset to find the previous row where the 'AwayTeam' string is found in the 'HomeTeam' column -> I would then like to pull in the value of 'home_form' for that matching incidence as an additional column
Date HomeTeam AwayTeam home_form away_form new_column
25/08/2019 Strasbourg Rennes 1.0 3.0 Nan (Row 25, just an example)
01/09/2019 Rennes Nice 3.0 3.0 3.0 (Row 37, just an example)
I want to pull in the previous 'away_form' value for the last row where a HomeTeam appeared in the AwayTeam column

This is not a complete solution, but I think I found a way to help you make some progress.
Here the steps:
Create a sample dataframe, just for illustration
Convert the 'HomeTeam' column into a list … this is the target column.
Create an empty list to store the results of searching 'HomeTeam' column
Loop through the teams in the 'AwayTeam' column
Use Python's list.index() method to return the index of the match … but use a try-except just in case you don't find a match.
Store the result into list
When finished with the for-loop, add the list as a new column in the pandas dataframe.
import pandas as pd
import numpy as np
# create sample dataframe
df = pd.DataFrame({
'Date': ['2019-08-18', '2019-08-25'],
'HomeTeam': ['Rennes', 'Strasbourg'],
'AwayTeam': ['Paris SG', 'Rennes'],
'home_form': [np.NaN, 1.0],
'away_form': [np.NaN, 3.0],
})
# convert your 'HomeTeam' column into a Python list
list_HomeTeam = list(df['HomeTeam'])
print(list_HomeTeam)
# create an empty list to capture the index position of matches in 'HomeTeam'
list_results_in_home = []
# loop through each team in the 'AwayTeam column'
for each_team in df['AwayTeam']:
# if you find a match in the list, store index as a result
try:
result = list_HomeTeam.index(each_team)
# if you don't find a match, store a string
except:
result = 'team not in list'
# add the result to the list that is capturing the index position in 'HomeTeam'
list_results_in_home.append(result)
print(list_index_home)
# add column to dataframe with the index position
df['index_match_in_HomeTeam'] = list_results_in_home

Looping through a panda dataframe

My variable noExperience1 is a dataframe
I am trying to go through this loop:
num = 0
for row in noExperience1:
if noExperience1[row+1] - noExperience1[row] > num:
num = noExperience1[row+1] - noExperience1[row]
print(num)
My goal is to find the biggest difference in y values from one x value to the next. But I get the error that the line of my if statement needs to be a string and not an integer. How do I fix this so I can have a number?

We can't directly access a row of dataframe using indexing. We need to use loc or iloc for it. I had just solved the problem stated by you.
`noExperience1=pd.read_csv("../input/data.csv")#reading CSV file
num=0
for row in range(1,len(noExperience1)): #iterating row in all rows of DF
if int(noExperience1.loc[row]-noExperience1.loc[row-1]) > num:
num = int(noExperience1.loc[row]-noExperience1.loc[row-1])
print(num)`
Note:
1.Column Slicing : DataFrame[ColName] ==> will give you all enteries of specified column.
2.Row Slicing: DataFrame.loc[RowNumber] ==> will give you a complete row of specified row numbe.RowNumber starts with 0.
Hope this helps.

How to add a value to a data frame which has given columns and rows

I would like to ask some values in the data frame. Here is my code:
I have the code as
algorithm_choice =['DUMMY','LINEAR_REGRESSION','RIDGE_REGRESSION','MLP','SVM','RANDOM_FOREST'] m
model_type_choice=['POPULATION_INFORMED','REGULAR','SINGLE_CYCLE','CYCLE_PREDICTION']
rmse_summary=pd.DataFrame(columns=algorithm_choice, index = model_type_choice)
How can I add a specific value to rmse_summary?

Use .loc and .iloc
To add a specific value, I assume one value, then you can use either .loc or .iloc.
.loc will give you a specific position by name:
rmse_summary.loc['REGULAR','DUMMY'] = 3
.iloc will give you access to a position by index number:
rmse_summary.iloc[2,4] = 5

Pandas .apply() function not always being called in python 3

Hello I wanted to increment a global variable 'count' through a function which will be called on a pandas dataframe of length 1458.
I have read other answers where they talk about .apply() not being inplace.
I therefore follow their advice but the count variable still is 4
count = 0
def cc(x):
global count
count += 1
print(count)
#Expected final value of count is 1458 but instead it is 4
# I think its 4, because 'PoolQC' is a categorical column with 4 possible values
# I want the count variable to be 1458 by the end instead it shows 4
all_data['tempo'] = all_data['PoolQC'].apply(cc)
# prints 4 instead of 1458
print("Count final value is ",count)

Yes, the observed effect is because you have categorical type of the column. This is smart of pandas that it just calculates apply for each category. Is counting only thing you're doing there? I guess not, but why you need such a calculation? Can't you use df.shape?
Couple of options I see here:
You can change type of column
e.g.
all_data['tempo'] = all_data['PoolQC'].astype(str).apply(cc)
You can use different non-categorical column
You can use df.shape to see how many rows you have in the df.
You can use apply for whole DataFrame like all_data['tempo'] = df.apply(cc, axis=1).
In such a case you still can use whatever is in all_data['PoolQC'] within cc function, like:
def cc(x):
global count
count += 1
print(count)
return x['PoolQC']

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string

How to access integer named attribute from pyspark row? - apache-spark

For the row object defined as from pyspark.sql import Row rr=Row('a',3) r = rr(3,4) print(r) output: Row(a=3, 3=4) I can access the value of a as either r.a or r['a'] But neither can be used to access the value of the key 3. How can I access the value of a Row attribute if it is named by an integer?

Related

pandas first_valid_index() as integer key

Pandas multi-column match and then backward shift to find value

Looping through a panda dataframe

How to add a value to a data frame which has given columns and rows

Pandas .apply() function not always being called in python 3

Categories

Resources