pandas first_valid_index() as integer key - python-3.x

I have a pandas dataframe with an index as a date string like so:
'2015-07-15'
and another column along side it with a value associated with the dates.
When I use to find out when the column first time equals 5:
df[df['Column'] == 5].first_valid_index()
it gives me back
'2020-12-19'
instead I want to know the exact integer index number of this occurence rather than the Date index itself so I can use the .iloc method to specify this index.
How would I do so? thank you

You need to reset_index before so that you can get your integer index.
df.reset_index(inplace=True)
df[df['Column'] == 5].first_valid_index()
Alternate way without resetting index would be using get_loc. Assuming your data atleast contain 1 value
df.index.get_loc(df.index[df['Column'] == 5][0])
Combination of both would look like,
df.index.get_loc(df[df['Column'] == 5].first_valid_index())

Related

how to find the max values of groupby pandas with the column name

Actually I want to find the maximum values of the teachers_prefix with the teacher's prefix (for eg: in this case mrs, and the value which is 639471
Input can be found
Try using pd.Series.idxmax(). This returns the index of the column/series where you get the max value.
#grouped_df[[grouped_df.idxmax()]] #If series
grouped_df.loc[grouped_df.idxmax()] #If dataFrame
teacher_prefix
mrs 639471
Name: sum, dtype: int64
If you want one or more than one rows based on top n largest values -
grouped_df.nlargest(2, 'sum')

Pyspark: Trying to Convert a Column to binary using a 'greater than' boolean expression

Is there a way to create a new column that only holds values for something 'greater than 1'?
There's a column for retweets and I need to make a new column that is binary. 0 for zero retweets, 1 for one retweet or more in pyspark.
You can use
df.withColumn('greater_than_1', (F.col('retweets').cast('int') >= 1).cast('int'))

How to populating one column in a dataframe from the truncated value of another column

I have column in a Pandas dataframe (final_combine_df) that is called GEOID. I will have a 15 character string number like this : '371899201001045'. I want to create a new column in my data frame called 'CB_GrpID' that is equal to just the first 12 characters of the GEOID values (ex: '371899201001'). I tried this, but it just returned the same GEOID value (non-truncated) in the new 'CB_GrpID':
final_combine_df['CB_GrpID'] = final_combine_df['GEOID'][:12]
What am I doing wrong here?
final_combine_df.iloc[0]['CB_GrpID']
>>371899201001045
pandas.Series.str
Working with text
The str accessor is what you're looking for. It gives access to the strings in each cell along with "vectorized" string methods.
final_combined_df['GEOID'].str[:12]
What you were doing:
final_combined_df['GEOID'][:12]
Was just getting the first 12 elements of the column.
Follow this format. Use a lambda function to return the first 12 digits of a string. Note python starts at index 0 and the upper limit is exclusive not inclusive, meaning the last element you want is at index 11, however you set the upper limit to 12 to ensure that 11 is included. Just FYI in case you were unaware.
df[‘new_var’] = df[‘old_var’].apply(lambda x: x[:12])

How to add a value to a data frame which has given columns and rows

I would like to ask some values in the data frame. Here is my code:
I have the code as
algorithm_choice =['DUMMY','LINEAR_REGRESSION','RIDGE_REGRESSION','MLP','SVM','RANDOM_FOREST'] m
model_type_choice=['POPULATION_INFORMED','REGULAR','SINGLE_CYCLE','CYCLE_PREDICTION']
rmse_summary=pd.DataFrame(columns=algorithm_choice, index = model_type_choice)
How can I add a specific value to rmse_summary?
Use .loc and .iloc
To add a specific value, I assume one value, then you can use either .loc or .iloc.
.loc will give you a specific position by name:
rmse_summary.loc['REGULAR','DUMMY'] = 3
.iloc will give you access to a position by index number:
rmse_summary.iloc[2,4] = 5

pandas iterate rows and then break until condition

I have a column that's unorganized like this;
Name
Jack
James
Riddick
Random value
Another random value
What I'm trying to do is get only the names from this column, but struggling to find a way to differentiate real names to random values. Fortunately the names are all together, and the random values are all together as well. The only thing I can do is iterate the rows until it gets to 'Random value' and then break off.
I've tried using lambda's for this but with no success as I don't think there's a way to break. And I'm not sure how comprehension could work in this case.
Here's the example I've been trying to play with;
df['Name'] = df['Name'].map(lambda x: True if x != 'Random value' else break)
But the above doesn't work. Any suggestions on what could work based on what I'm trying to achieve? Thanks.
Find index of row containing 'Random value':
index_split = df[df.Name == 'Random value'].index.values[0]
Save your random values column for use later if you want:
random_values = df.iloc[index_split+1:,].values[0]
Remove random values from the Names column:
df = df[0:index_split]

Resources