Pandas - Behaviour/Issue of reindex vs loc for setting values - python-3.x

I've been trying to use reindex instead of loc in pandas as from 0.24 there is a warning about reindexing with lists.
The issue I have is that I use loc to change the values of my dataframes.
Now if use reindex i lose this and if I try to be smart I even get a bug.
Contemplate the following case:
df = pd.DataFrame(data=pd.np.zeros(4, 2), columns=['a', 'b'])
ds = pd.Series(data=[1]*3)
I want to change a subset of values (while retaining the others), so df keeps the same shape.
So this is the original behaviour which works (and changes the values in a subset of df['a'] to 1)
df.loc[range(3), 'a'] = ds
But when I'm using the reindex I fail to change anything:
df.reindex(range(3)).loc['a'] = ds
Now when I try something like this:
df.loc[:, 'a'].reindex(range(3)) = ds
I get a SyntaxError: can't assign to function call error message.
For reference I am using pandas 0.24 and python 3.6.8

The quick answer from #coldspeed was the easiest, though the behaviour of the warning is misleading.
So reindex returns a copy when loc doesn't.
From the pandas docs:
A new object is produced unless the new index is equivalent to the current one and copy=False.
So saying reindex is an alternative to loc as per the warning is actually misleading.
Hope this helps people who face the same situation.

Related

Palantir Foundry spark.sql query

When I attempt to query my input table as a view, I get the error com.palantir.foundry.spark.api.errors.DatasetPathNotFoundException. My code is as follows:
def Median_Product_Revenue_Temp2(Merchant_Segments):
Merchant_Segments.createOrReplaceTempView('Merchant_Segments_View')
df = spark.sql('select * from Merchant_Segments_View limit 5')
return df
I need to dynamically query this table, since I am trying to calculate the median using percentile_approx across numerous fields, and I'm not sure how to do this without using spark.sql.
If I try to avoid using spark.sql to calculate median across numerous fields using something like the below code, it results in the error Missing Transform Attribute: A module object does not have an attribute percentile_approx. Please check the spelling and/or the datatype of the object.
import pyspark.sql.functions as F
exprs = {x: percentile_approx("x", 0.5) for x in df.columns if x is not exclustion_list}
df = df.groupBy(['BANK_NAME','BUS_SEGMENT']).agg(exprs)
try createGlobalTempView. It worked for me.
eg:
df.createGlobalTempView("people")
(Don't know the root cause why localTempView dose not work )
I managed to avoid using dynamic sql for calculating median across columns using the following code:
df_result = df.groupBy(group_list).agg(
*[ F.expr('percentile_approx(nullif('+col+',0), 0.5)').alias(col) for col in df.columns if col not in exclusion_list]
)
Embedding percentile_approx in an F.expr bypassed the issue I was encountering in the second half of my post.

Why SettingWithCopyWarning is raised using .loc?

I have checked similar questions on SO with the SettingWithCopyWarning error raised using .loc but I still don't understand why I have the error in the following example.
It appears line 3, I succeed to make it disappear with .copy() but I would like to understand why .loc didn't work specifically here.
Does making a conditional slice creates a view even if it's .loc ?
df = pd.DataFrame( data=[0,1,2,3,4,5], columns=['A'])
df.loc[:,'B'] = df.loc[:,'A'].values
dfa = df.loc[df.loc[:,'A'] < 4,:] # here .copy() removes the error
dfa.loc[:,'C'] = [3,2,1,0]
Edit : pandas version is 1.2.4
dfa = df.loc[df.loc[:,'A'] < 4,:]<br>
dfa is a slice of the df dataframe, still referencing the dataframe, a view..copy creates a separate copy, not just a view of the first dataframe.
dfa.loc[:,'C'] = [3,2,1,0]
When it's a view not a copy, you are getting the warning : A value is trying to be set on a copy of a slice from a DataFrame.
.loc is locating the conditions you give it, but it's still a view that you're setting values to if you don't make it a copy of the dataframe.

how to convert the type of an object from "pandas.core.groupby.generic.SeriesGroupBy" to "pandas.core.series.Series"?

I have a variable of type "pandas.core.groupby.generic.SeriesGroupBy" which I got from grouping various fields of a pandas dataframe. But, I would like to convert that variable into a pandas series which is working but with a lot of errors.
Here is the code which I have tried:
w = data.groupby(['dt', 'b'])['w']
w = pd.Series(w)
When I try to run this code, it's taking a lot of time to execute and also generating a lot of errors.
I am getting a pandas Series as follows:
But, I am expecting something similar to this:
Is there any other way to group the below column of a DataFrame and store it inside a pandas Series:
Pandas groupby objects are iterable. Using list comprehension you can extract the partitioned sub-series. Try:
list_of_series = [s for _, s in data.groupby(['dt', 'b'])['w']]
list_of_series is a list and should contain your desired pandas series.

how to use lambda function to select larger values from two python dataframes whilst comparing them by date?

I want to map through the rows of df1 and compare those with the values of df2 , by month and day, across every year in df2,leaving only the values in df1 which are larger than those in df2, to add into a new column, 'New'. df1 and df2 are of the same size, and are indexed by 'Month' and 'Day'. what would be the best way to do this?
df1=pd.DataFrame({'Date':['2015-01-01','2015-01-02','2015-01-03','2015-01-``04','2005-01-05'],'Values':[-5.6,-5.6,0,3.9,9.4]})
df1.Date=pd.to_datetime(df1.Date)
df1['Day']=pd.DatetimeIndex(df1['Date']).day
df1['Month']=pd.DatetimeIndex(df1['Date']).month
df1.set_index(['Month','Day'],inplace=True)
df1
df2 = pd.DataFrame({'Date':['2005-01-01','2005-01-02','2005-01-03','2005-01-``04','2005-01-05'],'Values':[-13.3,-12.2,6.7,8.8,15.5]})
df2.Date=pd.to_datetime(df1.Date)
df2['Day']=pd.DatetimeIndex(df2['Date']).day
df2['Month']=pd.DatetimeIndex(df2['Date']).month
df2.set_index(['Month','Day'],inplace=True)
df2
df1 and df2
df2['New']=df2[df2['Values']<df1['Values']]
gives
ValueError: Can only compare identically-labeled Series objects
I have also tried
df2['New']=df2[df2['Values'].apply(lambda x: x < df1['Values'].values)]
The best way to handle your problem is by using numpy as a tool. Numpy has an attribute called "where"that helps a lot in cases like this.
This is how the sentence works:
df1['new column that will contain the comparison results'] = np.where(condition,'value if true','value if false').
First import the library:
import numpy as np
Using the condition provided by you:
df2['New'] = np.where(df2['Values'] > df1['Values'], df2['Values'],'')
So, I think that solves your problem... You can change the value passed to the False condition to every thin you want, this is only an example.
Tell us if it worked!
Let´s try two possible solutions:
The first solution is to sort the index first.
df1.sort_index(inplace=True)
df2.sort_index(inplace=True)
Perform a simple test to see if it works!
df1 == df2
it is possible to raise some kind of error, so if that happens, try this correction instead:
df1.sort_index(inplace=True, axis=1)
df2.sort_index(inplace=True, axis=1)
The second solution is to drop the indexes and reset it:
df1.sort_index(inplace=True)
df2.sort_index(inplace=True)
Perform a simple test to see if it works!
df1 == df2
See if it works and tell us the result.

Selecting specific rows with multi-level indexes

I am learning Python in DataCamp where I am in a course titled Data Manipulation with pandas
In an exercise I want to select some rows from a data frame but I get an error and I really don't understand the message of the error.
Here is the code
# Load packages
import pandas as pd
# Import data
sales = pd.read_pickle("https://assets.datacamp.com/production/repositories/5386/datasets/b0e855c644024f850a7df3fe8b7bf0e23f7bd2fc/walmart_sales.pkl.bz2", 'bz2')
print(sales)
# Create multi-level indexes
sales_ind = sales.set_index(["store", "type"])
print(sales_ind)
# Define the rows to select
rows_to_keep = [(1, "A"), (2, "B")]
print(rows_to_keep)
# Use .iloc to select the rows
print(sales_ind.loc[rows_to_keep])
The problem is when I run print(sales_ind.loc[rows_to_keep]) where I get this error message KeyError: 'Passing list-likes to .loc or [] with any missing labels is no longer supported, see https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#deprecate-loc-reindex-listlike' but I don't understand how to fix the code.
I am using Python 3.7.6 and pandas 1.0.0
I figured out your problem and it is not a problem with df.loc, but instead with the data you provided. There is no value in your multiindex that matches the tuple (2,'B').
You can see this if you try:
sales_ind.loc[(2,'B'),:]
which returns:
KeyError: (2, 'B')
while (1,'A') works just fine.
By the way, when selecting on a multiindex, I always recommend using pd.IndexSlice

Resources