I have a dataframe of categorical variables. I want to replace all the fields in one column with an arbitrary unique string if the count of that category within the column is less than 100.
So, for example, in column color, if any color appears less than 100 times i want it to be replaced by the string "base"
I tried the below code and tried different things I found on stack overflow.
df['color'] = numpy.where(df.groupby("color").filter(lambda x: len(x) < 100), 'dummy', df['color'])
Operands could not be broadcast together with shapes (45638872,878) () (8765878782788,)
IIUC, you need this,
df.loc[df.groupby('color')['color'].transform('count')<100, 'color']= 'dummy'
Related
I use the following code to create two arrays in a histogram, one for the counts (percentages) and the other for values.
df = row.value_counts(normalize=True).mul(100).round(1)
counts = df # contains percentages
values = df.keys().tolist()
So, an output looks like
counts = 66.7, 8.3, 8.3, 8.3, 8.3
values = 1024, 356352, 73728, 16384, 4096
Problem is that some values exist one time only and I would like to ignore them. In the example above, only 1024 repeated multiple times and others are there only once. I can manually check the number of occurrences in the row and see if they are not repeated multiple times and ignore them.
df = row.value_counts(normalize=True).mul(100).round(1)
counts = df # contains percentages
values = df.keys().tolist()
for v in values:
# N = get_number_of_instances in row
# if N == 1
# remove v in row
I would like to know if there are other ways for that using the built-in functions in Pandas.
Some clarity requested on your question in comments above
If keys is a column and you want to retain non duplicates, please try
values=df.loc[~df['keys'].duplicated(keep=False), 'keys'].to_list()
In my Excel file, I have data split up over different tables for different values of parameter X.
I have tables for parameter X for values 0.1, 0.5, 1, 5 and 10. Each table has a parameter Y at the far left that I want to able to search for with a few data cells right of it. Like so:
X = 0.1
Y
Data_0
Data_1
Data_2
1
0.071251
0.681281
0.238509
2
0.283393
0.509497
0.397196
3
0.678296
0.789879
0.439004
4
0.788525
0.363215
0.248953
etc.
Now I want to find Data_0, Data_1 and Data_2 for a given X and Y value (in two separate cells).
My thought was naming the tables X0.1 X0.5 etc. and when defining the matrix for the lookup function use some syntax that would change the table it searches in. With three of these functions in adjacent cells, I would obtain the three values desired.
Is that possible, or is there some other method that would give me the result I want?
Thanks in advance
On the question what would be my desired result from this data:
I would like A1 to give the value for the X I'm searching for (so 0.1 in this case)
A2 would be the value of Y (let's pick 3)
then I want C1:E1 to give the values 0.678... 0.789... 0.439...
Now from usmanhaq, I think it should be something like:
=vlookup(A2,concatenate("X",A1),2)
=vlookup(A2,concatenate("X",A1),3)
=vlookup(A2,concatenate("X",A1),4)
for the three cells.
This exact formulation doesn't work and I can't find the formulation that does work.
I would like to ask some values in the data frame. Here is my code:
I have the code as
algorithm_choice =['DUMMY','LINEAR_REGRESSION','RIDGE_REGRESSION','MLP','SVM','RANDOM_FOREST'] m
model_type_choice=['POPULATION_INFORMED','REGULAR','SINGLE_CYCLE','CYCLE_PREDICTION']
rmse_summary=pd.DataFrame(columns=algorithm_choice, index = model_type_choice)
How can I add a specific value to rmse_summary?
Use .loc and .iloc
To add a specific value, I assume one value, then you can use either .loc or .iloc.
.loc will give you a specific position by name:
rmse_summary.loc['REGULAR','DUMMY'] = 3
.iloc will give you access to a position by index number:
rmse_summary.iloc[2,4] = 5
Hello I wanted to increment a global variable 'count' through a function which will be called on a pandas dataframe of length 1458.
I have read other answers where they talk about .apply() not being inplace.
I therefore follow their advice but the count variable still is 4
count = 0
def cc(x):
global count
count += 1
print(count)
#Expected final value of count is 1458 but instead it is 4
# I think its 4, because 'PoolQC' is a categorical column with 4 possible values
# I want the count variable to be 1458 by the end instead it shows 4
all_data['tempo'] = all_data['PoolQC'].apply(cc)
# prints 4 instead of 1458
print("Count final value is ",count)
Yes, the observed effect is because you have categorical type of the column. This is smart of pandas that it just calculates apply for each category. Is counting only thing you're doing there? I guess not, but why you need such a calculation? Can't you use df.shape?
Couple of options I see here:
You can change type of column
e.g.
all_data['tempo'] = all_data['PoolQC'].astype(str).apply(cc)
You can use different non-categorical column
You can use df.shape to see how many rows you have in the df.
You can use apply for whole DataFrame like all_data['tempo'] = df.apply(cc, axis=1).
In such a case you still can use whatever is in all_data['PoolQC'] within cc function, like:
def cc(x):
global count
count += 1
print(count)
return x['PoolQC']
I have a strange problem which i am not able to figure out. I have a dataframe subset that looks like this
in the dataframe, I add "zero" columns using the following code:
subset['IRNotional]=pd.DataFrame(numpy.zeros(shape=(len(subset),1)))
subset['IPNotional]=pd.DataFrame(numpy.zeros(shape=(len(subset),1)))
and i get a result similar to this
Now when i do similar things to another dataframe i get zeros columns with a mix NaN and zeros rows as shown below. This is really strange.
subset['IRNotional]=pd.DataFrame(numpy.zeros(shape=(len(subset),1)))
subset['IPNotional]=pd.DataFrame(numpy.zeros(shape=(len(subset),1)))
I dont understand why sometimes i get zeros the other i get either NaNs or a mix of NaNs and zeros. Please help if you can
Thanks
I believe you need assign with dictionary for set new columns names:
subset = subset.assign(**dict.fromkeys(['IRNotional','IPNotional'], 0))
#you can define each column separately
#subset = subset.assign(**{'IRNotional': 0, 'IPNotional': 1})
Or simplier:
subset['IRNotional'] = 0
subset['IPNotional'] = 0
Now when i do similar things to another dataframe i get zeros columns with a mix NaN and zeros rows as shown below. This is really strange.
I think problem is different index values, so is necessary create same indices, else for not matched indices get NaNs:
subset['IPNotional']=pd.DataFrame(numpy.zeros(shape=(len(subset),1)), index=subset.index)