i am using one hot encoding for mercedes-benz dataset kaggle for three columns in a for loop
but when use pd.get_dummies in a for loop the one hot encoding applies for "X6" column only. i assume it is iterating through other two "X3" and "X4" as well but it is not making change into dataframe.
X3 has 3 columns X4 has 4 columns and X6 has 12 columns.
for column in ['X3', 'X4','X6']:
dummies = pd.get_dummies(df[column])
df[dummies.columns] = dummies
Related
I have a dataframe with numerous float columns. I want to filter the dataframe, leaving only the values that are inbetween the High and Low columns of the same dataframe.
I know how to do this when the conditions are one column compared to another column. But there are 102 columns, so I cannot write a condition for each column. And all my research just illustrates how to compare two columns and not one column against all others (or I am not typing the right search terms).
I tried df= df[ (df['High'] <= df[DFColRBs]) & (df['Low'] >= df[DFColRBs])].copy() But it erases everything.
and I tried booleanselction = df[ (df[DFColRBs].between(df['High'],df['Low'])]
and I tried: df= df[(df[DFColRBs].ge(df['Low'])) & (df[DFColRBs].le(df['Low']))].copy()
and I tried:
BoolMatrix = (df[DFColRBs].ge(DF_copy['Low'], axis=0)) & (df[DFColRBs].le(DF_copy['Low'], axis=0))
df= df[BoolMatrix].copy()
But it erases everything in dataframe, even 3 columns that are not included in the list.
I appreciate the guidance.
Example Dataframe:
High Low Close _1m_21 _1m_34 _1m_55 _1m_89 _1m_144 _1m_233 _5m_21 _5m_34 _5m_55
0 1.23491 1.23456 1.23456 1.23401 1.23397 1.23391 1.2339 1.2337 1.2335 1.23392 1.23363 1.23343
1 1.23492 1.23472 1.23472 1.23422 1.23409 1.234 1.23392 1.23375 1.23353 1.23396 1.23366 1.23347
2 1.23495 1.23479 1.23488 1.23454 1.23422 1.23428 1.23416 1.23404 1.23372 1.23415 1.234 1.23367
3 1.23494 1.23472 1.23473 1.23457 1.23425 1.23428 1.23417 1.23405 1.23373 1.23415 1.234 1.23367
Based on what you've said in the comments, best to split the df into the pieces you want to operate on and the ones you don't, then use matrix operations.
tmp_df = DF_copy.iloc[:, 3:].copy()
# or tmp_df = DF_copy[DFColRBs].copy()
# mask by comparing test columns with the high and low columns
m = tmp_df.le(DF_copy['High'], axis=0) & tmp_df.ge(DF_copy['Low'], axis=0)
# combine the masked df with the original cols
DF_copy2 = pd.concat([DF_copy.iloc[:, :3], tmp_df.where(m)], axis=1)
# or replace with DF_copy.iloc[:, :3] with DF_copy.drop(columns=DFColRBs)
I have a dataframe like this:
df = pd.DataFrame({'id':[10,20,30,40],'text':['some text','another text','random stuff', 'my cat is a god'],
'A':[0,0,1,1],
'B':[1,1,0,0],
'C':[0,0,0,1],
'D':[1,0,1,0]})
Here I have columns from Ato D but my real dataframe has 100 columns with values of 0and 1. This real dataframe has 100k reacords.
For example, the column A is related to the 3rd and 4rd row of text, because it is labeled as 1. The Same way, A is not related to the 1st and 2nd rows of text because it is labeled as 0.
What I need to do is to sample this dataframe in a way that I have the same or about the same number of features.
In this case, the feature C has only one occurrece, so I need to filter all others columns in a way that I have one text with A, one text with B, one text with Cetc..
The best would be: I can set using for example n=100 that means I want to sample in a way that I have 100 records with all the features.
This dataset is a multilabel dataset training and is higly unbalanced, I am looking for the best way to balance it for a machine learning task.
Important: I don't want to exclude the 0 features. I just want to have ABOUT the same number of columns with 1 and 0
For example. with a final data set with 1k records, I would like to have all columns from A to the final_column and all these columns with the same numbers of 1 and 0. To accomplish this I will need to random discard text rows and id only.
The approach I was trying was to look to the feature with the lowest 1 and 0 counts and then use this value as threshold.
Edit 1: One possible way I thought is to use:
df.sum(axis=0, skipna=True)
Then I can use the column with the lowest sum value as threshold to filter the text column. I dont know how to do this filtering step
Thanks
The exact output you expect is unclear, but assuming you want to get 1 random row per letter with 1 you could reshape (while dropping the 0s) and use GroupBy.sample:
(df
.set_index(['id', 'text'])
.replace(0, float('nan'))
.stack()
.groupby(level=-1).sample(n=1)
.reset_index()
)
NB. you can rename the columns if needed
output:
id text level_2 0
0 30 random stuff A 1.0
1 20 another text B 1.0
2 40 my cat is a god C 1.0
3 30 random stuff D 1.0
I have multiple time series data in csv files from Netlogo model runs. I would like to join those series into one dataframe so that I can do a boxplot to see variations from different simulation model runs. X values in each csv are the time iterations (integers). The y values are the values of a particular measure in the model, e.g., population count. So, I can join the csvs with concat. There are repeated column names for the y variables. My thought is to combine columns with the same name into one column as a list of numbers (y values). Then I can pass that x, y to boxplot to plot that variable across time with its variations - median, etc. Data is of the form:
x population groups color
0 0 0.00 0.00 0.00
1 1 74.47 42.48 40.96
2 2 74.46 42.48 40.96
would become
x population groups color
0 0 [0.00, 1.2] [0.00, 5] [0.00, 4]
1 1 [74.47, 3.2] [42.48, 55] [40.96, 55]
2 2 [74.46, Nan] [42.48, NaN] [40.96, NaN]
There are multiples of this dataframe from different csv files (thousands). The x axis value can have a different maximum time value for different runs / csvs.
How do I combine dataframes such that I get one dataframe with a list of y values for a given y (column) for each x value. There will be NaNs for some y values for runs that ended early. Note that are multiple y columns. Note that each column is a separate boxplot (overlayed on the same plot).
I have tried concat, join, merge, and not been able to convert multiple columns with the same or different names into one column with a list of values rather than a single value.
Or, is there even a better way to do what I want to do with the data?
The answer ended up being simpler than I expected. Insight into how to do this came from this answer.
Make a list of the time series dataframes: dn = [d1,d2,d3,...]
Concatenate the dataframes: dn = pd.concat(dl, axis=1)
Create a new column with the list of values:
dn['new'] = dn['data column name'].values.tolist()
This generates the new column with the list of values that I can now use to make a box plot.
I've created an output variable 'a = pd.Series()', then run a number of simulations using a for loop that append the results of the simulation, temporarily stored in 'x', to 'a' in successive columns, each renamed to coincide with the simulation number, starting at the zero-th position, using the following code:
a = pandas.concat([a, x.rename(sim_count)], axis=1)
For some reason, the resulting dataframe includes a column of "NaN" values to the left of my first column of simulated results that I can't get rid of, as follows (example shows the results of three simulations):
0 0 1 2
0 NaN 0.136799 0.135325 -0.174987
1 NaN -0.010517 0.108798 0.003726
2 NaN 0.116757 0.030352 0.077443
3 NaN 0.148347 0.045051 0.211610
4 NaN 0.014309 0.074419 0.109129
Any idea how to prevent this column of NaN values from being generated?
Basically, by creating your output variable via pd.Series() you are creating an empty dataset. This is carried over in the concatenation, with the empty dataset's size being defined as the same size (well, same number of rows) as x[sim_count]. The only way Python/Pandas knows to represent this "empty" series is by using a series of NaN values. When you concatenate you are effectively saying: I want to add my new dataframe/series onto the "empty" series...and the empty series just gets NaN.
A more effective way of doing this is to assign "a" to a dataframe then concatenate.
a = pd.DataFrame()
a = pandas.concat([a, x.rename(sim_count)], axis=1)
You might be asking yourself why this works and using pd.Series() forces a column of NaNs. My understanding is the dataframe creates an empty place in memory for the data to be added (i.e. you are putting your new data INTO an empty dataframe), whereas when you do pd.concat([pd.Series(), x.rename(sim_count)], axis1) you are telling pandas that the empty series (pd.Series()) is important and should be retained, and that the new data should be added ONTO "a". Hence the column of NaNs.
I am inputting multiple spreadsheets with multiple columns of data. For each spreadsheet, the maximum value of each column is found. Then, for each element in the column, the element is divided by the maximum value of that column. The output should be a value (between 0 and 1) for each element in the column in ascending order. This is appended to a list which should be added to the source spreadsheet as a column.
Currently, the nested loops are performing correctly apart from the final step, as far as I understand. Each column is added to the spreadsheet EXCEPT the values are for the final column of the source spreadsheet rather than values related to each individual column.
I have tried changing the indents to associate levels of the code with different parts (as I think this is the problem) and tried moving the appended column along in the dataframe, to no avail.
for i in distlist:
#listname = i[4:] + '_norm'
df2 = pd.read_excel(i,header=0,index_col=None, skip_blank_lines=True)
df3 = df2.dropna(axis=0, how='any')
cols = []
for column in df3:
cols.append(column)
for x in cols:
listname = x + ' norm'
maxval = df3[x].max()
print(maxval)
mylist = []
for j in df3[x]:
findNL = (j/maxval)
mylist.append(findNL)
df3[listname] = mylist
saveloc = 'E:/test/'
filename = i[:-18] + '_Normalised.xlsx'
df3.to_excel(saveloc+filename, index=False)
New columns are added to the output dataframe with bespoke headings relating to the field headers in the source spreadsheet and renamed according to (listname). The data in each one of these new columns is identical and relates to the final column in the spreadsheet. To me, it seems to be overwriting the values each time (as if looping through the entire spreadsheet, not outputting for each column), and adding it to the spreadsheet.
Any help would be much appreciated. I think it's something simple, but I haven't managed to work out what...
If I understand you correctly, you are overcomplicating things. You dont need a for loop for this. You can simplify your code:
# Make example dataframe, this is not provided
df = pd.DataFrame({'col1':[1, 2, 3, 4],
'col2':[5, 6, 7, 8]})
print(df)
col1 col2
0 1 5
1 2 6
2 3 7
3 4 8
Now we can use DataFrame.apply and use add_suffix to give the new columns _norm suffix and after that concat the columns to one final dataframe
df_conc = pd.concat([df, df.apply(lambda x: x/x.max()).add_suffix('_norm')],axis=1)
print(df_conc)
col1 col2 col1_norm col2_norm
0 1 5 0.25 0.625
1 2 6 0.50 0.750
2 3 7 0.75 0.875
3 4 8 1.00 1.000
Many thanks. I think I was just overcomplicating it. Incidentally, I think my code may do the same job, but because there is so little difference in the values, it wasn't notable.
Thanks for your help #Erfan