I have a csv file which is missing the first two column names, therefore looking like that:
,,MedKD,AvgKD,Q,AKD,PKD(.2),GPKD(.2),FS
Genomics,First Query,0.007704,0.008301,0.005379,0.002975,0.000551,0.000547,0.000100
Genomics,PayOff,44,51,62,17,82,3,-
Genomics,Convergence,-,-,*,*,4062,*,-
Genomics,Robustness,-,-,8E-07,3E-07,5E-08,1E-08,-
Genomics,Time,0.968427,1.007088,1.445911,0.313453,1.088019,1.142605,2.277355
Power,First Query,0.000084,0.000049,0.000034,0.000035,0.000016,0.000016,0.000014
Power,PayOff,4,2,2,1,0,0,-
Power,Convergence,-,-,7,12,16,1993,-
Power,Robustness,-,-,9E-11,7E-11,2E-11,2E-11,-
Power,Time,0.017216,0.023430,0.019345,0.022128,0.017878,0.019799,0.044971
When reading into a pandas dataframe, it therefore looks like that:
I have tried all sorts of ways to give the first two a name, but every time, I just overwrote the other ones, but did not touch the first ones.
Example:
m.columns = ['Dataset', 'Measure'] + m.columns[2:].tolist()
This only results in moving all the other ones to the right, starting with MedKD
How is it possible to insert those by using pandas?
I have some data I imported from a excel spreadsheet as a csv. I created a dataframe using Pandas, and want to change a specific column. The column contains strings such as "5.15.1.0.0". I want to change these strings to floats like "5.15100".
So far I've tried using the method "replace" to change every instance in that column:
df['Fix versions'].replace("5.15.1.0.0", 5.15.1.0.0)
this however does not work. When I reprint the dataframe after the replace methods are called it shows me the same dataframe where no changes are made. Is it not possible to change a string to a float using replace? If not does anyone know another way to do this?
I could parse each string and remove the "." but I'd prefer not to do it this way as some of the strings represent numbers of different lengths and decimal place values.
Adding the parameter "inplace" which default is false. Changing this to true will change the dataframe in place, which can be type casted.
df['Fix versions'].replace(to_replace="5.15.1.0.0", value="5.15100", inplace=True)
I have a dataframe with 2 duplicate columns. I need to rename one of the column based on the position given in a configuration file. Using rename() is not working because it is renaming both the columns, even though I am passing the position of the column which needs to be renamed. Could you please suggest how to achieve this?
I want the logic to be generic as I have multiple files and each file has duplicates but different column names which are mentioned in the configuration file.
columns - state city country state
config file -
position HEADER
0 statename
Df.rename(columns = {Df.columns[position]:row['HEADER']}, inplace = True)
It is not working because both the columns are renamed even when position is passed.
IIUC you can convert the names of the column to a numpy array with to_numpy and alter the column names there. This will also the change the column name in the specified position without reassigning the new column names. You can also insert multiple positions with to_numpy unlike to_list:
Df.columns.to_numpy()[position] = row['HEADER']
This is not about dropping columns whose name contains a string.
I have a dataframe with 1600 columns. Several hundred are garbage. Most of the garbage columns contain a phrase such as invalid value encountered in double_scalars (XYZ) where `XYZ' is a filler name for the column name.
I would like to delete all columns that contain, in any of their elements, the string invalid
Purging columns with strings in general would work too. What I want is to clean it up so I can fit a machine learning model to it, so removing any/all columns that are not boolean or real would work.
This must be a duplicate question, but I can only find answers to how to remove a column with a specific column name.
You can use df.select_dtypes(include=[float,bool]) or df.select_dtypes(exclude=['object'])
Link to docs https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.select_dtypes.html
Use apply to make a mask checking if each column contains invalid, and then pass that mask to the second position of .loc:
df = df.loc[:, ~df.apply(lambda col: col.astype(str).str.contains('invalid')).any()]
This is my excel file:
Here I read the entire column A of the Excel sheet named "Lazy_eight" but the problem is that I have different sheets in which the column A has a different number of elements. So, I want to import only the numbers without specifing the length of the column vector.
I use the function readmatrix with the following syntax in order to read the entire column:
p_time = readmatrix('Input_signals.xlsx','Sheet','Lazy_eight','Range','A:A')
I get this in matlab workspace:
So, I wish to give to the "readmatrix" function only the first element of the column I want to import but I want that it stops at the last element, without specifing the coordinate of the last element in order to avoid the NaN that you can see in the last image. I want to import only the numbers without the NaN value.
I cannot read the initial and the last element (in this way: 'Range', 'A3: A13') beacuse in every sheet the column A (as the other ones) has a different number of elements.
I've solved the problem by using the “rmmissing” function, that removes “NaN” values from an array.