Pandas read_excel removes columns under empty header - python-3.x

I have an Excel file where A1,A2,A3 are empty but A4:A53 contains column names.
In "R" when you were to read that data, the columns names for A1,A2,A3 would be "X_1,X_2,X_3" but when using pandas.read_excel it simply skips the first three columns, thus ignoring them. The problem is that the number of columns in each file is dynamic thus I cannot parse the column range, and I cannot edit the files and adding "dummy names" for A1,A2,A3

Use parameter skip_blank_lines=False, like so:
pd.read_excel('your_excel.xlsx', header=None, skip_blank_lines=False)
This stackoverflow question (finally) pointed me in the right direction:
Python Pandas read_excel doesn't recognize null cell
The pandas.read_excel docs don't contain any info about this since it is one of the keywords, but you can find it in the general io docs here: http://pandas.pydata.org/pandas-docs/stable/io.html#io-read-csv-table

A quick fix would be to pass header=None to pandas' read_excel() function, manually insert the missing values into the first row (it now will contain the column names), then assign that row to df.columns and drop it after. Not the most elegant way, but I don't know of a builtin solution to your problem
EDIT: by "manually insert" I mean some messing with fillna(), since this appears to be an automated process of some sort

I realize this is an old thread, but I solved it by specifying the column names and naming the final empty column, rather than importing with no names and then having to deal with a row with names in it (also used use_cols). See below:
use_cols = 'A:L'
column_names = ['Col Name1', 'Col Name 2', 'Empty Col']
df = pd.read_excel(self._input_path, usecols=use_cols, names=column_names)

Related

Pandas: Add missing column name

I have a csv file which is missing the first two column names, therefore looking like that:
,,MedKD,AvgKD,Q,AKD,PKD(.2),GPKD(.2),FS
Genomics,First Query,0.007704,0.008301,0.005379,0.002975,0.000551,0.000547,0.000100
Genomics,PayOff,44,51,62,17,82,3,-
Genomics,Convergence,-,-,*,*,4062,*,-
Genomics,Robustness,-,-,8E-07,3E-07,5E-08,1E-08,-
Genomics,Time,0.968427,1.007088,1.445911,0.313453,1.088019,1.142605,2.277355
Power,First Query,0.000084,0.000049,0.000034,0.000035,0.000016,0.000016,0.000014
Power,PayOff,4,2,2,1,0,0,-
Power,Convergence,-,-,7,12,16,1993,-
Power,Robustness,-,-,9E-11,7E-11,2E-11,2E-11,-
Power,Time,0.017216,0.023430,0.019345,0.022128,0.017878,0.019799,0.044971
When reading into a pandas dataframe, it therefore looks like that:
I have tried all sorts of ways to give the first two a name, but every time, I just overwrote the other ones, but did not touch the first ones.
Example:
m.columns = ['Dataset', 'Measure'] + m.columns[2:].tolist()
This only results in moving all the other ones to the right, starting with MedKD
How is it possible to insert those by using pandas?

Change data in Pandas dataframe by column

I have some data I imported from a excel spreadsheet as a csv. I created a dataframe using Pandas, and want to change a specific column. The column contains strings such as "5.15.1.0.0". I want to change these strings to floats like "5.15100".
So far I've tried using the method "replace" to change every instance in that column:
df['Fix versions'].replace("5.15.1.0.0", 5.15.1.0.0)
this however does not work. When I reprint the dataframe after the replace methods are called it shows me the same dataframe where no changes are made. Is it not possible to change a string to a float using replace? If not does anyone know another way to do this?
I could parse each string and remove the "." but I'd prefer not to do it this way as some of the strings represent numbers of different lengths and decimal place values.
Adding the parameter "inplace" which default is false. Changing this to true will change the dataframe in place, which can be type casted.
df['Fix versions'].replace(to_replace="5.15.1.0.0", value="5.15100", inplace=True)

How to rename one column(based on given position) if there are duplicate column names

I have a dataframe with 2 duplicate columns. I need to rename one of the column based on the position given in a configuration file. Using rename() is not working because it is renaming both the columns, even though I am passing the position of the column which needs to be renamed. Could you please suggest how to achieve this?
I want the logic to be generic as I have multiple files and each file has duplicates but different column names which are mentioned in the configuration file.
columns - state city country state
config file -
position HEADER
0 statename
Df.rename(columns = {Df.columns[position]:row['HEADER']}, inplace = True)
It is not working because both the columns are renamed even when position is passed.
IIUC you can convert the names of the column to a numpy array with to_numpy and alter the column names there. This will also the change the column name in the specified position without reassigning the new column names. You can also insert multiple positions with to_numpy unlike to_list:
Df.columns.to_numpy()[position] = row['HEADER']

How to drop columns from a pandas DataFrame that have elements containing a string?

This is not about dropping columns whose name contains a string.
I have a dataframe with 1600 columns. Several hundred are garbage. Most of the garbage columns contain a phrase such as invalid value encountered in double_scalars (XYZ) where `XYZ' is a filler name for the column name.
I would like to delete all columns that contain, in any of their elements, the string invalid
Purging columns with strings in general would work too. What I want is to clean it up so I can fit a machine learning model to it, so removing any/all columns that are not boolean or real would work.
This must be a duplicate question, but I can only find answers to how to remove a column with a specific column name.
You can use df.select_dtypes(include=[float,bool]) or df.select_dtypes(exclude=['object'])
Link to docs https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.select_dtypes.html
Use apply to make a mask checking if each column contains invalid, and then pass that mask to the second position of .loc:
df = df.loc[:, ~df.apply(lambda col: col.astype(str).str.contains('invalid')).any()]

How to import a column from an Excel file?

This is my excel file:
Here I read the entire column A of the Excel sheet named "Lazy_eight" but the problem is that I have different sheets in which the column A has a different number of elements. So, I want to import only the numbers without specifing the length of the column vector.
I use the function readmatrix with the following syntax in order to read the entire column:
p_time = readmatrix('Input_signals.xlsx','Sheet','Lazy_eight','Range','A:A')
I get this in matlab workspace:
So, I wish to give to the "readmatrix" function only the first element of the column I want to import but I want that it stops at the last element, without specifing the coordinate of the last element in order to avoid the NaN that you can see in the last image. I want to import only the numbers without the NaN value.
I cannot read the initial and the last element (in this way: 'Range', 'A3: A13') beacuse in every sheet the column A (as the other ones) has a different number of elements.
I've solved the problem by using the “rmmissing” function, that removes “NaN” values from an array.

Resources