How to import a column from an Excel file? - excel

This is my excel file:
Here I read the entire column A of the Excel sheet named "Lazy_eight" but the problem is that I have different sheets in which the column A has a different number of elements. So, I want to import only the numbers without specifing the length of the column vector.
I use the function readmatrix with the following syntax in order to read the entire column:
p_time = readmatrix('Input_signals.xlsx','Sheet','Lazy_eight','Range','A:A')
I get this in matlab workspace:
So, I wish to give to the "readmatrix" function only the first element of the column I want to import but I want that it stops at the last element, without specifing the coordinate of the last element in order to avoid the NaN that you can see in the last image. I want to import only the numbers without the NaN value.
I cannot read the initial and the last element (in this way: 'Range', 'A3: A13') beacuse in every sheet the column A (as the other ones) has a different number of elements.

I've solved the problem by using the “rmmissing” function, that removes “NaN” values from an array.

Related

Change data in Pandas dataframe by column

I have some data I imported from a excel spreadsheet as a csv. I created a dataframe using Pandas, and want to change a specific column. The column contains strings such as "5.15.1.0.0". I want to change these strings to floats like "5.15100".
So far I've tried using the method "replace" to change every instance in that column:
df['Fix versions'].replace("5.15.1.0.0", 5.15.1.0.0)
this however does not work. When I reprint the dataframe after the replace methods are called it shows me the same dataframe where no changes are made. Is it not possible to change a string to a float using replace? If not does anyone know another way to do this?
I could parse each string and remove the "." but I'd prefer not to do it this way as some of the strings represent numbers of different lengths and decimal place values.
Adding the parameter "inplace" which default is false. Changing this to true will change the dataframe in place, which can be type casted.
df['Fix versions'].replace(to_replace="5.15.1.0.0", value="5.15100", inplace=True)

How to replace numbers with corresponding text values in Excel without making it treat a number individually?

Okay, so here's the thing. I have a list of compounds which are listed as compound_1,compound_2,compound_3 and so on. I would like to change it to the corresponding compound names using Excel. There are over 140 compounds and in the datatable, the compound repeats itself for a number of times.
The problem arises when the Excel treats two digit numbers as two individual numbers instead of treating them as a one whole unit and substitutes value accordingly. I'll give you an example. Say, compound_2 is Nicotine and compound_20 is quercetin, it treats 20 as 2 and 0, and the cell is replaced with Nicotine0 instead of quercetin.
Is there a way to replace it without this hassle?
P.S: The table looks something like this:
compound_1 Nicotine
compound_2 β- sitosterol
compound_3 D - mannitol
compound_4 Stigma Sterol
compound_5 Betulinic Acid
And, I've attached a collage comparing the column values before(left) and after(right) I tried replacing compound_2 with Nicotine.(This is only for example and not the real pair)image

Pandas read_excel removes columns under empty header

I have an Excel file where A1,A2,A3 are empty but A4:A53 contains column names.
In "R" when you were to read that data, the columns names for A1,A2,A3 would be "X_1,X_2,X_3" but when using pandas.read_excel it simply skips the first three columns, thus ignoring them. The problem is that the number of columns in each file is dynamic thus I cannot parse the column range, and I cannot edit the files and adding "dummy names" for A1,A2,A3
Use parameter skip_blank_lines=False, like so:
pd.read_excel('your_excel.xlsx', header=None, skip_blank_lines=False)
This stackoverflow question (finally) pointed me in the right direction:
Python Pandas read_excel doesn't recognize null cell
The pandas.read_excel docs don't contain any info about this since it is one of the keywords, but you can find it in the general io docs here: http://pandas.pydata.org/pandas-docs/stable/io.html#io-read-csv-table
A quick fix would be to pass header=None to pandas' read_excel() function, manually insert the missing values into the first row (it now will contain the column names), then assign that row to df.columns and drop it after. Not the most elegant way, but I don't know of a builtin solution to your problem
EDIT: by "manually insert" I mean some messing with fillna(), since this appears to be an automated process of some sort
I realize this is an old thread, but I solved it by specifying the column names and naming the final empty column, rather than importing with no names and then having to deal with a row with names in it (also used use_cols). See below:
use_cols = 'A:L'
column_names = ['Col Name1', 'Col Name 2', 'Empty Col']
df = pd.read_excel(self._input_path, usecols=use_cols, names=column_names)

Sort a spread sheet via gspread

I have a Google spreadsheet full of names, dates, and some other numbers.
I made an desktop application that provides a nice UI for said info.
After using the application a bit I became slightly annoyed with the order the data was being displayed.
I have been researching all day and I cannot seem to find anything on the topic of sorting the spreadsheet from a python script.
All I need is some function that I can call every time someone adds something to it to re-sort the sheet.
I would very much appreciate it if someone could help me out.
Thanks in advance.
GSpread has a .sort() method to sort a worksheet using given sort orders. Here's how you can use it (Source - GSpread Docs):
Parameters:
specs (list) – The sort order per column. Each sort order represented by a tuple where the first element is a column index and the second element is the order itself: ‘asc’ or ‘des’.
range (str) – The range to sort in A1 notation. By default sorts the whole sheet excluding frozen rows.
Example:
# Sort sheet A -> Z by column 'B'
wks.sort((2, 'asc'))
# Sort range A2:G8 basing on column 'G' A -> Z
# and column 'B' Z -> A
wks.sort((7, 'asc'), (2, 'des'), range='A2:G8')
You can use PyGsheets lib. It uses sheets API v4 on lower level.
my_worksheet.sort_range() function will help you but it has some specialities
Numbering in start and end cells start with 1 but
basecolumnindex starts with 0.
You can pass the cell's address in 2 ways: text index (like "A3") or tuple with 2 elements (like (1, 3)). The second way doesn't work for me.
The range limited by start and end cells should contain column passed in basecolumnindex

Join a path variable to each row in the import csv file

I have many Import files, which look like this
So there are sales values per Team Member, but NO period inside.
The period is coded in the Path like:
AllData\201501\Revenues.txt
AllData\201502\Revenues.txt
AllData\201503\Revenues.txt
I want to have the Periode from the path on each data row, so my final output table should look like this:
So I must bring the period from the path inside the file anyway.
The question how to access the path is solved in perfect example here:
How can I save a path criteria when I import from folders?
But there I have still the period on the "whole" text, not on the row.
In the linked question you can change the custom column formula from:
Text.FromBinary([Content])
to
Text.Split(Text.FromBinary([Content]), "#(000a)")
(depending on how line breaks are represented, you may need to use "#(000a)#(000d)" instead).
This will split the text at each new line, and you'll get a list of the name;value pairs. Click on the box with the two arrows next to the column name to expand the column. Each row should now have the period associated with the name;value pair. Finally, split the column by delimiter on the semicolon to separate the name from the value.
There are 2 options, both involve horrible looking equations.
First option, we assume the paths are going to have the period in the same position in the string.
for the example, we want the number between the 1st and 2nd slashes.
=TRIM(LEFT(SUBSTITUTE(MID(A1,FIND("|",SUBSTITUTE(A1,"\","|",1))+1,LEN(A1)),"\",REPT(" ",LEN(A1))),LEN(A1)))
If it's between a different set of slashes, alter the ,1 to tell the formula which slash to start from. If the number of slashes can be different, then we will have to try for the second option.
Second option, we assume that those are the only numbers in the path.
This formula will extract those numbers:
=SUMPRODUCT(MID(0&A1,LARGE(INDEX(ISNUMBER(--MID(A1,ROW($1:$25),1))* ROW($1:$25),0),ROW($1:$25))+1,1)*10^ROW($1:$25)/10)
Note that this will extract all the numbers from the string. If the path contains numbers, then these will get added to the string. e.g. C:\2014Data\201401\Revenues.txt would return 2014201401
If this doesn't take care of it, then it may be easier putting a column into the table yourself

Resources