Pandas: Add missing column name - python-3.x

I have a csv file which is missing the first two column names, therefore looking like that:
,,MedKD,AvgKD,Q,AKD,PKD(.2),GPKD(.2),FS
Genomics,First Query,0.007704,0.008301,0.005379,0.002975,0.000551,0.000547,0.000100
Genomics,PayOff,44,51,62,17,82,3,-
Genomics,Convergence,-,-,*,*,4062,*,-
Genomics,Robustness,-,-,8E-07,3E-07,5E-08,1E-08,-
Genomics,Time,0.968427,1.007088,1.445911,0.313453,1.088019,1.142605,2.277355
Power,First Query,0.000084,0.000049,0.000034,0.000035,0.000016,0.000016,0.000014
Power,PayOff,4,2,2,1,0,0,-
Power,Convergence,-,-,7,12,16,1993,-
Power,Robustness,-,-,9E-11,7E-11,2E-11,2E-11,-
Power,Time,0.017216,0.023430,0.019345,0.022128,0.017878,0.019799,0.044971
When reading into a pandas dataframe, it therefore looks like that:
I have tried all sorts of ways to give the first two a name, but every time, I just overwrote the other ones, but did not touch the first ones.
Example:
m.columns = ['Dataset', 'Measure'] + m.columns[2:].tolist()
This only results in moving all the other ones to the right, starting with MedKD
How is it possible to insert those by using pandas?

Related

Change data in Pandas dataframe by column

I have some data I imported from a excel spreadsheet as a csv. I created a dataframe using Pandas, and want to change a specific column. The column contains strings such as "5.15.1.0.0". I want to change these strings to floats like "5.15100".
So far I've tried using the method "replace" to change every instance in that column:
df['Fix versions'].replace("5.15.1.0.0", 5.15.1.0.0)
this however does not work. When I reprint the dataframe after the replace methods are called it shows me the same dataframe where no changes are made. Is it not possible to change a string to a float using replace? If not does anyone know another way to do this?
I could parse each string and remove the "." but I'd prefer not to do it this way as some of the strings represent numbers of different lengths and decimal place values.
Adding the parameter "inplace" which default is false. Changing this to true will change the dataframe in place, which can be type casted.
df['Fix versions'].replace(to_replace="5.15.1.0.0", value="5.15100", inplace=True)

How can I filter all values in a CSV spreadsheet that don't match one of hundreds of values from another spreadsheet?

I'm using Google Sheets and Google Collab together and trying to clean up the data I've downloaded as a CSV file. The problem I'm facing is that I want to filter out all results that don't match one of 100+ values one could have as group names which I've grabbed from another spreadsheet and currently have stored in an array. I think there are one or two other filters I'll want to apply, but the others only have four or five possible values in comparison.
I succeeded using Pandas isin()
https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.isin.html
Specifically I used something like the example here:
titanic[titanic["Pclass"].isin([2, 3])]
I understand isin() provided you with booleans telling you whether an index is in the array. Using it like above turns it into a filter of sorts only keeping the items that match the items in the array.
https://pandas.pydata.org/docs/reference/api/pandas.Series.isin.html

Pandas read_excel removes columns under empty header

I have an Excel file where A1,A2,A3 are empty but A4:A53 contains column names.
In "R" when you were to read that data, the columns names for A1,A2,A3 would be "X_1,X_2,X_3" but when using pandas.read_excel it simply skips the first three columns, thus ignoring them. The problem is that the number of columns in each file is dynamic thus I cannot parse the column range, and I cannot edit the files and adding "dummy names" for A1,A2,A3
Use parameter skip_blank_lines=False, like so:
pd.read_excel('your_excel.xlsx', header=None, skip_blank_lines=False)
This stackoverflow question (finally) pointed me in the right direction:
Python Pandas read_excel doesn't recognize null cell
The pandas.read_excel docs don't contain any info about this since it is one of the keywords, but you can find it in the general io docs here: http://pandas.pydata.org/pandas-docs/stable/io.html#io-read-csv-table
A quick fix would be to pass header=None to pandas' read_excel() function, manually insert the missing values into the first row (it now will contain the column names), then assign that row to df.columns and drop it after. Not the most elegant way, but I don't know of a builtin solution to your problem
EDIT: by "manually insert" I mean some messing with fillna(), since this appears to be an automated process of some sort
I realize this is an old thread, but I solved it by specifying the column names and naming the final empty column, rather than importing with no names and then having to deal with a row with names in it (also used use_cols). See below:
use_cols = 'A:L'
column_names = ['Col Name1', 'Col Name 2', 'Empty Col']
df = pd.read_excel(self._input_path, usecols=use_cols, names=column_names)

Removing duplicates between multiple large CSV files

I am trying to find the best way to remove duplicates from large CSV files.
I receive CSV files of around 5/6 million rows every month.
I need to adjust these (I only need some of the columns, and I need to add some others).
The files also contain a lot of duplicate, and incomplete rows.
I've come up with a solution in python where I use a set and check for each row if it's in the set. And change what needs changing.
Now, I get the second file, and it contains a lot of duplicates that are in the previous file.
I'm trying to find an efficient solution to remove duplicates within the file, and between the different files. In the end I want to have a list (table or csv file) that contains only the new entries for that month.
I would like use python, and I was thinking about using a sqlite database for storing the data. But I'm unsure which way would be most efficient.
I would use numpy.unique():
import numpy as np
data = np.vstack((np.loadtxt("path/to/file1.csv"), np.loadtxt("path/to/file2.csv")))
#this will stack both arrays on top of each other, creating one giant array
data = np.unique(data, axis=0)
np.unique takes the entire array and returns only the unique elements. Make sure you set axis=0 so that it goes row by row and not cell by cell.
One caveat: This should work, but if there are several million rows, it may take a while. Still better than doing it by hand though! Good luck!

Join a path variable to each row in the import csv file

I have many Import files, which look like this
So there are sales values per Team Member, but NO period inside.
The period is coded in the Path like:
AllData\201501\Revenues.txt
AllData\201502\Revenues.txt
AllData\201503\Revenues.txt
I want to have the Periode from the path on each data row, so my final output table should look like this:
So I must bring the period from the path inside the file anyway.
The question how to access the path is solved in perfect example here:
How can I save a path criteria when I import from folders?
But there I have still the period on the "whole" text, not on the row.
In the linked question you can change the custom column formula from:
Text.FromBinary([Content])
to
Text.Split(Text.FromBinary([Content]), "#(000a)")
(depending on how line breaks are represented, you may need to use "#(000a)#(000d)" instead).
This will split the text at each new line, and you'll get a list of the name;value pairs. Click on the box with the two arrows next to the column name to expand the column. Each row should now have the period associated with the name;value pair. Finally, split the column by delimiter on the semicolon to separate the name from the value.
There are 2 options, both involve horrible looking equations.
First option, we assume the paths are going to have the period in the same position in the string.
for the example, we want the number between the 1st and 2nd slashes.
=TRIM(LEFT(SUBSTITUTE(MID(A1,FIND("|",SUBSTITUTE(A1,"\","|",1))+1,LEN(A1)),"\",REPT(" ",LEN(A1))),LEN(A1)))
If it's between a different set of slashes, alter the ,1 to tell the formula which slash to start from. If the number of slashes can be different, then we will have to try for the second option.
Second option, we assume that those are the only numbers in the path.
This formula will extract those numbers:
=SUMPRODUCT(MID(0&A1,LARGE(INDEX(ISNUMBER(--MID(A1,ROW($1:$25),1))* ROW($1:$25),0),ROW($1:$25))+1,1)*10^ROW($1:$25)/10)
Note that this will extract all the numbers from the string. If the path contains numbers, then these will get added to the string. e.g. C:\2014Data\201401\Revenues.txt would return 2014201401
If this doesn't take care of it, then it may be easier putting a column into the table yourself

Resources