Pandas - How to skip the first row of a csv file to be made the header with combining multiple csv files [duplicate] - python-3.x

I'm reading in a pandas DataFrame using pd.read_csv. I want to keep the first row as data, however it keeps getting converted to column names.
I tried header=False but this just deleted it entirely.
(Note on my input data: I have a string (st = '\n'.join(lst)) that I convert to a file-like object (io.StringIO(st)), then build the csv from that file object.)

You want header=None the False gets type promoted to int into 0 see the docs emphasis mine:
header : int or list of ints, default ‘infer’ Row number(s) to use as
the column names, and the start of the data. Default behavior is as if
set to 0 if no names passed, otherwise None. Explicitly pass header=0
to be able to replace existing names. The header can be a list of
integers that specify row locations for a multi-index on the columns
e.g. [0,1,3]. Intervening rows that are not specified will be skipped
(e.g. 2 in this example is skipped). Note that this parameter ignores
commented lines and empty lines if skip_blank_lines=True, so header=0
denotes the first line of data rather than the first line of the file.
You can see the difference in behaviour, first with header=0:
In [95]:
import io
import pandas as pd
t="""a,b,c
0,1,2
3,4,5"""
pd.read_csv(io.StringIO(t), header=0)
Out[95]:
a b c
0 0 1 2
1 3 4 5
Now with None:
In [96]:
pd.read_csv(io.StringIO(t), header=None)
Out[96]:
0 1 2
0 a b c
1 0 1 2
2 3 4 5
Note that in latest version 0.19.1, this will now raise a TypeError:
In [98]:
pd.read_csv(io.StringIO(t), header=False)
TypeError: Passing a bool to header is invalid. Use header=None for no
header or header=int or list-like of ints to specify the row(s) making
up the column names

I think you need parameter header=None to read_csv:
Sample:
import pandas as pd
from pandas.compat import StringIO
temp=u"""a,b
2,1
1,1"""
df = pd.read_csv(StringIO(temp),header=None)
print (df)
0 1
0 a b
1 2 1
2 1 1

If you're using pd.ExcelFile to read all the excel file sheets then:
df = pd.ExcelFile("path_to_file.xlsx")
df.sheet_names # Provide the sheet names in the excel file
df = df.parse(2, header=None) # Parsing the 2nd sheet in the file with header = None
df
Output:
0 1
0 a b
1 1 1
2 0 1
3 5 2

You can set custom column name in order to prevent this:
Let say if you have two columns in your dataset then:
df = pd.read_csv(your_file_path, names = ['first column', 'second column'])
You can also generate programmatically column names if you have more than and can pass a list in front of names attribute.

Related

Problems reading a csv file [duplicate]

I'm trying to import a .csv file using pandas.read_csv(), however, I don't want to import the 2nd row of the data file (the row with index = 1 for 0-indexing).
I can't see how not to import it because the arguments used with the command seem ambiguous:
From the pandas website:
skiprows : list-like or integer
Row numbers to skip (0-indexed) or number of rows to skip (int) at the
start of the file."
If I put skiprows=1 in the arguments, how does it know whether to skip the first row or skip the row with index 1?
You can try yourself:
>>> import pandas as pd
>>> from StringIO import StringIO
>>> s = """1, 2
... 3, 4
... 5, 6"""
>>> pd.read_csv(StringIO(s), skiprows=[1], header=None)
0 1
0 1 2
1 5 6
>>> pd.read_csv(StringIO(s), skiprows=1, header=None)
0 1
0 3 4
1 5 6
I don't have reputation to comment yet, but I want to add to alko answer for further reference.
From the docs:
skiprows: A collection of numbers for rows in the file to skip. Can also be an integer to skip the first n rows
I got the same issue while running the skiprows while reading the csv file.
I was doning skip_rows=1 this will not work
Simple example gives an idea how to use skiprows while reading csv file.
import pandas as pd
#skiprows=1 will skip first line and try to read from second line
df = pd.read_csv('my_csv_file.csv', skiprows=1) ## pandas as pd
#print the data frame
df
All of these answers miss one important point -- the n'th line is the n'th line in the file, and not the n'th row in the dataset. I have a situation where I download some antiquated stream gauge data from the USGS. The head of the dataset is commented with '#', the first line after that are the labels, next comes a line that describes the date types, and last the data itself. I never know how many comment lines there are, but I know what the first couple of rows are. Example:
> # ----------------------------- WARNING ----------------------------------
> # Some of the data that you have obtained from this U.S. Geological Survey database
> # may not have received Director's approval. ... agency_cd site_no datetime tz_cd 139719_00065 139719_00065_cd
> 5s 15s 20d 6s 14n 10s USGS 08041780 2018-05-06 00:00 CDT 1.98 A
It would be nice if there was a way to automatically skip the n'th row as well as the n'th line.
As a note, I was able to fix my issue with:
import pandas as pd
ds = pd.read_csv(fname, comment='#', sep='\t', header=0, parse_dates=True)
ds.drop(0, inplace=True)
Indices in read_csv refer to line/row numbers in your csv file (the first line has the index 0). You have the following options to skip rows:
from io import StringIO
csv = \
"""col1,col2
1,a
2,b
3,c
4,d
"""
pd.read_csv(StringIO(csv))
# Output:
col1 col2 # index 0
0 1 a # index 1
1 2 b # index 2
2 3 c # index 3
3 4 d # index 4
Skip two lines at the start of the file (index 0 and 1). Column names are skipped as well (index 0) and the top line is used for column names. To add column names use names = ['col1', 'col2'] parameter:
pd.read_csv(StringIO(csv), skiprows=2)
# Output:
2 b
0 3 c
1 4 d
Skip second and fourth lines (index 1 and 3):
pd.read_csv(StringIO(csv), skiprows=[1, 3])
# Output:
col1 col2
0 2 b
1 4 d
Skip last two lines:
pd.read_csv(StringIO(csv), engine='python', skipfooter=2)
# Output:
col1 col2
0 1 a
1 2 b
Use a lambda function to skip every second line (index 1 and 3):
pd.read_csv(StringIO(csv), skiprows=lambda x: (x % 2) != 0)
# Output:
col1 col2
0 2 b
1 4 d
skip[1] will skip second line, not the first one.

How to organise different datasets on Excel into the same layout/order (using pandas)

I have multiple Excel spreadsheets containing the same types of data but they are not in the same order. For example, if file 1 has the results of measurements A, B, C and D from River X printed in columns 1, 2, 3 and 4, respectively but file 2 has the same measurements taken for a different river, River Y, printed in columns 6, 7, 8, and 9 respectively, is there a way to use pandas to reorganise one dataframe to match the layout of another dataframe (i.e. make it so that Sheet2 has the measurements for River Y printed in columns 1, 2, 3 and 4)? Sometimes the data is presented horizontally, not vertically as described above, too. If I have the same measurements for, say, 400 different rivers on 400 separate sheets, but the presentation/layout of data is erratic with regards to each individual file, it would be useful to be able to put a single order on every spreadsheet without having to manually shift columns on Excel.
Is there a way to use pandas to reorganise one dataframe to match the layout of another dataframe?
You can get a list of columns from one of your dataframes and then sort that. Next you can use the sorted order to reorder your remaining dataframes. I've created an example below:
import pandas as pd
import numpy as np
# Create an example of your problem
root = 'River'
suffix = list('123')
cols_1 = [root + '_' + each_suffix for each_suffix in suffix]
cols_2 = [root + '_' + each_suffix for each_suffix in suffix[::]]
data = np.arange(9).reshape(3,3)
df_1 = pd.DataFrame(columns=cols_1, data=data)
df_2 = pd.DataFrame(columns=cols_2, data=data)
df_1
[out] River_1 River_2 River_3
0 0 1 2
1 3 4 5
2 6 7 8
df_2
[out] River_3 River_2 River_1
0 0 1 2
1 3 4 5
2 6 7 8
col_list = df_1.columns.to_list() # Get a list of column names use .sort() to sort in place or
sorted_col_list = sorted(col_list, reverse=False) # Use reverse True to invert the order
def rearrange_df_cols(df, target_order):
df = df[target_order]
print(df)
return df
rearrange_df_cols(df_1, sorted_col_list)
[out] River_1 River_2 River_3
0 0 1 2
1 3 4 5
2 6 7 8
rearrange_df_cols(df_2, sorted_col_list)
[out] River_1 River_2 River_3
0 2 1 0
1 5 4 3
2 8 7 6
You can write a function based on what's above and apply it to all of your file/sheets provided that all columns names exist (NB the must be written identically).
Sometimes the data is presented horizontally, not vertically as described above, too.
This would be better as a separate question. In principle you should check the dimension of your data e.g. df.shape and based of the shape you can either use df.transpose() and then your function to reorder the columns names or directly use your function to reorder the column names.

Modifying multiple columns of data using iteration, but changing increment value for each column

I'm trying to modify multiple column values in pandas.Dataframes with different increments in each column so that the values in each column do not overlap with each other when graphed on a line graph.
Here's the end goal of what I want to do: link
Let's say I have this kind of Dataframe:
Col1 Col2 Col3
0 0.3 0.2
1 1.1 1.2
2 2.2 2.4
3 3 3.1
but with hundreds of columns and thousands of values.
When graphing this on a line-graph on excel or matplotlib, the values overlap with each other, so I would like to separate each column by adding the same values for each column like so:
Col1(+0) Col2(+10) Col3(+20)
0 10.3 20.2
1 11.1 21.2
2 12.2 22.4
3 13 23.1
By adding the same value to one column and increasing by an increment of 10 over each column, I am able to see each line without it overlapping in one graph.
I thought of using loops and iterations to automate this value-adding process, but I couldn't find any previous solutions on Stackoverflow that addresses how I could change the increment value (e.g. from adding 0 in Col1 in one loop, then adding 10 to Col2 in the next loop) between different columns, but not within the values in a column. To make things worse, I'm a beginner with no clue about programming or data manipulation.
Since the data is in a CSV format, I first used Pandas to read it and store in a Dataframe, and selected the columns that I wanted to edit:
import pandas as pd
#import CSV file
df = pd.read_csv ('data.csv')
#store csv data into dataframe
df1 = pd.DataFrame (data = df)
# Locate columns that I want to edit with df.loc
columns = df1.loc[:, ' C000':]
here is where I'm stuck:
# use iteration with increments to add numbers
n = 0
for values in columns:
values = n + 0
print (values)
But this for-loop only adds one increment value (in this case 0), and adds it to all columns, not just the first column. Not only that, but I don't know how to add the next increment value for the next column.
Any possible solutions would be greatly appreciated.
IIUC ,just use df.add() over axis=1 with a list made from the length of df.columns:
df1 = df.add(list(range(0,len(df.columns)*10))[::10],axis=1)
Or as #jezrael suggested, better:
df1=df.add(range(0,len(df.columns)*10, 10),axis=1)
print(df1)
Col1 Col2 Col3
0 0 10.3 20.2
1 1 11.1 21.2
2 2 12.2 22.4
3 3 13.0 23.1
Details :
list(range(0,len(df.columns)*10))[::10]
#[0, 10, 20]
I would recommend you to avoid looping over the data frame as it is inefficient but rather think of adding to matrixes.
e.g.
import numpy as np
import pandas as pd
# Create your example df
df = pd.DataFrame(data=np.random.randn(10,3))
# Create a Matrix of ones
x = np.ones(df.shape)
# Multiply each column with an incremented value * 10
x = x * 10*np.arange(1,df.shape[1]+1)
# Add the matrix to the data
df + x
Edit: In case you do not want to increment with 10, 20 ,30 but 0,10,20 use this instead
import numpy as np
import pandas as pd
# Create your example df
df = pd.DataFrame(data=np.random.randn(10,3))
# Create a Matrix of ones
x = np.ones(df.shape)
# THIS LINE CHANGED
# Obmit the 1 so there is only an end value -> default start is 0
# Adjust the length of the vector
x = x * 10*np.arange(df.shape[1])
# Add the matrix to the data
df + x

Appending values to a column in a loop

I have various files containing data. I want to extract one specific column from each file and create a new dataframe with one column containing all the extracted data.
So for example I have 3 files:
A B C
1 2 3
4 5 6
A B C
7 8 9
8 7 6
A B C
5 4 3
2 1 0
The new dataframe should only contain the values from column C:
C
3
6
9
6
3
0
So the column of the first file should be copied to the new dataframe, the column from the second file should be appendend to the new dataframe.
My code looks like this so far:
import pandas as pd
import glob
for filename in glob.glob('*.dat'):
df= pd.read_csv(filename, delimiter="\t", header=6)
df1= df["Bias"]
print(df)
Now df1 is overwritten in each loop step. Would it be a good idea to create a temporary dataframe in each loop step and then copy the data to the new dataframe?
Any input is appreciated!
Use list comprehension or for loop with append for list of DataFrames and if need only some columns add parameter usecols, last concat all together for big DataFrame:
dfs = [pd.read_csv(f, delimiter="\t", header=6, usecols=['C']) for f in glob.glob('*.dat')]
Or:
dfs = []
for filename in glob.glob('*.dat'):
df = pd.read_csv(filename, delimiter="\t", header=6, usecols=['C'])
#if need all columns
#df = pd.read_csv(filename, delimiter="\t", header=6)
dfs.append(df)
df = pd.concat(dfs, ignore_index=True)

Accumalate column through pandas

I have multiple tab delimited files, all having same entries. I intend to read each file choose first column as index. My final table will have first column as index mapped against last column from all the files. For this, I wrote a pandas code but not a great ones. Is there an alternate way to do this ?
import pandas as pd
df1 = pd.read_csv("FB_test.tsv",sep='\t')
df1_idx = df1.set_index('target_id')
df1_idx.drop(df1_idx[['length','eff_length','est_counts']],inplace=True, axis=1)
print(df1_idx)
df2 = pd.read_csv("Myc_test.tsv",sep='\t')
df2_idx = df2.set_index('target_id')
df2_idx.drop(df2_idx[['length','eff_length','est_counts']],inplace=True, axis=1)
print(df2_idx)
frames = [df1_idx, df2_idx]
results = pd.concat(frames, axis=1)
results
The output it generated was,
tpm
target_id
A 0
B 0
C 0
D 0
E 0
tpm
target_id
A 1
B 1
C 1
D 1
E 1
Out[18]:
target_id tpm tpm
A 0 1
B 0 1
C 0 1
D 0 1
E 0 1
How to loop it so that, I read each file and achieve this same output ?
Thanks,
AP
I think you can use parameters index_col and usecols in read_csv with list comprehension. But get duplicates columns names (so is problem for selecting), so better is add parameter keys to concat - after converting Multiindex get nice unique column names:
files = ["FB_test.tsv", "Myc_test.tsv"]
dfs = [pd.read_csv(f,sep='\t', index_col=['target_id'], usecols=['target_id','tpm'])
for f in files]
results = pd.concat(dfs, axis=1, keys=('a','b'))
results.columns = results.columns.map('_'.join)
results = results.reset_index()
print (results)
target_id a_tpm b_tpm
0 A 0 1
1 B 0 1
2 C 0 1
3 D 0 1
4 E 0 1
To clean the code and use a looping mechanism, you can put both your file names and the columns you are dropping in two separate lists, and then use list comprehension on the file names to import each dataset. Subsequently, you concatenate the output of the list comprehension into one dataframe:
import pandas as pd
drop_cols = ['length','eff_length','est_counts']
filenames = ["FB_test.tsv", "Myc_test.tsv"]
results = pd.concat([pd.read_csv(filename,sep='\t').set_index('target_id').drop(drop_cols, axis=1) for filename in filenames], axis=1)
I hope this helps.

Resources