Problems reading a csv file [duplicate] - python-3.x

I'm trying to import a .csv file using pandas.read_csv(), however, I don't want to import the 2nd row of the data file (the row with index = 1 for 0-indexing).
I can't see how not to import it because the arguments used with the command seem ambiguous:
From the pandas website:
skiprows : list-like or integer
Row numbers to skip (0-indexed) or number of rows to skip (int) at the
start of the file."
If I put skiprows=1 in the arguments, how does it know whether to skip the first row or skip the row with index 1?

You can try yourself:
>>> import pandas as pd
>>> from StringIO import StringIO
>>> s = """1, 2
... 3, 4
... 5, 6"""
>>> pd.read_csv(StringIO(s), skiprows=[1], header=None)
0 1
0 1 2
1 5 6
>>> pd.read_csv(StringIO(s), skiprows=1, header=None)
0 1
0 3 4
1 5 6

I don't have reputation to comment yet, but I want to add to alko answer for further reference.
From the docs:
skiprows: A collection of numbers for rows in the file to skip. Can also be an integer to skip the first n rows

I got the same issue while running the skiprows while reading the csv file.
I was doning skip_rows=1 this will not work
Simple example gives an idea how to use skiprows while reading csv file.
import pandas as pd
#skiprows=1 will skip first line and try to read from second line
df = pd.read_csv('my_csv_file.csv', skiprows=1) ## pandas as pd
#print the data frame
df

All of these answers miss one important point -- the n'th line is the n'th line in the file, and not the n'th row in the dataset. I have a situation where I download some antiquated stream gauge data from the USGS. The head of the dataset is commented with '#', the first line after that are the labels, next comes a line that describes the date types, and last the data itself. I never know how many comment lines there are, but I know what the first couple of rows are. Example:
> # ----------------------------- WARNING ----------------------------------
> # Some of the data that you have obtained from this U.S. Geological Survey database
> # may not have received Director's approval. ... agency_cd site_no datetime tz_cd 139719_00065 139719_00065_cd
> 5s 15s 20d 6s 14n 10s USGS 08041780 2018-05-06 00:00 CDT 1.98 A
It would be nice if there was a way to automatically skip the n'th row as well as the n'th line.
As a note, I was able to fix my issue with:
import pandas as pd
ds = pd.read_csv(fname, comment='#', sep='\t', header=0, parse_dates=True)
ds.drop(0, inplace=True)

Indices in read_csv refer to line/row numbers in your csv file (the first line has the index 0). You have the following options to skip rows:
from io import StringIO
csv = \
"""col1,col2
1,a
2,b
3,c
4,d
"""
pd.read_csv(StringIO(csv))
# Output:
col1 col2 # index 0
0 1 a # index 1
1 2 b # index 2
2 3 c # index 3
3 4 d # index 4
Skip two lines at the start of the file (index 0 and 1). Column names are skipped as well (index 0) and the top line is used for column names. To add column names use names = ['col1', 'col2'] parameter:
pd.read_csv(StringIO(csv), skiprows=2)
# Output:
2 b
0 3 c
1 4 d
Skip second and fourth lines (index 1 and 3):
pd.read_csv(StringIO(csv), skiprows=[1, 3])
# Output:
col1 col2
0 2 b
1 4 d
Skip last two lines:
pd.read_csv(StringIO(csv), engine='python', skipfooter=2)
# Output:
col1 col2
0 1 a
1 2 b
Use a lambda function to skip every second line (index 1 and 3):
pd.read_csv(StringIO(csv), skiprows=lambda x: (x % 2) != 0)
# Output:
col1 col2
0 2 b
1 4 d

skip[1] will skip second line, not the first one.

Related

How to separate specific strings from a text and add them as column names?

This is a looklike example of I data I have, but with much less lines.
So imagine I have a txt file like this:
'''
Useless information 1
Useless information 2
Useless information 3
Measurement:
Len. (cm) :length of the object
Hei. (cm) :height of the object
Tp. :type of the object
~A DATA
10 5 2
8 7 2
5 6 1
9 9 1
'''
and I would like to put the values below '~A DATA' as a DataFrame. I already managed to get the DataFrame without column names (although it got a little messy as there are lines nonsense in my code) as you can see:
with open(r'C:\Users\Lucas\Desktop\...\text.txt') as file:
for line in file:
if line.startswith('~A'):
measures = line.split()[len(line):]
break
df = pd.read_csv(file, names=measures, sep='~A', engine='python')
newdf = df[0].str.split(expand = True)
newdf()
0 1 2
0 10 5 2
1 8 7 2
2 5 6 1
3 9 9 1
Now, I would like to put 'Len', 'Hei' and 'Tp' from the text as column names on the DataFrame. Just these measurement codes (without the consequent strings). How can I do that to have a df like this?
Len Hei Tp
0 10 5 2
1 8 7 2
2 5 6 1
3 9 9 1
One of the solutions would be to split every line below the string 'Measurement' (or beginning with the line 'Len...') till every line above the string '~A' (or ending with line 'Tp'). And then split every line we got. But I don't know how to do that.
Solution 1: If you want to scrape the the column names from the text file itself, than for that, you need to know, from which line the column name information is starting, and then read the file line-by-line and do the processing for the particular lines which you know have column names as text.
To answer you the specific question that you asked, let's assume variable line contains one of the strings, say line = Len. (cm) :length of the object, you could do regex based splitting, wherein, you split across any special symbol except digits and alphabets.
import re
splited_line = re.split(r"[^a-zA-Z0-9]", line) #add other characters which you don't want
print(splited_line)
This results in
['Len', ' ', 'cm', ' ', 'length of the object']
Further, to get the column name, you pick the first element from the list as splited_line[0]
Solution 2: If you already know the column names, you could just do
df.columns = ['Len','Hei','Tp']
Here is the complete solution to what you are looking for:
In [34]: f = open('text.txt', "rb")
...: flag = False
...: column_names = []
...: for line in f:
...: splited_line = re.split(r"[^a-zA-Z0-9~]", line.decode('utf-8'))
...: if splited_line[0] == "Measurement":
...: flag = True
...: continue
...: elif splited_line[0] == "~A":
...: flag = False
...: if flag == True:
...: column_names.append(splited_line[0])

Modifying multiple columns of data using iteration, but changing increment value for each column

I'm trying to modify multiple column values in pandas.Dataframes with different increments in each column so that the values in each column do not overlap with each other when graphed on a line graph.
Here's the end goal of what I want to do: link
Let's say I have this kind of Dataframe:
Col1 Col2 Col3
0 0.3 0.2
1 1.1 1.2
2 2.2 2.4
3 3 3.1
but with hundreds of columns and thousands of values.
When graphing this on a line-graph on excel or matplotlib, the values overlap with each other, so I would like to separate each column by adding the same values for each column like so:
Col1(+0) Col2(+10) Col3(+20)
0 10.3 20.2
1 11.1 21.2
2 12.2 22.4
3 13 23.1
By adding the same value to one column and increasing by an increment of 10 over each column, I am able to see each line without it overlapping in one graph.
I thought of using loops and iterations to automate this value-adding process, but I couldn't find any previous solutions on Stackoverflow that addresses how I could change the increment value (e.g. from adding 0 in Col1 in one loop, then adding 10 to Col2 in the next loop) between different columns, but not within the values in a column. To make things worse, I'm a beginner with no clue about programming or data manipulation.
Since the data is in a CSV format, I first used Pandas to read it and store in a Dataframe, and selected the columns that I wanted to edit:
import pandas as pd
#import CSV file
df = pd.read_csv ('data.csv')
#store csv data into dataframe
df1 = pd.DataFrame (data = df)
# Locate columns that I want to edit with df.loc
columns = df1.loc[:, ' C000':]
here is where I'm stuck:
# use iteration with increments to add numbers
n = 0
for values in columns:
values = n + 0
print (values)
But this for-loop only adds one increment value (in this case 0), and adds it to all columns, not just the first column. Not only that, but I don't know how to add the next increment value for the next column.
Any possible solutions would be greatly appreciated.
IIUC ,just use df.add() over axis=1 with a list made from the length of df.columns:
df1 = df.add(list(range(0,len(df.columns)*10))[::10],axis=1)
Or as #jezrael suggested, better:
df1=df.add(range(0,len(df.columns)*10, 10),axis=1)
print(df1)
Col1 Col2 Col3
0 0 10.3 20.2
1 1 11.1 21.2
2 2 12.2 22.4
3 3 13.0 23.1
Details :
list(range(0,len(df.columns)*10))[::10]
#[0, 10, 20]
I would recommend you to avoid looping over the data frame as it is inefficient but rather think of adding to matrixes.
e.g.
import numpy as np
import pandas as pd
# Create your example df
df = pd.DataFrame(data=np.random.randn(10,3))
# Create a Matrix of ones
x = np.ones(df.shape)
# Multiply each column with an incremented value * 10
x = x * 10*np.arange(1,df.shape[1]+1)
# Add the matrix to the data
df + x
Edit: In case you do not want to increment with 10, 20 ,30 but 0,10,20 use this instead
import numpy as np
import pandas as pd
# Create your example df
df = pd.DataFrame(data=np.random.randn(10,3))
# Create a Matrix of ones
x = np.ones(df.shape)
# THIS LINE CHANGED
# Obmit the 1 so there is only an end value -> default start is 0
# Adjust the length of the vector
x = x * 10*np.arange(df.shape[1])
# Add the matrix to the data
df + x

Pandas - How to skip the first row of a csv file to be made the header with combining multiple csv files [duplicate]

I'm reading in a pandas DataFrame using pd.read_csv. I want to keep the first row as data, however it keeps getting converted to column names.
I tried header=False but this just deleted it entirely.
(Note on my input data: I have a string (st = '\n'.join(lst)) that I convert to a file-like object (io.StringIO(st)), then build the csv from that file object.)
You want header=None the False gets type promoted to int into 0 see the docs emphasis mine:
header : int or list of ints, default ‘infer’ Row number(s) to use as
the column names, and the start of the data. Default behavior is as if
set to 0 if no names passed, otherwise None. Explicitly pass header=0
to be able to replace existing names. The header can be a list of
integers that specify row locations for a multi-index on the columns
e.g. [0,1,3]. Intervening rows that are not specified will be skipped
(e.g. 2 in this example is skipped). Note that this parameter ignores
commented lines and empty lines if skip_blank_lines=True, so header=0
denotes the first line of data rather than the first line of the file.
You can see the difference in behaviour, first with header=0:
In [95]:
import io
import pandas as pd
t="""a,b,c
0,1,2
3,4,5"""
pd.read_csv(io.StringIO(t), header=0)
Out[95]:
a b c
0 0 1 2
1 3 4 5
Now with None:
In [96]:
pd.read_csv(io.StringIO(t), header=None)
Out[96]:
0 1 2
0 a b c
1 0 1 2
2 3 4 5
Note that in latest version 0.19.1, this will now raise a TypeError:
In [98]:
pd.read_csv(io.StringIO(t), header=False)
TypeError: Passing a bool to header is invalid. Use header=None for no
header or header=int or list-like of ints to specify the row(s) making
up the column names
I think you need parameter header=None to read_csv:
Sample:
import pandas as pd
from pandas.compat import StringIO
temp=u"""a,b
2,1
1,1"""
df = pd.read_csv(StringIO(temp),header=None)
print (df)
0 1
0 a b
1 2 1
2 1 1
If you're using pd.ExcelFile to read all the excel file sheets then:
df = pd.ExcelFile("path_to_file.xlsx")
df.sheet_names # Provide the sheet names in the excel file
df = df.parse(2, header=None) # Parsing the 2nd sheet in the file with header = None
df
Output:
0 1
0 a b
1 1 1
2 0 1
3 5 2
You can set custom column name in order to prevent this:
Let say if you have two columns in your dataset then:
df = pd.read_csv(your_file_path, names = ['first column', 'second column'])
You can also generate programmatically column names if you have more than and can pass a list in front of names attribute.

pandas: split dataframe into multiple csvs

I have a large file, imported into a single dataframe in Pandas.
I'm using pandas to split up a file into many segments, by the number of rows in the dataframe.
eg: 10 rows:
file 1 gets [0:4]
file 2 gets [5:9]
Is there a way to do this without having to create more dataframes?
assign a new column g here, you just need to specific how many item you want in each groupby, here I am using 3 .
df.assign(g=df.index//3)
Out[324]:
0 g
0 1 0
1 2 0
2 3 0
3 4 1
4 5 1
5 6 1
6 7 2
7 8 2
8 9 2
9 10 3
and you can call the df[df.g==1] to get what you need
There are two ways of doing this. I believe you are looking for the former. Basically, we open a series of csv writers, then we write to the correct csv writer by using some basic math with the index, then we close all files.
A single DataFrame evenly divided into N number of CSV files
import pandas as pd
import csv, math
df = pd.DataFrame([1,2,3,4,5,6,7,8,9,10]) # uncreative input values for 10 columns
NUMBER_OF_SPLITS = 2
fileOpens = [open(f"out{i}.csv","w") for i in range(NUMBER_OF_SPLITS)]
fileWriters = [csv.writer(v, lineterminator='\n') for v in fileOpens]
for i,row in df.iterrows():
fileWriters[math.floor((i/df.shape[0])*NUMBER_OF_SPLITS)].writerow(row.tolist())
for file in fileOpens:
file.close()
More than one DataFrame evenly divided into N number of CSV files
import pandas as pd
import numpy as np
df = pd.DataFrame([1,2,3,4,5,6,7,8,9,10]) # uncreative input values for 10 columns
NUMBER_OF_SPLITS = 2
for i, new_df in enumerate(np.array_split(df,NUMBER_OF_SPLITS)):
with open(f"out{i}.csv","w") as fo:
fo.write(new_df.to_csv())
use numpy.array_split to split your dataframe dfX and save it in N csv files of equal size: dfX_1.csv to dfX_N.csv
N = 10
for i, df in enumerate(np.array_split(dfX, N)):
df.to_csv(f"dfX_{i + 1}.csv", index=False)
iterating over iloc's arguments will do the trick.

Accumalate column through pandas

I have multiple tab delimited files, all having same entries. I intend to read each file choose first column as index. My final table will have first column as index mapped against last column from all the files. For this, I wrote a pandas code but not a great ones. Is there an alternate way to do this ?
import pandas as pd
df1 = pd.read_csv("FB_test.tsv",sep='\t')
df1_idx = df1.set_index('target_id')
df1_idx.drop(df1_idx[['length','eff_length','est_counts']],inplace=True, axis=1)
print(df1_idx)
df2 = pd.read_csv("Myc_test.tsv",sep='\t')
df2_idx = df2.set_index('target_id')
df2_idx.drop(df2_idx[['length','eff_length','est_counts']],inplace=True, axis=1)
print(df2_idx)
frames = [df1_idx, df2_idx]
results = pd.concat(frames, axis=1)
results
The output it generated was,
tpm
target_id
A 0
B 0
C 0
D 0
E 0
tpm
target_id
A 1
B 1
C 1
D 1
E 1
Out[18]:
target_id tpm tpm
A 0 1
B 0 1
C 0 1
D 0 1
E 0 1
How to loop it so that, I read each file and achieve this same output ?
Thanks,
AP
I think you can use parameters index_col and usecols in read_csv with list comprehension. But get duplicates columns names (so is problem for selecting), so better is add parameter keys to concat - after converting Multiindex get nice unique column names:
files = ["FB_test.tsv", "Myc_test.tsv"]
dfs = [pd.read_csv(f,sep='\t', index_col=['target_id'], usecols=['target_id','tpm'])
for f in files]
results = pd.concat(dfs, axis=1, keys=('a','b'))
results.columns = results.columns.map('_'.join)
results = results.reset_index()
print (results)
target_id a_tpm b_tpm
0 A 0 1
1 B 0 1
2 C 0 1
3 D 0 1
4 E 0 1
To clean the code and use a looping mechanism, you can put both your file names and the columns you are dropping in two separate lists, and then use list comprehension on the file names to import each dataset. Subsequently, you concatenate the output of the list comprehension into one dataframe:
import pandas as pd
drop_cols = ['length','eff_length','est_counts']
filenames = ["FB_test.tsv", "Myc_test.tsv"]
results = pd.concat([pd.read_csv(filename,sep='\t').set_index('target_id').drop(drop_cols, axis=1) for filename in filenames], axis=1)
I hope this helps.

Resources