How read big csv file and read each column separatly? - python-3.x

I have a big CSV file that I read as dataframe.
But I can not figure out how can I read separately by every column.
I have tried to use sep = '\' but it gives me an error.
I read my file with this code:
filename = "D:\\DLR DATA\\status maritime\\nperf-data_2019-01-01_2019-09-10_bC5HIhAfJWqdQckz.csv"
#Dataframes implement the Pandas API
import dask.dataframe as dd
df = dd.read_csv(filename)
df1 = df.head()
When I print my dataframe head I have this result:
In variable explorer my dataframe consist only 1 column with all data inside:
I try to set sep = '' with space and coma. But it didn't work.
How can I read all data with appropriate different columns?
I would appreciate any help.

You have a tab separated file use \t
df = dd.read_csv(filename, sep='\t')

Related

How do I convert my response with byte characters to readable CSV - PYTHON

I am building an API to save CSVs from Sharepoint Rest API using python 3. I am using a public dataset as an example. The original csv has 3 columns Group,Team,FIFA Ranking with corresponding data in the rows.For reference. the original csv on sharepoint ui looks like this:
after using data=response.content the output of data is:
b'Group,Team,FIFA Ranking\r\nA,Qatar,50\r\nA,Ecuador,44\r\nA,Senegal,18\r\nA,Netherlands,8\r\nB,England,5\r\nB,Iran,20\r\nB,United States,16\r\nB,Wales,19\r\nC,Argentina,3\r\nC,Saudi Arabia,51\r\nC,Mexico,13\r\nC,Poland,26\r\nD,France,4\r\nD,Australia,38\r\nD,Denmark,10\r\nD,Tunisia,30\r\nE,Spain,7\r\nE,Costa Rica,31\r\nE,Germany,11\r\nE,Japan,24\r\nF,Belgium,2\r\nF,Canada,41\r\nF,Morocco,22\r\nF,Croatia,12\r\nG,Brazil,1\r\nG,Serbia,21\r\nG,Switzerland,15\r\nG,Cameroon,43\r\nH,Portugal,9\r\nH,Ghana,61\r\nH,Uruguay,14\r\nH,South Korea,28\r\n'
how do I convert the above to csv that pandas can manipulate with the columns being Group,Team,FIFA and then the corresponding data dynamically so this method works for any csv.
I tried:
data=response.content.decode('utf-8', 'ignore').split(',')
however, when I convert the data variable to a dataframe then export the csv the csv just returns all the values in one column.
I tried:
data=response.content.decode('utf-8') or data=response.content.decode('utf-8', 'ignore') without the split
however, pandas does not take this in as a valid df and returns invalid use of dataframe constructor
I tried:
data=json.loads(response.content)
however, the format itself is invalid json format as you will get the error json.decoder.JSONDecodeError: Expecting value: line 1 column 1 (char 0)
Given:
data = b'Group,Team,FIFA Ranking\r\nA,Qatar,50\r\nA,Ecuador,44\r\nA,Senegal,18\r\n' #...
If you just want a CSV version of your data you can simply do:
with open("foo.csv", "wt", encoding="utf-8", newline="") as file_out:
file_out.writelines(data.decode())
If your objective is to load this data into a pandas dataframe and the CSV is not actually important, you can:
import io
import pandas
foo = pandas.read_csv(io.StringIO(data.decode()))
print(foo)

Split CSV File into two files keeping header in both files

I am trying to split a large CSV file into two files. I am using below code
import pandas as pd
#csv file name to be read in
in_csv = 'Master_file.csv'
#get the number of lines of the csv file to be read
number_lines = sum(1 for row in (open(in_csv)))
#size of rows of data to write to the csv,
#you can change the row size according to your need
rowsize = 600000
#start looping through data writing it to a new file for each set
for i in range(0,number_lines,rowsize):
df = pd.read_csv(in_csv,
nrows = rowsize,#number of rows to read at each loop
skiprows = i)#skip rows that have been read
#csv to write data to a new file with indexed name. input_1.csv etc.
out_csv = 'File_Number' + str(i) + '.csv'
df.to_csv(out_csv,
index=False,
header=True,
mode='a',#append data to csv file
chunksize=rowsize)#size of data to append for each loop
It is splitting the file but its missing header in second file. How can I fix it
.read_csv() returns an iterator when used with chunksize and then keeps track of the header. The following is an example. This should be much faster since the original code above reads the entire file to count the lines, then re-reads all previous lines in each chunk iteration; whereas below reads through the file only once:
import pandas as pd
with pd.read_csv('Master_file.csv', chunksize=60000) as reader:
for i,chunk in enumerate(reader):
chunk.to_csv(f'File_Number{i}.csv', index=False, header=True)

How to read text file data in scientific format using pandas DataFrame

I have a text file (input.txt) with 2 columns of data which is in scientific format as shown.
input.txt file contents:
4.6277245181485196e-02 -3.478992280123e-02
5.147225314664928553e-02 -3.626645537995224627e-02
5.719622597261836416e-02 -3.778369677696073736e-02
6.351385032440140521e-02 -3.9348512512335400e-02
7.049988917103996999e-02 -4.096034949794334634e-02
7.822948857937785105e-02 -4.261684461302106541e-02
8.67649433797989394455e-02 -4.77e-02
9.614380281036348508e-02 -4.604114963738591831e-02
1.063651118650106309e-01 -4.777947266421164740e-02
1.173824105396738815e-01 -4.950717696170207904e-02
1.291006932795119577e-01 -5.119743181445588626e-02
I used below code to read the data as a DataFrame.
import pandas as pd
from tabulate import tabulate
df = pd.read_csv('input.txt',delim_whitespace=True,engine='python',header=None,skip_blank_lines=True)
f=open('output.txt','w')
f.write(tabulate(df.values,tablefmt="plain"))
f.close()
But the data is not getting read in scientific format. I'm writing the same data to another outfile file using tabulate (to look evenly spaced as a table). And, it is not in scientific format and also truncating the digits as shown.
output.txt file contents:
0.0462772 -0.0347899
0.0514723 -0.0362665
0.0571962 -0.0377837
0.0635139 -0.0393485
0.0704999 -0.0409603
0.0782295 -0.0426168
0.0867649 -0.0477
0.0961438 -0.0460411
0.106365 -0.0477795
0.117382 -0.0495072
0.129101 -0.0511974
I need the data to be read as-is, i.e. scientific format in this case and output to another file using tabulate. What needs to modify in the above code?
When reading the CSV specify dtype=str:
df = pd.read_csv("input.txt", sep=r"\s+", engine="python", dtype=str, header=None)
print(tabulate(df.values, tablefmt="plain", disable_numparse=True))
Prints:
4.6277245181485196e-02 -3.478992280123e-02
5.147225314664928553e-02 -3.626645537995224627e-02
5.719622597261836416e-02 -3.778369677696073736e-02
6.351385032440140521e-02 -3.9348512512335400e-02
7.049988917103996999e-02 -4.096034949794334634e-02
7.822948857937785105e-02 -4.261684461302106541e-02
8.67649433797989394455e-02 -4.77e-02
9.614380281036348508e-02 -4.604114963738591831e-02
1.063651118650106309e-01 -4.777947266421164740e-02
1.173824105396738815e-01 -4.950717696170207904e-02
1.291006932795119577e-01 -5.119743181445588626e-02

How to read comma separated data into data frame without column header?

I am getting a comma separated data set as bytes which I need to:
Convert in to string from byte
Create csv (can skip this if there is any way to jump to 3rd output)
format and read as data frame without converting first row as column name.
(Later I will be using this df to compare with oracle db output.)
Input data:
val = '-8335,Q1,2017,2002-07-10 00:00:00.0,-,Mr. A,4342000,AnalystA,0,F\n-8336,Q1,2017,2002-07-11 00:00:00.0,-,Mr. B,4342001,Analyst A,0,F\n-8337,Q1,2017,2002-07-10 00:00:00.0,-,Mr. C,4342002,Analyst A,0,F\n'
type(val)
i managed to do till step 3 but my first row is becoming header. I am fine if we can give any value as column header e.g. a, b, c, ...
#1 Code I tried to convert byte to str
strval = val.decode('ascii').strip()
#2 code to craete csv. Frist i created blank csv and later appended the data
import csv
import pandas as pd
abc = ""
with open('csvfile.csv', 'w') as csvOutput:
testData = csv.writer(csvOutput)
testData.writerow(abc)
with open('csvfile.csv', 'a') as csvAppend:
csvAppend.write(val)
#3 now converting it into dataframe
df = pd.read_csv('csvfile.csv')
# hdf = pd.read_csv('csvfile.csv', column=none) -- this give NameError: name 'none' is not defined
output:
df
according to the read_csv documentation it should be enough to add header=None as a parameter:
df = pd.read_csv('csvfile.csv', header=None)
In this way the header will be interpreted as a row of data. If you want to exclude this line then you need to add the skiprows=1 parameter:
df = pd.read_csv('csvfile.csv', header=None, skiprows=1)
You can do it without saving to csv file like this, you don't need to convert the bytes to string or save that to file
Here val is of type bytes if it is of type string as in your example you can use io.StringIO instead of io.BytesIO
import pandas as pd
import io
val = b'-8335,Q1,2017,2002-07-10 00:00:00.0,-,Mr. A,4342000,AnalystA,0,F\n-8336,Q1,2017,2002-07-11 00:00:00.0,-,Mr. B,4342001,Analyst A,0,F\n-8337,Q1,2017,2002-07-10 00:00:00.0,-,Mr. C,4342002,Analyst A,0,F\n'
buf_bytes = io.BytesIO(val)
pd.read_csv(buf_bytes, header=None)

Choose an specific column of an imported text file

I am trying to import a text file into Python. The first column is date and others are integers. After importing the text file I want to extract each column, name them and plot each variable vs date (the first column). How can I extract columns? And how can I choose the 2nd column onwards? I tried two different methods for importing the file:
btcv = np.genfromtxt('example_Feb.388.btcv.txt', dtype=None);
and
btcv = pd.read_csv('example_Feb.388.btcv.txt', header = None)
The text file looks like:
"2015-06-17 00:00" -6.830000 -5.642747 -5.642747 -4.057440 -3.867922 -4.377454
"2015-06-18 00:00" -6.830000 -5.630413 -5.630413 -4.045107 -3.855588 -4.365120
"2015-06-19 00:00" -5.245973 -5.627623 -5.627623 -3.967911 -3.836147 -4.309624
"2015-06-20 00:00" -4.568952 -5.620628 -5.620628 -3.871517 -3.837915 -4.238232
"2015-06-21 00:00" -4.620864 -5.615302 -5.615302 -3.980928 -4.001598 -4.272657
"2015-06-22 00:00" -4.673435 -5.622433 -5.622433 -4.025599 -4.071035 -4.285809
With 1000 rows and 188 columns.
I tried
btcv.date = btcv[:,0]
and it did not work! and btcv[0] returns the full array.
Thanks.
using pandas you can read it as a csv and set the delimeter to whitespace
pd.read_csv('example.csv', delim_whitespace=True, header=None)
This will read the file into a pandas dataframe. You can then name your columns. For example
df.columns = ['date', 'first', 'second']
then you can access each column by name E.g
date = df.date
make the date the frame index
df.index = df.date
and then plot the data frame with a plotting tool

Resources