I want to read a few rows from a csv file using Pandas and convert the read data into Numpy arrays, do some arithmetic operations on these arrays and then overwrite the existing rows in the csv file with these arrays. I have the following code to do this but I am not sure how to specify the exact write location when using Pandas.
The sample csv file looks as follows:
import numpy as np
import pandas as pd
X = Nodes[0:Nnodes,0]
Y = Nodes[0:Nnodes,1]
Z = Nodes[0:Nnodes,2]
X = X.reshape(Nnodes,1)
Y = Y.reshape(Nnodes,1)
Z = Z.reshape(Nnodes,1)
# Do some mathematical operation on the selected arrays
X = X*0.5
Y = Y*0.5
Y = X*0.5
# Concatenate the three arrays
Data = np.concatenate((X, Y, Z),axis = 1)
# Convert the array back into a dataframe
df =pd.DataFrame(Data)
# Overwrite the previously selected arrays at the same location in the csv file
df.to_csv("test.csv",header = None,index = False, sep = '\t', mode = 'w')
I am not sure if Pandas is the most appropriate tool for this. Any help/suggestion is greatly appreciated.
Thank you in advance.
Related
I have a single netCDF4 file pertaining to historical average sea surface temperatures around a selected area of the Caribbean. The netCDF4 data spans a wide time scale (from 1850 to 2012) and I am trying to extract a snippet of this data (the decade from 1990 to 2000) and convert it to .CSV format.
So far, my code runs without any errors but it outputs a word document file (not a .CSV file) in my local .py3 directory. Within this file only a single line of text is present with the word ",temperature". ie. no actual data is present.
I am new to python and computer programming in general with nearly no previous experience so I have no clue as to what is wrong with my code. Any help is vastly appreciated.
Here is my code:
# Converting netCDF4 files to .csv
import netCDF4
from netCDF4 import Dataset
import numpy as np
import pandas as pd
# Reading in file
data = Dataset('/Users/terrysoennecken/Documents/Fish movement model data/Ocean Temp/adaptor.esgf_wps.retrieve-1628265122.7610238-22900-3-fb77a99c-36ca-48e6-ac6a-2895a39eddd0 2/historical.nc', 'r')
print(data)
tb = data.variables['time_bnds']
t = data.variables['time']
sst = data.variables['tos'] # Sea surface temp
lat = data.variables['latitude']
lon = data.variables['longitude']
# Storing relevant data into variables
time_data = data.variables['time'][:]
lon_data = data.variables['longitude'][:]
lat_data = data.variables['latitude'][:]
temp_data = data.variables['tos'][:]
# Creating a dataframe
start_date = time_data[1971]
end_date = time_data[1825]
time_scale = pd.date_range(start = start_date, end = end_date)
df = pd.DataFrame(1, columns = ['Temperature'], index = time_scale)
# Fnal dataframe save to a .csv file
df.to_csv('Sea Surface Temperature around the Bahamas for the decade, 1990 - 2000')
I have many .csv of NYC taxi from nyc.gov, one .csv = year-month. There I grab cca 15 of csvs and make HDF5s from them:
import h5py
import pandas as pd
import os
import glob
import numpy as np
import vaex
from tqdm import tqdm_notebook as tqdm
#hdf = pd.HDFStore('c:/Projekty/H5Edu/NYCTaxi/NYCTaxi.hp')
#df1 = pd.read_csv('path nejake csvcko')
#hdf.put('DF1', df1, format = 'table', data_columns = True)
csv_list = np.sort(np.array(glob.glob('G:\\NYCTaxi\\*.csv')))[::-1]
csv_list = csv_list[20:39]
output_dir = 'c:\\Datasety\\YelowTaxi\\DataH5\\'
for file in tqdm(csv_list, leave=False, desc='Converting to hdf5...'):
# Setting up the files, and directories
#zip_file = ZipFile(file)
output_file = file.split('\\')[-1][:-3]+'hdf5'
output = output_dir + output_file
#output = output_file
# Check if a converted file already exists: if it does skip it, otherwise read in the raw csv and convert it
if (os.path.exists(output) and os.path.isfile(output)):
pass
else:
# Importing the data into pandas
#pandas_df = [pd.read_csv(file, index_col=None, header=0)][0]
pandas_df = [pd.read_csv(file, index_col=None, header=0, low_memory=False)][0]
# Rename some columns to match the more well known dataset from
# http://stat-computing.org/dataexpo/2009/the-data.html
# Importing the data from pandas to vaex
vaex_df = vaex.from_pandas(pandas_df, copy_index=False)
# Export the data with vaex to hdf5
vaex_df.export_hdf5(path=output, progress=False)
Next I make one big HDF5:
import re
import glob
import vaex
import numpy as np
def tryint(s):
try:
return int(s)
except:
return s
def alphanum_key(s):
""" Turn a string into a list of string and number chunks.
"z23a" -> ["z", 23, "a"]
"""
return [ tryint(c) for c in re.split('([0-9]+)', s) ]
hdf5_list = glob.glob('c:\\Datasety\\YelowTaxi\\DataH5\\*.hdf5')
hdf5_list.sort(key=alphanum_key)
hdf5_list = np.array(hdf5_list)
#assert len(hdf5_list) == 3, "Incorrect number of files"
# This is an important step
master_df = vaex.open_many(hdf5_list)
# exporting
#master_df.export_hdf5(path='c:\\Datasety\\YelowTaxi\\DataH5\\Spojene.hd5', progress=True)
master_df.export_hdf5(path='c:\\Datasety\\YelowTaxi\\DataH5\\Spojene.hdf5', progress=True)
So far, everything is ok, I can open output file Spojene.hdf5.
Next, I append new .csv to Spojene.hdf5:
for file in csv_list:
#file = csv_list[0]
df2 = pd.read_csv(file, index_col=None, header=0, low_memory=False)
filename = 'c:\\Datasety\\YelowTaxi\\DataH5\\Spojene.hdf5'
df2.to_hdf(filename, 'data', append=True)
But, when I append new .csv to Spojene.hdf5, I cant open it:
df = vaex.open('c:\\Datasety\\YelowTaxi\\DataH5\\Spojene.hdf5')
ValueError: First columns has length 289184484, while column table has length 60107988
Pls, what can I do?
I think this is linked to how pandas is creating hdf5 files. According to vaex's documentation you can't open a HDF5 file with vaex if it has been created via to_hdf pandas method. I assume it is the same if you append to an existing HDF5 file.
To avoid this error you can reuse your logic where you convert the pandas dataframe to a vaex dataframe, export it to HDF5 and then use open_many. Something like this should work:
main_hdf5_file_path = "c:\\Datasety\\YelowTaxi\\DataH5\\Spojene.hdf5"
hdf5_files_created = []
for file in csv_list:
hdf5_file = file.replace(".csv", ".hdf5")
# from_csv can take additional parameters to forward to pd.read_csv
# You can also use convert=True to convert it automatically to hdf5 without the export_hdf5
# Refer to https://vaex.readthedocs.io/en/docs/api.html#vaex.from_csv
df = vaex.from_csv(file)
df.export_hdf5(hdf5_file)
hdf5_files_created.append(hdf5_file)
hdf5_to_read = hdf5_files_created + [main_hdf5_file_path]
final_df = vaex.open_many(hdf5_to_read)
final_df.export_hdf5(main_hdf5_file_path)
I'm trying to separate columns by slicing them because I need to assign dtypes for each one. So I grouped them by dtypes and assign their respective dtype and now I want to join or concat and that has the same column order as the main dataframe. I add that is not possible to do it by its column name because it may change.
Example:
import pandas as pd
file = pd.read_csv(f, encoding='utf8') #It has 11 columns
intg = file.iloc[:,[0,2,4,6,8,9,11]].astype("Int64")
obj = file.iloc[:,[1,3,5,7,10]].astype(str)
After doing this I need to join them with the same order as the main file, that is from 0 to 11.
To join these 2 chunks you can use the join function, then reindex based on your original dataframes columns. Should look something like this:
import pandas as pd
file = pd.read_csv(f, encoding='utf8') #It has 11 columns
intg = file.iloc[:,[0,2,4,6,8,9,11]].astype("Int64")
obj = file.iloc[:,[1,3,5,7,10]].astype(str)
out = pd.join(intg, obj).reindex(file.columns, axis="columns")
I am new to python and using numpy to read a csv into an array .So I used two methods:
Approach 1
train = np.asarray(np.genfromtxt(open("/Users/mac/train.csv","rb"),delimiter=","))
Approach 2
with open('/Users/mac/train.csv') as csvfile:
rows = csv.reader(csvfile)
for row in rows:
newrow = np.array(row).astype(np.int)
train.append(newrow)
I am not sure what is the difference between these two approaches? What is recommended to use?
I am not concerned which is faster since my data size is small but instead concerned more about differences in the resulting data type.
You can use pandas also, it is better and simple to use.
import pandas as pd
import numpy as np
dataset = pd.read_csv('file.csv')
# get all headers in csv
values = list(dataset.columns.values)
# get the labels, assuming last row is labels in csv
y = dataset[values[-1:]]
y = np.array(y, dtype='float32')
X = dataset[values[0:-1]]
X = np.array(X, dtype='float32')
So what is the difference in the result?
genfromtxt is the numpy csv reader. It returns an array. No need for an extra asarray.
The second expression is incomplete, looks like would produce a list of arrays, one for each line of the file. It uses the generic python csv reader which doesn't do much other than read a line and split it into strings.
I have successfully gotten the volumes to add up correctly, but it is returning the volume as a decimal. All volumes in the CSV file are whole numbers. I would like to have them without the decimal part.
Code is below.
import pandas as pd
datagrid = pd.read_csv("Daily Receipts.csv")
daily_vols = datagrid.groupby("Txn")["Scan Volume"].sum()
print(daily_vols)
When you sum with pandas it converted the results to float.
Use astype(int) <--- Link to Docs
import pandas as pd
datagrid = pd.read_csv("Daily Receipts.csv")
daily_vols = datagrid.groupby("Txn")["Scan Volume"].sum().astype(int)
print(daily_vols)
this will do it :
import pandas as pd
datagrid = pd.read_csv("Daily Receipts.csv")
daily_vols = datagrid.groupby("Txn")["Scan Volume"].sum()
daily_vols = [ int(x) for x in daily_vols ]
print(daily_vols)