numpy reading a csv file to an numpy array - python-3.x

I am new to python and using numpy to read a csv into an array .So I used two methods:
Approach 1
train = np.asarray(np.genfromtxt(open("/Users/mac/train.csv","rb"),delimiter=","))
Approach 2
with open('/Users/mac/train.csv') as csvfile:
rows = csv.reader(csvfile)
for row in rows:
newrow = np.array(row).astype(np.int)
train.append(newrow)
I am not sure what is the difference between these two approaches? What is recommended to use?
I am not concerned which is faster since my data size is small but instead concerned more about differences in the resulting data type.

You can use pandas also, it is better and simple to use.
import pandas as pd
import numpy as np
dataset = pd.read_csv('file.csv')
# get all headers in csv
values = list(dataset.columns.values)
# get the labels, assuming last row is labels in csv
y = dataset[values[-1:]]
y = np.array(y, dtype='float32')
X = dataset[values[0:-1]]
X = np.array(X, dtype='float32')

So what is the difference in the result?
genfromtxt is the numpy csv reader. It returns an array. No need for an extra asarray.
The second expression is incomplete, looks like would produce a list of arrays, one for each line of the file. It uses the generic python csv reader which doesn't do much other than read a line and split it into strings.

Related

How can I link this file to my .ipynb file to collect frequent data from the first dataset to the 9th dataset

data set imagePlease use python language. I'm a beginner in frequent data mining systems. I'm trying to understand. Be simple and detailed as much as possible please
I tried using the for loop to collect data from a range but I'm still learning so I don't know how to implement it (keeps giving me the error "index 1 is out of bounds for axis 1 with size 1"). Please guide me.
NB: I was trying to construct a data frame but I don't know how to. Help me with that too
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
import csv
# Calling DataFrame constructor
Data = pd.read_csv('retail.txt', header = None)
# Intializing the list
transacts = []
# populating a list of transactions
for i in range(1, 9):
transacts.append([str(Data.values[i,j]) for j in range(1, 2000)])
df = pd.DataFrame()

Overwrite entries in an existing csv file using Pandas in Python

I want to read a few rows from a csv file using Pandas and convert the read data into Numpy arrays, do some arithmetic operations on these arrays and then overwrite the existing rows in the csv file with these arrays. I have the following code to do this but I am not sure how to specify the exact write location when using Pandas.
The sample csv file looks as follows:
import numpy as np
import pandas as pd
X = Nodes[0:Nnodes,0]
Y = Nodes[0:Nnodes,1]
Z = Nodes[0:Nnodes,2]
X = X.reshape(Nnodes,1)
Y = Y.reshape(Nnodes,1)
Z = Z.reshape(Nnodes,1)
# Do some mathematical operation on the selected arrays
X = X*0.5
Y = Y*0.5
Y = X*0.5
# Concatenate the three arrays
Data = np.concatenate((X, Y, Z),axis = 1)
# Convert the array back into a dataframe
df =pd.DataFrame(Data)
# Overwrite the previously selected arrays at the same location in the csv file
df.to_csv("test.csv",header = None,index = False, sep = '\t', mode = 'w')
I am not sure if Pandas is the most appropriate tool for this. Any help/suggestion is greatly appreciated.
Thank you in advance.

Dask Dataframe groupby and aggregate for column

I had a pd.DataFrame that I converted to Dask.DataFrame for faster computations.
My requirement is that I have to find out the 'Total Views' of a channel.
In pandas it would be, df.groupby(['ChannelTitle'])['VideoViewCount'].sum() but in dask the columns dtypes is object and groupby is taking these as string and not int(see image 2)
To handle above issue, I added two columns separating figure(115) and multiplier(6 for M, 3 for K) of views hoping to do an operation like ddf['new_views_f'] * (10**ddf['new_views_m']), but now I cannot find mul for two columns in dask.
Either I am missing something or complicating the requirement.
It does sound like you are complicating the requirement. For column multiplication, the regular pandas syntax will work (df['c'] = df['a'] * df['b']). In your case, it's possible to use pd.eval to get the actual numeric value for views:
import pandas as pd
import numpy as np
import dask.dataframe as dd
import random
df = pd.DataFrame(15*np.random.rand(15), columns=['views'])
df['views'] = df['views'].round(2).astype('str') + [random.choice(['K views', 'M views']) for _ in range(len(df))]
df['group'] = [random.choice([1,2,3]) for _ in range(len(df))]
ddf = dd.from_pandas(df, npartitions=2)
ddf['views_digits'] = ddf['views'].replace({'K views': '*1e3', 'M views': '*1e6'}, regex=True).map(pd.eval, meta=ddf['group'])
aggregate_df = ddf.groupby(['group']).agg({'views_digits': 'sum'}).compute()

Plotting Pandas DF with Numpy Arrays

I have a Pandas df with multiple columns and each cell inside has a various number of elements of a Numpy array. I would like plot all the elements of the array for every cell within column.
I have tried
plt.plot(df['column'])
plt.plot(df['column'][0:])
both gives a ValueErr: setting an array element with a sequence
It is very important that these values get plotted to its corresponding index as the index represents linear time in this dataframe. I would really appreciate it if someone showed me how to do this properly. Perhaps there is a package other than matplotlib.pylot that is better suited for this?
Thank you
plt.plot needs a list of x-coordinates together with an equally long list of y-coordinates. As you seem to want to use the index of the dataframe for the x-coordinate and each cell contents for the y-coordinates, you need to repeat the x-values as many times as the length of the y-coordinates.
Note that this format doesn't suit a line plot, as connecting subsequent points would create some strange vertical lines. plt.plot accepts a marker as its third parameter, for example '.' to draw a simple dot at each position.
A code example:
import matplotlib.pyplot as plt
import pandas as pd
import numpy as np
N = 30
df = pd.DataFrame({f'column{c}':
[np.random.normal(np.random.uniform(10, 100), 1, np.random.randint(3, 11)) for _ in range(N)]
for c in range(1, 6)})
legend_handles = []
colors = plt.cm.Set1.colors
desired_columns = df.columns
for column, color in zip(desired_columns, colors):
for ind, cell in df[column].iteritems():
if len(cell) > 0:
plotted, = plt.plot([ind] * len(cell), cell, '.', color=color)
legend_handles.append(plotted)
plt.legend(legend_handles, desired_columns)
plt.show()
Note that pandas really isn't meant to store complete arrays inside cells. The preferred way is to create a dataframe in "long" form, with each value in a separate row (with the "index" repeated). Most functions of pandas and seaborn don't understand about arrays inside cells.
Here's a way to create a long form which can be called using Seaborn:
import matplotlib.pyplot as plt
import pandas as pd
import numpy as np
import seaborn as sns
N = 30
df = pd.DataFrame({f'column{c}':
[np.random.normal(np.random.uniform(10, 100), 1, np.random.randint(3, 11)) for _ in range(N)]
for c in range(1, 6)})
desired_columns = df.columns
df_long_data = []
for column in desired_columns:
for ind, cell in df[column].iteritems():
for val in cell:
dict = {'timestamp': ind, 'column_name': column, 'value': val}
df_long_data.append(dict)
df_long = pd.DataFrame(df_long_data)
sns.scatterplot(x='timestamp', y='value', hue='column_name', data=df_long)
plt.show()
As per your problem, you have numpy arrays in each cell which you wanna plot. To pass your data to plt.plot() method you might need to pass every cell individually as whenever you try to pass it as a whole like you did, it is actually a sequence that you are passing. But the plot() method will accept a numpy array.
This might help:
for column in df.columns:
for cell in df[column]:
plt.plot(cell)
plt.show()

Computing dask delayed objects stored in dataframe

I am looking for the best way to compute many dask delayed obejcts stored in a dataframe. I am unsure if the pandas dataframe should be converted to a dask dataframe with delayed objects within, or if the compute call should be called on all values of the pandas dataframe.
I would appreciate any suggestions in general, as I am having trouble with the logic of passing delayed object across nested for loops.
import numpy as np
import pandas as pd
from scipy.stats import hypergeom
from dask import delayed, compute
steps = 5
sample = [int(x) for x in np.linspace(5, 100, num=steps)]
enr_df = pd.DataFrame()
for N in sample:
enr = []
for i in range(20):
k = np.random.randint(1, 200)
enr.append(delayed(hypergeom.sf)(k=k, M=10000, n=20, N=N, loc=0))
enr_df[N] = enr
I cannot call compute on this dataframe without applying the function across all cells like so: enr_df.applymap(compute) (which I believe calls compute on each value individually).
However if I convert to a dask dataframe the delayed objects I want to compute are layered in the dask dataframe structure:
enr_dd = dd.from_pandas(enr_df, npartitions=1)
enr_dd.compute()
And the computation output I expect does not proceed.
You can pass a list of delayed objects into dask.compute
results = dask.compute(*list_of_delayed_objects)
So you need to get a list from your Pandas dataframe. This is something you can do with normal Python code.

Resources