python, loading a string from file - string

I'm trying to load a .txt file into my python project using numpy:
import numpy as np
import sys
g = np.loadtxt(sys.argv[1])
this command has worked for me when .txt file was a 0/1 matrix, but not
working now as it is a string matrix (4*7 table of words like "crew")
error says "cant convert string to float".. any help?

Take a look at the dtype parameter. (here)
dtype : data-type, optional
Data-type of the resulting array; default: float. If this is a structured data-type, the resulting array will be 1-dimensional, and each row will be interpreted as an element of the array. In this case, the number of columns used must match the number of fields in the data-type.
The default is float, which results in the error you are pointing out in your question.

One option is using pandas:
import numpy as np
import pandas as pd
arr = pd.read_table(filename, sep=" ", header=None).values
(Assuming the separator is a whitespace and there is no header column. Specify otherwise).

Related

Is index function in to_csv I have used correctly according to problem statement?

Problem Statement :
1) Create another Series named heights_B from a 1-D numpy array of 5 elements derived from the normal distribution of mean 170.0 and standard deviation 25.0.
Note: Set random seed to 100 before creating heights_B series. Use numpy.
2) Create another Series named weights_B from a 1-D numpy array of 5 elements derived from the normal distribution of mean 75.0 and standard deviation 12.0.
Note: Set random seed to 100 again before creating weights_B series. Use numpy.
3)Label both Series elements with s1, s2, s3, s4 and s5.
4)Create a dataframe df_B containing the height and weight of students s1, s2, s3, s4 and s5 belonging to class B.
5)Label the columns as Student_height and Student_weight respectively.
Write the contents of df_B without the index to a CSV file named classB.csv.
Note: Use the index argument of to_csv method.
Solution :
import pandas as pd
import numpy as np
import random
height_A=np.array([176.2,158.4,167.6,156.2,161.4])
s1=pd.Series(height_A,index=['s1','s2','s3','s4','s5'])
weight_A=np.array([85.1,90.2,76.8,80.4,78.9])
s2=pd.Series(weight_A,index=['s1','s2','s3','s4','s5'])
df={'Student_height':s1,'Student_weight':s2}
hdf=pd.DataFrame(df)
random.seed(100)
x=np.random.normal(loc=170.0,scale=25.0,size=5)
s3=pd.Series(x,index=['s1','s2','s3','s4','s5'])
random.seed(100)
y=np.random.normal(loc=75.0,scale=12.0,size=5)
s4=pd.Series(y,index=['s1','s2','s3','s4','s5'])
df1=df={'Student_height':s3,'Student_weight':s4}
hdf1=pd.DataFrame(df1)
hdf1.to_csv('classB.csv',index=False)
I have written code according to problem statement but online compiler is not accepting my solution , please tell me if I have done any mistake.
add one more line to code
m = pd.read_csv("classB.csv"); print(m)
Use this code :
import os
import numpy as np
np.random.seed(100)
x=np.random.normal(loc=170.0,scale=25.0,size=5)
np.random.seed(100)
heights_B=pd.Series(x,index=['s1','s2','s3','s4','s5'])
np.random.seed(100)
y=np.random.normal(loc=75.0,scale=12.0,size=5)
weights_B=pd.Series(y,index=['s1','s2','s3','s4','s5'])
df_B = pd.DataFrame({'Student_height': heights_B,'Student_weight':weights_B}, index = weights_B.index)
df_B.to_csv("classB.csv",index=False)
os.system("cat classB.csv")

Changing specific strings into floats in multidimensional array

I saved all the data into an array full of strings, but I want to change the strings in that array into float without changing the header (the first row) and first column of the array. How should I change my code?
import numpy as np
import csv
with open('MI_5MINS_INDEX.csv', encoding="utf-8") as f:
data=list(csv.reader(f))
for line in data:
line.remove('')
ary=np.array(data)
ary.astype(float)
Use pandas read_csv() and it will work as you wish.

Bokeh ValueError: expected an element of either Seq(String)

I'm trying to build a simple bar chart via bokeh but struggling for it to recognize the x-axis and keep getting a ValueError... I think it needs to be in string format but for some reason whatever I try it just won't work. Please note, the column that contains the Years (as floats by the looks of it) is called RegionName, if it seems confusing. Please see my code below, any suggestions?
import pandas as pd
from bokeh.plotting import figure, output_file, show
from bokeh.models import ColumnDataSource
from bokeh.models.tools import HoverTool
import os
from bokeh.palettes import Spectral5
from bokeh.transform import factor_cmap
os.chdir("C:/Users/Vladimir.Tikhnenko/Python/Land Reg")
# Pivot data
def pivot2(infile="Land Registry.csv", outfile="SalesVolume.csv"):
df=pd.read_csv(infile)
table=pd.pivot_table(df,index=
["RegionName"],columns="Year",values="SalesVolume",aggfunc=sum)
table.to_csv(outfile)
return table
pivot2()
# Transpose data
df=pd.read_csv("SalesVolume.csv")
df=df.drop(df.columns[1:28],1)
df=pd.read_csv("SalesVolume.csv", index_col=0, header=None).T
df.to_csv("C:\\Users\Vladimir.Tikhnenko\Python\Land
Reg\SalesVolume.csv",index=None)
df=pd.read_csv("SalesVolume.csv")
source = ColumnDataSource(df)
years = source.data['RegionName'].tolist()
p = figure(x_range=['RegionName'])
color_map = factor_cmap(field_name='RegionName',palette=Spectral5,
factors=years)
p.vbar(x='RegionName', top='Southwark', source=source, width=1,
color=color_map)
p.title.text ='Transactions'
p.xaxis.axis_label = 'Years'
p.yaxis.axis_label = 'Number of Sales'
show(p)
the error message is
ValueError: expected an element of either Seq(String), Seq(Tuple(String,
String)) or Seq(Tuple(String, String, String)), got [1968.0, 1969.0, 1970.0,
1971.0, 1972.0, 1973.0, 1974.0, 1975.0, 1976.0, 1977.0, 1978.0, 1979.0,
1980.0, 1981.0, 1982.0, 1983.0, 1984.0, 1985.0, 1986.0, 1987.0, 1988.0,
1989.0, 1990.0, 1991.0, 1992.0, 1993.0, 1994.0, 1995.0, 1996.0, 1997.0,
1998.0, 1999.0, 2000.0, 2001.0, 2002.0, 2003.0, 2004.0, 2005.0, 2006.0,
2007.0, 2008.0, 2009.0, 2010.0, 2011.0, 2012.0, 2013.0, 2014.0, 2015.0,
2016.0, 2017.0, 2018.0]
Categorical factors must only be strings (or sequences of strings for nested factors), so factor_cmap only accepts lists of those things. You passed it a list a numbers, which causes the error shown. To use use the years as categorical factors, you need to convert them to strings as suggested, and use those string values to initialize x_range, and for the coordinates to vbar.
Alternatively, if you want to use numerical values for the years, but just want to have fixed, controlled tick locations, do this:
p = figure() # don't pass x_range
p.xaxis.ticker = years
And then also use linear_cmap to map the numerical values (instead of factor_cmap)

Program using ast.literal_eval is too slow

I tried to transform strings into list using ast.literal_eval function for a column in a CSV file. The string is something like this '['abbb','cddd','cdcdc']'. For some reason this is a string instead of list, I tried to use ast.literal_eval to transform it into a list with components 'abbb','cddd' and 'cdcdc'. The problem is the execution is to slow (there are 1326101 rows to execute). The code I use is this:
import pandas as pd
import ast
import sys
user_dataset = pd.read_csv('user.csv')
for x in range(len(user_dataset['friends'])):
if user_dataset['friends'][x]!=[]:
"""Covert string to list"""
user_dataset['friends'][x] = ast.literal_eval(user_dataset['friends'][x])
Thanks a lot!

Matplotlib: Import and plot multiple time series with legends direct from .csv

I have several spreadsheets containing data saved as comma delimited (.csv) files in the following format: The first row contains column labels as strings ('Time', 'Parameter_1'...). The first column of data is Time and each subsequent column contains the corresponding parameter data, as a float or integer.
I want to plot each parameter against Time on the same plot, with parameter legends which are derived directly from the first row of the .csv file.
My spreadsheets have different numbers of (columns of) parameters to be plotted against Time; so I'd like to find a generic solution which will also derive the number of columns directly from the .csv file.
The attached minimal working example shows what I'm trying to achieve using np.loadtxt (minus the legend); but I can't find a way to import the column labels from the .csv file to make the legends using this approach.
np.genfromtext offers more functionality, but I'm not familiar with this and am struggling to find a way of using it to do the above.
Plotting data in this style from .csv files must be a common problem, but I've been unable to find a solution on the web. I'd be very grateful for your help & suggestions.
Many thanks
"""
Example data: Data.csv:
Time,Parameter_1,Parameter_2,Parameter_3
0,10,0,10
1,20,30,10
2,40,20,20
3,20,10,30
"""
import numpy as np
import matplotlib.pyplot as plt
data = np.loadtxt('Data.csv', skiprows=1, delimiter=',') # skip the column labels
cols = data.shape[1] # get the number of columns in the array
for n in range (1,cols):
plt.plot(data[:,0],data[:,n]) # plot each parameter against time
plt.xlabel('Time',fontsize=14)
plt.ylabel('Parameter values',fontsize=14)
plt.show()
Here's my minimal working example for the above using genfromtxt rather than loadtxt, in case it is helpful for anyone else.
I'm sure there are more concise and elegant ways of doing this (I'm always happy to get constructive criticism on how to improve my coding), but it makes sense and works OK:
import numpy as np
import matplotlib.pyplot as plt
arr = np.genfromtxt('Data.csv', delimiter=',', dtype=None) # dtype=None automatically defines appropriate format (e.g. string, int, etc.) based on cell contents
names = (arr[0]) # select the first row of data = column names
for n in range (1,len(names)): # plot each column in turn against column 0 (= time)
plt.plot (arr[1:,0],arr[1:,n],label=names[n]) # omitting the first row ( = column names)
plt.legend()
plt.show()
The function numpy.genfromtxt is more for broken tables with missing values rather than what you're trying to do. What you can do is simply open the file before handing it to numpy.loadtxt and read the first line. Then you don't even need to skip it. Here is an edited version of what you have here above that reads the labels and makes the legend:
"""
Example data: Data.csv:
Time,Parameter_1,Parameter_2,Parameter_3
0,10,0,10
1,20,30,10
2,40,20,20
3,20,10,30
"""
import numpy as np
import matplotlib.pyplot as plt
#open the file
with open('Data.csv') as f:
#read the names of the colums first
names = f.readline().strip().split(',')
#np.loadtxt can also handle already open files
data = np.loadtxt(f, delimiter=',') # no skip needed anymore
cols = data.shape[1]
for n in range (1,cols):
#labels go in here
plt.plot(data[:,0],data[:,n],label=names[n])
plt.xlabel('Time',fontsize=14)
plt.ylabel('Parameter values',fontsize=14)
#And finally the legend is made
plt.legend()
plt.show()

Resources