I have a text file that contain string, floating point numbers, integers and separated with double space
cat input.txt
nms val pet dzl
sdt 2.5 3.5 1
tyu 2.8 7.5 5
I want to load the txt file and want to define every rows values to the new variable inorder to perform some task inside the loop
My trial script
import numpy as np
import pandas as pd
main_file=np.loadtxt("input.txt")
for file in main_file:
a= should be sdt
b= should be 2.5
c= should be 3.5
d= should be 1
similarly for second row i want to do the same so that
a= should be tyu
b= should be 2.8
c= should be 7.5
d= should be 5
Error: ValueError: could not convert string to float: 'sdt'
How can I fix this.
I guess in your exampe a combination of open with readlines and split with two whitespaces is doing the job.
# Using readlines()
file1 = open('input.txt', 'r')
lines = file1.readlines()
count = 0
# Strips the newline character
for i, line in enumerate(lines):
if i==0:
continue
a,b,c,d = line.split(' '*3)
print(f'In line {i} a is {a}, b is {b}, c is {c} and d is {d}.')
This is the output:
In line 1 a is sdt, b is 2.5, c is 3.5 and d is 1
In line 2 a is tyu, b is 2.8, c is 7.5 and d is 5
If you have more than 4 columns, you can use split() and apply this to one variable. This will be a list and you can select the items by zero-based counting.
In [244]: cat test.csv
nms val pet dzl
sdt 2.5 3.5 1
tyu 2.8 7.5 5
An easy way to load a csv is with pandas - I use the engine and sep to handle the white space separator (rather than the default comma):
In [245]: df = pd.read_csv('test.csv', sep='\s+', engine='python')
In [246]: df
Out[246]:
nms val pet dzl
0 sdt 2.5 3.5 1
1 tyu 2.8 7.5 5
Since you tagged dataframe and pandas, I'll assume you can take it from there.
genfromtxt can read it as well, but for strings that aren't floats it inserts a nan:
In [247]: data = np.genfromtxt('test.csv')
In [248]: data
Out[248]:
array([[nan, nan, nan, nan],
[nan, 2.5, 3.5, 1. ],
[nan, 2.8, 7.5, 5. ]])
np.loadtxt given the same thing raises errors because it can't convert those strings to float. That should be clear from the docs.
With a few more parameters you can get a nice structured array:
In [251]: data = np.genfromtxt('test.csv', dtype=None, names=True, encoding=None)
In [252]: data
Out[252]:
array([('sdt', 2.5, 3.5, 1), ('tyu', 2.8, 7.5, 5)],
dtype=[('nms', '<U3'), ('val', '<f8'), ('pet', '<f8'), ('dzl', '<i8')])
data['val'] gives all the val column. Or you can iterate on the rows with:
In [258]: for row in data:
...: print(row)
...:
('sdt', 2.5, 3.5, 1)
('tyu', 2.8, 7.5, 5)
A similar structured array from loadtxt:
In [262]: data = np.loadtxt('test.csv', dtype='str,f,f,i', skiprows=1,encoding=None)
In [263]: data
Out[263]:
array([('', 2.5, 3.5, 1), ('', 2.8, 7.5, 5)],
dtype=[('f0', '<U'), ('f1', '<f4'), ('f2', '<f4'), ('f3', '<i4')])
With base readlines you can get a list of strings for each row, and parse those as you want:
In [264]: with open('test.csv','r') as f: lines = f.readlines()
In [265]: lines
Out[265]: ['nms val pet dzl\n', 'sdt 2.5 3.5 1\n', 'tyu 2.8 7.5 5\n']
In [266]: lines[1]
Out[266]: 'sdt 2.5 3.5 1\n'
In [267]: lines[1].split()
Out[267]: ['sdt', '2.5', '3.5', '1']
In [268]: a,b,c,d = lines[1].split()
In [269]: b
Out[269]: '2.5'
Related
I am new to Python and using pandas.
I am trying to convert a data in dictionary to csv file.
Here is the Dictionary
data_new = {'bedrooms': 2.0, 'bathrooms': 3.0, 'sqft_living': 1200,
'sqft_lot': 5800, 'floors': 2.0,
'waterfront': 1, 'view': 1, 'condition': 2, 'sqft_above': 1200,
'sqft_basement': 20,
'yr_built': 1925, 'yr_renovated': 2003, 'city': "Shoreline"}
And I use the below method to save and read the dictionary as csv file
with open('test.csv', 'w') as f:
for key in data_new:
f.write("%s,%s\n" % (key, data_new[key]))
df1 = pd.read_csv("test.csv")
df1
And when I read df1 I get it in the below format
but I want all rows to be columns so I used transpose function as below
However from the above output you see bathrooms is index 0 but I want index to start from bedrooms because with the below output if I try tdf1.info() I do not see bedroom data at all.
Could you please guide me how I can fix this?
Regards
Aravind Viswanathan
I think it would be easier to just use pandas to both write and read your csv file. Does this satisfy what you're trying to do?
import pandas as pd
data_new = {'bedrooms': 2.0, 'bathrooms': 3.0, 'sqft_living': 1200,
'sqft_lot': 5800, 'floors': 2.0,
'waterfront': 1, 'view': 1, 'condition': 2, 'sqft_above': 1200,
'sqft_basement': 20,
'yr_built': 1925, 'yr_renovated': 2003, 'city': "Shoreline"}
df1 = pd.DataFrame.from_dict([data_new])
df1.to_csv('test.csv', index=None) # index=None prevents index being added as column 1
df2 = pd.read_csv('test.csv')
print(df1)
print(df2)
Output:
bedrooms bathrooms sqft_living ... yr_built yr_renovated city
0 2.0 3.0 1200 ... 1925 2003 Shoreline
[1 rows x 13 columns]
bedrooms bathrooms sqft_living ... yr_built yr_renovated city
0 2.0 3.0 1200 ... 1925 2003 Shoreline
[1 rows x 13 columns]
Identical.
I am indexing and slicing my data using Pandas in Python3 to calculate spatial statistics.
When I am running a for loop over the range of latitude and longitude using .loc, gives an error KeyError: (slice(None, None, None), ) for the particular set of latitude and longitude for what no values are available in the input file. Instead of skipping those values, it gives an error and stops running the code. Following is my code.
import numpy as np
import pandas as pd
from scipy import stats
filename='input.txt'
df = pd.read_csv(filename,delim_whitespace=True, header=None, names = ['year','month','lat','lon','aod'], index_col = ['year','month','lat','lon'])
idx=pd.IndexSlice
for i in range (1, 13):
for lat0 in N.arange(0.,40.25,0.25,dtype=float):
for lon0 in N.arange(20.0,75.25,0.25,dtype=float):
tmp = df.loc[idx[:,i,lat0,lon0],:]
if (len(tmp) <= 0):
continue
tmp2 = tmp.index.tolist()
In the code above, if I run for tmp = df.loc[idx[:,1,0.0,34.0],:], it works well and provides the following output, which I used for the further calculation.
aod
year month lat lon
2003 1 0.0 34.0 0.032000
2006 1 0.0 34.0 0.114000
2007 1 0.0 34.0 0.035000
2008 1 0.0 34.0 0.026000
2011 1 0.0 34.0 0.097000
2012 1 0.0 34.0 0.106333
2013 1 0.0 34.0 0.081000
2014 1 0.0 34.0 0.038000
2015 1 0.0 34.0 0.278500
2016 1 0.0 34.0 0.033000
2017 1 0.0 34.0 0.036333
2019 1 0.0 34.0 0.064333
2020 1 0.0 34.0 0.109500
But, a same code I run for tmp = df.loc[idx[:,1,0.0,32.75],:], for the respective latitude and longitude no values available in the input file. Instead of skipping those, it gives me the following error:
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/usr/lib/python3/dist-packages/pandas/core/indexing.py", line 925, in __getitem__
return self._getitem_tuple(key)
File "/usr/lib/python3/dist-packages/pandas/core/indexing.py", line 1100, in _getitem_tuple
return self._getitem_lowerdim(tup)
File "/usr/lib/python3/dist-packages/pandas/core/indexing.py", line 822, in _getitem_lowerdim
return self._getitem_nested_tuple(tup)
File "/usr/lib/python3/dist-packages/pandas/core/indexing.py", line 906, in _getitem_nested_tuple
obj = getattr(obj, self.name)._getitem_axis(key, axis=axis)
File "/usr/lib/python3/dist-packages/pandas/core/indexing.py", line 1157, in _getitem_axis
locs = labels.get_locs(key)
File "/usr/lib/python3/dist-packages/pandas/core/indexes/multi.py", line 3347, in get_locs
indexer = _update_indexer(
File "/usr/lib/python3/dist-packages/pandas/core/indexes/multi.py", line 3296, in _update_indexer
raise KeyError(key)
KeyError: (slice(None, None, None), 1, 0.0, 32.75)
I tried to replace .loc with .iloc, but it came out with a too many indexers error. However, I tried solutions from internet using .to_numpy(), .values and .as_matrix(), but nothing work.
But, a same code I run for tmp = df.loc[idx[:,1,0.0,32.75],:], for the respective latitude and longitude no values available in the input file. Instead of skipping those, it gives me the following error:
The idiomatic Pandas solution would be to write this as a groupby. Example:
# split df into groups by the keys month, lat, and lon
for index, tmp in df.groupby(['month','lat','lon']):
# tmp is a dataframe where all rows have identical month, lat, and lon values
# ... do something with the tmp dataframe ...
This has three benefits.
Speed. A groupby will be faster because it only needs to loop over the dataframe once, rather than searching the whole dataframe for everything matching the first group, then searching for the second group, etc.
Simplicity.
Robustness. From a robustness perspective, if a dataframe doesn't have, for example, any rows matching "month=1,lat=0.0,lon=32.75", then it will not create that group.
More information: User guide on grouping
Remark about groupby aggregation functions
You'll also sometimes see groupby used with aggregation functions. For example, suppose you wanted to get the sum of each column within each group.
>>> l = [[1, 2, 3], [1, None, 4], [2, 1, 3], [1, 2, 2]]
>>> df = pd.DataFrame(l, columns=["a", "b", "c"])
>>> df.groupby(by=["b"]).sum()
a c
b
1.0 2 3
2.0 2 5
These aggregation functions are faster and easier to use, but sometimes I need something that is custom and unusual, so I'll write a loop. But if you're doing something common, like getting the average of a group, consider looking for an aggregation function.
I have 3 big CSV files. I try to randomly extract some samples from the files without loading them into the memory. I am doing this:
SITS = dd.read_csv("sits_train_0.csv", blocksize="512MB",
usecols=band_blue + ["samplefid"]).set_index("samplefid")
MASK = dd.read_csv("mask_train_0.csv", blocksize="512MB",
usecols=band_mask + ["samplefid"]).set_index("samplefid")
GP = dd.read_csv("sits_gp_train_0.csv", blocksize="512MB",
usecols=band_blue_gp + ["samplefid"]).set_index("samplefid")
# SITS = pd.read_csv("sits_train_0.csv",
# usecols=band_blue + ["samplefid"]).set_index("samplefid")
# MASK = pd.read_csv("mask_train_0.csv",
# usecols=band_mask + ["samplefid"]).set_index("samplefid")
# GP = pd.read_csv("sits_gp_train_0.csv",
# usecols=band_blue_gp + ["samplefid"]).set_index("samplefid")
np.random.seed(0)
NSAMPLES=100
samples = np.random.choice(MASK.index, size=NSAMPLES, replace=False)
s = SITS.loc[samples][band_blue].compute().values
m = MASK.loc[samples][band_mask].compute().values
sg = GP.loc[samples][band_blue_gp].compute().values
# s = SITS.loc[samples][band_blue].values
# m = MASK.loc[samples][band_mask].values
# sg = GP.loc[samples][band_blue_gp].values
I had strange results, so I compare to pandas with smaller files (see commented code above) for which I have correct results.
If I set blocksize to None, the results are fine, but it loads everything in memory, so using dask is not useful in that case and my CSV are to big to fits in memory. My CSV are written randomly so I need to use index to recover the same samples from the 3 CSV.
I feel I miss something from dask, but I don't see what.
I'd recommend using sample
In [16]: import pandas as pd
In [17]: import dask.dataframe as dd
In [18]: df = pd.DataFrame({'num_legs': [2, 4, 8, 0],
...: 'num_wings': [2, 0, 0, 0],
...: 'num_specimen_seen': [10, 2, 1, 8]},
...: index=['falcon', 'dog', 'spider', 'fish'])
In [19]: ddf = dd.from_pandas(df, npartitions=2)
In [20]: ddf.sample??
In [21]: df.sample(frac=0.5, replace=True, random_state=1)
Out[21]:
num_legs num_wings num_specimen_seen
dog 4 0 2
fish 0 0 8
In [22]: ddf.sample(frac=0.5, replace=True, random_state=1)
Out[22]:
Dask DataFrame Structure:
num_legs num_wings num_specimen_seen
npartitions=2
dog int64 int64 int64
fish ... ... ...
spider ... ... ...
Dask Name: sample, 4 tasks
In [23]: ddf.sample(frac=0.5, replace=True, random_state=1).compute()
Out[23]:
num_legs num_wings num_specimen_seen
falcon 2 2 10
fish 0 0 8
I am using a dataframe df as follows
DeviceID TimeStamp A B C
00234 11-03-2014 05:55 5.6 2.3 3.3
00235 11-03-2014 05:33 2.8 0.9 4.2
00236 11-03-2014 06:15 3.5 0.1 1.3
00234 11-03-2014 07:23 2.5 0.2 3.9
00236 11-03-2014 07:33 2.5 4.5 2.9
As we can see from the above sample df that for DeviceID 00234 the max value among A, B and C is 5.6. Similarly for DeviceID 00236 the max value among A, B and C is 4.5.
I want to retrieve the TimeStamp value based on the max value for each DeviceID. Clearly for DeviceID 00234 it is 11-03-2014 05:55.
While I have not tried any approach, however, will the following approach work?
from pyspark.sql import function as F
max_value = df.groupby('DeviceID').agg(F.greatest('A','B','C').alias('max_value'))
df.withColumn('Max-TimeStamp',where(# please help me in putting the right codes))
The resultant df should look like as follows
DeviceID Max_Value Max-TimeStamp
00234 5.6 11-03-2014 05:55
00236 4.5 11-03-2014 07:33
You can achieve this with a Window function:
import pyspark.sql.functions as F
from pyspark.sql import Window
l = [('00234' , '11-03-2014 05:55', 5.6 , 2.3 , 3.3),
('00235' , '11-03-2014 05:33' , 2.8, 0.9 , 4.2),
('00236' , '11-03-2014 06:15' , 3.5 , 0.1 , 1.3),
('00234' , '11-03-2014 07:23' , 2.5 , 0.2 , 3.9),
('00236' , '11-03-2014 07:33', 2.5 , 4.5, 2.9)]
columns = ['DeviceID', 'TimeStamp', 'A','B','C']
df=spark.createDataFrame(l, columns)
w = Window.partitionBy('DeviceID')
df = df.select('DeviceID', 'TimeStamp', F.greatest('A','B','C').alias('max_value'))
df.withColumn('bla', F.max('max_value').over(w)).where(F.col('max_value') == F.col('bla')).drop('bla').show()
Output:
+--------+----------------+---------+
|DeviceID| TimeStamp |max_value|
+--------+----------------+---------+
| 00236|11-03-2014 07:33| 4.5|
| 00234|11-03-2014 05:55| 5.6|
| 00235|11-03-2014 05:33| 4.2|
+--------+----------------+---------+
Spark 3.3+
max_by is available. Be cautious, as it returns only one value. So if you had several equal max values, you would only get 1 corresponding value.
from pyspark.sql import functions as F
df = spark.createDataFrame(
[('00234', '11-03-2014 05:55', 5.6, 2.3, 3.3),
('00235', '11-03-2014 05:33', 2.8, 0.9, 4.2),
('00236', '11-03-2014 06:15', 3.5, 0.1, 1.3),
('00234', '11-03-2014 07:23', 2.5, 0.2, 3.9),
('00236', '11-03-2014 07:33', 2.5, 4.5, 2.9)],
['DeviceID', 'TimeStamp', 'A', 'B', 'C'])
greatest = F.greatest('A', 'B', 'C')
df = df.groupBy('DeviceID').agg(
F.max_by('TimeStamp', greatest).alias('TimeStamp'),
F.max(greatest).alias('max_value')
)
df.show()
# +--------+----------------+---------+
# |DeviceID| TimeStamp|max_value|
# +--------+----------------+---------+
# | 00234|11-03-2014 05:55| 5.6|
# | 00235|11-03-2014 05:33| 4.2|
# | 00236|11-03-2014 07:33| 4.5|
# +--------+----------------+---------+
I need to read a text file that contains comma-delimited values into a 2D numpy array. The first 2 values on each line contain the index values for the numpy array and the third values contains the value to be stored in the array. As a catch, the index values are 1-based and need to be converted to the 0-based index values used by numpy. I've reviewed documentation and examples using genfromtxt and loadtxt but it's still not clear to me how to go about it. I've also tried the following code with no success:
a = np.arange(6).reshape(2,3)
for line in infile:
fields = line.split() #split fields inti list
rindex = int(fields[0]) - 1
cindex = int(fields[1]) - 1
a[rindex,cindex] = float(fields[2])
Here is an example of the input file:
1,1,10.1
1,2,11.2
1,3,12.3
2,3,13.4
2,2,14.5
2,3,15.6
And here is my desired output array. Ideally I'd like it to work on any array size without having to predefine the size of the array.
10.1 11.2 12.3
13.4 14.5 15.6
Here's one way you can do it. numpy.genfromtxt() is used to read the data into a structured array with three fields. The row and column indices are pulled out of the structured array and used to figure out the shape of the desired array, and to assign the values to the new array using numpy's "fancy" indexing:
In [46]: !cat test_data.csv
1,1,10.1
1,2,11.2
1,3,12.3
2,3,13.4
2,2,14.5
2,3,15.6
In [47]: data = np.genfromtxt('test_data.csv', dtype=None, delimiter=',', names=['i', 'j', 'value'])
In [48]: data
Out[48]:
array([(1, 1, 10.1), (1, 2, 11.2), (1, 3, 12.3), (2, 3, 13.4),
(2, 2, 14.5), (2, 3, 15.6)],
dtype=[('i', '<i8'), ('j', '<i8'), ('value', '<f8')])
In [49]: rows = data['i']
In [50]: cols = data['j']
In [51]: nrows = rows.max()
In [52]: ncols = cols.max()
In [53]: a = np.zeros((nrows, ncols))
In [54]: a[rows-1, cols-1] = data['value']
In [55]: a
Out[55]:
array([[ 10.1, 11.2, 12.3],
[ 0. , 14.5, 15.6]])