Question regarding converting one dictionary to csv fil - python-3.x

I am new to Python and using pandas.
I am trying to convert a data in dictionary to csv file.
Here is the Dictionary
data_new = {'bedrooms': 2.0, 'bathrooms': 3.0, 'sqft_living': 1200,
'sqft_lot': 5800, 'floors': 2.0,
'waterfront': 1, 'view': 1, 'condition': 2, 'sqft_above': 1200,
'sqft_basement': 20,
'yr_built': 1925, 'yr_renovated': 2003, 'city': "Shoreline"}
And I use the below method to save and read the dictionary as csv file
with open('test.csv', 'w') as f:
for key in data_new:
f.write("%s,%s\n" % (key, data_new[key]))
df1 = pd.read_csv("test.csv")
df1
And when I read df1 I get it in the below format
but I want all rows to be columns so I used transpose function as below
However from the above output you see bathrooms is index 0 but I want index to start from bedrooms because with the below output if I try tdf1.info() I do not see bedroom data at all.
Could you please guide me how I can fix this?
Regards
Aravind Viswanathan

I think it would be easier to just use pandas to both write and read your csv file. Does this satisfy what you're trying to do?
import pandas as pd
data_new = {'bedrooms': 2.0, 'bathrooms': 3.0, 'sqft_living': 1200,
'sqft_lot': 5800, 'floors': 2.0,
'waterfront': 1, 'view': 1, 'condition': 2, 'sqft_above': 1200,
'sqft_basement': 20,
'yr_built': 1925, 'yr_renovated': 2003, 'city': "Shoreline"}
df1 = pd.DataFrame.from_dict([data_new])
df1.to_csv('test.csv', index=None) # index=None prevents index being added as column 1
df2 = pd.read_csv('test.csv')
print(df1)
print(df2)
Output:
bedrooms bathrooms sqft_living ... yr_built yr_renovated city
0 2.0 3.0 1200 ... 1925 2003 Shoreline
[1 rows x 13 columns]
bedrooms bathrooms sqft_living ... yr_built yr_renovated city
0 2.0 3.0 1200 ... 1925 2003 Shoreline
[1 rows x 13 columns]
Identical.

Related

Encoding in Python such that numbering starts with 1

I have a dataframe, wherein the column 'team' needs to be encoded.
These are my codes:
#Load the required libraries
import pandas as pd
import numpy as np
from sklearn.preprocessing import LabelEncoder
#Create dictionary
data = {'team': ['A', 'A', 'B', 'B', 'C'],
'Income': [5849, 4583, 3000, 2583, 6000],
'Coapplicant Income': [0, 1508, 0, 2358, 0],
'LoanAmount': [123, 128, 66, 120, 141]}
#Convert dictionary to dataframe
df = pd.DataFrame(data)
print("\n df",df)
# Initiate label encoder
le = LabelEncoder()
# return encoded label
label = le.fit_transform(df['team'])
# printing label
print("\n label =",label )
# removing the column 'team' from df
df.drop("team", axis=1, inplace=True)
# Appending the array to our dataFrame
df["team"] = label
# printing Dataframe
print("\n df",df)
I am getting the below result after encoding:
However, I wish to ensure following two things:
Encoding starts with 1 and not 0
The location of column 'team' should remain the same as original
i.e. I wish to have following result:
Can somebody please help me out how to do this ?
Do not drop the column and increment the label on assignment:
le = LabelEncoder()
# return encoded label
label = le.fit_transform(df['team'])
# Replacing the column
df["team"] = label + 1
Output:
df
team
Income
Coapplicant Income
LoanAmount
0
1
5849
0
123
1
1
4583
1508
128
2
2
3000
0
66
3
2
2583
2358
120
4
3
6000
0
141

How to extract the specific part of text file in python?

I have big data as shown in the uploaded pic, it has 90 BAND-INDEX and each BAND-INDEX has 300 rows.
I want to search the text file for a specific value like -24.83271 and extract the BAND-INDEX containing that value in an array form. Can you please write the code to do so? Thank you in advance
I am unable to extract the specific BAND-INDEX in array form.
Try reading the file line by line and using a generator. Here is an example:
import csv
import pandas as pd
# generate and save demo csv
pd.DataFrame({
'Band-Index': (0.01, 0.02, 0.03, 0.04, 0.05, 0.06),
'value': (1, 2, 3, 4, 5, 6),
}).to_csv('example.csv', index=False)
def search_values_in_file(search_values: list):
with open('example.csv') as csvfile:
reader = csv.reader(csvfile)
reader.__next__() # skip header
for row in reader:
band_index, value = row
if value in search_values:
yield row
# get lines from csv where value in ['4', '6']
df = pd.DataFrame(list(search_values_in_file(['4', '6'])), columns=['Band-Index', 'value'])
print(df)
# Band-Index value
# 0 0.04 4
# 1 0.06 6

Python Pandas indexing provides KeyError: (slice(None, None, None), )

I am indexing and slicing my data using Pandas in Python3 to calculate spatial statistics.
When I am running a for loop over the range of latitude and longitude using .loc, gives an error KeyError: (slice(None, None, None), ) for the particular set of latitude and longitude for what no values are available in the input file. Instead of skipping those values, it gives an error and stops running the code. Following is my code.
import numpy as np
import pandas as pd
from scipy import stats
filename='input.txt'
df = pd.read_csv(filename,delim_whitespace=True, header=None, names = ['year','month','lat','lon','aod'], index_col = ['year','month','lat','lon'])
idx=pd.IndexSlice
for i in range (1, 13):
for lat0 in N.arange(0.,40.25,0.25,dtype=float):
for lon0 in N.arange(20.0,75.25,0.25,dtype=float):
tmp = df.loc[idx[:,i,lat0,lon0],:]
if (len(tmp) <= 0):
continue
tmp2 = tmp.index.tolist()
In the code above, if I run for tmp = df.loc[idx[:,1,0.0,34.0],:], it works well and provides the following output, which I used for the further calculation.
aod
year month lat lon
2003 1 0.0 34.0 0.032000
2006 1 0.0 34.0 0.114000
2007 1 0.0 34.0 0.035000
2008 1 0.0 34.0 0.026000
2011 1 0.0 34.0 0.097000
2012 1 0.0 34.0 0.106333
2013 1 0.0 34.0 0.081000
2014 1 0.0 34.0 0.038000
2015 1 0.0 34.0 0.278500
2016 1 0.0 34.0 0.033000
2017 1 0.0 34.0 0.036333
2019 1 0.0 34.0 0.064333
2020 1 0.0 34.0 0.109500
But, a same code I run for tmp = df.loc[idx[:,1,0.0,32.75],:], for the respective latitude and longitude no values available in the input file. Instead of skipping those, it gives me the following error:
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/usr/lib/python3/dist-packages/pandas/core/indexing.py", line 925, in __getitem__
return self._getitem_tuple(key)
File "/usr/lib/python3/dist-packages/pandas/core/indexing.py", line 1100, in _getitem_tuple
return self._getitem_lowerdim(tup)
File "/usr/lib/python3/dist-packages/pandas/core/indexing.py", line 822, in _getitem_lowerdim
return self._getitem_nested_tuple(tup)
File "/usr/lib/python3/dist-packages/pandas/core/indexing.py", line 906, in _getitem_nested_tuple
obj = getattr(obj, self.name)._getitem_axis(key, axis=axis)
File "/usr/lib/python3/dist-packages/pandas/core/indexing.py", line 1157, in _getitem_axis
locs = labels.get_locs(key)
File "/usr/lib/python3/dist-packages/pandas/core/indexes/multi.py", line 3347, in get_locs
indexer = _update_indexer(
File "/usr/lib/python3/dist-packages/pandas/core/indexes/multi.py", line 3296, in _update_indexer
raise KeyError(key)
KeyError: (slice(None, None, None), 1, 0.0, 32.75)
I tried to replace .loc with .iloc, but it came out with a too many indexers error. However, I tried solutions from internet using .to_numpy(), .values and .as_matrix(), but nothing work.
But, a same code I run for tmp = df.loc[idx[:,1,0.0,32.75],:], for the respective latitude and longitude no values available in the input file. Instead of skipping those, it gives me the following error:
The idiomatic Pandas solution would be to write this as a groupby. Example:
# split df into groups by the keys month, lat, and lon
for index, tmp in df.groupby(['month','lat','lon']):
# tmp is a dataframe where all rows have identical month, lat, and lon values
# ... do something with the tmp dataframe ...
This has three benefits.
Speed. A groupby will be faster because it only needs to loop over the dataframe once, rather than searching the whole dataframe for everything matching the first group, then searching for the second group, etc.
Simplicity.
Robustness. From a robustness perspective, if a dataframe doesn't have, for example, any rows matching "month=1,lat=0.0,lon=32.75", then it will not create that group.
More information: User guide on grouping
Remark about groupby aggregation functions
You'll also sometimes see groupby used with aggregation functions. For example, suppose you wanted to get the sum of each column within each group.
>>> l = [[1, 2, 3], [1, None, 4], [2, 1, 3], [1, 2, 2]]
>>> df = pd.DataFrame(l, columns=["a", "b", "c"])
>>> df.groupby(by=["b"]).sum()
a c
b
1.0 2 3
2.0 2 5
These aggregation functions are faster and easier to use, but sometimes I need something that is custom and unusual, so I'll write a loop. But if you're doing something common, like getting the average of a group, consider looking for an aggregation function.

reading values from the text file and defining the variables

I have a text file that contain string, floating point numbers, integers and separated with double space
cat input.txt
nms val pet dzl
sdt 2.5 3.5 1
tyu 2.8 7.5 5
I want to load the txt file and want to define every rows values to the new variable inorder to perform some task inside the loop
My trial script
import numpy as np
import pandas as pd
main_file=np.loadtxt("input.txt")
for file in main_file:
a= should be sdt
b= should be 2.5
c= should be 3.5
d= should be 1
similarly for second row i want to do the same so that
a= should be tyu
b= should be 2.8
c= should be 7.5
d= should be 5
Error: ValueError: could not convert string to float: 'sdt'
How can I fix this.
I guess in your exampe a combination of open with readlines and split with two whitespaces is doing the job.
# Using readlines()
file1 = open('input.txt', 'r')
lines = file1.readlines()
count = 0
# Strips the newline character
for i, line in enumerate(lines):
if i==0:
continue
a,b,c,d = line.split(' '*3)
print(f'In line {i} a is {a}, b is {b}, c is {c} and d is {d}.')
This is the output:
In line 1 a is sdt, b is 2.5, c is 3.5 and d is 1
In line 2 a is tyu, b is 2.8, c is 7.5 and d is 5
If you have more than 4 columns, you can use split() and apply this to one variable. This will be a list and you can select the items by zero-based counting.
In [244]: cat test.csv
nms val pet dzl
sdt 2.5 3.5 1
tyu 2.8 7.5 5
An easy way to load a csv is with pandas - I use the engine and sep to handle the white space separator (rather than the default comma):
In [245]: df = pd.read_csv('test.csv', sep='\s+', engine='python')
In [246]: df
Out[246]:
nms val pet dzl
0 sdt 2.5 3.5 1
1 tyu 2.8 7.5 5
Since you tagged dataframe and pandas, I'll assume you can take it from there.
genfromtxt can read it as well, but for strings that aren't floats it inserts a nan:
In [247]: data = np.genfromtxt('test.csv')
In [248]: data
Out[248]:
array([[nan, nan, nan, nan],
[nan, 2.5, 3.5, 1. ],
[nan, 2.8, 7.5, 5. ]])
np.loadtxt given the same thing raises errors because it can't convert those strings to float. That should be clear from the docs.
With a few more parameters you can get a nice structured array:
In [251]: data = np.genfromtxt('test.csv', dtype=None, names=True, encoding=None)
In [252]: data
Out[252]:
array([('sdt', 2.5, 3.5, 1), ('tyu', 2.8, 7.5, 5)],
dtype=[('nms', '<U3'), ('val', '<f8'), ('pet', '<f8'), ('dzl', '<i8')])
data['val'] gives all the val column. Or you can iterate on the rows with:
In [258]: for row in data:
...: print(row)
...:
('sdt', 2.5, 3.5, 1)
('tyu', 2.8, 7.5, 5)
A similar structured array from loadtxt:
In [262]: data = np.loadtxt('test.csv', dtype='str,f,f,i', skiprows=1,encoding=None)
In [263]: data
Out[263]:
array([('', 2.5, 3.5, 1), ('', 2.8, 7.5, 5)],
dtype=[('f0', '<U'), ('f1', '<f4'), ('f2', '<f4'), ('f3', '<i4')])
With base readlines you can get a list of strings for each row, and parse those as you want:
In [264]: with open('test.csv','r') as f: lines = f.readlines()
In [265]: lines
Out[265]: ['nms val pet dzl\n', 'sdt 2.5 3.5 1\n', 'tyu 2.8 7.5 5\n']
In [266]: lines[1]
Out[266]: 'sdt 2.5 3.5 1\n'
In [267]: lines[1].split()
Out[267]: ['sdt', '2.5', '3.5', '1']
In [268]: a,b,c,d = lines[1].split()
In [269]: b
Out[269]: '2.5'

Random selection of sample from CSV file with dask different than with pandas

I have 3 big CSV files. I try to randomly extract some samples from the files without loading them into the memory. I am doing this:
SITS = dd.read_csv("sits_train_0.csv", blocksize="512MB",
usecols=band_blue + ["samplefid"]).set_index("samplefid")
MASK = dd.read_csv("mask_train_0.csv", blocksize="512MB",
usecols=band_mask + ["samplefid"]).set_index("samplefid")
GP = dd.read_csv("sits_gp_train_0.csv", blocksize="512MB",
usecols=band_blue_gp + ["samplefid"]).set_index("samplefid")
# SITS = pd.read_csv("sits_train_0.csv",
# usecols=band_blue + ["samplefid"]).set_index("samplefid")
# MASK = pd.read_csv("mask_train_0.csv",
# usecols=band_mask + ["samplefid"]).set_index("samplefid")
# GP = pd.read_csv("sits_gp_train_0.csv",
# usecols=band_blue_gp + ["samplefid"]).set_index("samplefid")
np.random.seed(0)
NSAMPLES=100
samples = np.random.choice(MASK.index, size=NSAMPLES, replace=False)
s = SITS.loc[samples][band_blue].compute().values
m = MASK.loc[samples][band_mask].compute().values
sg = GP.loc[samples][band_blue_gp].compute().values
# s = SITS.loc[samples][band_blue].values
# m = MASK.loc[samples][band_mask].values
# sg = GP.loc[samples][band_blue_gp].values
I had strange results, so I compare to pandas with smaller files (see commented code above) for which I have correct results.
If I set blocksize to None, the results are fine, but it loads everything in memory, so using dask is not useful in that case and my CSV are to big to fits in memory. My CSV are written randomly so I need to use index to recover the same samples from the 3 CSV.
I feel I miss something from dask, but I don't see what.
I'd recommend using sample
In [16]: import pandas as pd
In [17]: import dask.dataframe as dd
In [18]: df = pd.DataFrame({'num_legs': [2, 4, 8, 0],
...: 'num_wings': [2, 0, 0, 0],
...: 'num_specimen_seen': [10, 2, 1, 8]},
...: index=['falcon', 'dog', 'spider', 'fish'])
In [19]: ddf = dd.from_pandas(df, npartitions=2)
In [20]: ddf.sample??
In [21]: df.sample(frac=0.5, replace=True, random_state=1)
Out[21]:
num_legs num_wings num_specimen_seen
dog 4 0 2
fish 0 0 8
In [22]: ddf.sample(frac=0.5, replace=True, random_state=1)
Out[22]:
Dask DataFrame Structure:
num_legs num_wings num_specimen_seen
npartitions=2
dog int64 int64 int64
fish ... ... ...
spider ... ... ...
Dask Name: sample, 4 tasks
In [23]: ddf.sample(frac=0.5, replace=True, random_state=1).compute()
Out[23]:
num_legs num_wings num_specimen_seen
falcon 2 2 10
fish 0 0 8

Resources