How to extract the specific part of text file in python? - python-3.x

I have big data as shown in the uploaded pic, it has 90 BAND-INDEX and each BAND-INDEX has 300 rows.
I want to search the text file for a specific value like -24.83271 and extract the BAND-INDEX containing that value in an array form. Can you please write the code to do so? Thank you in advance
I am unable to extract the specific BAND-INDEX in array form.

Try reading the file line by line and using a generator. Here is an example:
import csv
import pandas as pd
# generate and save demo csv
pd.DataFrame({
'Band-Index': (0.01, 0.02, 0.03, 0.04, 0.05, 0.06),
'value': (1, 2, 3, 4, 5, 6),
}).to_csv('example.csv', index=False)
def search_values_in_file(search_values: list):
with open('example.csv') as csvfile:
reader = csv.reader(csvfile)
reader.__next__() # skip header
for row in reader:
band_index, value = row
if value in search_values:
yield row
# get lines from csv where value in ['4', '6']
df = pd.DataFrame(list(search_values_in_file(['4', '6'])), columns=['Band-Index', 'value'])
print(df)
# Band-Index value
# 0 0.04 4
# 1 0.06 6

Related

Encoding in Python such that numbering starts with 1

I have a dataframe, wherein the column 'team' needs to be encoded.
These are my codes:
#Load the required libraries
import pandas as pd
import numpy as np
from sklearn.preprocessing import LabelEncoder
#Create dictionary
data = {'team': ['A', 'A', 'B', 'B', 'C'],
'Income': [5849, 4583, 3000, 2583, 6000],
'Coapplicant Income': [0, 1508, 0, 2358, 0],
'LoanAmount': [123, 128, 66, 120, 141]}
#Convert dictionary to dataframe
df = pd.DataFrame(data)
print("\n df",df)
# Initiate label encoder
le = LabelEncoder()
# return encoded label
label = le.fit_transform(df['team'])
# printing label
print("\n label =",label )
# removing the column 'team' from df
df.drop("team", axis=1, inplace=True)
# Appending the array to our dataFrame
df["team"] = label
# printing Dataframe
print("\n df",df)
I am getting the below result after encoding:
However, I wish to ensure following two things:
Encoding starts with 1 and not 0
The location of column 'team' should remain the same as original
i.e. I wish to have following result:
Can somebody please help me out how to do this ?
Do not drop the column and increment the label on assignment:
le = LabelEncoder()
# return encoded label
label = le.fit_transform(df['team'])
# Replacing the column
df["team"] = label + 1
Output:
df
team
Income
Coapplicant Income
LoanAmount
0
1
5849
0
123
1
1
4583
1508
128
2
2
3000
0
66
3
2
2583
2358
120
4
3
6000
0
141

Question regarding converting one dictionary to csv fil

I am new to Python and using pandas.
I am trying to convert a data in dictionary to csv file.
Here is the Dictionary
data_new = {'bedrooms': 2.0, 'bathrooms': 3.0, 'sqft_living': 1200,
'sqft_lot': 5800, 'floors': 2.0,
'waterfront': 1, 'view': 1, 'condition': 2, 'sqft_above': 1200,
'sqft_basement': 20,
'yr_built': 1925, 'yr_renovated': 2003, 'city': "Shoreline"}
And I use the below method to save and read the dictionary as csv file
with open('test.csv', 'w') as f:
for key in data_new:
f.write("%s,%s\n" % (key, data_new[key]))
df1 = pd.read_csv("test.csv")
df1
And when I read df1 I get it in the below format
but I want all rows to be columns so I used transpose function as below
However from the above output you see bathrooms is index 0 but I want index to start from bedrooms because with the below output if I try tdf1.info() I do not see bedroom data at all.
Could you please guide me how I can fix this?
Regards
Aravind Viswanathan
I think it would be easier to just use pandas to both write and read your csv file. Does this satisfy what you're trying to do?
import pandas as pd
data_new = {'bedrooms': 2.0, 'bathrooms': 3.0, 'sqft_living': 1200,
'sqft_lot': 5800, 'floors': 2.0,
'waterfront': 1, 'view': 1, 'condition': 2, 'sqft_above': 1200,
'sqft_basement': 20,
'yr_built': 1925, 'yr_renovated': 2003, 'city': "Shoreline"}
df1 = pd.DataFrame.from_dict([data_new])
df1.to_csv('test.csv', index=None) # index=None prevents index being added as column 1
df2 = pd.read_csv('test.csv')
print(df1)
print(df2)
Output:
bedrooms bathrooms sqft_living ... yr_built yr_renovated city
0 2.0 3.0 1200 ... 1925 2003 Shoreline
[1 rows x 13 columns]
bedrooms bathrooms sqft_living ... yr_built yr_renovated city
0 2.0 3.0 1200 ... 1925 2003 Shoreline
[1 rows x 13 columns]
Identical.

openpyxl : Update multiple columns & rows from dictionary

I have a nested Dictionary
aDictionary = {'Asset': {'Name': 'Max', 'Age': 28, 'Job': 'Nil'}, 'Parameter': {'Marks': 60, 'Height': 177, 'Weight': 76}}
I want to update the values in an excel as follows
|Asset |Name |Max|
|Asset |Age |28 |
|Asset |Job |Nil|
|Parameter|Marks |60 |
|Parameter|Height|177|
|Parameter|Weight|76 |
I tried something like this, but result is not what I was expecting. Am pretty new to openpyxl. I can't seem to wrap my head around it.
from openpyxl import *
workbook=load_workbook('Empty.xlsx')
worksheet= workbook['Sheet1']
for m in range(1,7):
for i in aDictionary:
worksheet["A"+str(m)].value=i
for j, k in aDictionary[i].items():
worksheet["B"+str(m)].value=j
worksheet["C"+str(m)].value=k
workbook.save('Empty.xlsx')
One way to do this is to convert the Dictionary to a DataFrame and stack it the way you indicated, rearrange the columns and then load it into Excel. I've used pandas to_excel as it is a single line of code. But, you can use load_workbook() as well...
Stacking part was borrowed from here
Code
aDictionary = {'Asset': {'Name': 'Max', 'Age': 28, 'Job': 'Nil'}, 'Parameter': {'Marks': 60, 'Height': 177, 'Weight': 76}}
df = pd.DataFrame(aDictionary) # Convert to dataframe
df = df.stack().reset_index() # Stack
# Rearrange columns to the way you want it
cols = df.columns.tolist()
cols = list(df.columns.values)
cols[0], cols[1] = cols[1], cols[0]
df = df[cols]
#Write to Excel
df.to_excel('Empty.xlsx', sheet_name='Sheet1', index=False, header=None)
Output in Excel

How to encode multiple categorical columns for test data efficiently?

I have multiple category columns (nearly 50). I using custom made frequency encoding and using it on training data. At last i am saving it as nested dictionary. For the test data I am using map function to encode and unseen labels are replaced with 0. But I need more efficient way?
I have already tried pandas replace method but it don't cares of unseen labels and leaves it as it. Further I am much concerned about the time and i want say 80 columns and 1 row to be encoded within 60 ms. Just need the most efficient way I can do it. I have taken my example from here.
import pandas
from sklearn import preprocessing
df = pandas.DataFrame({'pets': ['cat', 'dog', 'cat', 'monkey', 'dog', 'meo'],
'owner': ['Champ', 'Ron', 'Brick', 'Champ', 'Veronica', 'Ron'],
'location': ['San_Diego', 'New_York', 'New_York', 'San_Diego', 'San_Diego',
'New_York']})
My dict looks something like this :
enc = {'pets': {'cat': 0, 'dog': 1, 'monkey': 2},
'owner': {'Brick': 0, 'Champ': 1, 'Ron': 2, 'Veronica': 3},
'location': {'New_York': 0, 'San_Diego': 1}}
for col in enc:
if col in input_df.columns:
input_df[col]= input_df[col].map(dict_online['encoding'][col]).fillna(0)
Further I want multiple columns to be encoded at once. I don't want any loop for every column.... I guess we cant do it in map. Hence replace is good choice but in that as said it doesn't cares about unseen labels.
EDIT:
This the code i am using for now, Please note there is only 1 row in test data frame ( Not very sure i should handle it like numpy array to reduce time...). But i need to decrease this time to under 60 ms: Further i have dictionary only for mapping ( Cant use one hot because of use case). Currently time = 331.74 ms. Any idea how to do it more efficiently. Not sure that multiprocessing will work..? Further with replace method i have got many issues like : 1. It does not handle unseen labels and leave them as it is ( for string its issue). 2. It has problem with overlapping of keys and values.
from string import ascii_lowercase
import itertools
import pandas as pd
import numpy as np
import time
def iter_all_strings():
for size in itertools.count(1):
for s in itertools.product(ascii_lowercase, repeat=size):
yield "".join(s)
l = []
for s in iter_all_strings():
l.append(s)
if s == 'gr':
break
columns = l
df = pd.DataFrame(columns=columns)
for col in df.columns:
df[col] = np.random.randint(1, 4000, 3000)
transform_dict = {}
for col in df.columns:
cats = pd.Categorical(df[col]).categories
d = {}
for i, cat in enumerate(cats):
d[cat] = i
transform_dict[col] = d
print(f"The length of the dictionary is {len(transform_dict)}")
# Creating another test data frame
df2 = pd.DataFrame(columns=columns)
for col in df2.columns:
df2[col] = np.random.randint(1, 4000, 1)
print(f"The shape of teh 2nd data frame is {df2.shape}")
t1 = time.time()
for col in df2.columns:
df2[col] = df2[col].map(transform_dict[col]).fillna(0)
print(f"Time taken is {time.time() - t1}")
# print(df)
Firstly, when you want to encode categorical variables, which is not ordinal (meaning: there is no inherent ordering between the values of the variable/column. ex- cat, dog), you must use one hot encoding.
import pandas as pd
from sklearn.preprocessing import OneHotEncoder
df = pd.DataFrame({'pets': ['cat', 'dog', 'cat', 'monkey', 'dog', 'meo'],
'owner': ['Champ', 'Ron', 'Brick', 'Champ', 'Veronica', 'Ron'],
'location': ['San_Diego', 'New_York', 'New_York', 'San_Diego', 'San_Diego',
'New_York']})
enc = [['cat','dog','monkey'],
['Brick', 'Champ', 'Ron', 'Veronica'],
['New_York', 'San_Diego']]
ohe = OneHotEncoder(categories=enc, handle_unknown='ignore', sparse=False)
Here, I have modified your enc in a way that can be fed into the OneHotEncoder.
Now comes the point of how can we going to handle the unseen
labels?
when you handle_unknown as False, the unseen values will have zeros in all the dummy variables, which in a way would help the model to understand its a unknown value.
colnames= ['{}_{}'.format(col,val) for col,unique_values in zip(df.columns,ohe.categories_) \
for val in unique_values]
pd.DataFrame(ohe.fit_transform(df), columns=colnames)
Update:
If you are fine with ordinal endocing, the following change could help.
df2.apply(lambda row: [transform_dict[val].get(col,0) \
for val,col in row.items()],
axis=1,
result_type='expand')
#1000 loops, best of 3: 1.17 ms per loop

Read values from text file into 2D numpy array using index values from the text file

I need to read a text file that contains comma-delimited values into a 2D numpy array. The first 2 values on each line contain the index values for the numpy array and the third values contains the value to be stored in the array. As a catch, the index values are 1-based and need to be converted to the 0-based index values used by numpy. I've reviewed documentation and examples using genfromtxt and loadtxt but it's still not clear to me how to go about it. I've also tried the following code with no success:
a = np.arange(6).reshape(2,3)
for line in infile:
fields = line.split() #split fields inti list
rindex = int(fields[0]) - 1
cindex = int(fields[1]) - 1
a[rindex,cindex] = float(fields[2])
Here is an example of the input file:
1,1,10.1
1,2,11.2
1,3,12.3
2,3,13.4
2,2,14.5
2,3,15.6
And here is my desired output array. Ideally I'd like it to work on any array size without having to predefine the size of the array.
10.1 11.2 12.3
13.4 14.5 15.6
Here's one way you can do it. numpy.genfromtxt() is used to read the data into a structured array with three fields. The row and column indices are pulled out of the structured array and used to figure out the shape of the desired array, and to assign the values to the new array using numpy's "fancy" indexing:
In [46]: !cat test_data.csv
1,1,10.1
1,2,11.2
1,3,12.3
2,3,13.4
2,2,14.5
2,3,15.6
In [47]: data = np.genfromtxt('test_data.csv', dtype=None, delimiter=',', names=['i', 'j', 'value'])
In [48]: data
Out[48]:
array([(1, 1, 10.1), (1, 2, 11.2), (1, 3, 12.3), (2, 3, 13.4),
(2, 2, 14.5), (2, 3, 15.6)],
dtype=[('i', '<i8'), ('j', '<i8'), ('value', '<f8')])
In [49]: rows = data['i']
In [50]: cols = data['j']
In [51]: nrows = rows.max()
In [52]: ncols = cols.max()
In [53]: a = np.zeros((nrows, ncols))
In [54]: a[rows-1, cols-1] = data['value']
In [55]: a
Out[55]:
array([[ 10.1, 11.2, 12.3],
[ 0. , 14.5, 15.6]])

Resources