Replace items like A2 as AA in the dataframe - python-3.x

I have a list of items, like "A2BCO6" and "ABC2O6". I want to replace them as A2BCO6--> AABCO6 and ABC2O6 --> ABCCO6. The number of items are much more than presented here.
My dataframe is like:
listAB:
Finctional_Group
0 Ba2NbFeO6
1 Ba2ScIrO6
3 MnPb2WO6
I create a duplicate array and tried to replace with following way:
B = ["Ba2", "Pb2"]
C = ["BaBa", "PbPb"]
for i,j in range(len(B)), range(len(C)):
listAB["Finctional_Group"]= listAB["Finctional_Group"].str.strip().str.replace(B[i], C[j])
But it does not produce correct output. The output is like:
listAB:
Finctional_Group
0 PbPbNbFeO6
1 PbPbScIrO6
3 MnPb2WO6
Please suggest the necessary correction in the code.
Many thanks in advance.

I used for simplicity purpose chemparse package that seems to suite your needs.
As always we import the required packages, in this case chemparse and pandas.
import chemparse
import pandas as pd
then we create a pandas.DataFrame object like in your example with your example data.
df = pd.DataFrame(
columns=["Finctional_Group"], data=["Ba2NbFeO6", "Ba2ScIrO6", "MnPb2WO6"]
)
Our parser function will use chemparse.parse_formula which returns a dict of element and their frequency in a molecular formula.
def parse_molecule(molecule: str) -> dict:
# initializing empty string
molecule_in_string = ""
# iterating over all key & values in dict
for key, value in chemparse.parse_formula(molecule).items():
# appending number of elements to string
molecule_in_string += key * int(value)
return molecule_in_string
molecule_in_string contains the molecule formula without numbers now. We just need to map this function to all elements in our dataframe column. For that we can do
df = df.applymap(parse_molecule)
print(df)
which returns:
0 BaBaNbFeOOOOOO
1 BaBaScIrOOOOOO
2 MnPbPbWOOOOOO
dtype: object
Source code for chemparse: https://gitlab.com/gmboyer/chemparse

Related

Make some transformation to a column and then set it as index

Let say I have below pandas dataframe
import pandas as pd
dat = pd.DataFrame({'A' : [1,2,3,4], 'B' : [3,4,5,6]})
dat['A1'] = dat['A'].astype(str) + '_Something'
dat.set_index('A1')
While this alright, I want to achieve below things
Instead of having this line dat['A1'] = dat['A'].astype(str) + '_Something', can I transform the column A on the fly and directly pass that transformed values to dat.set_index? My transformation function is rather little complex, so I am looking for some general approach
After setting index, can I remove A1 which is now sitting as like the header of index
Any pointer will be very helpful
You can pass a np.array to df.set_index. So, just chain Series.to_numpy after the transformation, and make sure that you set the inplace parameter to True inside set_index.
dat.set_index(
(dat['A'].astype(str) + '_Something') # transformation
.to_numpy(),
inplace=True)
print(dat)
A B
1_Something 1 3
2_Something 2 4
3_Something 3 5
4_Something 4 6
So, generalized with a function applied, that would be something like:
def f(x):
y = f'{x}_Something'
return y
dat.set_index(dat['A'].apply(f).to_numpy(), inplace=True)

How to create a dictionary of dates as keys with value pair as list of three temperatures in python

The function extracts the max, min and avg temperatures for all days in the list. I want to combine the data into a dictionary; i.e. the returned temperatures and values and the dates as keys. Can't seem to get this to work. I may be going about this in the wrong way. End aim is to create a chart with date and the three temperatures for each day. I was anticipating something like: my_dict: {date,[list of 3 temps], date2,[list of 3 temps2]...}
lstdates=['09-27','09-28','09-29','09-30','10-1']
def daily_normals(date):
"""Daily Normals.
Args:
date (str): A date string in the format '%m-%d'
Returns:
A list of tuples containing the daily normals, tmin, tavg, and tmax
"""
sel = [func.min(meas.tobs), func.avg(meas.tobs), func.max(meas.tobs)]
return session.query(*sel).filter(func.strftime("%m-%d", meas.date) == date).all()
lstdaynorm=[]
my_dict ={}
for i in lstdates:
print(i)
dn=daily_normals(l)
lstdaynorm.append(dn)
my_dict.append(i,dn)
For starters, a dict object has no method called append, so my_dict.append(i,dn) is invalid syntax. Also, your iterator variable is i, but you called daily_normals on l. You should convert the tuple dn to a list and directly insert that list into the dict to achieve what you want:
lstdaynorm=[]
my_dict = {}
for i in lstdates:
dn=daily_normals(i)
lstdaynorm.append(dn)
my_dict[i] = list(dn[0][1:]) # extract elements of tuple excluding date from list and convert it to list
my_dict = dict(my_dict)
To put this in a dataframe:
import pandas as pd
df = pd.DataFrame.from_dict(my_dict, orient='index', columns=['tmin', 'tavg', 'tmax'])

Look up a number inside a list within a pandas cell, and return corresponding string value from a second DF

(I've edited the first column name in the labels_df for clarity)
I have two DataFrames, train_df and labels_df. train_df has integers that map to attribute names in the labels_df. I would like to look up each number within a given train_df cell and return in the adjacent cell, the corresponding attribute name from the labels_df.
So fore example, the first observation in train_df has attribute_ids of 147, 616 and 813 which map to (in the labels_df) culture::french, tag::dogs, tag::men. And I would like to place those strings inside one cell on the same row as the corresponding integers.
I've tried variations of the function below but fear I am wayyy off:
def my_mapping(df1, df2):
tags = df1['attribute_ids']
for i in tags.iteritems():
df1['new_col'] = df2.iloc[i]
return df1
The data are originally from two csv files:
train.csv
labels.csv
I tried this from #Danny :
sample_train_df['attribute_ids'].apply(lambda x: [sample_labels_df[sample_labels_df['attribute_name'] == i]
['attribute_id_num'] for i in x])
*please note - I am running the above code on samples of each DF due to run times on the original DFs.
which returned:
I hope this is what you are looking for. i am sure there's a much more efficient way using look up.
df['new_col'] = df['attribute_ids'].apply(lambda x: [labels_df[labels_df['attribute_id'] == i]['attribute_name'] for i in x])
This is super ugly and one day, hopefully sooner than later, i'll be able to accomplish this task in an elegant fashion though, until then, this is what got me the result I need.
split train_df['attribute_ids'] into their own cell/column
helper_df = train_df['attribute_ids'].str.split(expand=True)
combine train_df with the helper_df so I have the id column (they are photo id's)
train_df2 = pd.concat([train_df, helper_df], axis=1)
drop the original attribute_ids column
train_df2.drop(columns = 'attribute_ids', inplace=True)
rename the new columns
train_df2.rename(columns = {0:'attr1', 1:'attr2', 2:'attr3', 3:'attr4', 4:'attr5', 5:'attr6',
6:'attr7', 7:'attr8', 8:'attr9', 9:'attr10', 10:'attr11'})
convert the labels_df into a dictionary
def create_file_mapping(df):
mapping = dict()
for i in range(len(df)):
name, tags = df['attribute_id_num'][i], df['attribute_name'][i]
mapping[str(name)] = tags
return mapping
map and replace the tag numbers with their corresponding tag names
train_df3 = train_df2.applymap(lambda s: my_map.get(s) if s in my_map else s)
create a new column of the observations tags in a list of concatenated values
helper1['new_col'] = helper1[helper1.columns[0:10]].apply(lambda x: ','.join(x.astype(str)), axis = 1)

Append each value in a DataFrame to a np vector, grouping by column

I am trying to create a list, which will be fed as input to the neural network of a Deep Reinforcement Learning model.
What I would like to achieve:
This list should have the properties of this code's output
vec = []
lines = open("data/" + "GSPC" + ".csv", "r").read().splitlines()
for line in lines[1:]:
vec.append(float(line.split(",")[4]))
i.e. just a list of values like this [enter image description here][1]
The original dataframe looks like:
Out[0]:
Close sma15
0 1.26420 1.263037
1 1.26465 1.263193
2 1.26430 1.263350
3 1.26450 1.263533
but by using df.transpose() i obtained the following:
0 1 2 3
Close 1.264200 1.264650 1.26430 1.26450
sma15 1.263037 1.263193 1.26335 1.263533
from here I would like to obtain a list grouped by column, of the type:
[1.264200, 1.263037, 1.264650, 1.263193, 1.26430, 1.26335, 1.26450, 1.263533]
I tried
x = np.array(df.values.tolist(), dtype = np.float32).reshape(1,-1)
but this gives me a float with 1 row and 6 columns, how could I achieve a result that has the properties I am looking for?
From what I can understand, you just want a flattened version of the DataFrame's values. That can be done simply with the ndarray.flatten() method rather than reshaping it.
# Creating your DataFrame object
a = [[1.26420, 1.263037],
[1.26465, 1.263193],
[1.26430, 1.263350],
[1.26450, 1.263533]]
df = pd.DataFrame(a, columns=['Close', 'sma15'])
df.values.flatten()
This gives array([1.2642, 1.263037, 1.26465, 1.263193, 1.2643, 1.26335, 1.2645, 1.263533]) as is (presumably) desired.
PS: I am not sure why you have not included the last row of the DataFrame as the output of your transpose operation. Is that an error?

How to convert Excel negative value to Pandas negative value

I am a beginner in python pandas. I am working on a data-set named fortune_company. Data set are like below.
In this data-set for Profits_In_Million column there are some negative value which is indicating by red color and parenthesis.
but in pandas it's showing like below screenshot
I was trying to convert the data type Profits_In_Million column using below code
import pandas as pd
fortune.Profits_In_Million = fortune.Profits_In_Million.str.replace("$","").str.replace(",","").str.replace(")","").str.replace("(","-").str.strip()
fortune.Profits_In_Million.astype("float")
But I am getting the below error. Please someone help me one that. How I can convert this string datatype to float.
ValueError: could not convert string to float: '-'
Assuming you have no control over the cell format in Excel, the converters kwarg of read_excel can be used:
converters : dict, default None
Dict of functions for converting values in certain columns. Keys can
either be integers or column labels, values are functions that take
one input argument, the Excel cell content, and return the transformed
content.
From read_excel's docs.
def negative_converter(x):
# a somewhat naive implementation
if '(' in x:
x = '-' + x.strip('()')
return x
df = pd.read_excel('test.xlsx', converters={'Profits_In_Million': negative_converter})
print(df)
# Profits_In_Million
# 0 $1000
# 1 -$1000
Note however that the values of this column are still strings and not numbers (int/float). You can quite easily implement the conversion in negative_converter (remove the the dollar sign, and most probably the comma as well), for example:
def negative_converter(x):
# a somewhat naive implementation
x = x.replace('$', '')
if '(' in x:
x = '-' + x.strip('()')
return float(x)
df = pd.read_excel('test.xlsx', converters={'Profits_In_Million': negative_converter})
print(df)
# Profits_In_Million
# 0 1000.0
# 1 -1000.0

Resources