I've created a function to check the range of values in a pandas dataframe. But the output is producing all values with scientific notation.
When I select_dtypes to include only int, I don't get this problem. It only happens when I include float. How can I get non-scientific values?
# Function to check range of value in a col.
def value_range(col):
max = data[col].max()
min = data[col].min()
return max-min
# value_range('total_revenue')
numerical_data = data.select_dtypes(include=[float, int]).columns
print(value_range(numerical_data))
Out:
Unnamed: 0 3.081290e+05
number_of_sessions 1.340000e+02
total_bounce 1.080000e+02
total_hits 3.706000e+03
days_difference_from_last_first_visit 1.800000e+02
transactions 2.500000e+01
total_revenue 2.312950e+10
organic_search 1.110000e+02
direct 8.300000e+01
referral 1.070000e+02
social 1.120000e+02
paid_search 4.400000e+01
affiliates 3.700000e+01
display 8.500000e+01
target_if_purchase 1.000000e+00
target_total_revenue 2.039400e+09
dtype: float64
value_range(numerical_data).apply(lambda x: format(x, 'f')) solves the problem. Thanks Sparrow1029.
Related
I have 2 Excel files which contains names as the only column:
File 1: file1.xlsx
Names
Vinay adkz
Sagarbhansali
Jeffery Jonas
Kiara Francis
Dominic
File 2: file2.xlsx
Names:
bhansali Sagar
Dominic
Jenny
adkzVinay
Sample Output:
I want to match the names in file 1 with names in file 2, and i am trying to get an output like the below :
Names File2Matchname. Match%
Vinay adkz. adkzVinay. 98%
Sagarbhansali. bhansali sagar 97%
Jeffery Jonas NA 0%
Kiara Francis NA 0%
Dominic Dominic 100%
Is there any logic by which the above logic can be arrived in python ?
I tried to do this in Excel but vlookup doesn't help with match%. I know this is possible with python using cousine similarity but i am unable to get the logic in which the output can be arrived.
Any help would be much appreciated.
You can use Pandas and use python's built-in difflib library which has a function called difflib.SequenceMatcher() function to find the longest common substring between the two names.
Example code:
import pandas as pd
import difflib
#For testing
dict_lists = {"Names":["Vinay adkz", "Shailesh", "Seema", "Nidhi","Ajitesh"]}
dict_lists2 = {"Names":["Ajitesh", "Vinay adkz", "Seema", "Nid"]}
# Excel to dataframes
df1 = pd.DataFrame(dict_lists) #pd.read_excel('file1.xlsx')
df2 = pd.DataFrame(dict_lists2) #pd.read_excel('file2.xlsx')
# Empty lists to stor matched name, match percentage
match_name = []
match_percent = []
# Iterate through the first dataframe
for i, row in df1.iterrows():
name = row['Names']
match = difflib.get_close_matches(name, df2['Names'], n=1, cutoff=0.8)
if match:
match_name.append(match[0])
match_string = difflib.SequenceMatcher(None, name, match[0]).find_longest_match(0, len(name), 0, len(match[0]))
match_percentage = (match_string.size / len(name)) * 100
match_percent.append(match_percentage)
else:
match_name.append('NA')
match_percent.append(0)
df1['File2names'] = match_name
df1['Match_per'] = match_percent
print(df1)
# Write in Excel
# df1.to_excel('output.xlsx', index=False)
I hope this helps you. This is the first time I am answering a question here.
Read also: How to use SequenceMatcher to find similarity between two strings?
I am trying to sort a given series in python pandas but as per my knowledge it is not correct , it should be like [1,3,5,10,python]
can you please guide on what basis it is sorting this way ?
s1 = pd.Series(['1','3','python','10','5'])
s1.sort_values(ascending=True)
enter image description here
As explained in the comments, you have strings so '5' is greater than '10' (strings are compared character by character and '5' > '1').
One workaround is to use natsort for natural sorting:
from natsort import natsort_key
s1.sort_values(ascending=True, key=natsort_key)
output:
0 1
1 3
4 5
3 10
2 python
dtype: object
alternative without natsort (numbers first, strings after):
key = lambda s: (pd.concat([pd.to_numeric(s, errors='coerce')
.fillna(float('inf')), s], axis=1)
.agg(tuple, axis=1)
)
s1.sort_values(ascending=True, key=key)
I have a list of items, like "A2BCO6" and "ABC2O6". I want to replace them as A2BCO6--> AABCO6 and ABC2O6 --> ABCCO6. The number of items are much more than presented here.
My dataframe is like:
listAB:
Finctional_Group
0 Ba2NbFeO6
1 Ba2ScIrO6
3 MnPb2WO6
I create a duplicate array and tried to replace with following way:
B = ["Ba2", "Pb2"]
C = ["BaBa", "PbPb"]
for i,j in range(len(B)), range(len(C)):
listAB["Finctional_Group"]= listAB["Finctional_Group"].str.strip().str.replace(B[i], C[j])
But it does not produce correct output. The output is like:
listAB:
Finctional_Group
0 PbPbNbFeO6
1 PbPbScIrO6
3 MnPb2WO6
Please suggest the necessary correction in the code.
Many thanks in advance.
I used for simplicity purpose chemparse package that seems to suite your needs.
As always we import the required packages, in this case chemparse and pandas.
import chemparse
import pandas as pd
then we create a pandas.DataFrame object like in your example with your example data.
df = pd.DataFrame(
columns=["Finctional_Group"], data=["Ba2NbFeO6", "Ba2ScIrO6", "MnPb2WO6"]
)
Our parser function will use chemparse.parse_formula which returns a dict of element and their frequency in a molecular formula.
def parse_molecule(molecule: str) -> dict:
# initializing empty string
molecule_in_string = ""
# iterating over all key & values in dict
for key, value in chemparse.parse_formula(molecule).items():
# appending number of elements to string
molecule_in_string += key * int(value)
return molecule_in_string
molecule_in_string contains the molecule formula without numbers now. We just need to map this function to all elements in our dataframe column. For that we can do
df = df.applymap(parse_molecule)
print(df)
which returns:
0 BaBaNbFeOOOOOO
1 BaBaScIrOOOOOO
2 MnPbPbWOOOOOO
dtype: object
Source code for chemparse: https://gitlab.com/gmboyer/chemparse
I am a beginner in python pandas. I am working on a data-set named fortune_company. Data set are like below.
In this data-set for Profits_In_Million column there are some negative value which is indicating by red color and parenthesis.
but in pandas it's showing like below screenshot
I was trying to convert the data type Profits_In_Million column using below code
import pandas as pd
fortune.Profits_In_Million = fortune.Profits_In_Million.str.replace("$","").str.replace(",","").str.replace(")","").str.replace("(","-").str.strip()
fortune.Profits_In_Million.astype("float")
But I am getting the below error. Please someone help me one that. How I can convert this string datatype to float.
ValueError: could not convert string to float: '-'
Assuming you have no control over the cell format in Excel, the converters kwarg of read_excel can be used:
converters : dict, default None
Dict of functions for converting values in certain columns. Keys can
either be integers or column labels, values are functions that take
one input argument, the Excel cell content, and return the transformed
content.
From read_excel's docs.
def negative_converter(x):
# a somewhat naive implementation
if '(' in x:
x = '-' + x.strip('()')
return x
df = pd.read_excel('test.xlsx', converters={'Profits_In_Million': negative_converter})
print(df)
# Profits_In_Million
# 0 $1000
# 1 -$1000
Note however that the values of this column are still strings and not numbers (int/float). You can quite easily implement the conversion in negative_converter (remove the the dollar sign, and most probably the comma as well), for example:
def negative_converter(x):
# a somewhat naive implementation
x = x.replace('$', '')
if '(' in x:
x = '-' + x.strip('()')
return float(x)
df = pd.read_excel('test.xlsx', converters={'Profits_In_Million': negative_converter})
print(df)
# Profits_In_Million
# 0 1000.0
# 1 -1000.0
I have a DF that looks like this (it is matlab data):
datesAvail date
0 737272 737272
1 737273 737273
2 737274 737274
3 737275 737275
4 737278 737278
5 737279 737279
6 737280 737280
7 737281 737281
Reading on internet, i wanted to convert matlab datetime into python date using the following solution found here
python_datetime = datetime.fromordinal(int(matlab_datenum)) + timedelta(days=matlab_datenum%1) - timedelta(days = 366)
where matlab_datenum is in my case equal to DF['date'] or DF['datesAvail']
I get an error TypeError: cannot convert the series to <class 'int'>
note that the data type is int
Out[102]:
datesAvail int64
date int64
dtype: object
I am not sure where i am going wrong. Any help is very appreciated
I am not sure what you are expecting as an output from this, but I assume it is a list?
The error is telling you exactly what is wrong, you are trying to convert a series with int(). The only arguments int can accept are strings, a bytes-like objects or numbers.
When you call DF['date'] it is giving you a series, so this needs to be converted into a number(or string or byte) first, so you need a for loop to iterate over the whole series. I would change it to a list first by doing DF['date'].tolist()
If you are looking to have an output as a list, you can do a list comprehension as shown here(sorry, this is long);
python_datetime_list = [datetime.fromordinal(int(i)) + timedelta(days=i%1) - timedelta(days = 366) for i in DF['date'].tolist()]