Using pandas manipulate number format - python-3.x

just out of my curiosity, I have a name list with phone numbers in a csv file, and I want to change these phone numbers from ############ (11 digits) to the format of ###-####-####, adding two minus sign in between 3-4 and 7-8 place.
is this possible?

If it's Dataframe you can use apply with formate string
df
num
0 09187543839
1 08745763412
df.num = df.num.apply(lambda x : "{}-{}-{}".format(x[:3],x[3:7],x[7:]))
df
num
0 091-8754-3839
1 087-4576-3412

Yes, it is possible. Below is a code-snippet that accomplishes what you want:
phone = str(55512354567)
print(f'{phone[:3]}-{phone[3:7]}-{phone[7:]}')
You can adapt the above idea to your Pandas dataframe as shown below:
# Sample data
data_df = pd.DataFrame([[55512345678], [55587654321]], columns=['phone'])
# Create a string column
data_df['phone_str'] = data_df['phone'].map(lambda x: str(x))
# Convert the column values to the right format
data_df['phone_str'] = data_df['phone_str'].map(lambda x: f'{x[:3]}-{x[3:7]}-{x[7:]}')

I may not be using pandas but this could potentially work...
n = 3
n1 = 7
str = "12345678901"
l, m, r = str[:n], str[n:n1], str[n1:]
final = l+"-"+m+"-"+r
print(final)
Output:
123-4567-8901

Related

Replace items like A2 as AA in the dataframe

I have a list of items, like "A2BCO6" and "ABC2O6". I want to replace them as A2BCO6--> AABCO6 and ABC2O6 --> ABCCO6. The number of items are much more than presented here.
My dataframe is like:
listAB:
Finctional_Group
0 Ba2NbFeO6
1 Ba2ScIrO6
3 MnPb2WO6
I create a duplicate array and tried to replace with following way:
B = ["Ba2", "Pb2"]
C = ["BaBa", "PbPb"]
for i,j in range(len(B)), range(len(C)):
listAB["Finctional_Group"]= listAB["Finctional_Group"].str.strip().str.replace(B[i], C[j])
But it does not produce correct output. The output is like:
listAB:
Finctional_Group
0 PbPbNbFeO6
1 PbPbScIrO6
3 MnPb2WO6
Please suggest the necessary correction in the code.
Many thanks in advance.
I used for simplicity purpose chemparse package that seems to suite your needs.
As always we import the required packages, in this case chemparse and pandas.
import chemparse
import pandas as pd
then we create a pandas.DataFrame object like in your example with your example data.
df = pd.DataFrame(
columns=["Finctional_Group"], data=["Ba2NbFeO6", "Ba2ScIrO6", "MnPb2WO6"]
)
Our parser function will use chemparse.parse_formula which returns a dict of element and their frequency in a molecular formula.
def parse_molecule(molecule: str) -> dict:
# initializing empty string
molecule_in_string = ""
# iterating over all key & values in dict
for key, value in chemparse.parse_formula(molecule).items():
# appending number of elements to string
molecule_in_string += key * int(value)
return molecule_in_string
molecule_in_string contains the molecule formula without numbers now. We just need to map this function to all elements in our dataframe column. For that we can do
df = df.applymap(parse_molecule)
print(df)
which returns:
0 BaBaNbFeOOOOOO
1 BaBaScIrOOOOOO
2 MnPbPbWOOOOOO
dtype: object
Source code for chemparse: https://gitlab.com/gmboyer/chemparse

How to create a dataframe from extracted hashtags?

I have used below code to extract hashtags from tweets.
def find_tags(row_string):
tags = [x for x in row_string if x.startswith('#')]
return tags
df['split'] = df['text'].str.split(' ')
df['hashtags'] = df['split'].apply(lambda row : find_tags(row))
df['hashtags'] = df['hashtags'].apply(lambda x : str(x).replace('\\n', ',').replace('\\', '').replace("'", ""))
df.drop('split', axis=1, inplace=True)
df
However, when I am counting them using the below code I am getting output that is counting each character.
from collections import Counter
d = Counter(df.hashtags.sum())
data = pd.DataFrame([d]).T
data
Output I am getting is:
I think the problem lies with the code that I am using to extract hashtags. But I don't know how to solve this issue.
Change find_tags by replace in list comprehension with split and for count values use Series.explode with Series.value_counts:
def find_tags(row_string):
return [x.replace('\\n', ',').replace('\\', '').replace("'", "")
for x in row_string.split() if x.startswith('#')]
df['hashtags'] = df['text'].apply(find_tags)
and then:
data = df.hashtags.explode().value_counts().rename_axis('val').reset_index(name='count')

What is the best way to concatenate two float64 columns without converting to str in python?

I'm still a Python beginner and the project I'm working on right now requires me to concatenate 2 float columns to datetime.
The data looks something like this
Date = [20191219, 20190812]
Time = [31547, 121652]
What's the best way to convert this to datetime format?
I tried to join both cols using dat['Date'].map(str) +''+ dat['Time'].map(str) but the values return with .0 added at the end... i.e 20191219.0, 31547.0 so datetime does not recognize this format.
Convert both columns separately to datetimes and then sum together with subtract helper dates created after converting times to datetimes:
Date = [20191219.0, 20190812]
Time = [31547, 121652.0]
dat = pd.DataFrame({'Date':Date,'Time':Time})
d = pd.to_datetime(dat['Date'], format='%Y%m%d')
t = pd.to_datetime(dat['Time'], format='%H%M%S')
dat['date'] = d + (t - t.dt.floor('d'))
print (dat)
Date Time date
0 20191219.0 31547.0 2019-12-19 03:15:47
1 20190812.0 121652.0 2019-08-12 12:16:52
Another solution is replace possible .0 to empty strings:
s = (dat['Date'].astype(str).str.replace('\.0', '') + ' ' +
dat['Time'].astype(str).str.replace('\.0', ''))
dat['date'] = pd.to_datetime(s, format='%Y%m%d %H%M%S')

Question on calculating incoming data from file

If I am reading a data file with some variable, I need to calculate the total numbers of different items by adding them from different lines. For example:
Fruit,Number
banana,25
apple,12
kiwi,29
apple,44
apple,81
kiwi,3
banana,109
kiwi,113
kiwi,68
we would need to add a third variable which is a total of the fruit, and fouth total of all the fruits.
So the output should be like following:
Fruit,Number,TotalFruit,TotalAllFruits
banana,25,25,25
apple,12,12,37
kiwi,29,29,66
apple,44,56,110
apple,81,137,191
kiwi,3,32,194
banana,109,134,303
kiwi,113,145,416
kiwi,68,213,484
I was able to get the first 2 columns printed, but having problem with the last 2 columns
import sys
import re
f1 = open("SampleInput.csv", "r")
f2 = open('SampleOutput.csv', 'a')
sys.stdout = f2
print("Fruit,Number,TotalFruit,TotalAllFruits")
for line1 in f1:
fruit_list = line1.split(',')
exec("%s = %d" % (fruit_list[1], 0))
print(fruit_list[0] + ',' + fruit_list[1])
I am just learning python, so I want to apologize in advance if I am missing something very simple.
You need to declare a 2d-array to keep the values read from the input file.
And during the loop, you need to read the value from previous lines, and then calculate the value of the current line.
And print the 2d-array after all input lines read.
I would recommend you to use pandas library as it makes your process easier
import pandas as pd
df1 = pd.read_csv("SampleInput.csv",sep=",")
df2 = pd.DataFrame()
for index, row in df1.iterrows():
# change the above to what ever you need
df2['Totalsum'] = df1['TotalFruit'] + df1['TotalAllFruits']
df2['Fruit'] = df1['Fruit']
df2.to_csv('SampleOutput.csv',sep=",")
df2 format :
Fruit | Totalsum |
---------------------
Name | Sum |
---------------------
Feel free to change the number of columns to your needs and add your custom logic.

How to convert Excel negative value to Pandas negative value

I am a beginner in python pandas. I am working on a data-set named fortune_company. Data set are like below.
In this data-set for Profits_In_Million column there are some negative value which is indicating by red color and parenthesis.
but in pandas it's showing like below screenshot
I was trying to convert the data type Profits_In_Million column using below code
import pandas as pd
fortune.Profits_In_Million = fortune.Profits_In_Million.str.replace("$","").str.replace(",","").str.replace(")","").str.replace("(","-").str.strip()
fortune.Profits_In_Million.astype("float")
But I am getting the below error. Please someone help me one that. How I can convert this string datatype to float.
ValueError: could not convert string to float: '-'
Assuming you have no control over the cell format in Excel, the converters kwarg of read_excel can be used:
converters : dict, default None
Dict of functions for converting values in certain columns. Keys can
either be integers or column labels, values are functions that take
one input argument, the Excel cell content, and return the transformed
content.
From read_excel's docs.
def negative_converter(x):
# a somewhat naive implementation
if '(' in x:
x = '-' + x.strip('()')
return x
df = pd.read_excel('test.xlsx', converters={'Profits_In_Million': negative_converter})
print(df)
# Profits_In_Million
# 0 $1000
# 1 -$1000
Note however that the values of this column are still strings and not numbers (int/float). You can quite easily implement the conversion in negative_converter (remove the the dollar sign, and most probably the comma as well), for example:
def negative_converter(x):
# a somewhat naive implementation
x = x.replace('$', '')
if '(' in x:
x = '-' + x.strip('()')
return float(x)
df = pd.read_excel('test.xlsx', converters={'Profits_In_Million': negative_converter})
print(df)
# Profits_In_Million
# 0 1000.0
# 1 -1000.0

Resources