Pandas dataframe column float inside string (i.e. "float") to int - string

I'm trying to clean some data in a pandas df and I want the 'volume' column to go from a float to an int.
EDIT: The main issue was that the dtype for the float variable I was looking at was actually a str. So first it needed to be floated, before being changed.
I deleted the two other solutions I was considering, and left the one I used. The top one is the one with the errors, and the bottom one is the solution.
import pandas as pd
import numpy as np
#Call the df
t_df = pd.DataFrame(client.get_info())
#isolate only the 'symbol' column in t_df
tickers = t_df.loc[:, ['symbol']]
def tick_data(tickers):
for i in tickers:
tick_df = pd.DataFrame(client.get_ticker())
tick = tick_df.loc[:, ['symbol', 'volume']]
tick.iloc[:,['volume']].astype(int)
if tick['volume'].dtype != np.number:
print('yes')
else:
print('no')
return tick
Below is the revised code:
import pandas as pd
#Call the df
def ticker():
t_df = pd.DataFrame(client.get_info())
#isolate only the 'symbol' column in t_df
tickers = t_df.loc[:, ['symbol']]
for i in tickers:
#pulls out market data for each symbol
tickers = pd.DataFrame(client.get_ticker())
#isolates the symbol and volume
tickers = tickers.loc[:, ['symbol', 'volume']]
#floats volume
tickers['volume'] = tickers.loc[:, ['volume']].astype(float)
#volume to int
tickers['volume'] = tickers.loc[:, ['volume']].astype(int)
#deletes all symbols > 20,000 in volume, returns only symbol
tickers = tickers.loc[tickers['volume'] >= 20000, 'symbol']
return tickers

You have a few issues here.
In your first example, iloc only accepts integer locations for the rows and columns in the DataFrame, which is generating your error. I.e.
tick.iloc[:,['volume']].astype(int)
doesn't work. If you want label-based indexing, use .loc:
tick.loc[:,['volume']].astype(int)
Alternately, use bracket-based indexing, which allows you to take a whole column directly without using slice syntax (:) on the rows:
tick['volume'].astype(int)
Next, astype(int) returns a new value, it does not modify in-place. So what you want is
tick['volume'] = tick['volume'].astype(int)
As for your dtype is a number check, you don't want to check == np.number, but you don't want to check is either, which only returns True if it's np.number and not if it's a subclass like np.int64. Use np.issubdtype, or pd.api.types.is_numeric_dtype, i.e.:
if np.issubdtype(tick['volume'].dtype, np.number):
or:
if pd.api.types.is_numeric_dtype(tick['volume'].dtype):

Related

Replace items like A2 as AA in the dataframe

I have a list of items, like "A2BCO6" and "ABC2O6". I want to replace them as A2BCO6--> AABCO6 and ABC2O6 --> ABCCO6. The number of items are much more than presented here.
My dataframe is like:
listAB:
Finctional_Group
0 Ba2NbFeO6
1 Ba2ScIrO6
3 MnPb2WO6
I create a duplicate array and tried to replace with following way:
B = ["Ba2", "Pb2"]
C = ["BaBa", "PbPb"]
for i,j in range(len(B)), range(len(C)):
listAB["Finctional_Group"]= listAB["Finctional_Group"].str.strip().str.replace(B[i], C[j])
But it does not produce correct output. The output is like:
listAB:
Finctional_Group
0 PbPbNbFeO6
1 PbPbScIrO6
3 MnPb2WO6
Please suggest the necessary correction in the code.
Many thanks in advance.
I used for simplicity purpose chemparse package that seems to suite your needs.
As always we import the required packages, in this case chemparse and pandas.
import chemparse
import pandas as pd
then we create a pandas.DataFrame object like in your example with your example data.
df = pd.DataFrame(
columns=["Finctional_Group"], data=["Ba2NbFeO6", "Ba2ScIrO6", "MnPb2WO6"]
)
Our parser function will use chemparse.parse_formula which returns a dict of element and their frequency in a molecular formula.
def parse_molecule(molecule: str) -> dict:
# initializing empty string
molecule_in_string = ""
# iterating over all key & values in dict
for key, value in chemparse.parse_formula(molecule).items():
# appending number of elements to string
molecule_in_string += key * int(value)
return molecule_in_string
molecule_in_string contains the molecule formula without numbers now. We just need to map this function to all elements in our dataframe column. For that we can do
df = df.applymap(parse_molecule)
print(df)
which returns:
0 BaBaNbFeOOOOOO
1 BaBaScIrOOOOOO
2 MnPbPbWOOOOOO
dtype: object
Source code for chemparse: https://gitlab.com/gmboyer/chemparse

Only import ascending values from CSV to List - Python

I'm trying to only import part of a CSV file to a list.
In short, the CSV I recveive contains two columns [depth and speed]. Depth always starts at zero, gets larger and then back to zero again.
I would like to add the first part of the CSV to the list (depth 0-13+). I then want to add the second part of the CSV (13-0) to another list.
I assume a for loop would be the way to go, but I don't know how to check each row for ascending/descending numbers.
pullData = open("svp3.csv","r").read()
dataArray = pullData.split('\n')
depthArrayY = []
speedArrayX = []
depthArrayLength = len(depthArrayY)
for eachLine in dataArray:
if len(eachLine)>1:
x,y = eachLine.split(',')
speedArrayX.append(round(float(x), 2))
depthArrayY.append(round(float(y), 2))
I'd suggest using Pandas, I think it will allow you for much more when you need to deal with imported data.
import pandas as pd
df = pd.read_csv('svp3.csv')
tmp = df[df.depth <= df.depth.shift(-1)].values
depth_increase = tmp[:,0]
speed_while_depth_increase = tmp[:,1]
tmp = df[df.depth > df.depth.shift(-1)].values
depth_decrease = tmp[:,0]
speed_while_depth_decrease = tmp[:,1]
I assumed that your CSV has the first the depth column then the speed column.
Depth column had values from 0 to a certain max value say 14, then from 13 to 0 depth column->[0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,13,12,11,10,9,8,7,6,5,4,3,2,1]
and I populated speed column with some random values.
The following code makes use of pandas library and splits the column of depth into 2 lists of ascending and descending values using a simple logic of storing the current max value to determine when the ascending part of the column ends.
import pandas as pd
data = pd.read_csv('svp3.csv')
max_val = -10000
depthArrayAscendingY = []
speedArrayX = []
depthArrayDescendingY = []
for a in data.values:
if a[0]>max_val:
depthArrayAscendingY.append(a[0])
speedArrayX.append(a[1])
max_val = a[0]
else:
depthArrayDescendingY.append(a[0])
speedArrayX.append(a[1])
The answer to this question by Baleato is more efficient and cleaner than this answer, you should definitely check their answer.

How to save tuples output form for loop to DataFrame Python

I have some data 33k rows x 57 columns.
In some columns there is a data which I want to translate with dictionary.
I have done translation, but now I want to write back translated data to my data set.
I have problem with saving tuples output from for loop.
I am using tuples for creating good translation. .join and .append is not working in my case. I was trying in many case but without any success.
Looking for any advice.
data = pd.read_csv(filepath, engine="python", sep=";", keep_default_na=False)
for index, row in data.iterrows():
row["translated"] = (tuple(slownik.get(znak) for znak in row["1st_service"]))
I just want to see in print(data["1st_service"] a translated data not the previous one before for loop.
First of all, if your csv doesn't already have a 'translated' column, you'll have to add it:
import numpy as np
data['translated'] = np.nan
The problem is the row object you're trying to write to is only a view of the dataframe, it's not the dataframe itself. Plus you're missing square brackets for your list comprehension, if I'm understanding what you're doing. So change your last line to:
data.loc[index, "translated"] = tuple([slownik.get(znak) for znak in row["1st_service"]])
and you'll get a tuple written into that one cell.
In future, posting the exact error message you're getting is very helpful!
I have manage it, below working code:
data = pd.read_csv(filepath, engine="python", sep=";", keep_default_na=False)
data.columns = []
slownik = dict([ ])
trans = ' '
for index, row in data.iterrows():
trans += str(tuple([slownik.get(znak) for znak in row["1st_service"]]))
data['1st_service'] = trans.split(')(')
data.to_csv("out.csv", index=False)
Can you tell me if it is well done?
Maybe there is an faster way to do it?
I am doing it for 12 columns in one for loop, as shown up.

How to convert Excel negative value to Pandas negative value

I am a beginner in python pandas. I am working on a data-set named fortune_company. Data set are like below.
In this data-set for Profits_In_Million column there are some negative value which is indicating by red color and parenthesis.
but in pandas it's showing like below screenshot
I was trying to convert the data type Profits_In_Million column using below code
import pandas as pd
fortune.Profits_In_Million = fortune.Profits_In_Million.str.replace("$","").str.replace(",","").str.replace(")","").str.replace("(","-").str.strip()
fortune.Profits_In_Million.astype("float")
But I am getting the below error. Please someone help me one that. How I can convert this string datatype to float.
ValueError: could not convert string to float: '-'
Assuming you have no control over the cell format in Excel, the converters kwarg of read_excel can be used:
converters : dict, default None
Dict of functions for converting values in certain columns. Keys can
either be integers or column labels, values are functions that take
one input argument, the Excel cell content, and return the transformed
content.
From read_excel's docs.
def negative_converter(x):
# a somewhat naive implementation
if '(' in x:
x = '-' + x.strip('()')
return x
df = pd.read_excel('test.xlsx', converters={'Profits_In_Million': negative_converter})
print(df)
# Profits_In_Million
# 0 $1000
# 1 -$1000
Note however that the values of this column are still strings and not numbers (int/float). You can quite easily implement the conversion in negative_converter (remove the the dollar sign, and most probably the comma as well), for example:
def negative_converter(x):
# a somewhat naive implementation
x = x.replace('$', '')
if '(' in x:
x = '-' + x.strip('()')
return float(x)
df = pd.read_excel('test.xlsx', converters={'Profits_In_Million': negative_converter})
print(df)
# Profits_In_Million
# 0 1000.0
# 1 -1000.0

Store the value from pandas dataframe without index or header

I am trying to get the values from a CSV file using python and pandas. To avoid index and headers i am using .values with iloc but as an output my value is stored in [] brackets. I dont want them i just need the value. I dont want to print it but want to use it for other operations.
My code is :
import pandas as pd
ctr_x = []
ctr_y = []
tl_list = []
br_list = []
object_list = []
img = None
obj = 'red_hat'
df = pd.read_csv('ring_1_05_sam.csv')
ctr_x = df.iloc[10:12, 0:1].values #to avoid headers and index
ctr_y = df.iloc[10:12, 1:2].values #to avoid headers and index
ctr_x =[]
ctr_y =[]
If i print the result of ctr_x and ctr_y to check if correct values are recorded
The output i get is :
[[1536.25]
[1536.5 ]]
[[895.25]
[896. ]]
So i short i am getting the correct values but i don't want the brackets. Can anyone please suggest any other alternatives to my method. Note : I dont want to print the values but store it(without index and headers) for further operations
When you use column slice, pandas returns a Dataframe. Try
type(df.iloc[10:12, 0:1])
pandas.core.frame.DataFrame
This in turn will return a 2-D array when you use
df.iloc[10:12, 0:1].values
If you want a 1 dimensional array, you can use integer indexing which will return a Series,
type(df.iloc[10:12, 0])
pandas.core.series.Series
And a one dimensional array,
df.iloc[10:12, 0].values
So use
ctr_x = df.iloc[10:12, 0].values
ctr_y = df.iloc[10:12, 1].values

Resources