Unable to convert pandas str column with .0 to int [duplicate] - python-3.x

This question already has answers here:
Change column type in pandas
(16 answers)
Remove decimals fom pandas column(String type)
(2 answers)
Closed 8 months ago.
Let's say I have a dataframe that looks like this...
xdf = pd.DataFrame({ 'foo': ['1', '2', '3']})
foo
1
2
3
And I want to convert the column to type int. I can do that easily with...
df = df.astype({ 'foo': 'int' })
But if my dataframe looks like this...
df = pd.DataFrame({ 'foo': ['1.0', '2.0', '3.0']})
foo
1.0
2.0
3.0
And I try to convert it from object to int then I get this error
ValueError: invalid literal for int() with base 10: '1.0'
Why doesn't this work? How would I go about converting this to an int properly?

Just do a two step conversion, string to float then float to int.
>>> df.astype({ 'foo': 'float' }).astype({ 'foo': 'int' })
foo
0 1
1 2
2 3
It works with or without the decimal point.

You can use the downcast option of to_numeric method.
df['foo'] = pd.to_numeric(df['foo'], downcast='integer')

Related

Formatted String Literals

Why one formatted String literals is able to print without print() but the other not?
>> price = 11.23
>> f"Price in Euro: {price}"
>> for article in ["bread", "butter", "tea"]:
print(f"{article:>10}:")
Interactive sessions print the result of the last top-level statement. A format string is an expression and thus has a result, but a for loop is not and produces no result to display.
>>> 3 # expression
3
>>> a = 3
>>> a # expresion
3
>>> a + 4 # expression
7
>>> if a:
... 9
>>> f"{a}" # expression
3
It's pretty unclear what you want to do with your code, but this should work
for article in ['bread', 'butter']:
print(f"{article}")
Can you please properly format your code and edit your question?

How to convert list to strings in python? [duplicate]

This question already has answers here:
How to convert list to string [duplicate]
(3 answers)
Closed 3 years ago.
I have this list:
x = ['nm0000131', 'nm0000432', 'nm0000163']
And I would like to convert it to:
'nm0000131',
'nm0000432',
'nm0000163'
e.g: I would like convert a list of strings (x) to 3 independent strings.
If you want three separate string you can use for loop.
Try the following code:
x = ['nm0000131', 'nm0000432', 'nm0000163']
for value in x:
print(value)
Output will be like:
nm0000131
nm0000432
nm0000163
The following code will display an output like "nm0000131" ,"nm0000432" ,"nm0000163":
x = ['nm0000131', 'nm0000432', 'nm0000163']
str1 = '" ,"'.join(x)
x = '"'+str1+'"'
print(x)
As you mentioned in the comment I would like to include some more points to my answer.
If you want to get the key-value pairs then try the following code.
y = {'131': 'a', '432': 'b', '163': 'c'}
w = []
for key, value in y.items():
w.append(value)
print(w)
Output:
['c', 'a', 'b']

Problem converting python list to np.array. Process is dropping sting type data

My goal is to convert this list of strings to a Numpy Array.
I want to convert the first 2 columns to numerical data (integer)
list1 = [['380850', '625105', 'Dota 2'],
['354804', '846193', "PLAYERUNKNOWN'S BATTLEGROUNDS"],
['204354', '467109', 'Counter-Strike: Global Offensive']
]
dt = np.dtype('i,i,U')
cast_array = np.array([tuple(row) for row in sl], dtype=dt)
print(cast_array)
The result is ...
[OUT] [(380850, 625105, '') (354804, 846193, '') (204354, 467109, '')]
I am losing the string data. I am interested in
Understanding why the string data is getting dropped
Finding any solution that converts the first 2 columns to integer type in a numpy array
This answer gave me the approach but doesn't seem to work for strings
Thanks to user: 9769953's comment above, this is the solution.
#when specifying strings you need to specify the length (derived from longest string in the list)
dtypestr = 'int, int, U' + str(max([len(i[2]) for i in plist1]))
cast_array = np.array([tuple(row) for row in plist1], dtype=dtypestr)
print(np.array(cast_array))
The simplest way to do that at high level is to use pandas, as said in comments, which will silently manage tricky problems :
In [64]: df=pd.DataFrame(list1)
In [65]: df2=df.apply(pd.to_numeric,errors='ignore')
In [66]: df2
Out[66]:
0 1 2
0 380850 625105 Dota 2
1 354804 846193 PLAYERUNKNOWN'S BATTLEGROUNDS
2 204354 467109 Counter-Strike: Global Offensive
In [67]: df2.dtypes
Out[67]:
0 int64
1 int64
2 object
dtype: object
df2.iloc[:,:2].values will be the numpy array, You can use all numpy accelerations on this part.
Your dtype is not what you expect it to be - you're running into https://github.com/numpy/numpy/issues/8969:
>>> dt = np.dtype('i,i,U')
>>> dt
dtype([('f0', '<i4'), ('f1', '<i4'), ('f2', '<U')])
>>> dt['f2'].itemsize
0 # 0-length strings!
You need to either specify a maximum number of characters
>>> dt = np.dtype('i,i,16U')
Or use an object type to store variable length strings:
>>> dt = np.dtype('i,i,O')

Convert string to list of mixed float, int and str knowing the format in Python [duplicate]

This question already has answers here:
Use Python format string in reverse for parsing
(5 answers)
Closed 5 years ago.
Consider a simple string like '1.5 1e+05 test 4'. The format used in Python 2.7 to generate that string is '%f %e %s %i'. I want to retrieve a list of the form [1.5,100000,'test',4] from my input string knowing the string formatter. How can I do that (Python 2.7 or Python 3)?
Thanks a lot,
Ch.
Use the parse module. This module can do 'reverse formatting'. Example:
from parse import parse
format_str = '{:1f} {:.2e} {} {:d}'
data = [1.5, 100000, 'test', 4]
data_str = format_str.format(*data)
print(data_str) # Output: 1.500000 1.00e+05 test 4
parsed = parse(format_str, data_str)
print(parsed) # Output: <Result (1.5, 100000.0, 'test', 4) {}>
a, b, c, d = parsed # Whatever

Minimize float format in Pandas df.to_csv()

For large datasets, I would like to encode floats minimally when writing the CSV.
0.0 or 1.0 should be written 0 or 1
1.234567 should be written 1.235
123.0 should be written 123
DataFrame.to_csv() allows a float_format, but that makes every float look the same, which doesn't save space when writing integers.
You could do something hacky like this:
def to_str(item):
if type(item) in {np.int, np.float64}:
return '{:g}'.format(item)
else:
return item
pd.DataFrame({'int': [1, 2], 'float': [1.03, 1.0], 'str': ['a', 'b']}).applymap(to_str)
which returns
float int str
0 1.03 1 a
1 1 2 b
If that's too slow, you can also skip the type checking and just apply the string conversion to columns matching the numeric type.

Resources