Is there any difference between using/not using "astype(np.float)" for narray? - python-3.x

I'm going to import the txt file which contains Numbers only, for some coding practice.
Noticed that i can get the same result with either code_1 or code_2:
code_1 = np.array(pd.read_csv('e:/data.txt', sep='\t', header=None)).astype(np.float)
code_2 = np.array(pd.read_csv('e:/data.txt', sep='\t', header=None))
So I wonder if there is any difference between using or not using .astype(np.float)?
please tell me if there is an similar question. thx a lot.

DataFrame.astype() method is used to cast a pandas object to a specified dtype. astype() function also provides the capability to convert any suitable existing column to categorical type.
The DataFrame.astype() function comes very handy when we want to case a particular column data type to another data type.
In your case, the file is loaded as a DataFrame. The numbers will be loaded as integers or floats depending on the numbers. The astype(np.float) method converts the numbers to floats. On the other hand if the numbers are already of float type, then as you saw there will not be any difference between the two.

Related

How is this error possible and what can be done about it? "ValueError: invalid literal for int() with base 10: '1.0'"

I'm using Python 3 with the pandas library and some other data science libraries. After running into a variety of subtle type errors while just trying to compare values across two columns that should contain like integer values in a single pandas DataFrame (although the Python interpreter arbitrarily interprets the types as float, string, or series, seemingly almost at random), I'm now running into this inexplicable / nonsensical seeming error while attempting to cast back to integer, after converting the values to string to strip out blank spaces introduced (presumably by pandas internal processing, because my code tries to keep the type int throughout) much further upstream in the program flow.
ValueError: invalid literal for int() with base 10: '1.0'
The main problem I have with this error message is there should be no reason a type conversion to int should ever blow up on the value '1.0.' Just taking the error message at face value, it makes no sense to me and seems like a deeper problem or bug in pandas.
But ignoring more fundamental problems or bugs in Python or pandas, any help resolving this in a generalizable way that will play nice consistently in every reasonable scenario (behaving more like strongly-typed, type-safe code, basically) would be appreciated.
Here's the bit of code where I'm trying to deal with all the various type conversion and blank value issues I've bumped into at once, because I've gone round and around on this a few times in subtly different scenarios, and every time I thought I'd finally bullet-proofed this bit of code and gotten it working as intended in every case, some new unexpected type conversion issue like this crops up.
df[getVariableLabel(outvar)] = df[getVariableLabel(outvar)].astype(str).str.strip()
df['prediction'] = df['prediction'].astype(str).str.strip()
actual = np.array(df[getVariableLabel(outvar)].fillna(-1).astype(int))
// this is the specific line that throws the error
predicted = np.array(df['prediction'].fillna(-1).astype(int))
For further context on the code above, the "df" object is a pandas dataframe passed in by parameter. "getVariableLabel" is a helper function used to format a dynamic field name. Both columns contain simple "1" and "0" values, except where there may be nAn/blanks (which I'm attempting to fill with dummy values).
It doesn't really have to be a conversion to int for my needs. String values would be fine, too, if it were possible to keep pandas/Python from arbitrarily treating one series as ints, and the other, as floats before the conversion to string, which makes value comparisons between the two sets of values fail.
Here's the bit of the call stack dump where pandas is throwing the error, for further context:
File "C:\Anaconda3\lib\site-packages\pandas\core\dtypes\cast.py",
line 874, in astype_nansafe
return lib.astype_intsafe(arr.ravel(), dtype).reshape(arr.shape)
File "pandas_libs\lib.pyx", line 560, in
pandas._libs.lib.astype_intsafe
Solved it for myself with the following substitution, in case anyone else runs into this. It may also have helped that I updated pandas from 1.0.1 to 1.0.2, since that update did includes some type conversion bug fixes, but more likely it was this workaround (where pd is of course the alias for the pandas library):
df[getVariableLabel(outvar)] = pd.to_numeric(df[getVariableLabel(outvar)])
df['prediction'] = pd.to_numeric(df['prediction'])
The original value error message is still confusing and seems like a bug but this worked in my particular case.

Is there any difference between using Dataframe.columns and Dataframe.keys() to obtain the column names?

For the sake of curiosity is there any practical difference between getting the column names of a DataFrame (let's say df) by using df.columns or df.keys()?
I've checked the outs by type and it seems to be exactly the same. Am I missing something or these two methods are just as redundant as it seems? Is one more appropriate to use than the other?
Thanks.
one difference I noticed That you can Use .keys() with Series but You can not use .columns with Series .
Doesn't look like there's a practical difference and if there is, I'd really like to know what it is. You probably saw in the documentation that DataFrame.columns has the column labels and it is an axis property and DataFrame.keys gets the info axis. I would think that since the former is an attribute or reference and the latter a callable method, the method takes a little more time to execute. I have not tested this but I'm pretty sure that, even if there's a difference, it is not significant. Also they both return the same type:
>>> type(data.columns)
<class 'pandas.core.indexes.base.Index'>
>>> type(data.keys())
<class 'pandas.core.indexes.base.Index'>

Defining dtypes at time of import of a tab-delimited file into a dataframe

As some data are ambiguous (e.g customer numbers that should be interpreted as strings and not integers), I am using the dtype option (pd.read_table('BSC.csv', dtype=str).
It works fine,as Pandas do not complain anymore about ambiguous types.
Nevertheless, when I stored the dataframe in an HDFStore, I got a complaint that using untyped objects will result in performance loss. I looked at my dataframe using .dtypes, and I saw that all types moved back to 'object'.
I looked at Pandas.read_table doc, but I did not find any setting that would freeze the type to string after the import. Does it mean that the only option is to use a .apply(to_string) step just before storing the dataframe ?

Python float() limitation on scientific notation

python 3.6.5
numpy 1.14.3
scipy 1.0.1
cerberus 1.2
I'm trying to convert a string '6.1e-7' to a float 0.00000061 so I can save it in a mongoDb field.
My problem here is that float('6.1e-7') doesn't work (it will work for float('6.1e-4'), but not float('6.1e-5') and more).
Python float
I can't seem to find any information about why this happen, on float limitations, and every examples I found shows a conversion on e-3, never up to that.
Numpy
I installed Numpy to try the float96()/float128() ...float96() doesn't exist and float128() return a float '6.09999999999999983e-07'
Format
I tried 'format(6.1E-07, '.8f')' which works, as it return a string '0.00000061' but when I convert the string to a float (so it can pass cerberus validation) it revert back to '6.1E-7'.
Any help on this subject would be greatly appreciated.
Thanks
'6.1e-7' is a string:
>>> type('6.1e-7')
<class 'str'>
While 6.1e-7 is a float:
>>> type(6.1e-7)
<class 'float'>
0.00000061 is the same as 6.1e-7
>>> 0.00000061 == 6.1e-7
True
And, internally, this float is represented by 0's and 1's. That's just yet another representation of the same float.
However, when converted into a string, they're no longer compared as numbers, they are just characters:
>>> '0.00000061' == '6.1e-7'
False
And you can't compare strings with numbers either:
>>> 0.00000061 == '6.1e-7'
False
Your problem description is too twisted to be precisely understood but I'll try to get some telepathy for this.
In an internal format, numbers don't keep any formatting information, neither integers nor floats do. For an integer 123, you can't restore whether it was presented as "123", " 123 " (with tons of spaces before and after it), 000000123 or +0123. For a floating number, 0.1, +0.0001e00003, 1.000000e-1 and myriads of other forms can be used. Internally, all they shall result in the same number.
(There are some specifics with it when you use IEEE754 "decimal floating", but I am sure it is not your case.)
When saving to a database, internal representation stops having much sense. Instead, the database specifics starts playing role, and it can be quite different. For example, SQL suggests using column types like numeric(10,4), and each value will be converted to decimal format corresponding to the column type (typically, saved on disk as text string, with or without decimal point). In MongoDB, you can keep a floating value either as JSON number (IEEE754 double) or as text. Each variant has its own specifics, but, if you choose text, it is your own responsibility to provide proper formatting each time you form this text. You want to see a fixed-point decimal number with 8 digits after point? OK, no problems: you just shall format according to %.8f on each preparing of such representation.
The issues with representation selection are:
Uniqueness: no different forms should be available for the same value. Otherwise you can, for example, store the same contents under multiple keys, and then mistake older one for a last one.
Ordering awareness: DB should be able to provide natural order of values, for requests like "ceiling key-value pair".
If you always format values using %.8f, you will reach uniqueness, but not ordering. The same for %.g, %.e and really other text format except special (not human readable) ones that are constructed to keep such ordering. If you need ordering, just use numbers as numbers, and don't concentrate on how they look like in text forms.
(And, your problem is not tied with numpy.)

how to write xdr packed data in binary files

I want to convert an array into xdr format and save it in binary format. Here's my code:
# myData is Pandas data frame, whose 3rd col is int (but it could anything else)
import xdrlib
p=xdrlib.Packer()
p.pack_list(myData[2],p.pack_int)
newFile=open("C:\\Temp\\test.bin","wb")
# not sure what to put
# p.get_buffer() returns a string as per document, but how can I provide xdr object?
newFile.write(???)
newFile.close()
How can I provide the XDR-"packed" data to newFile.write function?
Thanks
XDR is a pretty raw data format. Its specification (RFC 1832) doesn't specify any file headers, or anything else, beyond the encoding of various data types.
The binary string you get from p.get_buffer() is the XDR encoding of the data you've fed to p. There is no other kind of XDR object.
I suspect that what you want is simply newFile.write(p.get_buffer()).
Unrelated to the XDR question, I'd suggest using a with statement to take care of closing the file.

Resources