Pandas Dataframe Integer Column Data Cleanup - python-3.x

I have a CSV file, which I read through pandas read_csv module.
There is one column, which is supposed to have numbers only, but the data has some bad values.
Some rows (very few) have "alphanumeric" strings, few rows are empty while a few others have floating point numbers. Also, for some reason, some numbers are also being read as strings.
I want to convert it in the following way:
Alphanumeric, None, empty (numpy.nan) should be converted to 0
Floating point should be typecasted to int
Integers should remain as they are
And obvs, numbers should be read as numbers only.
How should I proceed, as I have no other idea than to read each row one by one and typecast into int, in a try-except block, while assigning 0 if exception is raised.
like:
def typecast_int(n):
try:
return int(n)
except:
return 0
for idx, row in df.iterrows:
row["number_column"] = typecast_int(row["number_column"])
But there are some issues with this approach. Firstly, iterrows is bad performance wise. And my dataframe may have upto 700k to 1M records and I have to process ~500 such CSV files. And secondly, it just doesn't feel right to do it this way.
I could do a tad better by using df.apply instead of iterrows but that is also not too different.

From your 4 conditions, there's
df.number_column = (pd.to_numeric(df.number_column, errors="coerce")
.fillna(0)
.astype(int))
This first converts the column to be numeric values only. If errors arise (e.g., due to alphanumerics) they got "coerce"d to NaN. Then we fill those NaN's with 0 and lastly cast everything to integers.

Related

Pandas: Problem in Concatenation of three Data Frames

I am loading three.csv files into three different Pandas data frames. The number of rows in each file should be the same, but it is not. Sometimes one file contains four to five additional rows compared to another. Sometimes two files include two to three extra rows compared to one file. I'm concatenating all files row by row, so I'm removing these additional rows to make them the same length. I wrote the following code to accomplish this.
df_ch = pd.read_csv("./file1.csv")
df_wr = pd.read_csv("./file2.csv")
df_an = pd.read_csv("./file3.csv")
# here df_wr have less number of rows than df_ch and df_an, so dropping rows from other two frames (df_ch and df_an)
df_ch.drop(df_ch.index[274299:274301], inplace=True)
df_an.drop(df_an.index[274299], inplace=True)
I did it manually in the above code, and now I want to do it automatically. One method is to use if-else to check the length of all frames and make it equal to the shortest length frame. But I'm curious whether Pandas has a faster technique to compare these frames and delete the additional rows.
Couple of way you can do:
In general,
#take the intersection of all the index:
common = df1.index.intersection(df2.index).intersection(df3.index)
pd.concat([df1, df2, df3], axis=1).reindex(common)
Or in your case, maybe just
max_rows = min(map(len, [df1,df2,df3]))
pd.concat([df1, df2, df3], axis=1).iloc[:max_rows]
Update: best option should be join:
df1.join(df2, how='inner').join(df3, how='inner')

Convert all strings with numbers to integers in DataFrames

I am using pandas with openpyxl to process multiple Excel files into a single Excel file as output. In this output file, cells can contain a combination of numbers and other characters or exclusively numbers, and all cells are stored as text.
I want all cells that only contain numbers in the output file to be stored as numbers. As the columns with numbers are known (5 to 8), I used the following code to transform the text to floats:
for dictionary in list_of_Excelfiles
dictionary[DataFrame][5:8].astype(float)
However, this manual procedure is not scalable and might be prone to errors when other characters than numbers are present in the column. As such, I want to create a statement that transforms any cell with only numbers to an integer.
What condition can filter for cells with only numbers and transform these to integers?
You could use try and except and apply map, here is a full example:
create some random data for example:
def s():
return [''.join(random.choices([x for x in string.ascii_letters[:6]+string.digits], k=random.randint(1, 5))) for x in range(5)]
df = pd.DataFrame()
for c in range(4):
df[c] = s()
define a try and except func:
def try_int(s):
try:
return int(s)
except ValueError:
return s
apply on each cell:
df2 = df.applymap(try_int)

Efficient way to convert a string column with comma-separated floats into 2D NumPy array?

I have a Pandas dataset with one string column containing a series of floats separated by commas (as one big string):
4677579 -0.26751,0.559024,0.690269,0.0446298,0.967434,...
5193597 -0.587694,0.15063,-0.163831,-0.696565,0.972488...
596398 -0.732648,0.69533,0.722288,-0.0453636,0.435788...
5744417 -0.354733,-0.782564,-0.301927,0.263827,0.96237...
2464195 -0.326526,0.341944,0.330533,-0.250108,0.673552...
So, the first row has the following format:
{4677579: '-0.26751,0.559024,[..skipped..],-0.394059,0.974787'}
I need to convert that column into a (preferably 2D NumPy) array of floats:
array([[-2.67510e-01, 5.59024e-01, 6.90269e-01, skipped, 4.45222e-01, -1.82369e-01],
[-5.87694e-01, 1.50630e-01, -1.63831e-01, skipped, 9.47768e-01]])
The problem is that the dataset is very large (>10M rows, >400 floats/row), and my naive approaches to conversion, like:
vectordata.apply(lambda x: np.fromstring(x, sep=','))
or
vectordata.apply(lambda x: list(map(float,x.split(','))))
just time out (they work OK if the total size of the dataset is <5M rows, though).
Any ideas how one could optimize this operation to make it work on a large dataset?

Adding a zero columns to a dataframe

I have a strange problem which i am not able to figure out. I have a dataframe subset that looks like this
in the dataframe, I add "zero" columns using the following code:
subset['IRNotional]=pd.DataFrame(numpy.zeros(shape=(len(subset),1)))
subset['IPNotional]=pd.DataFrame(numpy.zeros(shape=(len(subset),1)))
and i get a result similar to this
Now when i do similar things to another dataframe i get zeros columns with a mix NaN and zeros rows as shown below. This is really strange.
subset['IRNotional]=pd.DataFrame(numpy.zeros(shape=(len(subset),1)))
subset['IPNotional]=pd.DataFrame(numpy.zeros(shape=(len(subset),1)))
I dont understand why sometimes i get zeros the other i get either NaNs or a mix of NaNs and zeros. Please help if you can
Thanks
I believe you need assign with dictionary for set new columns names:
subset = subset.assign(**dict.fromkeys(['IRNotional','IPNotional'], 0))
#you can define each column separately
#subset = subset.assign(**{'IRNotional': 0, 'IPNotional': 1})
Or simplier:
subset['IRNotional'] = 0
subset['IPNotional'] = 0
Now when i do similar things to another dataframe i get zeros columns with a mix NaN and zeros rows as shown below. This is really strange.
I think problem is different index values, so is necessary create same indices, else for not matched indices get NaNs:
subset['IPNotional']=pd.DataFrame(numpy.zeros(shape=(len(subset),1)), index=subset.index)

python having issues with large numbers used as IDs

I am putting together a script to analyze campaigns and report out. I'm building it in python in order to make this easy the next time around. I'm running into issues with the IDs involved in my data, they are essentially really large numbers (no strings, no characters).
when pulling the data in from excel I get floats like this (7.000000e+16) when in reality it is an integer like so(70000000001034570). my problem is that im losing a ton of data and all kinds of unique ID's are getting converted to a couple of different floats. I realize this may be an issue with the read_csv function I use to pull these in like this all comes from .csv. I am not sure what to do as converting to string gives me the same results as the float only as a string datatype, and converting to int gives me the literal results of the scientific notation (i.e. 70000000000000000). Is there a datatype I can store these as or a method I can use for preserving the data? I will have to merge on the ID's later with data pulled from a query so ideally, I would like to find a datatype that can preserve them. The few lines of code below run but return a handful of rows because of the issue I described.
`high_lvl_df = pd.read_csv(r"mycsv.csv")
full_df = low_lvl_df.merge(right=high_lvl_df, on='fact', how='outer')
full_df.to_csv(r'fullmycsv.csv')`
It may have to do with missing values.
Consider this CSV:
70000000001034570,2.
70000000001034571,3.
Then:
>>> pandas.read_csv('asdf.csv', header=None)
0 1
0 70000000001034570 2.0
1 70000000001034571 3.0
Gives you the expected result.
Hoever with:
70000000001034570,2.
,1.
70000000001034571,3.
You get:
>>> pandas.read_csv('asdf.csv', header=None)
0 1
0 7.000000e+16 2.0
1 NaN 2.0
2 7.000000e+16 3.0
That is because integers do not have NaN values, whereas floats do have that as a valid value. Pandas, therefore, infers the column type is float, not integer.
You can use pandas.read_csv()'s dtype parameter to force type string, for example:
pandas.read_csv('asdf.csv', header=None, dtype={0: str})
0 1
0 70000000001034570 2.0
1 NaN 2.0
2 70000000001034571 3.0
According to Pandas' documentation:
dtype : Type name or dict of column -> type, optional
Data type for data or columns. E.g. {‘a’: np.float64, ‘b’: np.int32, ‘c’: ‘Int64’} Use str or object together with suitable na_values settings to preserve and not interpret dtype. If converters are specified, they will be applied INSTEAD of dtype conversion.

Resources