Change data in Pandas dataframe by column - python-3.x

I have some data I imported from a excel spreadsheet as a csv. I created a dataframe using Pandas, and want to change a specific column. The column contains strings such as "5.15.1.0.0". I want to change these strings to floats like "5.15100".
So far I've tried using the method "replace" to change every instance in that column:
df['Fix versions'].replace("5.15.1.0.0", 5.15.1.0.0)
this however does not work. When I reprint the dataframe after the replace methods are called it shows me the same dataframe where no changes are made. Is it not possible to change a string to a float using replace? If not does anyone know another way to do this?
I could parse each string and remove the "." but I'd prefer not to do it this way as some of the strings represent numbers of different lengths and decimal place values.

Adding the parameter "inplace" which default is false. Changing this to true will change the dataframe in place, which can be type casted.
df['Fix versions'].replace(to_replace="5.15.1.0.0", value="5.15100", inplace=True)

Related

Problem and a Solution for pd.DataFrame values changing to Nan while changing index/row default names

I had a dataframe like in image-1 - Input dataframe on which I want to rename Rows/indices by dates (dtype='datetime64[ns]) in YYYY-MM-DD format.
So, I used index re-naming option as shown in the image-2 below, which is last date of every 6th month for every row incrementing till end. It did rename the rows but end up making NaNs for all data values. I did try the transpose of dataframe, same result.
After trying few other things as shown in image-3, which were all unfruitful and mostly I had error suggesting TypeError: 'DatetimeIndex' object is not callable
As the final solution, I end up creating dataframe for all dates image-4, followed by merging two dataframes by columns, image-5 and then assign/set very first column as row names, image-6.
Dates have a weird format when converting to list, and wondering why it is so, image-7. How do we get exactly the year-month-date? I tried different combinations but didn't end up in fruitful results. strftime is the way to go here, but how?
Why I went this strftime approach, I was thinking to output a list of dates in a sensible YYYY-MM-DD format and then use function as --> pd.rename(index=list_dates) to replace default 0 1 2 by dates as new index names.
So, I have a solution but is it an economic solution or are there good solutions available?
This is an attempt to share my solution for those who can use it and learn new solutions from wizards here.
BRgrds,

How to drop columns from a pandas DataFrame that have elements containing a string?

This is not about dropping columns whose name contains a string.
I have a dataframe with 1600 columns. Several hundred are garbage. Most of the garbage columns contain a phrase such as invalid value encountered in double_scalars (XYZ) where `XYZ' is a filler name for the column name.
I would like to delete all columns that contain, in any of their elements, the string invalid
Purging columns with strings in general would work too. What I want is to clean it up so I can fit a machine learning model to it, so removing any/all columns that are not boolean or real would work.
This must be a duplicate question, but I can only find answers to how to remove a column with a specific column name.
You can use df.select_dtypes(include=[float,bool]) or df.select_dtypes(exclude=['object'])
Link to docs https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.select_dtypes.html
Use apply to make a mask checking if each column contains invalid, and then pass that mask to the second position of .loc:
df = df.loc[:, ~df.apply(lambda col: col.astype(str).str.contains('invalid')).any()]

Pandas read_csv - dealing with columns that have ID numbers with consecutive '$' and '#' symbols, along with letters and digits

I'm trying to read a csv file with a column of data that has a scrambled ID number that includes the occasional consecutive $$ along with #, numbers, and letters.
SCRAMBLE_ID
AL9LLL677
AL9$AM657
$L9$$4440
#L9$306A1
etc.
I tried the following:
df = pd.read_csv('MASTER~1.CSV',
dtype = {'SCRAMBLE_ID': str})
which rendered the third entry as L9$4440 (L9 appear in serif font, italicized, and the first and second $ vanish).
Faced with an entire column of ID numbers configured in this manner, what is the best way of dealing with such data? I can imagine:
PRIOR TO pd.read_csv: replacing the offending symbols with substitutes that don't create this problem (and what would those be), OR,
is there a way of preserving the IDs as is but making them into a data type that ignores these symbols while keeping them present?
Thank you. I've attached a screenshot of the .csv side by side with resulting df (Jupyter notebook) below.
csv column to pandas df with $$
I cannot replicate this using the same values as you in a mock CSV file.
Are you sure that the formatting based on the $ symbol is not occurring in wherever you are rendering your dataframe values? Have you checked to see if the data in the dataframe is what you expect or are you just rendering it externally?

How to convert recordset.field to a string

I am currently attempting to compare certain values in a column from a query in access to a vector of strings to look for a match between any two values.
I used recordset.fields("column1") to access specific records from my desired column, but it seems like I am unable to get matches since the values are of different data types.
How do I convert the records from recordset.fields("column1") into a string?
Thanks!
If you are working in VBA, surround your value with the CStr() function which will return the value converted to string output.

Best way to import numeric and non-numeric data (string) from an excel file into MATLAB?

I want to know the best way of importing both number and non-numeric data (which is string in the present case) from an excel file into MATLAB? By best (or better) way, I mean all the data together in a variable (or data structure).
First, I tried uiopen(filename) function which opens a wizard and from there, I can import the data into a MATLAB variable. However, problem here is that it replaces all the non-numeric data with zeros which is not required. I later on, found that this function calls another function, named xlsread(filename), which is another way (actual way) of importing excel file.
Second (last) way that I tried (which seems to be better) is to use function called importdata(filename) which imports both numeric and non-numeric data into separate structure variables.
However, I am wondering if there exists some other way(s) to import everything into a single variable or data structure?
xlsread is the correct way to import data from Excel spreadsheets,both numeric and non-numeric data. Check the documentation:
[num,txt,raw] = xlsread(___) additionally returns the text fields in
cell array txt, and the unprocessed data (numbers and text) in cell
array raw using any of the input arguments in the previous syntaxes.
If xlRange is specified, leading blank rows and columns in the
worksheet that precede rows and columns with data are returned in raw.

Resources