Convert all strings with numbers to integers in DataFrames - excel

I am using pandas with openpyxl to process multiple Excel files into a single Excel file as output. In this output file, cells can contain a combination of numbers and other characters or exclusively numbers, and all cells are stored as text.
I want all cells that only contain numbers in the output file to be stored as numbers. As the columns with numbers are known (5 to 8), I used the following code to transform the text to floats:
for dictionary in list_of_Excelfiles
dictionary[DataFrame][5:8].astype(float)
However, this manual procedure is not scalable and might be prone to errors when other characters than numbers are present in the column. As such, I want to create a statement that transforms any cell with only numbers to an integer.
What condition can filter for cells with only numbers and transform these to integers?

You could use try and except and apply map, here is a full example:
create some random data for example:
def s():
return [''.join(random.choices([x for x in string.ascii_letters[:6]+string.digits], k=random.randint(1, 5))) for x in range(5)]
df = pd.DataFrame()
for c in range(4):
df[c] = s()
define a try and except func:
def try_int(s):
try:
return int(s)
except ValueError:
return s
apply on each cell:
df2 = df.applymap(try_int)

Related

Splitting the data of one excel column into two columns sing python

I have problem of splitting the content of one excel column which contains numbers and letters into two columns the numbers in one column and the letters in the other.
As can you see in the first photo there is no space between the numbers and the letters, but the good thing is the letters are always "ms". I need a method split them as in the second photo.
Before
After
I tried to use the replace but it did not work. it did not split them.
Is there any other method.
You can use the extract method. Here is an example:
df = pd.DataFrame({'time': ['34ms', '239ms', '126ms']})
df[['time', 'unit']] = df['time'].str.extract('(\d+)(\D+)')
# convert time column into integer
df['time'] = df['time'].astype(int)
print(df)
# output:
# time unit
# 0 343 ms
# 1 239 ms
# 2 126 ms
It is pretty simple.
You need to use pandas.Series.str.split
Attaching the Syntax here :- pandas.Series.str.split
The Code should be
import pandas as pd
data_before = {'data' : ['34ms','56ms','2435ms']}
df = pd.DataFrame(data_before)
result = df['data'].str.split(pat='(\d+)',expand=True)
result = result.loc[:,[1,2]]
result.rename(columns={1:'number', 2:'string'}, inplace=True)
Output : -
print(result)
Output

Using value_counts() and filter elements based on number of instances

I use the following code to create two arrays in a histogram, one for the counts (percentages) and the other for values.
df = row.value_counts(normalize=True).mul(100).round(1)
counts = df # contains percentages
values = df.keys().tolist()
So, an output looks like
counts = 66.7, 8.3, 8.3, 8.3, 8.3
values = 1024, 356352, 73728, 16384, 4096
Problem is that some values exist one time only and I would like to ignore them. In the example above, only 1024 repeated multiple times and others are there only once. I can manually check the number of occurrences in the row and see if they are not repeated multiple times and ignore them.
df = row.value_counts(normalize=True).mul(100).round(1)
counts = df # contains percentages
values = df.keys().tolist()
for v in values:
# N = get_number_of_instances in row
# if N == 1
# remove v in row
I would like to know if there are other ways for that using the built-in functions in Pandas.
Some clarity requested on your question in comments above
If keys is a column and you want to retain non duplicates, please try
values=df.loc[~df['keys'].duplicated(keep=False), 'keys'].to_list()

Pandas Dataframe Integer Column Data Cleanup

I have a CSV file, which I read through pandas read_csv module.
There is one column, which is supposed to have numbers only, but the data has some bad values.
Some rows (very few) have "alphanumeric" strings, few rows are empty while a few others have floating point numbers. Also, for some reason, some numbers are also being read as strings.
I want to convert it in the following way:
Alphanumeric, None, empty (numpy.nan) should be converted to 0
Floating point should be typecasted to int
Integers should remain as they are
And obvs, numbers should be read as numbers only.
How should I proceed, as I have no other idea than to read each row one by one and typecast into int, in a try-except block, while assigning 0 if exception is raised.
like:
def typecast_int(n):
try:
return int(n)
except:
return 0
for idx, row in df.iterrows:
row["number_column"] = typecast_int(row["number_column"])
But there are some issues with this approach. Firstly, iterrows is bad performance wise. And my dataframe may have upto 700k to 1M records and I have to process ~500 such CSV files. And secondly, it just doesn't feel right to do it this way.
I could do a tad better by using df.apply instead of iterrows but that is also not too different.
From your 4 conditions, there's
df.number_column = (pd.to_numeric(df.number_column, errors="coerce")
.fillna(0)
.astype(int))
This first converts the column to be numeric values only. If errors arise (e.g., due to alphanumerics) they got "coerce"d to NaN. Then we fill those NaN's with 0 and lastly cast everything to integers.

python xlsxwriter extract value from cell

Is it possible to extract data that I've written to a xlsxwriter.worksheet?
import xlsxwriter
output = "test.xlsx"
workbook = xlsxwriter.Workbook(output)
worksheet = workbook.add_worksheet()
worksheet.write(0, 0, 'top left')
if conditional:
worksheet.write(1, 1, 'bottom right')
for row in range(2):
for col in range(2):
# Now how can I check if a value was written at this coordinate?
# something like worksheet.get_value_at_row_col(row, col)
workbook.close()
Is it possible to extract data that I've written to a xlsxwriter.worksheet?
Yes. Even though XlsxWriter is write only, it stores the table values in an internal structure and only writes them to file when workbook.close() is executed.
Every Worksheet has a table attribute. It is a dictionary, containing entries for all populated rows (row numbers starting at 0 are the keys). These entries are again dictionaries, containing entries for all populated cells within the row (column numbers starting at 0 are the keys).
Therefore, table[row][col] will give you the entry at the desired position (but only in case there is an entry, it will fail otherwise).
Note that these entries are still not the text, number or formula you are looking for, but named tuples, which also contain the cell format. You can type check the entries and extract the contents depending on their nature. Here are the possible outcomes of type(entry) and the fields of the named tuples that are accessible:
xlsxwriter.worksheet.cell_string_tuple: string, format
xlsxwriter.worksheet.cell_number_tuple: number, format
xlsxwriter.worksheet.cell_blank_tuple: format
xlsxwriter.worksheet.cell_boolean_tuple: boolean, format
xlsxwriter.worksheet.cell_formula_tuple: formula, format, value
xlsxwriter.worksheet.cell_arformula_tuple: formula, format, value, range
For numbers, booleans, and formulae, the contents can be accessed by reading the respective field of the named tuple.
For array formulae, the contents are only present in the upper left cell of the output range, while the rest of the cells are represented by number entries with 0 value.
For strings, the situation is more complicated, since Excel's storage concept has a shared string table, while the individual cell entries only point to an index of this table. The shared string table can be accessed as the str_table.string_table attribute of the worksheet. It is a dictionary, where the keys are strings and the values are the associated indices. In order to access the strings by index, you can generate a sorted list from the dictionary as follows:
shared_strings = sorted(worksheet.str_table.string_table, key=worksheet.str_table.string_table.get)
I expanded your example from above to include all the explained features. It now looks like this:
import xlsxwriter
output = "test.xlsx"
workbook = xlsxwriter.Workbook(output)
worksheet = workbook.add_worksheet()
worksheet.write(0, 0, 'top left')
worksheet.write(0, 1, 42)
worksheet.write(0, 2, None)
worksheet.write(2, 1, True)
worksheet.write(2, 2, '=SUM(X5:Y7)')
worksheet.write_array_formula(2,3,3,4, '{=TREND(X5:X7,Y5:Y7)}')
worksheet.write(4,0, 'more text')
worksheet.write(4,1, 'even more text')
worksheet.write(4,2, 'more text')
worksheet.write(4,3, 'more text')
for row in range(5):
row_dict = worksheet.table.get(row, None)
for col in range(5):
if row_dict != None:
col_entry = row_dict.get(col, None)
else:
col_entry = None
print(row,col,col_entry)
shared_strings = sorted(worksheet.str_table.string_table, key=worksheet.str_table.string_table.get)
print()
if type(worksheet.table[0][0]) == xlsxwriter.worksheet.cell_string_tuple:
print(shared_strings[worksheet.table[0][0].string])
# type checking omitted for the rest...
print(worksheet.table[0][1].number)
print(bool(worksheet.table[2][1].boolean))
print('='+worksheet.table[2][2].formula)
print('{='+worksheet.table[2][3].formula+'}')
workbook.close()
Is it possible to extract data that I've written to a xlsxwriter.worksheet?
No. XlsxWriter is write only. If you need to keep track of your data you will need to do it in your own code, outside of XlsxWriter.

About lists in python

I have an excel file with a column in which values are in multiple rows in this format 25/02/2016. I want to save all this rows of dates in a list. Each row is a separate value. How do I do this? So far this is my code:
I have an excel file with a column in which values are in multiple rows in this format 25/02/2016. I want to save all this rows of dates in a list. Each row is a separate value. How do I do this? So far this is my code:
import openpyxl
wb = openpyxl.load_workbook ('LOTERIAREAL.xlsx')
sheet = wb.get_active_sheet()
rowsnum = sheet.get_highest_row()
wholeNum = []
for n in range(1, rowsnum):
wholeNum = sheet.cell(row=n, column=1).value
print (wholeNum[0])
When I use the print statement, instead of printing the value of the first row which should be the first item in the list e.g. 25/02/2016, it is printing the first character of the row which is the number 2. Apparently it is slicing thru the date. I want the first row and subsequent rows saved as separate items in the list. What am I doing wrong? Thanks in advance
wholeNum = sheet.cell(row=n, column=1).value assigns the value of the cell to the variable wholeNum, so you're never adding anything to the initial empty list and just overwrite the value each time. When you call wholeNum[0] at the end, wholeNum is a the last string that was read, and you're getting the first character of it.
You probable want wholeNum.append(sheet.cell(row=n, column=1).value) to accumulate a list.
wholeNum =
This is an assignment. It makes the name wholeNum refer to whatever object the expression to the right of the = operator evaluates to.
for ...:
wholeNum = ...
Performing assignment in a loop is frequently not useful. The name wholeNum will refer to whatever value was assigned to it in the last iteration of the loop. The other iterations have no discernible effect.
To append values to a list, use the .append() method.
for ...:
wholeNum.append( ... )
print( wholeNum )
print( wholeNum[0] )

Resources