Splitting the data of one excel column into two columns sing python - excel

I have problem of splitting the content of one excel column which contains numbers and letters into two columns the numbers in one column and the letters in the other.
As can you see in the first photo there is no space between the numbers and the letters, but the good thing is the letters are always "ms". I need a method split them as in the second photo.
Before
After
I tried to use the replace but it did not work. it did not split them.
Is there any other method.

You can use the extract method. Here is an example:
df = pd.DataFrame({'time': ['34ms', '239ms', '126ms']})
df[['time', 'unit']] = df['time'].str.extract('(\d+)(\D+)')
# convert time column into integer
df['time'] = df['time'].astype(int)
print(df)
# output:
# time unit
# 0 343 ms
# 1 239 ms
# 2 126 ms

It is pretty simple.
You need to use pandas.Series.str.split
Attaching the Syntax here :- pandas.Series.str.split
The Code should be
import pandas as pd
data_before = {'data' : ['34ms','56ms','2435ms']}
df = pd.DataFrame(data_before)
result = df['data'].str.split(pat='(\d+)',expand=True)
result = result.loc[:,[1,2]]
result.rename(columns={1:'number', 2:'string'}, inplace=True)
Output : -
print(result)
Output

Related

How can I define a parameter from specific columns and rows from excel?

I want to obtain a list of certain values from an excel file.
I tried this:
import pandas as pd
df = pd.read_excel('Data.xlsx')
orders = df[['Order']].loc[[4,129]]
print(orders)
I obtained this solution:
Order
4 18292839
129 83938292
But the solution that I want to obtain is the orders from 4 to 129 in a list on values like this:
['18292839', .............. (other orders), '83938292']
If someone can help me I will be very grateful!
You can use orders.values.tolist() to convert pd.Series into list. More about converting DataFrames and Series into the list you can read here.

Using value_counts() and filter elements based on number of instances

I use the following code to create two arrays in a histogram, one for the counts (percentages) and the other for values.
df = row.value_counts(normalize=True).mul(100).round(1)
counts = df # contains percentages
values = df.keys().tolist()
So, an output looks like
counts = 66.7, 8.3, 8.3, 8.3, 8.3
values = 1024, 356352, 73728, 16384, 4096
Problem is that some values exist one time only and I would like to ignore them. In the example above, only 1024 repeated multiple times and others are there only once. I can manually check the number of occurrences in the row and see if they are not repeated multiple times and ignore them.
df = row.value_counts(normalize=True).mul(100).round(1)
counts = df # contains percentages
values = df.keys().tolist()
for v in values:
# N = get_number_of_instances in row
# if N == 1
# remove v in row
I would like to know if there are other ways for that using the built-in functions in Pandas.
Some clarity requested on your question in comments above
If keys is a column and you want to retain non duplicates, please try
values=df.loc[~df['keys'].duplicated(keep=False), 'keys'].to_list()

Display 2 decimal places, and use comma as separator in pandas?

Is there any way to replace the dot in a float with a comma and keep a precision of 2 decimal places?
Example 1 : 105 ---> 105,00
Example 2 : 99.2 ---> 99,20
I used a lambda function df['abc']= df['abc'].apply(lambda x: f"{x:.2f}".replace('.', ',')). But then I have an invalid format in Excel.
I'm updating a specific sheet on excel, so I'm using : wb = load_workbook(filename) ws = wb["FULL"] for row in dataframe_to_rows(df, index=False, header=True): ws.append(row)
Let us try
out = (s//1).astype(int).astype(str)+','+(s%1*100).astype(int).astype(str).str.zfill(2)
0 105,00
1 99,20
dtype: object
Input data
s=pd.Series([105,99.2])
s = pd.Series([105, 99.22]).apply(lambda x: f"{x:.2f}".replace('.', ',')
First .apply takes a function inside and
f string: f"{x:.2f} turns float into 2 decimal point string with '.'.
After that .replace('.', ',') just replaces '.' with ','.
You can change the pd.Series([105, 99.22]) to match it with your dataframe.
I think you're mistaking something in here. In excel you can determine the print format i.e. the format in which numbers are printed (this icon with +/-0).
But it's not a format of cell's value i.e. cell either way is numeric. Now your approach tackles only cell value and not its formatting. In your question you save it as string, so it's read as string from Excel.
Having this said - don't format the value, upgrade your pandas (if you haven't done so already) and try something along these lines: https://stackoverflow.com/a/51072652/11610186
To elaborate, try replacing your for loop with:
i = 1
for row in dataframe_to_rows(df, index=False, header=True):
ws.append(row)
# replace D with letter referring to number of column you want to see formatted:
ws[f'D{i}'].number_format = '#,##0.00'
i += 1
well i found an other way to specify the float format directly in Excel using this code :
for col_cell in ws['S':'CP'] :
for i in col_cell :
i.number_format = '0.00'

Convert all strings with numbers to integers in DataFrames

I am using pandas with openpyxl to process multiple Excel files into a single Excel file as output. In this output file, cells can contain a combination of numbers and other characters or exclusively numbers, and all cells are stored as text.
I want all cells that only contain numbers in the output file to be stored as numbers. As the columns with numbers are known (5 to 8), I used the following code to transform the text to floats:
for dictionary in list_of_Excelfiles
dictionary[DataFrame][5:8].astype(float)
However, this manual procedure is not scalable and might be prone to errors when other characters than numbers are present in the column. As such, I want to create a statement that transforms any cell with only numbers to an integer.
What condition can filter for cells with only numbers and transform these to integers?
You could use try and except and apply map, here is a full example:
create some random data for example:
def s():
return [''.join(random.choices([x for x in string.ascii_letters[:6]+string.digits], k=random.randint(1, 5))) for x in range(5)]
df = pd.DataFrame()
for c in range(4):
df[c] = s()
define a try and except func:
def try_int(s):
try:
return int(s)
except ValueError:
return s
apply on each cell:
df2 = df.applymap(try_int)

using pandas to extract a string from text file

import pandas as pd
s = pd.read_csv("DIM.txt")
print(s)
This works good and I get output like below in different lines
abc,fgc,vvb....
sdc,trl,bgv...
And I like to show as below line by line
abc:fgc
sdc:trl
As I see, your input file has no "title" (column names) row.
So in this case you should pass header=None parameter.
Another detail is that s (variable name) can be associated with a Series. As the result of read_csv is a DataFrame, use rather df variable name.
To sum up, the code to read your file should be:
df = pd.read_csv("DIM.txt", header=None)
The result (for your input sample) is:
0 1 2
0 abc fgc vvb....
1 sdc trl bgv...
(if your sample contains more commas, there will be more columns).
To generate your desired result (concatenation of column 0 and 1),
run:
result = df[0] + ':' + df[1]
The result is:
0 abc:fgc
1 sdc:trl
dtype: object

Resources