using pandas to extract a string from text file - python-3.x

import pandas as pd
s = pd.read_csv("DIM.txt")
print(s)
This works good and I get output like below in different lines
abc,fgc,vvb....
sdc,trl,bgv...
And I like to show as below line by line
abc:fgc
sdc:trl

As I see, your input file has no "title" (column names) row.
So in this case you should pass header=None parameter.
Another detail is that s (variable name) can be associated with a Series. As the result of read_csv is a DataFrame, use rather df variable name.
To sum up, the code to read your file should be:
df = pd.read_csv("DIM.txt", header=None)
The result (for your input sample) is:
0 1 2
0 abc fgc vvb....
1 sdc trl bgv...
(if your sample contains more commas, there will be more columns).
To generate your desired result (concatenation of column 0 and 1),
run:
result = df[0] + ':' + df[1]
The result is:
0 abc:fgc
1 sdc:trl
dtype: object

Related

Splitting the data of one excel column into two columns sing python

I have problem of splitting the content of one excel column which contains numbers and letters into two columns the numbers in one column and the letters in the other.
As can you see in the first photo there is no space between the numbers and the letters, but the good thing is the letters are always "ms". I need a method split them as in the second photo.
Before
After
I tried to use the replace but it did not work. it did not split them.
Is there any other method.
You can use the extract method. Here is an example:
df = pd.DataFrame({'time': ['34ms', '239ms', '126ms']})
df[['time', 'unit']] = df['time'].str.extract('(\d+)(\D+)')
# convert time column into integer
df['time'] = df['time'].astype(int)
print(df)
# output:
# time unit
# 0 343 ms
# 1 239 ms
# 2 126 ms
It is pretty simple.
You need to use pandas.Series.str.split
Attaching the Syntax here :- pandas.Series.str.split
The Code should be
import pandas as pd
data_before = {'data' : ['34ms','56ms','2435ms']}
df = pd.DataFrame(data_before)
result = df['data'].str.split(pat='(\d+)',expand=True)
result = result.loc[:,[1,2]]
result.rename(columns={1:'number', 2:'string'}, inplace=True)
Output : -
print(result)
Output

Display 2 decimal places, and use comma as separator in pandas?

Is there any way to replace the dot in a float with a comma and keep a precision of 2 decimal places?
Example 1 : 105 ---> 105,00
Example 2 : 99.2 ---> 99,20
I used a lambda function df['abc']= df['abc'].apply(lambda x: f"{x:.2f}".replace('.', ',')). But then I have an invalid format in Excel.
I'm updating a specific sheet on excel, so I'm using : wb = load_workbook(filename) ws = wb["FULL"] for row in dataframe_to_rows(df, index=False, header=True): ws.append(row)
Let us try
out = (s//1).astype(int).astype(str)+','+(s%1*100).astype(int).astype(str).str.zfill(2)
0 105,00
1 99,20
dtype: object
Input data
s=pd.Series([105,99.2])
s = pd.Series([105, 99.22]).apply(lambda x: f"{x:.2f}".replace('.', ',')
First .apply takes a function inside and
f string: f"{x:.2f} turns float into 2 decimal point string with '.'.
After that .replace('.', ',') just replaces '.' with ','.
You can change the pd.Series([105, 99.22]) to match it with your dataframe.
I think you're mistaking something in here. In excel you can determine the print format i.e. the format in which numbers are printed (this icon with +/-0).
But it's not a format of cell's value i.e. cell either way is numeric. Now your approach tackles only cell value and not its formatting. In your question you save it as string, so it's read as string from Excel.
Having this said - don't format the value, upgrade your pandas (if you haven't done so already) and try something along these lines: https://stackoverflow.com/a/51072652/11610186
To elaborate, try replacing your for loop with:
i = 1
for row in dataframe_to_rows(df, index=False, header=True):
ws.append(row)
# replace D with letter referring to number of column you want to see formatted:
ws[f'D{i}'].number_format = '#,##0.00'
i += 1
well i found an other way to specify the float format directly in Excel using this code :
for col_cell in ws['S':'CP'] :
for i in col_cell :
i.number_format = '0.00'

Split all column names by specific characters and take the last part as new column names in Pandas

I have a dataframe which has column names like this:
id, xxx>xxx>x, yy>y, zzzz>zzz>zz>z, ...
I need to split by the second > from the right side and take the first element as new column names, id, xxx>x, yy>y, zz>z, ....
I have used: 'zzzz>zzz>zz>z'.rsplit('>', 1)[-1] to get z as the expected new column name for the third column.
When I use: df.columns = df.columns.rsplit('>', 1)[-1]:
Out:
ValueError: Length mismatch: Expected axis has 13 elements, new values have 2 elements
How could I do that correctly?
try doing:
names = pd.Index(['xxx>xxx>x', 'yy>y', 'zzzz>zzz>zz>z'])
names = pd.Index([idx[-1] for idx in names.str.rsplit('>')])
print(names)
# Index(['x', 'y', 'z'], dtype='object')

Python loop through a range and print into a csv for different columns with different ranges

I have three different columns that I need to print them into a CSV which each column have a different range for each column using Python
e.g column 1 - range: 0 to 5000
column 2 - range: 450123 to 565123
column 3 - range: 125000 to 130000
Would like to print these columns into a CSV something like this
column1, column2 , column3
0,450123,125000
1,450124,125001
......
......
5000,565123,130000
import csv
header = ['column1']
with open('test.csv', 'w') as file:
writer = csv.writer(file, delimiter='\t', lineterminator='\n', )
writer.writerow(i for i in header)
for j in range(1, 500001):
row = ["martec" + str(j)]
writer.writerow(row)
I managed to get it for one column but would to print multiple columns
As scotscotmcc pointed out, the output won't be exactly as you want it. There are unequal ranges in the three columns. But assuming the columns should all have the same amount of numbers in them then something like the following should work. You just need to add all three columns to the list (row) to be written to the csv.
header = ['column1','column2','column3']
max = 5000
with open('test.csv','w') as file:
writer = csv.writer(file,delimiter=',',lineterminator='\n')
writer.writerow(header)
for i in range(0,max):
row = [ i, 450123+i, 125000+i ]
writer.writerow(row)
Output would be something like
column1,column2,column3
0,450123,125000
1,450124,125001
2,450125,125002
3,450126,125003
4,450127,125004
5,450128,125005
6,450129,125006
7,450130,125007
8,450131,125008

How to select a column from a text file which has no header using python

I have a text file which is tabulated. When I open the file in python using pandas, it shows me that the file contains only one column but there are many columns in it. I've tried using pd.DataFrames, sep= '\s*', sep= '\t', but I can't select the column since there is only one column. I've even tried specifying the header but the header moves to the exterior right side and specifies the whole file as one column only. I've also tried .loc method and mentioned specific column number but it always returns rows. I want to select the first column (A, A), third column (HIS, PRO) and fourth column (0, 0).
I want to get the above mentioned specific columns and print it in a CSV file.
Here is the code I have used along with some file components.
1) After opening the file using pd:
[599 rows x 1 columns]
2) The file format:
pdb_id: 1IHV
0 radii_filename: MD_threshold: 4
1 A 20 HIS 0 MaximumDistance
2 A 21 PRO 0 MaximumDistance
3 A 22 THR 0 MaximumDistance
Any help will be highly appreciated.
3) code:
import pandas as pd
df= pd.read_table("file_path.txt", sep= '\t')
U= df.loc[:][2:4]
Any help will be highly appreciated.
If anybody gets any file like this, it can be opened and the column can be selected using the following codes:
f=open('file.txt',"r")
lines=f.readlines()
result=[]
for x in lines:
result.append(x.split(' ')[range])
for w in result:
s='\t'.join(w)
print(s)
Where range is the column you want to select.

Resources