How do I take the punctuation off each line of a column of an xlsx file in Python?

How do I take the punctuation off each line of a column of an xlsx file in Python? - excel

I have an excel file (.xlsx) with a column having rows of strings. I used the following code to get the file:
import pandas as pd
df = pd.read_excel("file.xlsx")
db = df['Column Title']
I am removing the punctuation for the first line (row) of the column using this code:
import string
translator = str.maketrans('', '', string.punctuation)
sent_pun = db[0].translate(translator)
I would like to remove the punctuation for each line (until the last row). How would I correctly write this with a loop? Thank you.

Well given that this code is working for one value and producing the right kind of results then you can write it in a loop as
for row in rows(min_row=1, min_col=1, max_row=6, max_col=3):
for cell in row:
translator = str.maketrans('', '', string.punctuation)
sent_pun = db[0].translate(translator)
Change the arguments (number of rows and columns) as per your need.

Related

extract position of char from each row & provide an aggregate of position across a list

I need some python help with this problem. Would appreciate any assistance !. Thanks.
I need an extracted matrix of values enclosed between square brackets. A toy example is below:
File Input will be in a txt file as below:
AB_1 Q[A]IHY[P]GVA
AB_2 Q[G][C]HY[R]GVA
AB_3 Q[G][C]HY[R]GV[D]
Answer out.txt: Script extracts index of char enclosed between sq.brackets "[]" for each row from input and makes an aggregate of index positions for the entire list. The aggregated index is then used to extract all of those positions from input file and produce a matrix as below.
Index 2,3,6,9
AB_1 [A],I,[P],A
AB_2 [G],[C],[R],A
AB_3 [G],[C],[R],[D]
Any help would be greatly appreciated !. Thanks.

If you want to reduce your table to only those columns in which an entry with square-brackets appears, you can go with this:
import re
def transpose(matrix):
return [[matrix[j][i] for j in range(len(matrix))] for i in range(len(matrix[0]))]
with open("test_table.txt", "r") as f:
content = f.read()
rows = [re.findall(r'(\[.\]|.)', line.split()[1]) for line in content.split("\n")]
columns = transpose(rows)
matching_columns = [[str(i + 1)] + column for i, column in enumerate(columns) if "[" in "".join(column)]
matching_rows = transpose(matching_columns)
headline = ["Index {}".format(",".join(matching_rows[0]))]
target_table = headline + ["AB_{0} {1}".format((i + 1), ",".join(line)) for i, line in enumerate(matching_rows[1:])]
with open("out.txt", "w") as f:
f.write("\n".join(target_table))
First of all you want the content of your .txt file to be represented in arrays. Unfortunately your input data has no seperators yet (as in .csv files) so you need to take care of that. To get a string like this "Q[A]IHY[P]GVA" sorted out I would recommend working with regular expressions.
import re
cells = re.findall(r'(\[.\]|.)', "Q[A]IHY[P]GVA")
The pattern within the r'' string matches a single character within square brackets or just a single character. The re.findall() method returns a list of all matching substrings, in this case: ['Q', '[A]', 'I', 'H', 'Y', '[P]', 'G', 'V', 'A']
rows = [re.findall(r'(\[.\]|.)', line.split()[1]) for line in content.split("\n")] applies this method on every line in your file. The line.split()[1] will leave out the row label "AB_X " as it is not usefull.
Having your data sorted in columns is more fitting, because you want to preserve all columns that match a certain condition (contain an entry in brackets). For this you can just transpose rows. This is done by the transpose() function. If you have imported numpy numpy.transpose(rows) would be the better option I guess.
Next you want to get all columns that satisfy your condition "[" in "".join(column). All done in one line by: matching_columns = [[str(i + 1)] + column for i, column in enumerate(columns) if "[" in "".join(column)] Here [str(i + 1)] does add the column index that you want to use later.
The rest now is easy: Transpose the columns back to rows, relabel the rows, format the row data into strings that fit your desired output format and then write those strings to the out.txt file.

Csv file writing a new row for each letter

import csv
email = 'someone#somemail.com'
password = 'password123'
with open('test.csv', 'a', newline='') as accts:
b = csv.writer(accts, delimiter=',')
b.writerow(email)
b.writerow(password)
I'm trying to append to a csv file with the format email:password on the same row, but everytime I run the program it creates a new row for each letter and the password is written under the email. What am I doing wrong?
Output:
s,o,m,e,o,n,e,#,s,o,m,e,m,a,i,l,.,c,o,m
p,a,s,s,w,o,r,d,1,2,3
Desired output:
someone#somemail.com,password123

A string looks like a list of individual characters, and writerow expects a list of the column values, so you end up with columns of individual characters.
Instead, use a list of the column values:
b.writerow([email,password])

Sorting csv data by a column of numerical values

I have a csv file that has 10 columns and about 7,000 rows. I have to sort the data based on the 4th column (0, 1, 2, 3), which I know is column #3 with 0 based counting. The column has a header and the data in this column is numeric values. The largest value in this column is: 7548375288960, so that row should be at the top of my results.
My code is below. Something interesting is happening. If I change the reverse=True to reverse=False then the 15 rows printed to the screen are the correct ones based on me manually sorting the csv file in Excel. But when I set reverse=True they are not the correct ones. Below are the first 4 that my print statement puts out:
999016759.26
9989694.0
99841428.0
998313048.0
Here is my code:
def display():
theFile = open("MyData.csv", "r")
mycsv = csv.reader(theFile)
sort = sorted(mycsv, key=operator.itemgetter(3), reverse=True)
for row in islice(sort, 15):
print(row)
Appreciate any help!

OK, I got this solved. A couple of things:
The data in the column, while only containing numerical values, was in string format. To overcome this I did the following in the function that was generating the csv file.
concatDf["ColumnName"] = concatDf["ColumnName"].astype(float)
This converted all of the strings to floats. Then in my display function I changed the sort line of code to the following:
sort = sorted(reader, key=lambda x: int(float(x[3])), reverse=True)
Then I got a different error that I realized was trying to covert the header from string to float and this was impossible. To overcome that I added the following line:
next(theFile, None)
This is what the function looks like now:
def display():
theFile = open("MyData.csv", "r")
next(theFile, None)
reader = csv.reader(theFile, delimiter = ",", quotechar='"')
sort = sorted(reader, key=lambda x: int(float(x[3])), reverse=True)
for row in islice(sort, 15):
print(row)

How to strip a string from a datetime stamp?

I am reading the attached excel file (only an image attached) using Pandas. There is one row with DateTime stamp with following format (M- 05.02.2018 13:41:51). I would like to separate/remove the 'M-' from DateTime in the whole row.
import pandas as pd
df=pd.read_excel('test.xlsx')
df=df.drop([0,1,2,3])
I would then like to use the following code to convert to Datetime:
df.iloc[0]= pd.to_datetime(df.iloc[0], format='%d.%m.%Y %H:%M:%S')
Can please someone help me to remove the 'M-' from the complete row?
Thank you.
Excel-file (image)

Use pandas.Series.str.strip to remove 'M-' from the rows:
If removing from the rows:
df['Column_Name'] = df['Column_Name'].str.strip('M- ')
If removing from columns or DataFrame headers:
df.columns = df.columns.str.strip('M- ')

You may want Series.str.lstrip to remove leading characters from row.
df.iloc[0] = df.iloc[0].str.lstrip('M- ')

Sort excel worksheet using python

I have an excel sheet like this:
I would like to output the data into an excel file like this:
Basically, for the common elements in column 2,3,4, I want to contacenate the values in the 5th column.
Please suggest, how could I do this ?

The easiest way to approach an issue like this is exporting the spreadsheet to CSV first, in order to help ensure that it imports correctly into Python.
Using a defaultdict, you can create a dictionary that has unique keys and then iterate through lines adding the final column's values to a list.
Finally you can write it back out to a CSV format:
from collections import defaultdict
results = defaultdict(list)
with open("in_file.csv") as f:
header = f.readline()
for line in f:
cols = line.split(",")
key = ",".join(cols[0:4])
results[key].append(cols[4])
with open("out_file.csv", "w") as f:
f.write(header)
for k, v in results.iteritems():
line = '{},"{}",\n'.format(k, ", ".join(v))
f.write(line)

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string

How do I take the punctuation off each line of a column of an xlsx file in Python? - excel

Related

extract position of char from each row & provide an aggregate of position across a list

Csv file writing a new row for each letter

Sorting csv data by a column of numerical values

How to strip a string from a datetime stamp?

Sort excel worksheet using python

Categories

Resources