How to strip a string from a datetime stamp? - python-3.x

I am reading the attached excel file (only an image attached) using Pandas. There is one row with DateTime stamp with following format (M- 05.02.2018 13:41:51). I would like to separate/remove the 'M-' from DateTime in the whole row.
import pandas as pd
df=pd.read_excel('test.xlsx')
df=df.drop([0,1,2,3])
I would then like to use the following code to convert to Datetime:
df.iloc[0]= pd.to_datetime(df.iloc[0], format='%d.%m.%Y %H:%M:%S')
Can please someone help me to remove the 'M-' from the complete row?
Thank you.
Excel-file (image)

Use pandas.Series.str.strip to remove 'M-' from the rows:
If removing from the rows:
df['Column_Name'] = df['Column_Name'].str.strip('M- ')
If removing from columns or DataFrame headers:
df.columns = df.columns.str.strip('M- ')

You may want Series.str.lstrip to remove leading characters from row.
df.iloc[0] = df.iloc[0].str.lstrip('M- ')

Related

Change panda date format into another date format?

How do I convert this format below into the result format ?
import pandas as pd
date = pd.date_range('2022-01-01',2022-01-31', freq = 'H')
Result:
'2021-01-01T01%3A00%3A00',
What is the correct name for the result time format ? Have tried using urlilib.parse module, but it did not have T and it can take 1 date.
Thank you !
This so called url encode , so we need urllib, notice here %3A = ':'
import urllib
date.astype(str).map(urllib.parse.quote)
Out[158]:
Index(['2022-01-01%2000%3A00%3A00', '2022-01-01%2001%3A00%3A00',
....

Read CSV using pandas

I'm trying to read data from https://download.bls.gov/pub/time.series/bp/bp.measure using pandas, like this:
import pandas as pd
url = 'https://download.bls.gov/pub/time.series/bp/bp.measure'
df = pd.read_csv(url, sep='\t')
However, I just need to get the dataset with the two columns: measure_code and measure_text. As this dataset as a title BP measure I also tried to read it like:
url = 'https://download.bls.gov/pub/time.series/bp/bp.measure'
df = pd.read_csv(url, sep='\t', skiprows=1)
But in this case it returns a dataset with just one column and I'm not being able to slipt it:
>>> df.columns
Index([' measure_code measure_text'], dtype='object')
Any suggestion/idea on a better approach to get this dataset?
It's definitely possible, but the format has a few quirks.
As you noted, the column headers start on line 2, so you need skiprows=1.
The file is space-separated, not tab-separated.
Column values are continued across multiple lines.
Issues 1 and 2 can be fixed using skiprows and sep. Problem 3 is harder, and requires you to preprocess the file a little. For that reason, I used a slightly more flexible way of fetching the file, using the requests library. Once I have the file, I use regular expressions to fix problem 3, and give the file back to pandas.
Here's the code:
import requests
import re
import io
import pandas as pd
url = 'https://download.bls.gov/pub/time.series/bp/bp.measure'
# Get the URL, convert the document from DOS to Unix linebreaks
measure_codes = requests.get(url) \
.text \
.replace("\r\n", "\n")
# If there's a linebreak, followed by at least 7 spaces, combine it with
# previous line
measure_codes = re.sub("\n {7,}", " ", measure_codes)
# Convert the string to a file-like object
measure_codes = io.BytesIO(measure_codes.encode('utf-8'))
# Read in file, interpreting 4 spaces or more as a delimiter.
# Using a regex like this requires using the slower Python engine.
# Use skiprows=1 to skip the header
# Use dtype="str" to avoid converting measure code to integer.
df = pd.read_csv(measure_codes, engine="python", sep=" {4,}", skiprows=1, dtype="str")
print(df)

Pandas skip header until key found

I am importing a bunch of data files. The format of the files has changed over the years, accumulating more "header" material without any identifying "comment", which makes it hard to know how many lines to skip.
Is there a way in pandas to skip rows until the desired column names are encountered:
import pandas as pd
import os
my_names=['A','B','C']
max_head=30
my_file=os.path.join(my_file)
f=open(my_file,'r')
lines=f.readlines()
for i,line in enumerate(lines[:max_head]):
if line.strip().split()==my_names:
skiprows=i
a=pd.read_csv(my_file,skiprows=skiprows)
And if not, should there be? Something like:
pd.read_csv(my_file,start_names=my_names)
You can do this in two parts. First find out the row where the matching header is then use that in your pd.read_csv
def header(file_name):
with open(file_name) as f:
for n,line in enumerate(f):
if line.startswith("whatever_the_header_name_is"):
return n
You can now pass the above function into read_csv like this
pd.read_csv (file_name, header=header(file_name))

Remove double quotes while printing string in dataframe to text file

I have a dataframe which contains one column with multiple strings. Here is what the data looks like:
Value
EU-1050-22345,201908 XYZ DETAILS, CD_123_123;CD_123_124,2;1
There are almost 100,000 such rows in the dataframe. I want to write this data into a text file.
For this, I tried the following:
df.to_csv(filename, header=None,index=None,mode='a')
But I am getting the entire string in quotes when I do this. The output I obtain is:
"EU-1050-22345,201908 XYZ DETAILS, CD_123_123;CD_123_124,2;1"
But what I want is:
EU-1050-22345,201908 XYZ DETAILS, CD_123_123;CD_123_124,2;1 -> No Quotes
I also tried this:
df.to_csv(filename,header=None,index=None,mode='a',
quoting=csv.QUOTE_NONE)
However, I get an error that an escapechar is required. If i add escapechar='/' into the code, I get '/' in multiple places (but no quotes). I don't want the '/' either.
Is there anyway I can remove the quotes while writing into a text file WITHOUT adding any other escape characters ?
Based on OP's comment, I believe the semicolon is messing things up. I no longer have unwanted \ if using tabs to delimit csv.
import pandas as pd
import csv
df = pd.DataFrame(columns=['col'])
df.loc[0] = "EU-1050-22345,201908 XYZ DETAILS, CD_123_123;CD_123_124,2;1"
df.to_csv("out.csv", sep="\t", quoting=csv.QUOTE_NONE, quotechar="", escapechar="")
Original Answer:
According to this answer, you need to specify escapechar="\\" to use csv.QUOTE_NONE.
Have you tried:
df.to_csv("out.csv", sep=",", quoting=csv.QUOTE_NONE, quotechar="", escapechar="\\")
I was able to write a df to a csv using a single space as the separator and get the "quotes" around strings removed by replacing existing in-string spaces in the dataframe with non-breaking spaces before I wrote it as as csv.
df = df.applymap(lambda x: str(x).replace(' ', u"\u00A0"))
df.to_csv(outpath+filename, header=True, index=None, sep=' ', mode='a')
I couldn't use a tab delimited file for what I was writing output for, though that solution also works using additional keywords to df.to_csv(): quoting=csv.QUOTE_NONE, quotechar="", escapechar="")

How do I take the punctuation off each line of a column of an xlsx file in Python?

I have an excel file (.xlsx) with a column having rows of strings. I used the following code to get the file:
import pandas as pd
df = pd.read_excel("file.xlsx")
db = df['Column Title']
I am removing the punctuation for the first line (row) of the column using this code:
import string
translator = str.maketrans('', '', string.punctuation)
sent_pun = db[0].translate(translator)
I would like to remove the punctuation for each line (until the last row). How would I correctly write this with a loop? Thank you.
Well given that this code is working for one value and producing the right kind of results then you can write it in a loop as
for row in rows(min_row=1, min_col=1, max_row=6, max_col=3):
for cell in row:
translator = str.maketrans('', '', string.punctuation)
sent_pun = db[0].translate(translator)
Change the arguments (number of rows and columns) as per your need.

Resources