How to create datetime index from string in python? - python-3.x

There are three files with names: file_2018-01-01_01_temp.tif, file_2018-01-01_02_temp.tif and file_2018-01-01_03_temp.tif. I want to list them names as ['2018010101', '2018010102', '2018010103'] in python.
The below code create an incorrect list.
import pandas as pd
from glob import glob
from os import path
pattern = '*.tif'
filenames = [path.basename(x) for x in glob(pattern)]
pd.DatetimeIndex([pd.Timestamp(f[5:9]) for f in filenames])
Result:
DatetimeIndex(['2018-01-01', '2018-01-01', '2018-01-01']

I think simpliest is indexing with replace in list comprehension:
a = [f[5:18].replace('_','').replace('-','') for f in filenames]
print (a)
['2018010101', '2018010102', '2018010103']
Similar with Series.str.replace:
a = pd.Index([f[5:18] for f in filenames]).str.replace('\-|_', '')
print (a)
Index(['2018010101', '2018010102', '2018010103'], dtype='object')
Or convert values to DatetimeIndex and then use DatetimeIndex.strftime:
a = pd.to_datetime([f[5:18] for f in filenames], format='%Y-%m-%d_%H').strftime('%Y%m%d%H')
print (a)
Index(['2018010101', '2018010102', '2018010103'], dtype='object')
EDIT:
dtype is in object, but it must be in dtype='datetime64[ns]
If need datetimes, then formating has to be default, not possible change it:
d = pd.to_datetime([f[5:18] for f in filenames], format='%Y-%m-%d_%H')
print (d)
DatetimeIndex(['2018-01-01 01:00:00', '2018-01-01 02:00:00',
'2018-01-01 03:00:00'],
dtype='datetime64[ns]', freq=None)

Related

How to save values in row after one loop?

How to save values to row in one cycle, please?
import numpy as np
A = 5.4654
B = [4.78465, 6.46545, 5.798]
for i in range(2):
f = open(f'file.txt', 'a')
np.savetxt(f, np.r_[A, B], fmt='%22.16f')
f.close()
The output is:
5.4653999999999998
4.7846500000000001
6.4654499999999997
5.7980000000000000
5.4653999999999998
4.7846500000000001
6.4654499999999997
5.7980000000000000
The desired output is:
5.4653999999999998 4.7846500000000001 6.4654499999999997 5.7980000000000000
5.4653999999999998 4.7846500000000001 6.4654499999999997 5.7980000000000000
According to the documentation:
newlinestr, optional
String or character separating lines.
So, perhaps:
np.savetxt(f, np.r_[A, B], fmt='%22.16f', newlinestr=' ')
print('\n', file=f)
An alternative might be to np.transpose(np.r_[A, B]) perhaps?

For loop and replace part string in paths based on dataframe two columns in Python

I try to iterate excel files under a directory with code below:
import glob
import pandas as pd
files = glob.glob('./*.xlsx')
for file_path in files:
print(file_path)
Out:
./data\S273-2021-12-09.xlsx
./data\S357-2021-12-09.xlsx
./data\S545-2021-12-09.xlsx
./data\S607-2021-12-09.xlsx
Now I hope to replace S273, S357, etc. based dataframe df to map old_name to new_name:
old_name new_name
0 S273 a
1 S357 b
2 S545 c
3 S607 d
4 S281 e
To convert dataframe to dictionary if necessary: name_dict = dict(zip(df.old_name, df.new_name))
The expected result will like:
./data\a-2021-12-09.xlsx
./data\b-2021-12-09.xlsx
./data\c-2021-12-09.xlsx
./data\d-2021-12-09.xlsx
How could I achieve that in Python? Sincere thanks at advance.
EDIT:
for file_path in files:
for key, value in name_dict.items():
if key in str(file_path):
new_path = file_path.replace(key, value)
print(new_path)
The code above works, welcome to share other solutions if it's possible.
You can split basename first by os.path.split, then first part of name of file by - and mapping by dict.get, if no match is return same value, so second argument is also first:
import os
name_dict = dict(zip(df.old_name, df.new_name))
print (name_dict)
{'S273': 'a', 'S357': 'b', 'S545': 'c', 'S607': 'd', 'S281': 'e'}
#for test
L = './data\S273-2021-12-09.xlsx ./data\S357-2021-12-09.xlsx ./data\S545-2021-12-09.xlsx ./data\S607-2021-12-09.xlsx'
files = L.split()
for file_path in files:
head, tail = os.path.split(file_path)
first, last = tail.split('-', 1)
out = os.path.join(head, f'{name_dict.get(first, first)}-{last}')
print(out)
./data\a-2021-12-09.xlsx
./data\b-2021-12-09.xlsx
./data\c-2021-12-09.xlsx
./data\d-2021-12-09.xlsx

How to convert a list with to wildcards into a list of lists using python?

I obtained a list of all files in a folder using glob:
lista = glob.glob("*.h5")
The list basically contains files with names like:
abc_000000000_000.h5
abc_000000000_001.h5
abc_000000000_002.h5
......
abc_000000000_011.h5
......
abc_000000001_000.h5
abc_000000001_001.h5
abc_000000001_002.h5
....
abc_000000026_000.h5
abc_000000026_001.h5
....
abc_000000027_000.h5
....
abc_000000027_011.h5
which has a format abc_0*_0*.h5. How do I reshape this into a list of lists? The inner list would be ['abc_000000027_0*.h5'] and the outer list would be the sequence of the 'abc_000000*' i.e first wildcard.
One way to create an input would be:
lista=[]
for i in range(115):
for j in range(14):
item="abc_%0.9d_%0.3d"%(i,j)
lista.append(item)
My attempt: my solution is not nice and ugly.
listb = glob.glob("*_011.h5")
then for each item in listb split and glob again, for example
listc = glob.glob("abc_000000027*.h5")
Given:
ls -1
abc_00000001_1.h5
abc_00000001_2.h5
abc_00000001_3.h5
abc_00000002_1.h5
abc_00000002_2.h5
abc_00000002_3.h5
abc_00000003_1.h5
abc_00000003_2.h5
abc_00000003_3.h5
You can use pathlib, itertools.groupby and natural sorting to achieve this:
from pathlib import Path
from itertools import groupby
import re
p=Path('/tmp/t2')
def _k(s):
s=str(s)
try:
return tuple(map(int, re.search(r'_(\d+)_(\d*)', s).groups()))
except ValueError:
return (0,0)
def k1(s):
return _k(s)
def k2(s):
return _k(s)[0]
result=[]
files=sorted(p.glob('abc_000000*.h5'), key=k1)
for k,g in groupby(files, key=k2):
result.append(list(map(str, g)))
Which could be simplified to:
def _k(p):
try:
return tuple(map(int, p.stem.split('_')[-2:]))
except ValueError:
return (0,0)
files=sorted(p.glob('abc_000000*_*.h5'), key=lambda e: _k(e))
result=[list(map(str, g)) for k,g in groupby(files, key=lambda e: _k(e)[0])]
Result (in either case):
>>> result
[['/tmp/t2/abc_00000001_1.h5', '/tmp/t2/abc_00000001_2.h5', '/tmp/t2/abc_00000001_3.h5'], ['/tmp/t2/abc_00000002_1.h5', '/tmp/t2/abc_00000002_2.h5', '/tmp/t2/abc_00000002_3.h5'], ['/tmp/t2/abc_00000003_1.h5', '/tmp/t2/abc_00000003_2.h5', '/tmp/t2/abc_00000003_3.h5']]
Which easily could be a dict:
>>> {k:list(map(str, g)) for k,g in groupby(files, key=k2)}
{1: ['/tmp/t2/abc_00000001_1.h5', '/tmp/t2/abc_00000001_2.h5', '/tmp/t2/abc_00000001_3.h5'],
2: ['/tmp/t2/abc_00000002_1.h5', '/tmp/t2/abc_00000002_2.h5', '/tmp/t2/abc_00000002_3.h5'],
3: ['/tmp/t2/abc_00000003_1.h5', '/tmp/t2/abc_00000003_2.h5', '/tmp/t2/abc_00000003_3.h5']}

Finding Specific word in a pandas column and assigning to a new column and replicate the row

I am trying to find specific words from a pandas column and assign it to a new column and column may contain two or more words. Once I find it I wish to replicate the row by creating it for that word.
import pandas as pd
import numpy as np
import re
wizard=pd.read_excel(r'C:\Python\L\Book1.xlsx'
,sheet_name='Sheet1'
, header=0)
test_set = {'941', '942',}
test_set2={'MN','OK','33/3305'}
wizard['ZTYPE'] = wizard['Comment'].apply(lambda x: any(i in test_set for i in x.split()))
wizard['ZJURIS']=wizard['Comment'].apply(lambda x: any(i in test_set2 for i in x.split()))
wizard_new = pd.DataFrame(np.repeat(wizard.values,3,axis=0))
wizard_new.columns = wizard.columns
wizard_new.head()
I am getting true and false, however unable to split it.
Above is how the sample data reflects. I need to find anything like this '33/3305', Year could be entered as '19', '2019', and quarter could be entered are 'Q1'or '1Q' or 'Q 1' or '1 Q' and my test set lists.
ZJURIS = dict(list(itertools.chain(*[[(y_, x) for y_ in y] for x, y in wizard.comment()])))
def to_category(x):
for w in x.lower().split(" "):
if w in ZJURIS:
return ZJURIS[w]
return None
Finally, apply the method on the column and save the result to a new one:
wizard["ZJURIS"] = wizard["comment"].apply(to_category)
I tried the above solution well it did not
Any suggestions how to do I get the code to work.
Sample data.
data={ 'ID':['351362278576','351539320880','351582465214','351609744560','351708198604'],
'BU':['SBS','MAS','NAS','ET','SBS'],
'Comment':['940/941/w2-W3NYSIT/SUI33/3305/2019/1q','OK SUI 2Q19','941 - 3Q2019NJ SIT - 3Q2019NJ SUI/SDI - 3Q2019','IL,SUI,2016Q4,2017Q1,2017Q2','1Q2019 PA 39/5659 39/2476','UT SIT 1Q19-3Q19']
}
df = pd.DataFrame(data)
Based on the data sample data set attached is the output.

pandas dataframe output need to be a string instead of a list

I have a requirement that the result value should be a string. But when I calculate the maximum value of dataframe it gives the result as a list.
import pandas as pd
def answer_one():
df_copy = [df['# Summer'].idxmax()]
return (df_copy)
df = pd.read_csv('olympics.csv', index_col=0, skiprows=1)
for col in df.columns:
if col[:2]=='01':
df.rename(columns={col:'Gold'+col[4:]}, inplace=True)
if col[:2]=='02':
df.rename(columns={col:'Silver'+col[4:]}, inplace=True)
if col[:2]=='03':
df.rename(columns={col:'Bronze'+col[4:]}, inplace=True)
if col[:1]=='№':
df.rename(columns={col:'#'+col[1:]}, inplace=True)
names_ids = df.index.str.split('\s\(')
df.index = names_ids.str[0] # the [0] element is the country name (new index)
df['ID'] = names_ids.str[1].str[:3] # the [1] element is the abbreviation or ID (take first 3 characters from that)
df = df.drop('Totals')
df.head()
answer_one()
But here the answer_one() will give me a List as an output and not a string. Can someone help me know how this came be converted to a string or how can I get the answer directly from dataframe as a string. I don't want to convert the list to a string using str(df_copy).
Your first solution would be as #juanpa.arrivillaga put it: To not wrap it. Your function becomes:
def answer_one():
df_copy = df['# Summer'].idxmax()
return (df_copy)
>>> 1
Another thing that you might not be expecting but idxmax() will return the index of the max, perhaps you want to do:
def answer_one():
df_copy = df['# Summer'].max()
return (df_copy)
>>> 30
Since you don't want to do str(df_copy) you can do df_copy.astype(str) instead.
Here is how I would write your function:
def get_max_as_string(data, column_name):
""" Return Max Value from a column as a string."""
return data[column_name].max().astype(str)
get_max_as_string(df, '# Summer')
>>> '30'

Resources