How to remove the time from csv file in python - python-3.x

How to remove the time from the csv file in python
I have a csv file in this format: "SSP_Ac_INVOICE_DISTRIBUTIONS_17022023072701.csv"
I am trying to remove the time which is after2023.my expectation was SSP_Ac_INVOICE_DISTRIBUTIONS_17022023.csv
I tried to use strptime but getting below error:
s = "SSP_AP_INVOICE_DISTRIBUTIONS_17022023072701.csv"
temp = dt.datetime.strptime(SSP_AP_INVOICE_DISTRIBUTIONS_17022023072701, '%d%m%Y')
final = temp.strftime('%d-%m-%Y')
print(final)

In the strptime function, you are passing the string 'SSP_AP_INVOICE_DISTRIBUTIONS_17022023072701.csv' instead of the variable s. Also, you are using the wrong format string in strptime. Since the date string in your filename is in the format %d%m%Y%H%M%S, you need to include %H%M%S in the format string to parse the time as well.
The code should look something like this:
import datetime as dt
filename = "SSP_AP_INVOICE_DISTRIBUTIONS_17022023072701.csv"
# Parse the date from the filename
date_str = filename.split('_')[3]
date = dt.datetime.strptime(date_str, '%d%m%Y%H%M%S')
# Format the date as required
new_filename = f"{filename.split('_')[0]}_{filename.split('_')[1]}_{filename.split('_')[2]}_{date.strftime('%d%m%Y')}.csv"
print(new_filename)
This code first splits the filename by the underscore character to extract the date string, and then uses strptime to parse the date and time. Finally, it formats the new filename using the date and the other parts of the original filename that were not changed.

Related

Saving string column into NetCdf file

I want to save string data into a single column in the NetCDF file using MATLAB, there are no options given for string. Can someone tell me how to save string data into the NetCDF file?
S_rebuilt=["101670";"101670";"101670";"101670"]
nccreate('file_name.nc','S_rebuilt',...
'Dimensions', {'x',size(S_rebuilt,1),'y',size(S_rebuilt,2)},...
'FillValue','disable');
ncwrite('file_name.nc','S_rebuilt',S_rebuilt);
With using format netcdf4, one can use datatype string in the MatLab. So, to save the variable S_rebuilt as string, I suggest code:
filename = 'file_name.nc'
S_rebuilt = ["101670";"101670";"101670";"101670"]
nccreate(filename,'S_rebuilt',...
'Dimensions', {'nvars',length(S_rebuilt)},...
'Datatype','string','format','netcdf4');
% ----------------------------------------------
ncwrite(filename,'S_rebuilt',S_rebuilt);

What is the appropriate way to take in files that have a filename with a timestamp in it?

What is the appropriate way to take in files that have a filename with a timestamp in it and read properly?
One way I'm thinking of so far is to take these filenames into one single text file to read all at once.
For example, filenames such as
1573449076_1570501819_file1.txt
1573449076_1570501819_file2.txt
1573449076_1570501819_file3.txt
Go into a file named filenames.txt
Then something like
with open('/Documents/filenames.txt', 'r') as f:
for item in f:
if item.is_file():
file_stat = os.stat(item)
item = item.replace('\n', '')
print("Fetching {}".format(convert_times(file_stat)))
My question is how would I go about this where I can properly read the names in the text file given that they have timestamps in the actual names? Once figuring that out I can convert them.
If you just want to get the timestamps from the file names, assuming that they all use the same naming convention, you can do so like this:
import glob
import os
from datetime import datetime
# Grab all .txt files in the specified directory
files = glob.glob("<path_to_dir>/*.txt")
for file in files:
file = os.path.basename(file)
# Check that it contains an underscore
if not '_' in file:
continue
# Split the file name using the underscore as the delimiter
stamps = file.split('_')
# Convert the epoch to a legible string
start = datetime.fromtimestamp(int(stamps[0])).strftime("%c")
end = datetime.fromtimestamp(int(stamps[1])).strftime("%c")
# Consume the data
print(f"{start} - {end}")
...
You'll want to add some error checking and handling; for instance, if the first or second index in the stamps array isn't a parsable int, this will fail.

Using filenames to create variable - PySpark

I have a folder where files get dropped (daily, weekly) and I need to add the year and week/day, which are in the file name in a consistent format, as variables to my data frame. The prefix can change (e.g., sales_report, cash_flow, etc.) but the last characters are always YYYY_WW.csv.
For instance, for a weekly file I could manually do it for each file as:
from pyspark.sql.functions import lit
df = spark.read.load('my_folder/sales_report_2019_12.csv', format="csv").withColumn("sales_year", lit(2019)).withColumn("sales_week", lit(12))
I would like to do the equivalent of using a substring function counting from the right of the file name to parse the 12 and 2019. Were I able to parse the file name for these variables I could then read in all of the files in the folder using a wildcard such as df = spark.read.load('my_folder/sales_report_*.csv', format="csv") which would greatly simplify my code.
You can easily extract it from the filename using the input_file_name() column and some string functions like regexp_extract and substring_index:
df = spark.read.load('my_folder/*.csv', format="csv")
df = df.withColumn("year_week", regexp_extract(input_file_name(), "\d{4}_\d{1,2}"))\
.withColumn("sales_year", substring_index(col("year_week"), "_", 1))\
.withColumn("sales_week", substring_index(col("year_week"), "_", -1))\
.drop("year_week")
you can try the below :
import glob
listfiles = glob.glob('my_folder/sales_report_*.csv')
for files in listfiles:
weekyear = c.split('_',2)[-1].split('_')
week = weekyear[1]
year = weekyear[0]
df = spark.read.load('files', format="csv").withColumn("sales_year", lit(year)).withColumn("sales_week", lit(week))

Extract files from single day- U SQL

I am facing issues with a U SQL script. I am trying to get files which were created on current day from a directory. the file name will have the date in yyyyMMdd format. But when i try to extract data instead of taking only one days files i am getting all the files inside the directory. I am using the below script.
DECLARE #file_set_path string ="/XXXX/Sample_{date:yyyy}{date:MM}{date:dd}{*}.csv";
#searchlog =
EXTRACT PART_NUMBER string, date DateTime FROM #file_set_path USING Extractors.Tsv(skipFirstNRows:1);
Can someone please help me on this.
You can use the Date property of the DateTime object to compare dates without including the time component, something like this:
DECLARE #file_set_path string ="/Sample_{date:yyyy}{date:MM}{date:dd}{*}.csv";
DECLARE #now DateTime = DateTime.Now;
#searchlog =
EXTRACT PART_NUMBER string,
date DateTime
FROM #file_set_path
USING Extractors.Csv(skipFirstNRows : 1);
#output =
SELECT *,
#now AS now,
date.Date AS x,
#now.Date AS y
FROM #searchlog
WHERE date.Date == #now.Date;
OUTPUT #output
TO "/output/output.csv"
USING Outputters.Csv();
NB I noticed you are using the Tsv extractor with Csv files. It may not matter when there is only one column or possibly this is a typo?

Split a string by '_'

I have a number of files in a directory with the following file format:
roll_#_oe_yyyy-mm-dd.csv
where # is a integer and yyyy-mm-dd is a date (for example roll_6_oe_2008-02-12).
I am trying to use the split function so I can return the number on its own. So for example:
roll_6_oe_2008-02-12 would yield 6
and
roll_14_oe_2008-02-12 would yield 14
I have tried :
filename.split("_")
but cannot write the number to a variable. What can I try next?
Supposing that: filename = 'roll_14_oe_2008-02-12'
print(filename.split('_')) evaluates to ['roll', '14', 'oe', '2008-02-12']
The number you want to retrieve is in the 2nd position of the list:
my_number = filename.split('_')[1]
You could also extract the number using regex:
import re
filename = 'roll_134_oe_2008-02-12'
number_match = re.match("roll_*(\d+)", filename)
if number_match:
print number_match.group(1)
Working example for both methods: http://www.codeskulptor.org/#user41_jEFOv5N5GN_2.py

Resources