Extract files from single day- U SQL - azure

I am facing issues with a U SQL script. I am trying to get files which were created on current day from a directory. the file name will have the date in yyyyMMdd format. But when i try to extract data instead of taking only one days files i am getting all the files inside the directory. I am using the below script.
DECLARE #file_set_path string ="/XXXX/Sample_{date:yyyy}{date:MM}{date:dd}{*}.csv";
#searchlog =
EXTRACT PART_NUMBER string, date DateTime FROM #file_set_path USING Extractors.Tsv(skipFirstNRows:1);
Can someone please help me on this.

You can use the Date property of the DateTime object to compare dates without including the time component, something like this:
DECLARE #file_set_path string ="/Sample_{date:yyyy}{date:MM}{date:dd}{*}.csv";
DECLARE #now DateTime = DateTime.Now;
#searchlog =
EXTRACT PART_NUMBER string,
date DateTime
FROM #file_set_path
USING Extractors.Csv(skipFirstNRows : 1);
#output =
SELECT *,
#now AS now,
date.Date AS x,
#now.Date AS y
FROM #searchlog
WHERE date.Date == #now.Date;
OUTPUT #output
TO "/output/output.csv"
USING Outputters.Csv();
NB I noticed you are using the Tsv extractor with Csv files. It may not matter when there is only one column or possibly this is a typo?

Related

How to remove the time from csv file in python

How to remove the time from the csv file in python
I have a csv file in this format: "SSP_Ac_INVOICE_DISTRIBUTIONS_17022023072701.csv"
I am trying to remove the time which is after2023.my expectation was SSP_Ac_INVOICE_DISTRIBUTIONS_17022023.csv
I tried to use strptime but getting below error:
s = "SSP_AP_INVOICE_DISTRIBUTIONS_17022023072701.csv"
temp = dt.datetime.strptime(SSP_AP_INVOICE_DISTRIBUTIONS_17022023072701, '%d%m%Y')
final = temp.strftime('%d-%m-%Y')
print(final)
In the strptime function, you are passing the string 'SSP_AP_INVOICE_DISTRIBUTIONS_17022023072701.csv' instead of the variable s. Also, you are using the wrong format string in strptime. Since the date string in your filename is in the format %d%m%Y%H%M%S, you need to include %H%M%S in the format string to parse the time as well.
The code should look something like this:
import datetime as dt
filename = "SSP_AP_INVOICE_DISTRIBUTIONS_17022023072701.csv"
# Parse the date from the filename
date_str = filename.split('_')[3]
date = dt.datetime.strptime(date_str, '%d%m%Y%H%M%S')
# Format the date as required
new_filename = f"{filename.split('_')[0]}_{filename.split('_')[1]}_{filename.split('_')[2]}_{date.strftime('%d%m%Y')}.csv"
print(new_filename)
This code first splits the filename by the underscore character to extract the date string, and then uses strptime to parse the date and time. Finally, it formats the new filename using the date and the other parts of the original filename that were not changed.

Saving string column into NetCdf file

I want to save string data into a single column in the NetCDF file using MATLAB, there are no options given for string. Can someone tell me how to save string data into the NetCDF file?
S_rebuilt=["101670";"101670";"101670";"101670"]
nccreate('file_name.nc','S_rebuilt',...
'Dimensions', {'x',size(S_rebuilt,1),'y',size(S_rebuilt,2)},...
'FillValue','disable');
ncwrite('file_name.nc','S_rebuilt',S_rebuilt);
With using format netcdf4, one can use datatype string in the MatLab. So, to save the variable S_rebuilt as string, I suggest code:
filename = 'file_name.nc'
S_rebuilt = ["101670";"101670";"101670";"101670"]
nccreate(filename,'S_rebuilt',...
'Dimensions', {'nvars',length(S_rebuilt)},...
'Datatype','string','format','netcdf4');
% ----------------------------------------------
ncwrite(filename,'S_rebuilt',S_rebuilt);

What is the appropriate way to take in files that have a filename with a timestamp in it?

What is the appropriate way to take in files that have a filename with a timestamp in it and read properly?
One way I'm thinking of so far is to take these filenames into one single text file to read all at once.
For example, filenames such as
1573449076_1570501819_file1.txt
1573449076_1570501819_file2.txt
1573449076_1570501819_file3.txt
Go into a file named filenames.txt
Then something like
with open('/Documents/filenames.txt', 'r') as f:
for item in f:
if item.is_file():
file_stat = os.stat(item)
item = item.replace('\n', '')
print("Fetching {}".format(convert_times(file_stat)))
My question is how would I go about this where I can properly read the names in the text file given that they have timestamps in the actual names? Once figuring that out I can convert them.
If you just want to get the timestamps from the file names, assuming that they all use the same naming convention, you can do so like this:
import glob
import os
from datetime import datetime
# Grab all .txt files in the specified directory
files = glob.glob("<path_to_dir>/*.txt")
for file in files:
file = os.path.basename(file)
# Check that it contains an underscore
if not '_' in file:
continue
# Split the file name using the underscore as the delimiter
stamps = file.split('_')
# Convert the epoch to a legible string
start = datetime.fromtimestamp(int(stamps[0])).strftime("%c")
end = datetime.fromtimestamp(int(stamps[1])).strftime("%c")
# Consume the data
print(f"{start} - {end}")
...
You'll want to add some error checking and handling; for instance, if the first or second index in the stamps array isn't a parsable int, this will fail.

Reading excel files in python using dynamic dates in the file and sheet name

I have a situation in which I would want to read an excel file on a daily basis where the file name is written in the following format : file_name 08.20.2018 xyz.xlsx and gets updated daily where the date getting changed on a daily basis.
The same thing I need to do when this file is being read, I need to extract the data from a sheet whose naming convention also changes daily with the date. An example sheet name is sheet1-08.20.2020-data
How should I achieve this? I am using the following code but it does not work:
df = pd.read_Excel(r'file_name 08.20.2018 xyz.xlsx', sheet_name = 'sheet1-08.20.2020-data')
how do I update this line of code so that it picks the data dynamically daily with new dates coming in. And to be clear here, the date will also be incremental with no gaps.
You could use pathlib and the datetime module to automate the process :
from pathlib import Path
from datetime import date
#assuming you have a directory of files:
folder = Path(directory of files)
sheetname = f"sheet1-0{date.today().month}.{date.today().day}.{date.today().year}-data"
date_string = f"filename 0{date.today().month}.{date.today().day}.{date.today().year}.xlsx"
xlsx_file = folder.glob(date_string)
#read in data
df = pd.read_excel(io=next(xlsx_file), sheet_name = sheetname)

Using filenames to create variable - PySpark

I have a folder where files get dropped (daily, weekly) and I need to add the year and week/day, which are in the file name in a consistent format, as variables to my data frame. The prefix can change (e.g., sales_report, cash_flow, etc.) but the last characters are always YYYY_WW.csv.
For instance, for a weekly file I could manually do it for each file as:
from pyspark.sql.functions import lit
df = spark.read.load('my_folder/sales_report_2019_12.csv', format="csv").withColumn("sales_year", lit(2019)).withColumn("sales_week", lit(12))
I would like to do the equivalent of using a substring function counting from the right of the file name to parse the 12 and 2019. Were I able to parse the file name for these variables I could then read in all of the files in the folder using a wildcard such as df = spark.read.load('my_folder/sales_report_*.csv', format="csv") which would greatly simplify my code.
You can easily extract it from the filename using the input_file_name() column and some string functions like regexp_extract and substring_index:
df = spark.read.load('my_folder/*.csv', format="csv")
df = df.withColumn("year_week", regexp_extract(input_file_name(), "\d{4}_\d{1,2}"))\
.withColumn("sales_year", substring_index(col("year_week"), "_", 1))\
.withColumn("sales_week", substring_index(col("year_week"), "_", -1))\
.drop("year_week")
you can try the below :
import glob
listfiles = glob.glob('my_folder/sales_report_*.csv')
for files in listfiles:
weekyear = c.split('_',2)[-1].split('_')
week = weekyear[1]
year = weekyear[0]
df = spark.read.load('files', format="csv").withColumn("sales_year", lit(year)).withColumn("sales_week", lit(week))

Resources