Perforce Streams - Importing a stream which imports other streams - perforce

When importing a stream, is there a way to have the files from the imported stream's imports pulled into the workspace?
For example:
StreamA
StreamB imports StreamA
StreamC imports StreamB
I would like to know if there is a way for a workspace of StreamC to have the files from StreamC, StreamB and StreamA. From my testing, Perforce will only populate a StreamC workspace with files from StreamC and StreamB. If this is not possible or intentionally not allowed, what is the rationale? Thanks!

It's not possible because an import operates at the depot path level, rather than at the stream level. So if you have:
import //depot/streamB/...
you're not importing all of the files mapped by streamB, you're only mapping the files in the named depot path.
There is not presently a way to refer to the files mapped by a stream as a unit -- mostly people "fake it" by using the depot path, but as you've discovered, if the stream uses anything other than the default share ... Path definition, they aren't really the same thing.

Related

Loop through multiple folders and subfolders using Pyspark in Azure Blob container (ADLS Gen2)

I am trying to loop through multiple folders and subfolders in Azure Blob container and read multiple xml files.
Eg: I have files in YYYY/MM/DD/HH/123.xml format
Similarly I have multiple sub folders under month, date, hours and multiple XML files at last.
My intention is to loop through all these folder and read XML files. I have tried using few Pythonic approaches which did not give me the intended result. Can you please help me with any ideas in implementing this?
import glob, os
for filename in glob.iglob('2022/08/18/08/225.xml'):
if os.path.isfile(filename): #code does not enter the for loop
print(filename)
import os
dir = '2022/08/19/08/'
r = []
for root, dirs, files in os.walk(dir): #Code not moving past this for loop, no exception
for name in files:
filepath = root + os.sep + name
if filepath.endswith(".xml"):
r.append(os.path.join(root, name))
return r
The glob is a python function and it won't recognize the blob folders path directly as code is in pyspark. we have to give the path from root for this. Also, make sure to specify recursive=True in that.
For Example, I have checked above pyspark code in databricks.
and the OS code as well.
You can see I got the no result as above. Because for the above, we need to give the absolute root. it means the root folder.
glob code:
import glob, os
for file in glob.iglob('/path_from_root_to_folder/**/*.xml',recursive=True):
print(file)
For me in databricks the root to access is /dbfs and I have used csv files.
Using os:
You can see my blob files are listed from folders and subfolders.
I have used databricks for my repro after mounting. Wherever you are trying this code in pyspark, make sure you are giving the root of the folder in the path. when using glob, set the recursive = True as well.
There is an easier way to solve this problem with PySpark!
The tough part is all the files have to have the same format. In the Azure databrick's sample directory, there is a /cs100 folder that has a bunch of files that can be read in as text (line by line).
The trick is the option called "recursiveFileLookup". It will assume that the directories are created by spark. You can not mix and match files.
I added to the data frame the name of the input file for the dataframe. Last but not least, I converted the dataframe to a temporary view.
Looking at a simple aggregate query, we have 10 unique files. The biggest have a little more than 1 M records.
If you need to cherry pick files for a mixed directory, this method will not work.
However, I think that is an organizational cleanup task, versus easy reading one.
Last but not least, use the correct formatter to read XML.
spark.read.format("com.databricks.spark.xml")

Zip multiple directories and files in memory without storing zip on disk

I have a web application which should combine multiple directories and files into one ZIP file and then send this ZIP file to the end user. The end user then automatically receives a ZIP file which is downloaded.
This means that I shouldn't actually store the ZIP on the server disk and that I can generate it in memory so I can pass it along right away. My initial plan was to go with shutil.make_archive but this will store a ZIP file somewhere on the disk. The second problem with make_archive is that it takes a directory but doesn't seem to contain logic to combine multiple directories into one ZIP. Here's the minimal example code for reference:
import shutil
zipper = shutil.make_archive('/tmp/test123', 'zip', 'foldera')
I then started looking at zipfile which is pretty powerful and can be used to loop over both directories and files recursively and applying them to a ZIP. The answer from Jerr on another topic was a great start. The only problem with zipfile.ZipFile seems to be that it cannot ZIP without storing on disk? The code now looks like this:
import os
import zipfile
import io
def zip_directory(path, ziph):
# ziph is a zipfile handle
for root, dirs, files in os.walk(path):
for file in files:
ziph.write(os.path.join(root, file),
os.path.relpath(os.path.join(root, file),
os.path.join(path, '..')))
def zip_content(dir_list, zip_name):
zipf = zipfile.ZipFile(zip_name, 'w', zipfile.ZIP_DEFLATED)
for dir in dir_list:
zip_directory(dir, zipf)
zipf.close()
zip_content(['foldera', 'folderb'], 'test.zip')
My folder structure looks like this:
foldera
├── test.txt
|── test2.txt
folderb
├── test.py
├── folderc
├── script.py
However, the issue is that this ZIP is being stored on the disk. I don't have any use in storing it and as I have to generate thousands of ZIP's a day it would fill up way too much storage.
Is there a way to not store the ZIP and convert the output to a BytesIO, str or other type where I can work with in memory and just dispose it when I'm done?
I've also noticed that if you have just a folder without any files in it that it will not be added to the ZIP either (for example a folder 'folderd' without anything in it). Is it possible to add a folder to the ZIP even if if it is empty?
As far as I know, there would be no way to create ZIPs without storing them on a disk (unless you figure out some hacky way of saving it to memory, but that would be a hog of memory). I noticed you brought up that generating thousands of ZIPs a day would fill up your storage device, but you could simply delete the ZIP after it is sent back to the user. While this would make a file on your server, it would only be temporary and therefor would not require a ton of storage as long as you delete it after it is sent.
use BytesIO
filestream=BytesIO()
with zipfile.ZipFile(filestream, mode='w', compression=zipfile.ZIP_DEFLATED) as zipf:
for dir in dir_list:
zip_directory(dir, zipf)
you didn't ask how to send it, but for the record, the way I send it in my flask code is:
filestream.seek(0)
return send_file(filestream, attachment_filename=attachment_filename,
as_attachment=True, mimetype='application/zip')

Azure ML SDK DataReference - File Pattern - MANY files

I’m building out a pipeline that should execute and train fairly frequently. I’m following this: https://learn.microsoft.com/en-us/azure/machine-learning/service/how-to-create-your-first-pipeline
Anyways, I’ve got a stream analytics job dumping telemetry into .json files on blob storage (soon to be adls gen2). Anyways, I want to find all .json files and use all of those files to train with. I could possibly use just new .json files as well (interesting option honestly).
Currently I just have the store mounted to a data lake and available; and it just iterates the mount for the data files and loads them up.
How can I use data references for this instead?
What does data references do for me that mounting time stamped data does not?
a. From an audit perspective, I have version control, execution time and time stamped read only data. Albeit, doing a replay on this would require additional coding, but is do-able.
As mentioned, the input to the step can be a DataReference to the blob folder.
You can use the default store or add your own store to the workspace.
Then add that as an input. Then when you get a handle to that folder in your train code, just iterate over the folder as you normally would. I wouldnt dynamically add steps for each file, I would just read all the files from your storage in a single step.
ds = ws.get_default_datastore()
blob_input_data = DataReference(
datastore=ds,
data_reference_name="data1",
path_on_datastore="folder1/")
step1 = PythonScriptStep(name="1step",
script_name="train.py",
compute_target=compute,
source_directory='./folder1/',
arguments=['--data-folder', blob_input_data],
runconfig=run_config,
inputs=[blob_input_data],
allow_reuse=False)
Then inside your train.py you access the path as
parser = argparse.ArgumentParser()
parser.add_argument('--data-folder', type=str, dest='data_folder', help='data folder')
args = parser.parse_args()
print('Data folder is at:', args.data_folder)
Regarding benefits, it depends on how you are mounting. For example if you are dynamically mounting in code, then the credentials to mount need to be in your code, whereas a DataReference allows you to register credentials once, and we can use KeyVault to fetch them at runtime. Or, if you are statically making the mount on the machine, you are required to run on that machine all the time, whereas a DataReference can dynamically fetch the credentials from any AMLCompute, and will tear that mount down right after the job is over.
Finally, if you want to train on a regular interval, then its pretty easy to schedule it to run regularly. For example
pub_pipeline = pipeline_run1.publish_pipeline(name="Sample 1",description="Some desc", version="1", continue_on_step_failure=True)
recurrence = ScheduleRecurrence(frequency="Hour", interval=1)
schedule = Schedule.create(workspace=ws, name="Schedule for sample",
pipeline_id=pub_pipeline.id,
experiment_name='Schedule_Run_8',
recurrence=recurrence,
wait_for_provisioning=True,
description="Scheduled Run")
You could pass pointer to folder as an input parameter for the pipeline, and then your step can mount the folder to iterate over the json files.

How do I create a bucket and multiple subfolders at once with Boto3 in s3?

I'm able to create one directory at the top level using
s3.create_bucket(Bucket=bucket_name)
I want to create a new bucket and subfolders so I have a directory structure like:
-top_level_bucket
-sub_folder
-sub_sub_folder
I want to do something like this to create everything at once if not already existent:
path = 'top_level_bucket/sub_folder/sub_sub_folder'
s3.create_bucket(Bucket=path)
Is this possible?
There is no concept of a 'sub-bucket' in Amazon S3.
Amazon S3 is actually a flat object storage service. It does not use directories.
Instead, files are uploaded with a path, eg:
aws s3 cp file.txt s3://my-bucket/bob/files/file.txt
The full name of the object will be: bob/files/file.txt
It looks and behaves like there are directories, but they are not actually there. In fact, you can run the above command and it will automatically 'create' the bob and files directory, but they are not actually there! If you delete the object, those directories will disappear (because they were never actually there!).
Bottom line: Upload files to where ever you wish, even if the buckets do not exist. Don't worry about creating a folder structure in advance.

Perforce stream view import path strips file extension

I have 2 streams right now, and in my second stream I want to import certain files from the other stream based on their file extensions.
If I set it up using the following statements:
import from_second_stream/... //second_stream/....xml
import from_second_stream/... //second_stream/....json
It successfully imports all the files in the correct place, but it strips the file extensions.
For instance, I have a file in the second stream in this path:
//second_stream/test/myTest.json
Which should get imported as:
from_second_stream/test/myTest.json
But instead becomes:
from_second_stream/test/myTest
What am I doing wrong?
According to Perforce support:
The path should be:
import from_second_stream/....json //second_stream/....json
However it is not allowed to have embedded wildcards in stream views, so you will not be able to use that. Instead you need to specify each file in the view:
import from_second_stream/test/myTest.json //second_stream/test/myTest.json
So we ended up having to restructure our build system to accommodate this...

Resources