I have following situation
Checkpoint A creates files (either 1 or 2) in folder A_Out
Checkpoint B takes all the files from A_Out as an input and generates files (1 or 2) in B_Out
I have conditional rule C which can take input from A or B and produce final output
I am using following code,
checkpoint A:
output: directory("A_Out")
# Rest of the logic
def collect_a(wildcards):
path = checkpoints.A.get(**wildcards).output[0]
return expand("{path}/{file}.txt",
path=path,
file=glob_wildcards(os.path.join(path, "{name}.txt")).name)
checkpoint B:
input: collect_a
output: directory("B_Out")
# Rest of the logic
def collect_b(wildcards):
path = checkpoints.B.get(**wildcards).output[0]
return expand("{path}/{file}.txt",
path=path,
file=glob_wildcards(os.path.join(path, "{name}.txt")).name)
def conditional_input(wildcards):
if condition_A:
return collect_a(wildcards)
else:
return collect_b(wildcards)
rule c:
input: conditional_input
# Rest of the logic
In above case, when condition_a is False, it evaluates only checkpoint B and do not evaluate checkpoint A. How can I solve this problem? Or is there any other elegant way?
These cascading checkpoint calls are pretty buggy and have several issues associated with them. If you flesh out your example more you can submit an issue to github, assuming you have the most recent version of snakemake.
As a workaround, you can try to add in a temporary signal file in each directory to indicate the rule has completed. You then need to do your glob wildcards in the rule code.
Not great, but could work. Because you have a temp input as a signal, the rules will rerun every time even if the other files are already present; snakemake doesn't know they are outputs. Another option would be to examine your inputs beforehand to decide if one or two files would be generated and tie that into your input function logic, bypassing checkpoints.
Related
Background: The purpose of this script is to take eight very large (~7GB) FASTQ files, subsample each, and concatenate each subsample into one "master" FASTQ file. The resulting file is about 60GB. Each file is subsampled to 120,000,000 lines.
The issue: The basic purpose of this script is to output a huge file. I have print statements & time stamps in my code so I know that it goes through the entire script, processes the input files and creates the output files. After I see the final print statement, I go to my directory and see that the output file has been generated, it's the correct size, and it was last modified a while ago, despite the fact that the script is still running. At this point, however, the code has still not finished running, and it will actually stall there for about 2-3 hours before I can enter anything into my terminal again.
My code is behaving like it gets stuck on the last line of the script even after it's finished creating the output file.
I'm hoping someone might be able to identify what's causing this weird behavior. Below is a dummy version of what my script does:
import random
import itertools
infile1 = "sample1_1.fastq"
inFile2 = "sample1_2.fastq"
with open(infile1, 'r') as file_1:
f1 = file_1.read()
with open(inFile2, 'r') as file_2:
f2 = file_2.read()
fastq1 = f1.split('\n')
fastq2 = f2.split('\n')
def subsampleFASTQ(compile1, compile2):
random.seed(42)
random_1 = random.sample(compile1, 30000000)
random.seed(42)
random_2 = random.sample(compile2, 30000000)
return random_1, random_2
combo1, combo2 = subsampleFASTQ(fastq1, fastq2)
with open('sampleout_1.fastq', 'w') as out1:
out1.write('\n'.join(str(i) for i in combo1))
with open('sampleout_2.fastq', 'w') as out2:
out2.write('\n'.join(str(i) for i in combo2))
My ideas of what it could be:
File size is causing some slowness
There is some background process running in this script that wont let it finish (but i have no idea how to debug that-- any resources would be appreciated)
I am in need of processing several thousands small log files.
I opted for Databricks to handle this problem, because it has good parallel computing capacities and interacts nicely with the Azure Blob storage account where the files are hosted.
After some researching, I always retrieve the same snippet of code (in PySpark).
# Getting your list of files with custom function
list_of_files = get_my_files()
# Create a path_rdd and use a custom udf to parse it
path_rdd = sc.parallelize(list_of_files)
content = path_rdd.map(parse_udf).collect()
Is there a better any method to do this? Would you opt for a flatmap if the logfiles are in a CSV format?
Thank you!
My current solution is:
content = sc.wholeTextFiles('/mnt/container/foo/*/*/', numPartitions=XX)
parsed_content = content.flatMap(custom_parser).collect()
I read all the content of the files as a string and keep their filenames.
I then pass this to my parsing function "custom_parser" using a flatMap, "where custom_parser" is defined as
def custom_parser(*argv):
file, content = argv
# Apply magic
return parsed_content_
I am currently finishing with a .collect() action, but I will alter this to save the output directly.
I'm pretty new to databricks, so excuse my ignorance.
I have a databricks notebook that creates a table to hold data. I'm trying to output the data to a pipe delimited file using another notebook which is using python. If I use the 'Order By' clause each record is created in a seperate file. If I leave the clause out of the code I get 1 file, but it's not in order
The code from the notebook is as follows
%python
try:
dfsql = spark.sql("select field_1, field_2, field_3, field_4, field_5, field_6, field_7, field_8, field_9, field_10, field_11, field_12, field_13, field_14, field_15, field_16 from dbsmets1mig02_technical_build.tbl_tech_output_bsmart_update ORDER BY MSN,Sort_Order") #Replace with your SQL
except:
print("Exception occurred")
if dfsql.count() == 0:
print("No data rows")
else:
dfsql.write.format("com.databricks.spark.csv").option("header","false").option("delimiter", "|").mode("overwrite").save("/mnt/publisheddatasmets1mig/smetsmig1/mmt/bsmart")
Spark creates a file per partition when writing files. So your order by is creating lots of partitions. Generally you want multiple files as that means you get more throughput - if you have 1 file/partition then you are only using one thread - therefore only 1 CPU on your workers is active - the others are idle which makes it a very expensive way of solving your problem.
You could leave the order by in and coalesce back into a single partition:
dfsql.coalesce(1).write.format("com.databricks.spark.csv").option("header","false").option("delimiter", "|").mode("overwrite").save("/mnt/publisheddatasmets1mig/smetsmig1/mmt/bsmart")
Even if you have multiple files you can point your other notebook at the folder and it will read all files in the folder.
To accomplish this I have done something similar to what simon_dmorias suggested. I am not sure if there is a better way to do so, as this doesn't scale very well but if you are working with a small dataset it will work.
simon_dmorias suggested: df.coalesce(1).write.format("com.databricks.spark.csv").option("header","false").option("delimiter", "|").mode("overwrite").save("/mnt/mountone/data/")
This will write a single partition in a directory /mnt/mountone/data/data-<guid>-.csv, which I believe is not what you are looking for, right? You just want /mnt/mountone/data.csv, similar to the pandas .to_csv function.
Therefore, I will write it to a temporary location on the cluster (not on the mount).
df.coalesce(1).write.format("com.databricks.spark.csv").option("header","false").option("delimiter", "|").mode("overwrite").save("/tmpdir/data")
I will then use the dbutils.fs.ls("/tmpdir/data") command to list the directory contents and identify the name of the csv file that was written in the directory i.e. /tmpdir/data/data-<guid>-.csv.
Once you have the CSV file name, I will use the dbutils.fs.cp function to copy the file to a mount location and rename the file. This allows you to have a single file without the directory, which is what I believe you were looking for.
dbutils.fs.cp("/tmpdir/data/data-<guid>-.csv", "/mnt/mountone/data.csv")
I am getting a value from HTTP request which I am writing it into a CSV file, each and every time when the program is executed, the new values are overwritten and not appended to the CSV. I would like to append the values instead of overwriting. I am using Regex and XPath extractor to get the values from the HTTP requests and writing it an CSV file.
new File('/Users/ddd/testgui/queueId1.csv').newWriter().withWriter { w ->
w << vars.get('queueid')
}
So this works for me, on groovysh 2.5.3 :
new File('/Users/ddd/testgui/queueId1.csv').newWriter(true).withWriter { w ->
w << vars.get('queueid')
}
The true in the newWriter is for append == true.
You can do just:
new File('/Users/ddd/testgui/queueId1.csv') << vars.get('queueid')
Be aware that your code is going to work fine only when you have 1 thread, if there will be more - you may suffer from a race condition when 2 threads will be simultaneously writing into a file.
If you're going to execute this code with > 1 virtual user I would rather recommend going for Sample Variables functionality.
If you add the next line to user.properties file:
sample_variables=queueid
and restart JMeter to pick the property up next time you run your test the .jtl results file will have an extra column with queueid variable value for each thread/request.
If you want to store it into a separate file - go for Flexible File Writer
I have a simple program that manipulates some stored data on some text files. However I have to store the name and the password on different files for python to read.
I was wondering if I could get these two words (The name and the password) on two separate lines on one file and get python to overwrite just one of the lines based on what I choose to overwrite (either the password or the name).
I can get python to read specific lines with:
linenumber=linecache.getline("example.txt",4)
Ideally id like something like this:
linenumber=linecache.writeline("example.txt","Hello",4)
So this would just write "Hello" in "example.txt" only on line 4.
But unfortunately it doesn't seem to be as simple as that, I can get the words to be stored on separate files but overall doing this on a larger scale, I'm going to have a lot of text files all named differently and with different words on them.
If anyone would be able to help, it would be much appreciated!
Thanks, James.
You can try with built in open() function:
def overwrite(filename,newline,linenumber):
try:
with open(filename,'r') as reading:
lines = reading.readlines()
lines[linenumber]=newline+'\n'
with open(filename,'w') as writing:
for i in lines:
writing.write(i)
return 0
except:
return 1 #when reading/writing gone wrong, eg. no such a file
Be careful! It is writing all the lines all over again in a loop and when it comes to exception example.txt may already be blank. You may want to store all the lines in list all the time to write them back to file in exception. Or keep backup of your old files.