spliced and unspliced sequence alignment using STAR - sequence-alignment

I am working with single cell sequencing data, and want to run this through RNA velocity (https://www.nature.com/articles/s41586-018-0414-6). For that, I need to map both spliced and unspliced reads. The dataset I am working with is a SMARTseq dataset, without UMIs or barcodes (https://www.embopress.org/doi/full/10.15252/embj.2018100164). Data are provided as a single sra file per cell, which can be fetched and unpacked into a single-ended fastq file through GEO using the SRAtoolkit.
What I aimed to do was map the data twice, using STAR alignment. First I wanted to map the reads to exonic regions, then again to intronic regions. To do this, I downloaded the fasta and gtf file from the ensembl reference genome GRCm38 build 100 (https://www.ensembl.org/Mus_musculus/Info/Index). The gtf file does not by itself contain intronic info, which I added in R using the following code:
library(gread) # https://rdrr.io/github/asrinivasan-oa/gread/
library(rtracklayer)
dir <- "/to/reference/genome"
gtf_file <- file.path(dir, "GRCm38build100.gtf")
gtf <- read_format(gtf_file)
gtf.new <- construct_introns(gtf, update = TRUE)[]
export(gtf.new, "GRCm38build100withIntrons.gtf", format = "gtf")
Then, I generated a STAR genome as follows:
STAR \
--runMode genomeGenerate \
--runThreadN 8 \
--sjdbOverhang 50 \
--genomeDir GRCm38build100/50bp/ \
--genomeFastaFiles GRCm38build100/GRCm38build100.fa \
--sjdbGTFfile GRCm38build100/GRCm38build100withIntrons.gtf \
The code for mapping reads against exonic sequences was as follows:
STAR \
--runMode alignReads \
--runThreadN 8 \
--genomeDir GRCm38build100/50bp \
--readFilesIn $R1 \
--outSAMtype None \
--twopassMode Basic \
--sjdbGTFfeatureExon exon \
--quantMode GeneCounts \
--outFileNamePrefix output/${SAMPLE}_exon_
This worked just fine. For unspliced reads, I wanted to do the following:
STAR \
--runMode alignReads \
--runThreadN 8 \
--genomeDir GRCm38build100/50bp \
--readFilesIn $R1 \
--outSAMtype None \
--twopassMode Basic \
--sjdbGTFfeatureExon intron \
--quantMode GeneCounts \
--outFileNamePrefix output/${SAMPLE}_intron_
Strangely, this gave me the exact same result as mapping against exonic reads. I'm confused, I'm sure I'm doing something wrong, but I cannot figure out what. Any help would be much appreciated!
Best regards,
Leon

Related

Spark SQL- Regex to find 4 consequtive numbers

I am trying to write a regex function which has consequite 4 digits and checking it as rlike(postal_code,'(?:^|\D)(\d{4})(?!\d)')=='true' but I am getting below error.
My Spark sql is :
df_email_cleaned=spark.sql("select *,case when postal_code like '%-%' and rlike(postal_code,'(?:^|\D)(\d{4})(?!\d)')=='true' then split(postal_code, '-')[0] \
when postal_code like '%-%' and rlike(postal_code,'(?:^|\D)(\d{4})(?!\d)')=='true' then split(postal_code, '-')[1] \
when postal_code rlike(postal_code,'(?:^|\D)(\d{4})(?!\d)')=='true' and length(postal_code)>2 and length(postal_code)<6 then postal_code \
else '' \
end sd_ps_clean \
from df_email_cleaned_sql ")
I am getting below error:
pyspark.sql.utils.AnalysisException: cannot resolve '(named_struct('postal_code',
df_email_cleaned_sql.`postal_code`, 'col2', '(?:^|D)(d{4})(?!d)') = 'true')'
due to data type mismatch: differing types in '(named_struct('postal_code', df_email_cleaned_sql.`postal_code`, 'col2', '(?:^|D)(d{4})(?!d)') = 'true')' (struct<postal_code:string,col2:string> and string).; line 1 pos 343;

Can't create table in YugbyteDB in multi region deployment with preferred region

[Question posted by a user on YugabyteDB Community Slack]
I deployed a 3 master + 3 tserver cluster on 3 different regions. I am trying to set a preferred region.
Even though each master is in a different region, I had to use https://docs.yugabyte.com/latest/admin/yb-admin/#modify-placement-info so that is appeared in the config.
I then used : https://docs.yugabyte.com/latest/admin/yb-admin/#set-preferred-zones to set the preferred zone.
But the tablets are not rebalancing the leaders. Is there anything else to do ?
sample yb-tserver.conf
/home/yugabyte/bin/yb-tserver \
--fs_data_dirs=/mnt/disk0 \
--tserver_master_addrs={yb-master-0.yb-masters.yugabyte1.svc.cluster.local:7100,nyzks901i:29600},ldzks449i:29605,sgzks449i:29601 \
--placement_region=nyz-core-prod \
--use_private_ip=never \
--server_broadcast_addresses=nyzks902i:29500 \
--metric_node_name=yb-tserver-0 \
--memory_limit_hard_bytes=3649044480 \
--stderrthreshold=0 --num_cpus=0 \
--undefok=num_cpus,enable_ysql \
--rpc_bind_addresses=yb-tserver-0.yb-tservers.yugabyte1.svc.cluster.local \
--webserver_interface=0.0.0.0 \
--enable_ysql=true \
--pgsql_proxy_bind_address=0.0.0.0:5433 \
--cql_proxy_bind_address=yb-tserver-0.yb-tservers.yugabyte1.svc.cluster.local
sample yb-master.conf
/home/yugabyte/bin/yb-tserver \
--fs_data_dirs=/mnt/disk0 \
--tserver_master_addrs={yb-master-0.yb-masters.yugabyte1.svc.cluster.local:7100,nyzks901i:29600},ldzks449i:29605,sgzks449i:29601 \
--placement_region=nyz-core-prod \
--use_private_ip=never \
--server_broadcast_addresses=nyzks902i:29500 \
--metric_node_name=yb-tserver-0 \
--memory_limit_hard_bytes=3649044480 \
--stderrthreshold=0 \
--num_cpus=0 \
--undefok=num_cpus,enable_ysql \
--rpc_bind_addresses=yb-tserver-0.yb-tservers.yugabyte1.svc.cluster.local \
--webserver_interface=0.0.0.0 \
--enable_ysql=true \
--pgsql_proxy_bind_address=0.0.0.0:5433 \
--cql_proxy_bind_address=yb-tserver-0.yb-tservers.yugabyte1.svc.cluster.local
Even creating a new table results in errors:
cur.execute(
... """
... CREATE TABLE employee (id int PRIMARY KEY,
... name varchar,
... age int,
... language varchar)
... """)
Traceback (most recent call last):
File "<stdin>", line 7, in <module>
psycopg2.errors.InternalError_: Invalid argument: Invalid table definition: Timed out waiting for Table Creation
And following yb-tserver logs after the error above:
W0806 15:00:55.004124 63 catalog_manager.cc:7475] Aborting the current task due to error: Invalid argument (yb/master/catalog_manager.cc:7623): An error occurred while selecting replicas for tablet 0a503b0b9196425c956a8b9939b2c370: Invalid argument (yb/master/catalog_manager.cc:7623): Not enough tablet servers in the requested placements. Need at least 3, have 1: Not enough tablet servers in the requested placements. Need at least 3, have 1
E0806 15:00:55.004171 63 catalog_manager_bg_tasks.cc:142] Error processing pending assignments, aborting the current task: Invalid argument (yb/master/catalog_manager.cc:7623): An error occurred while selecting replicas for tablet 0a503b0b9196425c956a8b9939b2c370: Invalid argument (yb/master/catalog_manager.cc:7623): Not enough tablet servers in the requested placements. Need at least 3, have 1: Not enough tablet servers in the requested placements. Need at least 3, have 1
I0806 15:00:55.139647 2352 ysql_transaction_ddl.cc:46] Verifying Transaction { transaction_id: bb93b749-6b57-41b6-8f50-7461a07dc254 isolation: SNAPSHOT_ISOLATION status_tablet: 4069e18783a747bea31895b3ab6c69f6 priority: 1756854571847405073 start_time: { physical: 1628261964493502 } }
I0806 15:00:55.295994 53 ysql_transaction_ddl.cc:77] TransactionReceived: OK : status: PENDING
status_hybrid_time: 6669361378174894079
propagated_hybrid_time: 6669361378174910464
I0806 15:00:55.296051 53 ysql_transaction_ddl.cc:97] Got Response for { transaction_id: bb93b749-6b57-41b6-8f50-7461a07dc254 isolation: SNAPSHOT_ISOLATION status_tablet: 4069e18783a747bea31895b3ab6c69f6 priority: 1756854571847405073 start_time: { physical: 1628261964493502 } }: status: PENDING
status_hybrid_time: 6669361378174894079
propagated_hybrid_time: 6669361378174910464
W0806 15:00:55.310122 66 master_service_base-internal.h:39] Unknown master error in status: Invalid argument (yb/master/catalog_manager.cc:7623): An error occurred while selecting replicas for tablet 0a503b0b9196425c956a8b9939b2c370: Invalid argument (yb/master/catalog_manager.cc:7623): Not enough tablet servers in the requested placements. Need at least 3, have 1: Not enough tablet servers in the requested placements. Need at least 3, have 1
I0806 15:00:55.496171 2352 ysql_transaction_ddl.cc:46] Verifying Transaction { transaction_id: bb93b749-6b57-41b6-8f50-7461a07dc254 isolation: SNAPSHOT_ISOLATION status_tablet: 4069e18783a747bea31895b3ab6c69f6 priority: 1756854571847405073 start_time: { physical: 1628261964493502 } }
I0806 15:00:55.652542 67 ysql_transaction_ddl.cc:77] TransactionReceived: OK : status: PENDING
In addition to Dorian's answer: please show and validate the full placement info. You can do this in the following way:
The master placement info can be found using curl http://MASTER-ADDRESS:7000/api/v1/masters | jq.
The tserver placement info can be found on http://MASTER-ADDRESS:7000/tablet-servers (via a browser).
The preferred zones overview using curl http://MASTER-ADDRESS:7000/cluster-config | jq
That way, it should be easy to see and validate if everything is set as it should.
You also need the --placement_cloud=cloud and --placement_zone=rack1 arguments to your yb-tserver processes (to match what you passed to modify_placement_info step). Not just the --placement_region gflag.
Otherwise, the create table step doesn't find matching TServers to meet the desired placement for the table.

How to read and repartition a large dataset from one s3 location to another using spark, s3Distcp & aws EMR

I am trying to move data in s3 which is partitioned on a date string at rest(source) to another location where it is partitioned at rest (destination) as year=yyyy/month=mm/day=dd/
While I am able to read the entire source location data in Spark and partition it in the destination format in tmp hdfs, the s3DistCp fails to copy this from hdfs to s3.
It fails with OutOnMemory error.
I am trying to write close to 2 million small files (20KB each)
My s3Distcp is running with the following args
sudo -H -u hadoop nice -10 bash -c "if hdfs dfs -test -d hdfs:///<source_path>; then /usr/lib/hadoop/bin/hadoop jar /usr/share/aws/emr/s3-dist-cp/lib/s3-dist-cp.jar -libjars /usr/share/aws/emr/s3-dist-cp/lib/ -Dmapreduce.job.reduces=30 -Dmapreduce.child.java.opts=Xmx2048m --src hdfs:///<source_path> --dest s3a://<destination_path> --s3ServerSideEncryption;fi"
It fails with
[2020-08-06 14:23:36,038] {bash_operator.py:126} INFO - # java.lang.OutOfMemoryError: Java heap space
[2020-08-06 14:23:36,038] {bash_operator.py:126} INFO - # -XX:OnOutOfMemoryError="kill -9 %p"```
The emr cluster I am running this is
"master_instance_type": "r5d.8xlarge",
"core_instance_type": "r5.2xlarge",
"core_instance_count": "8",
"task_instance_types": [ "r5.2xlarge","m5.4xlarge"],
"task_instance_count": "1000"
Any suggestions what I could increase configurations on s3Distcp for it to be able to copy this without running out of memory?
I ended up running this iteratively, for the said aws stack it was able to handle about 300K files in each iteration without OOM
This is a classic case where you can use multithread scheduling capabilities of Spark by setting spark.scheduler.mode=FAIR and assigning pools
What you need to do is
create your list of partitions beforehand
Use this list as an iterable
for each iterator from this list trigger one spark-job in different pools
No need to use a differents3distcp
An example shown below:
before doing spark-submit =>
# Create a List of all *possible* partitions like this
# Example S3 prefixes :
# s3://my_bucket/my_table/year=2019/month=02/day=20
# ...
# ...
# s3://my_bucket/my_table/year=2020/month=03/day=15
# ...
# ...
# s3://my_bucket/my_table/year=2020/month=09/day=01
# WE SET `TARGET_PREFIX` as:
TARGET_PREFIX="s3://my_bucket/my_table"
# And Create a List ( till Day=nn part)
# By looping twice
# Increase loop numbers if partition is till hour
aws s3 ls "${TARGET_PREFIX}/"|grep PRE|awk '{print $2}'|while read year_part ;
do
full_year_part="${TARGET_PREFIX}/${year_part}";
aws s3 ls ${full_year_part}|grep PRE|awk '{print $2}'|while read month_part;
do
full_month_part=${full_year_part}${month_part};
aws s3 ls ${full_month_part}|grep PRE|awk -v pref=$full_month_part '{print pref$2}';
done;
done
Once Done, we run this script and save result in a file like this:
bash build_year_month_day.sh > s3_<my_table_day_partition>_file.dat
Now We are ready to run spark in multithread
The Spark code would need two things ( other than scheduler.mode=FAIR
1. creating an iterator from the file created above # s3_<my_table_day_partition>_file.dat
2. sc.setLocalProperty
See How It is Done .
A. We read The File in our spark-app Python
year_month_date_index_file = "s3_<my_table_day_partition>_file.dat"
with open(year_month_date_index_file, 'r') as f:
content = f.read()
content_iter = [(idx, c) for idx, c in enumerate(content.split("\n")) if c]
B.And use a slice of 100 Days to fire 100 threads:
# Number of THREADS can be Increased or Decreased
strt = 0
stp = 99
while strt < len(content_iter):
threads_lst = []
path_slices = islice(content_iter, strt, stp)
for s3path in path_slices:
print("PROCESSING FOR PATH {}".format(s3path))
pool_index = int(s3path[0]) # Spark needs a POOL ID
my_addr = s3path[1]
# CALLING `process_in_pool` in each thread
agg_by_day_thread = threading.Thread(target=process_in_pool, args=(pool_index, <additional_args>)) # Pool_index is mandatory argument.
agg_by_day_thread.start() # Start opf Thread
threads_lst.append(agg_by_day_thread)
for process in threads_lst:
process.join() # Wait for All Threads To Finish
strt = stp
stp += 100
Two Things to notice
path_slices = islice(content_iter, strt, stp) => returns slices of the size (strt - stp)
pool_index = int(s3path[0]) => the index of content_iter, we would use this to assign a pool id.
Now The Meat of the code
def process_in_pool(pool_id, <other_arguments>):
sc.setLocalProperty("spark.scheduler.pool", "pool_id_{}".format(str(int(pool_id) % 100)))
As you see we want to restrict threads to 100 pools
So, we set spark.scheduler.pool as pool_idex%100
Write your actual Transformation/Action in this `process_in_pool() function
And once done, exit the function by freeing that pool
as
...
sc.setLocalProperty("spark.scheduler.pool", None)
return
finally
Run you spark-submit like
spark-submit \
-- Other options \
--conf spark.scheduler.mode=FAIR \
--other options \
my_spark_app.py
If tuned with correct executor/core/memory, you would see a huge performance gain.
Same can be done in scala with concurrent.futures
But that's for another day.

Different Output of Pyspark code on same set of data while multiple runs

I have write one pyspark function,but when i run multiple times its giving me every time diffrent outputs
on same set of input data.
-pyspark Function
def give_percentile(plat,metrics,perc):
df_perc = df_final.filter(df_final.platform.like('%' + plat + '%'))
df_perc = df_perc.filter(df_final[metrics]!=0)
percentile_val = df_perc.approxQuantile(metrics, [perc], 0.05)
if len(percentile_val)>0:
percentile_val = float(percentile_val[0])
else:
percentile_val = float(0)
return percentile_val
Calling Function-
df_agg = sqlContext.createDataFrame([Row(platform='iOS',
percentile_page_load_50=give_percentile("iOS","page_load",0.5),
percentile_time_diff_50=give_percentile("iOS","session_duration",0.5)),
Row(platform='Android',
percentile_page_load_50=give_percentile("Android","page_load",0.5),
percentile_time_diff_50=give_percentile("Android","session_duration",0.5)),
Row(platform='Web',
percentile_page_load_50=give_percentile("Web","page_load",0.5),
percentile_time_diff_50=give_percentile("Web","session_duration",0.5)))
Spark Job Submit:-
spark-submit --deploy-mode cluster --executor-cores 4 --executor-memory 12G --driver-cores 4 --driver-memory 12G --conf spark.driver.extraJavaOptions=-Duser.timezone=UTC --conf spark.executor.extraJavaOptions=-Duser.timezone=UTC "path"
we are storing the output of pyspark code in parquet file format and on top of it we are creating impala table as below:
1.select a.percentile_page_load_50,a.percentile_time_diff_50 from Tablename1 a where a.platform ='Colvalue' and
a.dt ='20190501' limit 5;
Table Record Count= 22093826
Output =0.62400001287460327
0.35100001096725464
2.select a.percentile_page_load_50,a.percentile_time_diff_50 from Tablename2 a where a.platform ='Colvalue' and
a.dt ='20190501' limit 5;
Table Record Count= 22093826
output=0.61500000953674316
0.28499999642372131
3.select a.percentile_page_load_50,a.percentile_time_diff_50 from Tablename3 a where a.platform ='Colvalue' and
a.dt ='20190501' limit 5;
Table Record Count= 22093826
output= 0.61799997091293335
0.27799999713897705
now here Tablename1,Tablename2 and Tablename3 are created the output of multiple run of pyspark code on same set of input data.
but still the values are diffrent as our pyspark code is running in cluser mode/distributed mode. When we checked for sample data on
standlone mode its values are not changing.
So could you please help me here and tell me what is wrong in the above function code or any other cluster issue?
The function approxQuantile gives you an approximate solution depending on the given relativeError. You set the allowed relativeError of the approxQuantile function to 0.05, which means it is only deterministic with the following bounds:
"If the DataFrame has N elements and if we request the quantile at probability p up to error err, then the algorithm will return a sample x from the DataFrame so that the exact rank of x is close to (p * N)." (I have highlighted the part why you get different results).
You have to set the relativeError to 0.0 in case you need the exact quantiles, but this also increases the runtime.
More information can be found in the documentation.

Pandas groupby: aggregate only on partial records

I have the following data frame:
id src target duration
001 A C 4
001 B C 3
001 C C 2
002 B D 5
002 C D 2
and I used the following code to do some aggregations, which works fine.
df_new = df.groupby(['id','target']) \
.apply(lambda x: pd.Series({'min_duration': min(x['duration']), \
'total_duration':sum(x['duration']), \
'all_src':list(x['src'])
})).reset_index()
Now I want to compute the sum only for src != target records. I modified my code like below:
df_new = df.groupby(['id','target']) \
.apply(lambda x: pd.Series({'min_duration': min(x['duration']), \
'total_duration':sum(x['duration']), \
'total_duration_condition':sum(x['duration']) if x['src'] != x['target'], \
'all_src':list(x['src'])
})).reset_index()
But then got Invalid Syntax error in my new line:
'total_duration_condition':sum(x['duration']) if x['src'] != x['target']
I am wondering what should be the proper way to do the sum only for part of the records? Thanks!
Try writing your code like below
df.groupby(['id','target']).apply(lambda x: pd.Series({'min_duration': min(x['duration']), \
'total_duration':sum(x['duration']), \
'total_duration_condition':sum(x['duration'][x['src'] != x['target']]), \# I change this part
'all_src':list(x['src'])
})).reset_index()
Change line
'total_duration_condition':sum(x['duration']) if x['src'] != x['target']
To
sum(x['duration'][x['src'] != x['target']])

Resources