How to overwrite specific partitions in spark/BigQuery/GCP

How to overwrite specific partitions in spark/BigQuery/GCP - apache-spark

How to overwrite specific partitions in spark/BigQuery/GCP?
I user this Code,
df.write \
.format("bigquery") \
.mode("overwrite") \
.option("table",bq_path) \
.option("temporaryGcsBucket",GCS_BUCKET) \
.option("partitionField",'partition_date') \
.save()
but when batch is Executed, the existing data is deleted.
Please Refer this Image. -->>
enter image description here

You can do like that:
df.write.format('bigquery') \
.option("datePartition", "20220810").mode("overwrite") \
.option("partitionField", "processed_date") \
.option('table', "YOUR_TABLE") \
.save()

Related

How to Identify This Pyspark Coding Issue - No viable alternative at input

Trying to run the following code in Azure Synapse pyspark and receive the parsing error, it doesn't seem like Synapse accept the double brackets, anyone know how to fix it?
def curated_report(entity_name):
sqlstr ="WITH Participant_Name \
AS (SELECT \
CASEID, \
PARTICIPANTID, \
LASTWRITTEN, \
PARTICIPANT, \
FIRSTNAME, \
MIDDLENAME, \
LASTNAME \
FROM (SELECT \
ab.CASEID, \
ab.PARTICIPANTID, \
ab.DYNAMICDATATYPE, \
ab.DYNAMICEVIDENCEVALUE, \
ab.LASTWRITTEN \
FROM a.ev ab \
INNER JOIN (SELECT \
PARTICIPANTID, \
MAX(LASTWRITTEN) AS MAXDATE \
FROM a.bd \
where TYPE in ( 'PDC001' ) \
GROUP BY PARTICIPANTID) cd \
ON ab.PARTICIPANTID = cd.PARTICIPANTID \
AND ab.LASTWRITTEN = cd.MAXDATE \
GROUP BY ab.CASEID, \
ab.PARTICIPANTID, \
ab.DYNAMICDATATYPE, \
ab.DYNAMICEVIDENCEVALUE, \
ab.LASTWRITTEN) AS SOURCE \
PIVOT(max(DYNAMICEVIDENCEVALUE) \
FOR DYNAMICDATATYPE IN (PARTICIPANT, \
FIRSTNAME, \
MIDDLENAME,\
LASTNAME) \
)AS RESULT) \ <----*this line seems to be causing error*
SELECT* \
FROM PARTICIPANT_NAME"
df = spark.sql(sqlstr)
return df
*solved.

ParseException:
no viable alternative at input 'WITH Participant_Name AS (SELECT ...
Remove as RESULT from the query. For CTE table, it is sufficient to give with CTE table name.
Error Screenshot
I tried to repro this with similar script and got the same error.
In order to avoid this error, I removed the alias name as Result from the Query and it is executed successfully.
Corrected Code
sqlstr ="WITH Participant_Name \
AS (SELECT \
CASEID, \
PARTICIPANTID, \
LASTWRITTEN, \
PARTICIPANT, \
FIRSTNAME, \
MIDDLENAME, \
LASTNAME \
FROM (SELECT \
ab.CASEID, \
ab.PARTICIPANTID, \
ab.DYNAMICDATATYPE, \
ab.DYNAMICEVIDENCEVALUE, \
ab.LASTWRITTEN \
FROM enhanced.BDMCASEEVIDENCE ab \
INNER JOIN (SELECT \
PARTICIPANTID, \
MAX(LASTWRITTEN) AS MAXDATE \
FROM enhanced.BDMCASEEVIDENCE \
where EVIDENCETYPE in ( 'PDC0000258' ) \
GROUP BY PARTICIPANTID) cd \
ON ab.PARTICIPANTID = cd.PARTICIPANTID \
AND ab.LASTWRITTEN = cd.MAXDATE \
GROUP BY ab.CASEID, \
ab.PARTICIPANTID, \
ab.DYNAMICDATATYPE, \
ab.DYNAMICEVIDENCEVALUE, \
ab.LASTWRITTEN) AS SOURCE \
PIVOT(max(DYNAMICEVIDENCEVALUE) \
FOR DYNAMICDATATYPE IN ('PARTICIPANT', \
'FIRSTNAME', \
'MIDDLENAME',\
'LASTNAME') \
)) \
SELECT* \
FROM PARTICIPANT_NAME"
df = spark.sql(sqlstr)

kubernetes config map - syntax error: unterminated quoted string

I am getting below error when trying to attach shell script as config map.
I am not sure what's the issue because script work without adding in config map
It shows error is on the line 58
Which is not even there.
Any help will be really appreciated.
apiVersion: v1
kind: ConfigMap
metadata:
name: {{ .Values.metadata.name }}-micro
data:
micro-integrator.sh: |
#!/bin/sh
# micro-integrator.sh
while [ "$status" = "$START_EXIT_STATUS" ]
do
$JAVACMD \
-Xbootclasspath/a:"$CARBON_XBOOTCLASSPATH" \
$JVM_MEM_OPTS \
-XX:+HeapDumpOnOutOfMemoryError \
-XX:HeapDumpPath="$CARBON_HOME/repository/logs/heap-dump.hprof" \
$JAVA_OPTS \
-Dcom.sun.management.jmxremote \
-classpath "$CARBON_CLASSPATH" \
-Djava.io.tmpdir="$CARBON_HOME/tmp" \
-Dcatalina.base="$CARBON_HOME/wso2/lib/tomcat" \
-Dwso2.server.standalone=true \
-Dcarbon.registry.root=/ \
-Djava.command="$JAVACMD" \
-Dqpid.conf="/conf/advanced/" \
$JAVA_VER_BASED_OPTS \
-Dcarbon.home="$CARBON_HOME" \
-Dlogger.server.name="micro-integrator" \
-Djava.util.logging.manager=org.apache.juli.ClassLoaderLogManager \
-Dcarbon.config.dir.path="$CARBON_HOME/conf" \
-Dcarbon.repository.dir.path="$CARBON_HOME/repository" \
-Dcarbon.components.dir.path="$CARBON_HOME/wso2/components" \
-Dcarbon.dropins.dir.path="$CARBON_HOME/dropins" \
-Dcarbon.external.lib.dir.path="$CARBON_HOME/lib" \
-Dcarbon.patches.dir.path="$CARBON_HOME/patches" \
-Dcarbon.internal.lib.dir.path="$CARBON_HOME/wso2/lib" \
-Dcom.atomikos.icatch.hide_init_file_path=true \
-Dorg.apache.jasper.compiler.Parser.STRICT_QUOTE_ESCAPING=false \
-Dorg.apache.jasper.runtime.BodyContentImpl.LIMIT_BUFFER=true \
-Dcom.sun.jndi.ldap.connect.pool.authentication=simple \
-Dcom.sun.jndi.ldap.connect.pool.timeout=3000 \
-Dorg.terracotta.quartz.skipUpdateCheck=true \
-Djava.security.egd=file:/dev/./urandom \
-Dfile.encoding=UTF8 \
-Djava.net.preferIPv4Stack=true \
-DNonRegistryMode=true \
-DNonUserCoreMode=true \
-Dcom.ibm.cacheLocalHost=true \
-Dcarbon.use.registry.repo=false \
-DworkerNode=false \
-Dorg.apache.cxf.io.CachedOutputStream.Threshold=104857600 \
-DavoidConfigHashRead=true \
-Dproperties.file.path=default \
-DenableReadinessProbe=true \
-DenableManagementApi=true \
$NODE_PARAMS \
-Dorg.apache.activemq.SERIALIZABLE_PACKAGES="*" \
org.wso2.micro.integrator.bootstrap.Bootstrap $*
status="$?"
done

How to create user for connect to database

ERROR :
[FATAL] [DBT-05509] Failed to connect to the specified database (cdb21).
CAUSE: OS Authentication might be disabled for this database (cdb21).
ACTION: Specify a valid sysdba user name and password to connect to the database.
First step:
./runInstaller -silent -responseFile /scratch/app/user/product/21.0.0/dbhome_1/install/response/db_install.rsp \
oracle.install.option=INSTALL_DB_SWONLY \
UNIX_GROUP_NAME=oinstall \
ORACLE_BASE=/scratch/app/user \
INVENTORY_LOCATION=/scratch/app/oraInventory \
SELECTED_LANGUAGES=en \
oracle.install.db.InstallEdition=EE \
oracle.install.db.isCustomInstall=false \
oracle.install.db.OSDBA_GROUP=oinstall \
oracle.install.db.OSBACKUPDBA_GROUP=oinstall \
oracle.install.db.OSDGDBA_GROUP=oinstall \
oracle.install.db.OSKMDBA_GROUP=oinstall \
oracle.install.db.OSRACDBA_GROUP=oinstall \
SECURITY_UPDATES_VIA_MYORACLESUPPORT=false \
DECLINE_SECURITY_UPDATES=true
Second step:
dbca -silent -createDatabase \
-templateName General_Purpose.dbc \
-gdbname cdb21 \
-sid cdb21 \
-responseFile NO_VALUE \
-characterSet AL32UTF8 \
-sysPassword Welcome1 \
-systemPassword Welcome1 \
-createAsContainerDatabase true \
-numberOfPDBs 1 \
-pdbName pdb21 \
-pdbAdminPassword Welcome1 \
-databaseType MULTIPURPOSE \
-memoryMgmtType auto_sga \
-totalMemory 4096 \
-storageType FS \
-datafileDestination /scratch/oradata/ \
-emConfiguration NONE \
-ignorePreReqs

Start The service using :
lsnrctl start
Then :
startup

why run "python run_squad.py" doesn't work?

I want fine tune on squad with huggingface run_squad.py, but meet the following question:
1, when I use "--do_train" without "True" as following code, after 20 minutes runing,there is no models in output_dir:
!python run_squad.py \
--model_type bert \
--model_name_or_path bert-base-uncased \
--output_dir models/bert/ \
--data_dir data/squad \
--overwrite_output_dir \
--overwrite_cache \
--do_train \
--train_file train-v2.0.json \
--version_2_with_negative \
--do_lower_case \
--do_eval \
--predict_file dev-v2.0.json \
--per_gpu_train_batch_size 2 \
--learning_rate 3e-5 \
--num_train_epochs 2.0 \
--max_seq_length 384 \
--doc_stride 128 \
--threads 10 \
--save_steps 5000
2, when I use "--do_train=True" as following code, the error message is "run_squad.py: error: argument --do_train: ignored explicit argument 'True'":
!python run_squad.py \
--model_type bert \
--model_name_or_path bert-base-uncased \
--output_dir models/bert/ \
--data_dir data/squad \
--overwrite_output_dir \
--overwrite_cache \
--do_train=True \
--train_file train-v2.0.json \
--version_2_with_negative \
--do_lower_case \
--do_eval \
--predict_file dev-v2.0.json \
--per_gpu_train_batch_size 2 \
--learning_rate 3e-5 \
--num_train_epochs 2.0 \
--max_seq_length 384 \
--doc_stride 128 \
--threads 10 \
--save_steps 5000
3, when I use "--do_train True" as following code, the error message is "run_squad.py: error: unrecognized arguments: True":
!python run_squad.py \
--model_type bert \
--model_name_or_path bert-base-uncased \
--output_dir models/bert/ \
--data_dir data/squad \
--overwrite_output_dir \
--overwrite_cache \
--do_train True \
--train_file train-v2.0.json \
--version_2_with_negative \
--do_lower_case \
--do_eval \
--predict_file dev-v2.0.json \
--per_gpu_train_batch_size 2 \
--learning_rate 3e-5 \
--num_train_epochs 2.0 \
--max_seq_length 384 \
--doc_stride 128 \
--threads 10 \
--save_steps 5000
I run code in colab with GPU: Tesla P100-PCIE-16GB
Judging by the running time, I think the code didn't through training process, but I don't know how to set parameters in order to let training go.what should I do?

Catalina-Opts with string parameter is not working

On my linux machine I want to configure tomcat 8 with the following
catalina_opts:
export CATALINA_OPTS="$CATALINA_OPTS -Dsina.elasticsearch.cluster.nodes=sina-1:9300 -Dsina.elasticsearch.cluster.name=sinasuite-dev -Dsina.rabbitmq.host=sina-1 -Dsina.rabbitmq.port=5672 -Dsina.rabbitmq.user=guest -Dsina.rabbitmq.password=guest -Dsina.images.directory=/home/dev/tmp -Dsina.forms.directory=/home/dev/tmp -Dsina.scheduler.rate=30000 -Dsina.alfresco.url=http://ares:8181/alfresco/api/-default-/public/cmis/versions/1.1/browser -Dsina.alfresco.site=/Sitios/sina-suite-dev/documentLibrary -Dsina.alfresco.repository=-default- -Dsina.alfresco.user=admin -Dsina.alfresco.password=admin -Dsina.cas.server.host=sina-1.alfatecsistemas.es -Dsina.cas.server.port=9444 -Dsina.cas.service.host=sina-1 -Dsina.cas.service.port=9443 -Dsina.cas.service.appname=sina-suite -Dsina.forms.pdf.files.directory=/home/dev/tmp -Dsina.fileupload.size=250000000 -Dsina.farhos.url.login=https://www.detots.com/farhos/token?usuario=%s&clave=%s -Dsina.farhos.url.component=https://www.detots.com/farhos/5/?vista=%s&paciente=%s&episodio=%s&token=%s -Dsina.nurse.profile.id=1 -Dsina.farhos.url.logout=https://www.detots.com/farhos/token?%s"
But on trying to start tomcat I'm getting the error:
/home/dev/tomcat/bin/catalina.sh: line 434: -Dsina.farhos.url.logout=https://www.detots.com/farhos/token?%s: No such file or directory
/home/dev/tomcat/bin/catalina.sh: line 434: -Dsina.nurse.profile.id=1: command not found
/home/dev/tomcat/bin/catalina.sh: line 434: -Dsina.farhos.url.component=https://www.detots.com/farhos/5/?vista=%s: No such file or directory
/home/dev/tomcat/bin/catalina.sh: line 434: -Dsina.nurse.profile.id=1: command not found
Please help

Try to surround URLs with single quotes.
export CATALINA_OPTS="$CATALINA_OPTS \
-Dsina.elasticsearch.cluster.nodes=sina-1:9300 \
-Dsina.elasticsearch.cluster.name=sinasuite-dev \
-Dsina.rabbitmq.host=sina-1 \
-Dsina.rabbitmq.port=5672 \
-Dsina.rabbitmq.user=guest \
-Dsina.rabbitmq.password=guest \
-Dsina.images.directory=/home/dev/tmp \
-Dsina.forms.directory=/home/dev/tmp \
-Dsina.scheduler.rate=30000 \
-Dsina.alfresco.url='http://ares:8181/alfresco/api/-default-/public/cmis/versions/1.1/browser' \
-Dsina.alfresco.site=/Sitios/sina-suite-dev/documentLibrary \
-Dsina.alfresco.repository=-default- \
-Dsina.alfresco.user=admin \
-Dsina.alfresco.password=admin \
-Dsina.cas.server.host=sina-1.alfatecsistemas.es \
-Dsina.cas.server.port=9444 \
-Dsina.cas.service.host=sina-1 \
-Dsina.cas.service.port=9443 \
-Dsina.cas.service.appname=sina-suite \
-Dsina.forms.pdf.files.directory=/home/dev/tmp \
-Dsina.fileupload.size=250000000 \
-Dsina.farhos.url.login='https://www.detots.com/farhos/token?usuario=%s&clave=%s' \
-Dsina.farhos.url.component='https://www.detots.com/farhos/5/?vista=%s&paciente=%s&episodio=%s&token=%s' \
-Dsina.nurse.profile.id=1 \
-Dsina.farhos.url.logout='https://www.detots.com/farhos/token?%s'"
FYI, It is recommended to add CATALINA_OPTS to bin/setenv.sh.

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string

How to overwrite specific partitions in spark/BigQuery/GCP - apache-spark

You can do like that: df.write.format('bigquery') \ .option("datePartition", "20220810").mode("overwrite") \ .option("partitionField", "processed_date") \ .option('table', "YOUR_TABLE") \ .save()

Related

How to Identify This Pyspark Coding Issue - No viable alternative at input

kubernetes config map - syntax error: unterminated quoted string

How to create user for connect to database

why run "python run_squad.py" doesn't work?

Catalina-Opts with string parameter is not working

Categories

Resources