insert into l1.portal_files_download_dashboard_tracker_final(files_to_download, tax_id, site_code, health_plan, file_type, health_plan_lob, market, dlp_to_download, provider_reporting_option, tat_for_availibility, tat_for_innovaccer_download, tat_to_upload_files_sftp, signature_job_present, file_month, file_name, upload_date, download_status, upload_status)
select distinct b.files_to_download, b.tax_id, b.site_code, b.health_plan, b.file_type, b.health_plan_lob, b.market, b.dlp_to_download, b.provider_reporting_option, b.tat_for_availibility, b.tat_for_innovaccer_download, b.tat_to_upload_sftp, b.signature_job_present, initcap(to_char(now(), 'mon-yyyy')) as file_month, a.file_name, a.upload_date as upload_date, '' as download_status, '' as upload_status from l1.portal_downloaded_files a right join l1.dashboard_files_tracker_csv b on lower(split_part(a.file_name,'.',1)) like '%' || lower(split_part(b.files_to_download,'.',1)) || '%' where tat_for_availibility ilike '%9th of the Month%';
Facing Error in end of query. Running in Python Script using psycopg2. (Postgres)
using cur.execute(query).
Can someone tell me why I am facing error here
The || is used for string concatenation in postgresql. You should use the OR in your join condition, also misplaced the like.
Below is the corrected query,
insert into l1.portal_files_download_dashboard_tracker_final(files_to_download, tax_id, site_code, health_plan, file_type, health_plan_lob, market, dlp_to_download, provider_reporting_option, tat_for_availibility, tat_for_innovaccer_download, tat_to_upload_files_sftp, signature_job_present, file_month, file_name, upload_date, download_status, upload_status) select distinct b.files_to_download, b.tax_id, b.site_code, b.health_plan, b.file_type, b.health_plan_lob, b.market, b.dlp_to_download, b.provider_reporting_option, b.tat_for_availibility, b.tat_for_innovaccer_download, b.tat_to_upload_sftp, b.signature_job_present, initcap(to_char(now(), 'mon-yyyy')) as file_month, a.file_name, a.upload_date as upload_date, '' as download_status, '' as upload_status from l1.portal_downloaded_files a right join l1.dashboard_files_tracker_csv b on lower(split_part(a.file_name,'.',1)) like '%' OR lower(split_part(b.files_to_download,'.',1)) like '%' where tat_for_availibility ilike '%9th of the Month%';
Related
can i please get some assistance: why i am receiving this error and how can i fix this ?
ERROR: could not write to file "base/pgsql_tmp/pgsql_tmp8524.111": No
space left on device SQL state: 53100
i have 150 GB free space on my local disk and i still got this error that says " no space left on device SQL State: 53100".
I got this error on PGAdmin 4 version 6.9. PostgreSQL 12
also i got the above error when i run the following query :
INSERT INTO "SIRENE"."SIRENE_ODS"."DIM_CAT_NAF_ENTR_TMP" (
"ID_SIREN",
"ID_NIC",
"CD_DS_CAT_JURI_NIV1_ENTR",
"CD_DS_CAT_JURI_NIV2_ENTR",
"CD_DS_CAT_JURI_NIV4_ENTR",
"CD_DS_SEC_NAF_ENTR",
"CD_DS_DIVISION_NAF_ENTR",
"CD_DS_ACT_PRIN_ENTR"
)
SELECT
ENTR."ID_SIREN" AS ID_SIREN,
ETAB."ID_NIC" AS ID_NIC,
CASE WHEN CJUR3."CD_CAT_NIV1"='' OR CJUR3."CD_CAT_NIV1" IS NULL THEN 'ZZ - Non renseignée' ELSE CONCAT(replace(CJUR3."CD_CAT_NIV1", ' ', ''), ' - ', CJUR3."DS_CAT_NIV1") END AS CD_DS_CAT_JURI_NIV1_ENTR,
CASE WHEN CJUR2."CD_CAT_NIV2"='' OR CJUR2."CD_CAT_NIV2" IS NULL THEN 'ZZ - Non renseignée' ELSE CONCAT(replace(CJUR2."CD_CAT_NIV2", ' ', ''), ' - ', CJUR2."DS_CAT_NIV2") END AS CD_DS_CAT_JURI_NIV2_ENTR,
CASE WHEN CJUR1."CD_CAT_NIV4"='' OR CJUR1."CD_CAT_NIV4" IS NULL THEN 'ZZ - Non renseignée' ELSE CONCAT(replace(CJUR1."CD_CAT_NIV4", ' ', ''), ' - ', CJUR1."DS_CAT_NIV4") END AS CD_DS_CAT_JURI_NIV4_ENTR,
CASE WHEN RSE_ENTR3."CD_SECTION_NAF"='' OR RSE_ENTR3."CD_SECTION_NAF" IS NULL THEN 'ZZ - Non renseignée' ELSE CONCAT(RSE_ENTR3."CD_SECTION_NAF", ' - ', RSE_ENTR3."DS_SECTION_NAF") END AS CD_DS_NAF1_ENTR,
CASE WHEN RSE_ENTR1."CD_DIVISION_NAF"='' OR RSE_ENTR1."CD_DIVISION_NAF" IS NULL THEN 'ZZ - Non renseignée' ELSE CONCAT(RSE_ENTR1."CD_DIVISION_NAF", ' - ', RSE_ENTR1."DS_DIVISION_NAF") END AS CD_DS_NAF2_ENTR,
CASE WHEN RSE_ENTR2."CD_ACT_PRIN_ENTR"='' OR RSE_ENTR2."CD_ACT_PRIN_ENTR" IS NULL THEN 'ZZ - Non renseignée' ELSE CONCAT(RSE_ENTR2."CD_ACT_PRIN_ENTR", ' - ', RSE_ENTR2."DS_ACT_PRIN_ENTR") END AS CD_DS_NAF4_ENTR
FROM "SIRENE"."SIRENE_ODS"."SR_ETAB" ETAB
INNER JOIN "SIRENE"."SIRENE_ODS"."SR_ENTR" ENTR ON (ETAB."ID_SIREN" = ENTR."ID_SIREN")
LEFT JOIN "SIRENE"."SIRENE_REF"."REF_ACT_NAF_ENTR" RSE_ENTR1 ON (SUBSTR(ENTR."CD_NAF4_ENTR",1,2) = RSE_ENTR1."CD_DIVISION_NAF")
LEFT JOIN "SIRENE"."SIRENE_REF"."REF_CAT_JUR_ENTR" CJUR1 ON (ENTR."CD_CAT_NIV4_ENTR" = CJUR1."CD_CAT_NIV4")
LEFT JOIN "SIRENE"."SIRENE_REF"."REF_ACT_NAF_ENTR" RSE_ENTR2 ON (ENTR."CD_NAF4_ENTR" = RSE_ENTR2."CD_ACT_PRIN_ENTR")
LEFT JOIN "SIRENE"."SIRENE_REF"."REF_CAT_JUR_ENTR" CJUR2 ON (SUBSTR(ENTR."CD_CAT_NIV4_ENTR",1,2) = CJUR2."CD_CAT_NIV2")
LEFT JOIN "SIRENE"."SIRENE_REF"."REF_ACT_NAF_ENTR" RSE_ENTR3 ON (ENTR."CD_NAF_NIV1" = RSE_ENTR3."CD_SECTION_NAF")
LEFT JOIN "SIRENE"."SIRENE_REF"."REF_CAT_JUR_ENTR" CJUR3 ON (SUBSTR(ENTR."CD_CAT_NIV4_ENTR",1,1) = CJUR3."CD_CAT_NIV1")
This query is supposed to insert 24 million row in a new table but it took just 11 minutes to run on my friend machine. but for me it took more than 4 hours ant it end with the error : SQL Error [53100]: could not write to file « base/pgsql_tmp/pgsql_tmp8524.111 » : No space left on device.
And this happened despite the optimizing of postgresql performance !!!
Any help is appreciated.
Need to identify numbers near keyword number:, no:, etc..
Tried:
import re
matchstring="Sales Quote"
string_lst = ['number:', 'No:','no:','number','No : ']
x=""" Sentence1: Sales Quote number 36886DJ9 is entered
Sentence2: SALES QUOTE No: 89745DFD is entered
Sentence3: Sales Quote No : 7964KL is entered
Sentence4: SALES QUOTE NUMBER:879654DF is entered
Sentence5: salesquote no: 9874656LD is entered"""
documentnumber= re.findall(r"(?:(?<="+matchstring+ '|'.join(string_lst)+r')) [\w\d-]',x,flags=re.IGNORECASE)
print(documentnumber)
Required soln:36886DJ9,89745DFD,7964KL,879654DF,9874656LD
Is there any solution?
Actually your solution is very close. You just need some missing parenthesis and check for optional whitespace:
documentnumber = re.findall(r"(?:(?<="+matchstring + ").*?(?:" + '|'.join(string_lst) + ')\s?)([\w\d-]*)', x, re.IGNORECASE)
However this won't match with the last one (9874656LD) because of the missing whitespace between "Sales" and "quote". If you want to build it in the same way than the rest of the pattern, replace the lookbehind by a non capturing group and join words with \s?:
documentnumber= re.findall(r"(?:(?:" + "\s?".join(matchstring.split()) + ").*?(?:" + '|'.join(string_lst) + ')\s?)([\w\d-]*)', x, re.IGNORECASE)
Output:
['36886DJ9', '89745DFD', '7964KL', '879654DF', '9874656LD']
I am fairly new to Pyspark, and I am trying to do some text pre-processing with Pyspark.
I have a column Name and ZipCode that belongs to a spark data frame new_df. The column 'Name' contains values like WILLY:S MALMÖ, EMPORIA and ZipCode contains values like 123 45 which is a string too. what I want to do is I want to remove characters like :, , etc and want to remove space between the ZipCode.
I tried the following but nothing seems to work :
new_df = new_df.withColumn('Name', sfn.regexp_replace('Name', r',' , ' '))
new_df = new_df.withColumn('ZipCode', sfn.regexp_replace('ZipCode', r' ' , ''))
I tried other things too from the SO and other websites. Nothing seems to work.
Use [,|:] to match , or : and replace with space ' ' in Name column and for zipcode search for space ' ' and replace with empty string ''.
Example:
new_df.show(10,False)
#+-----------------------+-------+
#|Name |ZipCode|
#+-----------------------+-------+
#|WILLY:S MALMÖ, EMPORIA|123 45 |
#+-----------------------+-------+
new_df.withColumn('Name', regexp_replace('Name', r'[,|:]' , ' ')).\
withColumn('ZipCode', regexp_replace('ZipCode', r' ' , '')).\
show(10,False)
#or
new_df.withColumn('Name', regexp_replace('Name', '[,|:]' , ' ')).\
withColumn('ZipCode', regexp_replace('ZipCode', '\s+' , '')).\
show(10,False)
#+-----------------------+-------+
#|Name |ZipCode|
#+-----------------------+-------+
#|WILLY S MALMÖ EMPORIA|12345 |
#+-----------------------+-------+
I have a list of strings in python like this
['AM_B0_D0.0_2016-04-01T010000.flac.h5',
'AM_B0_D3.7_2016-04-13T215000.flac.h5',
'AM_B0_D10.3_2017-03-17T110000.flac.h5',
'AM_B0_D0.7_2016-10-21T104000.flac.h5',
'AM_B0_D4.4_2016-08-05T151000.flac.h5',
'AM_B0_D0.0_2016-04-01T010000.flac.h5',
'AM_B0_D3.7_2016-04-13T215000.flac.h5',
'AM_B0_D10.3_2017-03-17T110000.flac.h5',
'AM_B0_D0.7_2016-10-21T104000.flac.h5',
'AM_B0_D4.4_2016-08-05T151000.flac.h5']
I want to parse only the date and time (for example, 2016-08-05 15:10:00 )from these strings.
So far I used a for loop like the one below but it's very time consuming, is there a better way to do this?
for files in glob.glob("AM_B0_*.flac.h5"):
if files[11]=='_':
year=files[12:16]
month=files[17:19]
day= files[20:22]
hour=files[23:25]
minute=files[25:27]
second=files[27:29]
tindex=pd.date_range(start= '%d-%02d-%02d %02d:%02d:%02d' %(int(year),int(month), int(day), int(hour), int(minute), int(second)), periods=60, freq='10S')
else:
year=files[11:15]
month=files[16:18]
day= files[19:21]
hour=files[22:24]
minute=files[24:26]
second=files[26:28]
tindex=pd.date_range(start= '%d-%02d-%02d %02d:%02d:%02d' %(int(year), int(month), int(day), int(hour), int(minute), int(second)), periods=60, freq='10S')
Try this (based on the 2nd last '-', no need of if-else case):
filesall = ['AM_B0_D0.0_2016-04-01T010000.flac.h5',
'AM_B0_D3.7_2016-04-13T215000.flac.h5',
'AM_B0_D10.3_2017-03-17T110000.flac.h5',
'AM_B0_D0.7_2016-10-21T104000.flac.h5',
'AM_B0_D4.4_2016-08-05T151000.flac.h5',
'AM_B0_D0.0_2016-04-01T010000.flac.h5',
'AM_B0_D3.7_2016-04-13T215000.flac.h5',
'AM_B0_D10.3_2017-03-17T110000.flac.h5',
'AM_B0_D0.7_2016-10-21T104000.flac.h5',
'AM_B0_D4.4_2016-08-05T151000.flac.h5']
def find_second_last(text, pattern):
return text.rfind(pattern, 0, text.rfind(pattern))
for files in filesall:
start = find_second_last(files,'-') - 4 # from yyyy- part
timepart = (files[start:start+17]).replace("T"," ")
#insert 2 ':'s
timepart = timepart[:13] + ':' + timepart[13:15] + ':' +timepart[15:]
# print(timepart)
tindex=pd.date_range(start= timepart, periods=60, freq='10S')
In Place of using file[11] as hard coded go for last or 2nd last index of _ then use your code then you don't have to write 2 times same code. Or use regex to parse the string.
I want to execute this query.The query is " filtering data with 'Gas Oil/ Diesel Oil - Production' transaction and the year is greater than 2000 ". Firstly , i tried to execute my query with & operand and vectorized column selection without using if statement. But it did not work.After then , i found this query at below.This time i could not get any output.What do you think about my query problem ?.Thanks ...
if all(b['Commodity - Transaction'] == 'Gas Oil/ Diesel Oil - Production') and all(b[ b['Year'] >2000 ]):
print (b)
else:
print('did not find any values')
what's wrong with:
b.loc[(b['Commodity - Transaction'] == 'Gas Oil/ Diesel Oil - Production') & (b['Year'] >2000)]
?
You can try first create mask with contains and the create subset - use Boolean indexing:
print b[(b['Commodity - Transaction'].str.contains('Gas Oil/ Diesel Oil - Production')) &
(b['Year'] > 2000) ]