My code is
cd /home/XXX/db-new
while read -r line; do
data=$(echo $line | awk -F'"' -v OFS='' '{ for (i=2; i<=NF; i+=2) gsub(",", "", $i) } 1' | awk '{gsub(/\"/,"")};1' | tr -d \'\" )
d2=$(echo $data | awk -F, '{print $2}')
d3=$(echo $data | awk -F, '{print $3}')
d17=$(echo $data | awk -F, '{print $17}')
d4=$(echo $data | awk -F, '{print $4","$5","$6","$7","$8","$9","$10","$11","$12","$13","$14","$15","$16","$17","$18","$19","$20","$21","$22","$23","$24","$25","$26","$27","$28","$29","$30","$31","$32","$33","$34","$35","$36","$37","$38","$39","$40","$45","$46","$47","$48","$49","$50","$51","$52","$53","$54","$55","$56","$57","$58}')
d1=$d2+$d3
d59=$(echo $d2 | cut -d "." -f 2,3)
d60=$(echo $data | awk -F, '{print $19}' | awk 'BEGIN{FS=OFS=","} {gsub(/[[:punct:] ]/,"",$1)} 1' | sed 's/[^0-9]*//g' )
echo $d1,$d2,$d4,$d59,$d17,$d60 >> abc.csv
done < /home/XXX/db-new/2021-09-04.csv
/home/domainsanalytics/db-new/2021-09-04.csv is very big so I add only 1st 3 lines.
head -3 /home/domainsanalytics/db-new/2021-09-04.csv
"num","domain_name","query_time","create_date","update_date","expiry_date","domain_registrar_id","domain_registrar_name","domain_registrar_whois","domain_registrar_url","registrant_name","registrant_company","registrant_address","registrant_city","registrant_state","registrant_zip","registrant_country","registrant_email","registrant_phone","registrant_fax","administrative_name","administrative_company","administrative_address","administrative_city","administrative_state","administrative_zip","administrative_country","administrative_email","administrative_phone","administrative_fax","technical_name","technical_company","technical_address","technical_city","technical_state","technical_zip","technical_country","technical_email","technical_phone","technical_fax","billing_name","billing_company","billing_address","billing_city","billing_state","billing_zip","billing_country","billing_email","billing_phone","billing_fax","name_server_1","name_server_2","name_server_3","name_server_4","domain_status_1","domain_status_2","domain_status_3","domain_status_4"
"1","accounting-fwppool.com","2021-09-04 00:53:04","2021-08-10","2021-08-10","2022-08-10","303","PDR Ltd. d/b/a PublicDomainRegistry.com","whois.publicdomainregistry.com","http://www.publicdomainregistry.com","Micael brown","","4941 Maui Cir Huntington Beach, CA 92649","CA","CA","92649","United States","michbrown7654gh#gmail.com","+1.9169136369","","Micael brown","","4941 Maui Cir Huntington Beach, CA 92649","CA","CA","92649","United States","michbrown7654gh#gmail.com","+1.9169136369","","Micael brown","","4941 Maui Cir Huntington Beach, CA 92649","CA","CA","92649","United States","michbrown7654gh#gmail.com","+1.9169136369","","","","","","","","","","","","ns1.verification-hold.suspended-domain.com","ns2.verification-hold.suspended-domain.com","","","clientTransferProhibited","","",""
"2","xjava.com","2021-09-04 00:53:11","2001-03-06","2021-03-12","2022-03-06","472","Dynadot, LLC","whois.dynadot.com","http://www.dynadot.com","Super Privacy Service LTD c/o Dynadot","","PO Box 701","San Mateo","California","94401","United States","xjava.com#superprivacyservice.com","+1.6505854708","","Super Privacy Service LTD c/o Dynadot","","PO Box 701","San Mateo","California","94401","United States","xjava.com#superprivacyservice.com","+1.6505854708","","Super Privacy Service LTD c/o Dynadot","","PO Box 701","San Mateo","California","94401","United States","xjava.com#superprivacyservice.com","+1.6505854708","","","","","","","","","","","","ns1.sedoparking.com","ns2.sedoparking.com","","","clientTransferProhibited","","",""
My code give me result good, but $59 ,$17 and $60 is coming in new line...
$59 is just tld i am getting,
$17 is reprint of country,
$60 is phone number without special characters
All I want is all in 1 row
My output is
domain_name+query_time domain_name create_date update_date expiry_date domain_registrar_id domain_registrar_name domain_registrar_whois domain_registrar_url registrant_name registrant_company registrant_address registrant_city registrant_state registrant_zip registrant_country registrant_email registrant_phone registrant_fax administrative_name administrative_company administrative_address administrative_city administrative_state administrative_zip administrative_country administrative_email administrative_phone administrative_fax technical_name technical_company technical_address technical_city technical_state technical_zip technical_country technical_email technical_phone technical_fax billing_state billing_zip billing_country billing_email billing_phone billing_fax name_server_1 name_server_2 name_server_3 name_server_4 domain_status_1 domain_status_2 domain_status_3 domain_status_4
domain_name registrant_country
accounting-fwppool.com+2021-09-04 00:53:04 accounting-fwppool.com 10/08/21 10/08/21 10/08/22 303 PDR Ltd. d/b/a PublicDomainRegistry.com whois.publicdomainregistry.com http://www.publicdomainregistry.com Micael brown 4941 Maui Cir Huntington Beach CA 92649 CA CA 92649 United States michbrown7654gh#gmail.com 1.916913637 Micael brown 4941 Maui Cir Huntington Beach CA 92649 CA CA 92649 United States michbrown7654gh#gmail.com 1.916913637 Micael brown 4941 Maui Cir Huntington Beach CA 92649 CA CA 92649 United States michbrown7654gh#gmail.com 1.916913637 ns1.verification-hold.suspended-domain.com ns2.verification-hold.suspended-domain.com clientTransferProhibited
com United States 19169136369
xjava.com+2021-09-04 00:53:11 xjava.com 06/03/01 12/03/21 06/03/22 472 Dynadot LLC whois.dynadot.com http://www.dynadot.com Super Privacy Service LTD c/o Dynadot PO Box 701 San Mateo California 94401 United States xjava.com#superprivacyservice.com 1.650585471 Super Privacy Service LTD c/o Dynadot PO Box 701 San Mateo California 94401 United States xjava.com#superprivacyservice.com 1.650585471 Super Privacy Service LTD c/o Dynadot PO Box 701 San Mateo California 94401 United States xjava.com#superprivacyservice.com 1.650585471 ns1.sedoparking.com ns2.sedoparking.com clientTransferProhibited
com United States 16505854708
accuratetactics.com+2021-09-04 00:53:14 accuratetactics.com 26/08/20 30/08/21 26/08/21 1660 Domainshype.com Inc. whois.domainshype.com http://www.domainshype.com This Domain For Sale Worldwide 339 222 5132 Buydomains.com 738 Main Street #389 Waltham Massachusetts 2451 United States brokerage#buydomains.com 1.339222513 1.78183928 This Domain For Sale Worldwide 339 222 5132 Buydomains.com 738 Main Street #389 Waltham Massachusetts 2451 United States brokerage#buydomains.com 1.339222513 1.78183928 This Domain For Sale Worldwide 339 222 5132 Buydomains.com 738 Main Street #389 Waltham Massachusetts 2451 United States brokerage#buydomains.com 1.339222513 1.78183928 dns7.parkpage.foundationapi.com dns8.parkpage.foundationapi.com OK
com United States 13392225132
vej.com+2021-09-04 00:53:16 vej.com 16/09/99 31/08/21 16/09/23 128 DomainRegistry.com Inc. nswhois.domainregistry.com http://www.domainregistry.com Scottcraft Label Co. Scottcraft Label Co. c/o Admin Svcs. PO Box 145 Marlton NJ 8053 United States itadmin#scottcraftlabel.com 1.215870212 IT Admin MS 445 Scottcraft Label Co. c/o Admin Svcs. PO Box 145 Marlton NJ 8053 United States itadmin#scottcraftlabel.com 1.215870212 IT Admin MS 445 Scottcraft Label Co. c/o Admin Svcs. PO Box 145 Marlton NJ 8053 United States itadmin#scottcraftlabel.com 1.215870212 colohost1.domainregistry.com cs03.domainregistry.com clientDeleteProhibited clientTransferProhibited clientUpdateProhibited
com United States 12158702120
accutekware.com+2021-09-04 00:53:24 accutekware.com 26/08/03 26/08/21 26/08/21 303 PDR Ltd. d/b/a PublicDomainRegistry.com whois.publicdomainregistry.com http://www.publicdomainregistry.com R Benedict Accutek Systems Inc PO Box 591125 Houston Texas 77259 United States rbeny09#hotmail.com 1.281461701 R Benedict Accutek Systems Inc PO Box 591125 Houston Texas 77259 United States rbeny09#hotmail.com 1.281461701 R Benedict Accutek Systems Inc PO Box 591125 Houston Texas 77259 United States rbeny09#hotmail.com 1.281461701 dns10.parkpage.foundationapi.com dns11.parkpage.foundationapi.com clientTransferProhibited
com United States 12814617007
crmxon.com+2021-09-04 00:53:27 crmxon.com 04/09/20 04/11/20 04/09/21 303 PDR Ltd. d/b/a PublicDomainRegistry.com whois.publicdomainregistry.com http://www.publicdomainregistry.com GDPR Masked GDPR Masked GDPR Masked GDPR Masked Newcastleupon Tyne(Cityof) GDPR Masked United Kingdom gdpr-masking#gdpr-masked.com GDPR Masked GDPR Masked GDPR Masked GDPR Masked GDPR Masked GDPR Masked GDPR Masked GDPR Masked GDPR Masked gdpr-masking#gdpr-masked.com GDPR Masked GDPR Masked GDPR Masked GDPR Masked GDPR Masked GDPR Masked GDPR Masked GDPR Masked GDPR Masked gdpr-masking#gdpr-masked.com GDPR Masked GDPR Masked ns1.edagent.com ns2.edagent.com ns3.edagent.com ns4.edagent.com clientTransferProhibited
com United Kingdom
Expected output
domain_name+query_time domain_name create_date update_date expiry_date domain_registrar_id domain_registrar_name domain_registrar_whois domain_registrar_url registrant_name registrant_company registrant_address registrant_city registrant_state registrant_zip registrant_country registrant_email registrant_phone registrant_fax administrative_name administrative_company administrative_address administrative_city administrative_state administrative_zip administrative_country administrative_email administrative_phone administrative_fax technical_name technical_company technical_address technical_city technical_state technical_zip technical_country technical_email technical_phone technical_fax billing_state billing_zip billing_country billing_email billing_phone billing_fax name_server_1 name_server_2 name_server_3 name_server_4 domain_status_1 domain_status_2 domain_status_3 domain_status_4 domain_name registrant_country
accounting-fwppool.com+2021-09-04 00:53:04 accounting-fwppool.com 10/08/21 10/08/21 10/08/22 303 PDR Ltd. d/b/a PublicDomainRegistry.com whois.publicdomainregistry.com http://www.publicdomainregistry.com Micael brown 4941 Maui Cir Huntington Beach CA 92649 CA CA 92649 United States michbrown7654gh#gmail.com 1.91691364 Micael brown 4941 Maui Cir Huntington Beach CA 92649 CA CA 92649 United States michbrown7654gh#gmail.com 1.91691364 Micael brown 4941 Maui Cir Huntington Beach CA 92649 CA CA 92649 United States michbrown7654gh#gmail.com 1.91691364 ns1.verification-hold.suspended-domain.com ns2.verification-hold.suspended-domain.com clientTransferProhibited com United States 1.9169E+10
xjava.com+2021-09-04 00:53:11 xjava.com 06/03/01 12/03/21 06/03/22 472 Dynadot LLC whois.dynadot.com http://www.dynadot.com Super Privacy Service LTD c/o Dynadot PO Box 701 San Mateo California 94401 United States xjava.com#superprivacyservice.com 1.65058547 Super Privacy Service LTD c/o Dynadot PO Box 701 San Mateo California 94401 United States xjava.com#superprivacyservice.com 1.65058547 Super Privacy Service LTD c/o Dynadot PO Box 701 San Mateo California 94401 United States xjava.com#superprivacyservice.com 1.65058547 ns1.sedoparking.com ns2.sedoparking.com clientTransferProhibited com United States 1.6506E+10
accuratetactics.com+2021-09-04 00:53:14 accuratetactics.com 26/08/20 30/08/21 26/08/21 1660 Domainshype.com Inc. whois.domainshype.com http://www.domainshype.com This Domain For Sale Worldwide 339 222 5132 Buydomains.com 738 Main Street #389 Waltham Massachusetts 2451 United States brokerage#buydomains.com 1.33922251 1.78183928 This Domain For Sale Worldwide 339 222 5132 Buydomains.com 738 Main Street #389 Waltham Massachusetts 2451 United States brokerage#buydomains.com 1.33922251 1.78183928 This Domain For Sale Worldwide 339 222 5132 Buydomains.com 738 Main Street #389 Waltham Massachusetts 2451 United States brokerage#buydomains.com 1.33922251 1.78183928 dns7.parkpage.foundationapi.com dns8.parkpage.foundationapi.com OK com United States 1.3392E+10
Suggesting single awk script to process all the data:
Staring with this:
script.awk
BEGIN{FS="\",\"|\"[[:space:]]*$|^[[:space:]]*\""; OFS=" "}
{
$1=$1; # recalculate fields
# num field start from $2
arr[1] = $3 "+" $4;
arr[2] = $4;
arr[4] = $5;
# right append to arr[4] fields 6-41
for (i = 6; i <= 41; i++) arr[4] = arr[4] "," $i;
# right append to arr[4] fields 46-59
for (i = 46; i <= 59; i++) arr[4] = arr[4] "," $i;
arr[17] = $18;
arr[59 ] = $3;
# in 3rd field remove text after first "."
sub(/\..*$/,"",arr[59]);
# remove all punctuations and digits from 20th field
gsub(/[[:punct:]]|[[:digit:]]*/,"",$20);
arr[60] = $20;
# output to stdout
print arr[1],arr[2],arr[4],arr[59],arr[17],arr[60];
}
Running:
awk -f script.awk input.csv > output.csv
Did not test since the sample data did not contain numeric values.
I am trying to remove a custom list of stop words, but its not working.
desc = pd.DataFrame(description, columns =['description'])
print(desc)
Which gives the following results
description
188693 The Kentucky Cannabis Company and Bluegrass He...
181535 Ohio County Sheriff
11443 According to new reports from federal authorit...
213919 KANSAS CITY, Mo. (AP)The Chiefs will be withou...
171509 The crew of Insight, WCNY's weekly public affa...
... ...
2732 The Arkansas Supreme Court on Thursday cleared...
183367 Larry Pegram, co-owner of Pure Ohio Wellness, ...
134291 Joe Biden will spend the next five months pres...
239270 Find out where your Texas representatives stan...
246070 SAN TAN VALLEY — Two men have been charged wit...
[9875 rows x 1 columns]
I found the following code here, but it doesn't seem to work
remove_words = ["marijuana", "cannabis", "hemp", "thc", "cbd"]
pat = '|'.join([r'\b{}\b'.format(w) for w in remove_words])
desc.assign(new_desc=desc.replace(dict(string={pat: ''}), regex=True))
Which produces the following results
description new_desc
188693 The Kentucky Cannabis Company and Bluegrass He... The Kentucky Cannabis Company and Bluegrass He...
181535 Ohio County Sheriff Ohio County Sheriff
11443 According to new reports from federal authorit... According to new reports from federal authorit...
213919 KANSAS CITY, Mo. (AP)The Chiefs will be withou... KANSAS CITY, Mo. (AP)The Chiefs will be withou...
171509 The crew of Insight, WCNY's weekly public affa... The crew of Insight, WCNY's weekly public affa...
... ... ...
2732 The Arkansas Supreme Court on Thursday cleared... The Arkansas Supreme Court on Thursday cleared...
183367 Larry Pegram, co-owner of Pure Ohio Wellness, ... Larry Pegram, co-owner of Pure Ohio Wellness, ...
134291 Joe Biden will spend the next five months pres... Joe Biden will spend the next five months pres...
239270 Find out where your Texas representatives stan... Find out where your Texas representatives stan...
246070 SAN TAN VALLEY — Two men have been charged wit... SAN TAN VALLEY — Two men have been charged wit...
9875 rows × 2 columns
As you can see, the stop words weren't removed. Any help you can provide would be greatly appreciated.
Handle the case, simplify pattern,
remove_words = ["marijuana", "cannabis", "hemp", "thc", "cbd"]
pat = '|'.join(remove_words)
desc['new_desc'] = desc.description.str.lower().replace(pat,'', regex=True)
description new_desc
0 The Kentucky Cannabis Company and Bluegrass He... the kentucky company and bluegrass he...
1 Ohio County Sheriff ohio county sheriff
2 According to new reports from federal authorit... according to new reports from federal authorit...
3 KANSAS CITY, Mo. (AP)The Chiefs will be mariju... kansas city, mo. (ap)the chiefs will be witho...
4 The crew of Insight, WCNY's weekly public affa... the crew of insight, wcny's weekly public affa...
My dataset looks like this...
State Close Date Probability Highest Prob/State
WA 12/31/2016 50% FALSE
WA 12/19/2016 80% FALSE
WA 10/15/2016 80% TRUE
My objective is to build a formula to populate the right-most column. The formula should assess Close Dates and Probabilities within each state. First, it should select the highest probability, then it should select the nearest close date if there is a tie on probability (as in the example). For that record, it should read "TRUE".
I assume this would include a MAX IF statement but haven't been able to get it to work.
Here is a more robust set of data I'm working with. It may actually be easier to first find the highest probability within each Region then select the minimum (oldest) date if there is a tie on probability. This too will serve my purposes.
Region Forecast Close Date Probability (%)
Okeechobee FL 6/27/2016 90
Okeechobee West FL 7/1/2016 40
Albany GA 3/11/2016 100
Emerald Coast FL 6/30/2016 60
Emerald Coast FL 10/1/2016 40
Cullman_Hartselle TN 4/30/2016 10
North MS 10/1/2016 25
Roanoke VA 8/31/2016 25
Roanoke VA 8/1/2016 40
Gardena CA 6/1/2016 80
Gardena CA 6/1/2016 80
Lomita-Harbor City 6/30/2016 60
Lomita-Harbor City 6/30/2016 0
Lomita-Harbor City 6/30/2016 40
Eastern NC 6/30/2016 60
Northwest NC 9/16/2016 10
Fort Collins_Greeley CO 3/1/2016 100
Northwest OK 6/30/2016 100
Southwest MO 7/29/2016 90
Northern NH-VT 3/1/2016 20
South DE 12/1/2016 0
South DE 12/1/2016 20
Kingston NY 12/30/2016 5
Longview WA 11/30/2016 5
North DE 12/1/2016 20
North DE 12/1/2016 0
Salt Lake City UT 8/31/2016 20
Idaho Panhandle 8/26/2016 0
Bridgeton_Salem NJ 7/1/2016 25
Bridgeton_Salem NJ 7/1/2016 65
Layton_Ogden UT 3/25/2016 5
Central OR 6/30/2016 10
The following Array formula should work:
=(ABS(B2-$F$2)=MIN(IF(($A$2:$A$33=A2)*(C2=MAX(IF($A$2:$A$33=A2,$C$2:$C$33))),ABS($B$2:$B$33-$F$2))))*(C2=MAX(IF($A$2:$A$33=A2,$C$2:$C$33)))>0
Being an array formula use Ctrl-Shift-Enter when exiting Edit mode. If done properly Excel will put {} around the formula.
Edit
Added #tigeravatar suggestion to avoid volatile functions.
I think this is OK now but needs to be checked against the more complete set of data provided by OP.
It counts:-
(1) Any rows with same state but higher probability
(2) Any rows with same state and probability, in the future (or present) and nearer to today's date
(3) Any rows with same state and probability, in the past and nearer to today's date.
If all these are zero, you should have the right one.
=COUNTIFS($A$2:$A$100,$A2,$C$2:$C$100,">"&$C2)
+COUNTIFS($A$2:$A$100,$A2,$C$2:$C$100,$C2,$B$2:$B$100,"<"&$G$2+IF ($B2>=$G$2,DATEDIF($G$2,$B2,"d"),DATEDIF($B2,$G$2,"d")),$B$2:$B$100,">="&$G$2)
+COUNTIFS($A$2:$A$100,$A2,$C$2:$C$100,$C2,$B$2:$B$100,">"&$G$2-IF($B2>=$G$2,DATEDIF($G$2,$B2,"d"),DATEDIF($B2,$G$2,"d")),$B$2:$B$100,"<"&$G$2)
=0
If the dates are all in the future, it can be simplified a lot:-
=COUNTIFS($A$2:$A$100,$A2,$C$2:$C$100,">"&$C2)
+COUNTIFS($A$2:$A$100,$A2,$C$2:$C$100,$C2,$B$2:$B$100,"<"&$G$2+DATEDIF($G$2,$B2,"d"))
=0