My code is
cd /home/XXX/db-new
while read -r line; do
data=$(echo $line | awk -F'"' -v OFS='' '{ for (i=2; i<=NF; i+=2) gsub(",", "", $i) } 1' | awk '{gsub(/\"/,"")};1' | tr -d \'\" )
d2=$(echo $data | awk -F, '{print $2}')
d3=$(echo $data | awk -F, '{print $3}')
d17=$(echo $data | awk -F, '{print $17}')
d4=$(echo $data | awk -F, '{print $4","$5","$6","$7","$8","$9","$10","$11","$12","$13","$14","$15","$16","$17","$18","$19","$20","$21","$22","$23","$24","$25","$26","$27","$28","$29","$30","$31","$32","$33","$34","$35","$36","$37","$38","$39","$40","$45","$46","$47","$48","$49","$50","$51","$52","$53","$54","$55","$56","$57","$58}')
d1=$d2+$d3
d59=$(echo $d2 | cut -d "." -f 2,3)
d60=$(echo $data | awk -F, '{print $19}' | awk 'BEGIN{FS=OFS=","} {gsub(/[[:punct:] ]/,"",$1)} 1' | sed 's/[^0-9]*//g' )
echo $d1,$d2,$d4,$d59,$d17,$d60 >> abc.csv
done < /home/XXX/db-new/2021-09-04.csv
/home/domainsanalytics/db-new/2021-09-04.csv is very big so I add only 1st 3 lines.
head -3 /home/domainsanalytics/db-new/2021-09-04.csv
"num","domain_name","query_time","create_date","update_date","expiry_date","domain_registrar_id","domain_registrar_name","domain_registrar_whois","domain_registrar_url","registrant_name","registrant_company","registrant_address","registrant_city","registrant_state","registrant_zip","registrant_country","registrant_email","registrant_phone","registrant_fax","administrative_name","administrative_company","administrative_address","administrative_city","administrative_state","administrative_zip","administrative_country","administrative_email","administrative_phone","administrative_fax","technical_name","technical_company","technical_address","technical_city","technical_state","technical_zip","technical_country","technical_email","technical_phone","technical_fax","billing_name","billing_company","billing_address","billing_city","billing_state","billing_zip","billing_country","billing_email","billing_phone","billing_fax","name_server_1","name_server_2","name_server_3","name_server_4","domain_status_1","domain_status_2","domain_status_3","domain_status_4"
"1","accounting-fwppool.com","2021-09-04 00:53:04","2021-08-10","2021-08-10","2022-08-10","303","PDR Ltd. d/b/a PublicDomainRegistry.com","whois.publicdomainregistry.com","http://www.publicdomainregistry.com","Micael brown","","4941 Maui Cir Huntington Beach, CA 92649","CA","CA","92649","United States","michbrown7654gh#gmail.com","+1.9169136369","","Micael brown","","4941 Maui Cir Huntington Beach, CA 92649","CA","CA","92649","United States","michbrown7654gh#gmail.com","+1.9169136369","","Micael brown","","4941 Maui Cir Huntington Beach, CA 92649","CA","CA","92649","United States","michbrown7654gh#gmail.com","+1.9169136369","","","","","","","","","","","","ns1.verification-hold.suspended-domain.com","ns2.verification-hold.suspended-domain.com","","","clientTransferProhibited","","",""
"2","xjava.com","2021-09-04 00:53:11","2001-03-06","2021-03-12","2022-03-06","472","Dynadot, LLC","whois.dynadot.com","http://www.dynadot.com","Super Privacy Service LTD c/o Dynadot","","PO Box 701","San Mateo","California","94401","United States","xjava.com#superprivacyservice.com","+1.6505854708","","Super Privacy Service LTD c/o Dynadot","","PO Box 701","San Mateo","California","94401","United States","xjava.com#superprivacyservice.com","+1.6505854708","","Super Privacy Service LTD c/o Dynadot","","PO Box 701","San Mateo","California","94401","United States","xjava.com#superprivacyservice.com","+1.6505854708","","","","","","","","","","","","ns1.sedoparking.com","ns2.sedoparking.com","","","clientTransferProhibited","","",""
My code give me result good, but $59 ,$17 and $60 is coming in new line...
$59 is just tld i am getting,
$17 is reprint of country,
$60 is phone number without special characters
All I want is all in 1 row
My output is
domain_name+query_time domain_name create_date update_date expiry_date domain_registrar_id domain_registrar_name domain_registrar_whois domain_registrar_url registrant_name registrant_company registrant_address registrant_city registrant_state registrant_zip registrant_country registrant_email registrant_phone registrant_fax administrative_name administrative_company administrative_address administrative_city administrative_state administrative_zip administrative_country administrative_email administrative_phone administrative_fax technical_name technical_company technical_address technical_city technical_state technical_zip technical_country technical_email technical_phone technical_fax billing_state billing_zip billing_country billing_email billing_phone billing_fax name_server_1 name_server_2 name_server_3 name_server_4 domain_status_1 domain_status_2 domain_status_3 domain_status_4
domain_name registrant_country
accounting-fwppool.com+2021-09-04 00:53:04 accounting-fwppool.com 10/08/21 10/08/21 10/08/22 303 PDR Ltd. d/b/a PublicDomainRegistry.com whois.publicdomainregistry.com http://www.publicdomainregistry.com Micael brown 4941 Maui Cir Huntington Beach CA 92649 CA CA 92649 United States michbrown7654gh#gmail.com 1.916913637 Micael brown 4941 Maui Cir Huntington Beach CA 92649 CA CA 92649 United States michbrown7654gh#gmail.com 1.916913637 Micael brown 4941 Maui Cir Huntington Beach CA 92649 CA CA 92649 United States michbrown7654gh#gmail.com 1.916913637 ns1.verification-hold.suspended-domain.com ns2.verification-hold.suspended-domain.com clientTransferProhibited
com United States 19169136369
xjava.com+2021-09-04 00:53:11 xjava.com 06/03/01 12/03/21 06/03/22 472 Dynadot LLC whois.dynadot.com http://www.dynadot.com Super Privacy Service LTD c/o Dynadot PO Box 701 San Mateo California 94401 United States xjava.com#superprivacyservice.com 1.650585471 Super Privacy Service LTD c/o Dynadot PO Box 701 San Mateo California 94401 United States xjava.com#superprivacyservice.com 1.650585471 Super Privacy Service LTD c/o Dynadot PO Box 701 San Mateo California 94401 United States xjava.com#superprivacyservice.com 1.650585471 ns1.sedoparking.com ns2.sedoparking.com clientTransferProhibited
com United States 16505854708
accuratetactics.com+2021-09-04 00:53:14 accuratetactics.com 26/08/20 30/08/21 26/08/21 1660 Domainshype.com Inc. whois.domainshype.com http://www.domainshype.com This Domain For Sale Worldwide 339 222 5132 Buydomains.com 738 Main Street #389 Waltham Massachusetts 2451 United States brokerage#buydomains.com 1.339222513 1.78183928 This Domain For Sale Worldwide 339 222 5132 Buydomains.com 738 Main Street #389 Waltham Massachusetts 2451 United States brokerage#buydomains.com 1.339222513 1.78183928 This Domain For Sale Worldwide 339 222 5132 Buydomains.com 738 Main Street #389 Waltham Massachusetts 2451 United States brokerage#buydomains.com 1.339222513 1.78183928 dns7.parkpage.foundationapi.com dns8.parkpage.foundationapi.com OK
com United States 13392225132
vej.com+2021-09-04 00:53:16 vej.com 16/09/99 31/08/21 16/09/23 128 DomainRegistry.com Inc. nswhois.domainregistry.com http://www.domainregistry.com Scottcraft Label Co. Scottcraft Label Co. c/o Admin Svcs. PO Box 145 Marlton NJ 8053 United States itadmin#scottcraftlabel.com 1.215870212 IT Admin MS 445 Scottcraft Label Co. c/o Admin Svcs. PO Box 145 Marlton NJ 8053 United States itadmin#scottcraftlabel.com 1.215870212 IT Admin MS 445 Scottcraft Label Co. c/o Admin Svcs. PO Box 145 Marlton NJ 8053 United States itadmin#scottcraftlabel.com 1.215870212 colohost1.domainregistry.com cs03.domainregistry.com clientDeleteProhibited clientTransferProhibited clientUpdateProhibited
com United States 12158702120
accutekware.com+2021-09-04 00:53:24 accutekware.com 26/08/03 26/08/21 26/08/21 303 PDR Ltd. d/b/a PublicDomainRegistry.com whois.publicdomainregistry.com http://www.publicdomainregistry.com R Benedict Accutek Systems Inc PO Box 591125 Houston Texas 77259 United States rbeny09#hotmail.com 1.281461701 R Benedict Accutek Systems Inc PO Box 591125 Houston Texas 77259 United States rbeny09#hotmail.com 1.281461701 R Benedict Accutek Systems Inc PO Box 591125 Houston Texas 77259 United States rbeny09#hotmail.com 1.281461701 dns10.parkpage.foundationapi.com dns11.parkpage.foundationapi.com clientTransferProhibited
com United States 12814617007
crmxon.com+2021-09-04 00:53:27 crmxon.com 04/09/20 04/11/20 04/09/21 303 PDR Ltd. d/b/a PublicDomainRegistry.com whois.publicdomainregistry.com http://www.publicdomainregistry.com GDPR Masked GDPR Masked GDPR Masked GDPR Masked Newcastleupon Tyne(Cityof) GDPR Masked United Kingdom gdpr-masking#gdpr-masked.com GDPR Masked GDPR Masked GDPR Masked GDPR Masked GDPR Masked GDPR Masked GDPR Masked GDPR Masked GDPR Masked gdpr-masking#gdpr-masked.com GDPR Masked GDPR Masked GDPR Masked GDPR Masked GDPR Masked GDPR Masked GDPR Masked GDPR Masked GDPR Masked gdpr-masking#gdpr-masked.com GDPR Masked GDPR Masked ns1.edagent.com ns2.edagent.com ns3.edagent.com ns4.edagent.com clientTransferProhibited
com United Kingdom
Expected output
domain_name+query_time domain_name create_date update_date expiry_date domain_registrar_id domain_registrar_name domain_registrar_whois domain_registrar_url registrant_name registrant_company registrant_address registrant_city registrant_state registrant_zip registrant_country registrant_email registrant_phone registrant_fax administrative_name administrative_company administrative_address administrative_city administrative_state administrative_zip administrative_country administrative_email administrative_phone administrative_fax technical_name technical_company technical_address technical_city technical_state technical_zip technical_country technical_email technical_phone technical_fax billing_state billing_zip billing_country billing_email billing_phone billing_fax name_server_1 name_server_2 name_server_3 name_server_4 domain_status_1 domain_status_2 domain_status_3 domain_status_4 domain_name registrant_country
accounting-fwppool.com+2021-09-04 00:53:04 accounting-fwppool.com 10/08/21 10/08/21 10/08/22 303 PDR Ltd. d/b/a PublicDomainRegistry.com whois.publicdomainregistry.com http://www.publicdomainregistry.com Micael brown 4941 Maui Cir Huntington Beach CA 92649 CA CA 92649 United States michbrown7654gh#gmail.com 1.91691364 Micael brown 4941 Maui Cir Huntington Beach CA 92649 CA CA 92649 United States michbrown7654gh#gmail.com 1.91691364 Micael brown 4941 Maui Cir Huntington Beach CA 92649 CA CA 92649 United States michbrown7654gh#gmail.com 1.91691364 ns1.verification-hold.suspended-domain.com ns2.verification-hold.suspended-domain.com clientTransferProhibited com United States 1.9169E+10
xjava.com+2021-09-04 00:53:11 xjava.com 06/03/01 12/03/21 06/03/22 472 Dynadot LLC whois.dynadot.com http://www.dynadot.com Super Privacy Service LTD c/o Dynadot PO Box 701 San Mateo California 94401 United States xjava.com#superprivacyservice.com 1.65058547 Super Privacy Service LTD c/o Dynadot PO Box 701 San Mateo California 94401 United States xjava.com#superprivacyservice.com 1.65058547 Super Privacy Service LTD c/o Dynadot PO Box 701 San Mateo California 94401 United States xjava.com#superprivacyservice.com 1.65058547 ns1.sedoparking.com ns2.sedoparking.com clientTransferProhibited com United States 1.6506E+10
accuratetactics.com+2021-09-04 00:53:14 accuratetactics.com 26/08/20 30/08/21 26/08/21 1660 Domainshype.com Inc. whois.domainshype.com http://www.domainshype.com This Domain For Sale Worldwide 339 222 5132 Buydomains.com 738 Main Street #389 Waltham Massachusetts 2451 United States brokerage#buydomains.com 1.33922251 1.78183928 This Domain For Sale Worldwide 339 222 5132 Buydomains.com 738 Main Street #389 Waltham Massachusetts 2451 United States brokerage#buydomains.com 1.33922251 1.78183928 This Domain For Sale Worldwide 339 222 5132 Buydomains.com 738 Main Street #389 Waltham Massachusetts 2451 United States brokerage#buydomains.com 1.33922251 1.78183928 dns7.parkpage.foundationapi.com dns8.parkpage.foundationapi.com OK com United States 1.3392E+10
Suggesting single awk script to process all the data:
Staring with this:
script.awk
BEGIN{FS="\",\"|\"[[:space:]]*$|^[[:space:]]*\""; OFS=" "}
{
$1=$1; # recalculate fields
# num field start from $2
arr[1] = $3 "+" $4;
arr[2] = $4;
arr[4] = $5;
# right append to arr[4] fields 6-41
for (i = 6; i <= 41; i++) arr[4] = arr[4] "," $i;
# right append to arr[4] fields 46-59
for (i = 46; i <= 59; i++) arr[4] = arr[4] "," $i;
arr[17] = $18;
arr[59 ] = $3;
# in 3rd field remove text after first "."
sub(/\..*$/,"",arr[59]);
# remove all punctuations and digits from 20th field
gsub(/[[:punct:]]|[[:digit:]]*/,"",$20);
arr[60] = $20;
# output to stdout
print arr[1],arr[2],arr[4],arr[59],arr[17],arr[60];
}
Running:
awk -f script.awk input.csv > output.csv
Did not test since the sample data did not contain numeric values.
I am trying to remove a custom list of stop words, but its not working.
desc = pd.DataFrame(description, columns =['description'])
print(desc)
Which gives the following results
description
188693 The Kentucky Cannabis Company and Bluegrass He...
181535 Ohio County Sheriff
11443 According to new reports from federal authorit...
213919 KANSAS CITY, Mo. (AP)The Chiefs will be withou...
171509 The crew of Insight, WCNY's weekly public affa...
... ...
2732 The Arkansas Supreme Court on Thursday cleared...
183367 Larry Pegram, co-owner of Pure Ohio Wellness, ...
134291 Joe Biden will spend the next five months pres...
239270 Find out where your Texas representatives stan...
246070 SAN TAN VALLEY — Two men have been charged wit...
[9875 rows x 1 columns]
I found the following code here, but it doesn't seem to work
remove_words = ["marijuana", "cannabis", "hemp", "thc", "cbd"]
pat = '|'.join([r'\b{}\b'.format(w) for w in remove_words])
desc.assign(new_desc=desc.replace(dict(string={pat: ''}), regex=True))
Which produces the following results
description new_desc
188693 The Kentucky Cannabis Company and Bluegrass He... The Kentucky Cannabis Company and Bluegrass He...
181535 Ohio County Sheriff Ohio County Sheriff
11443 According to new reports from federal authorit... According to new reports from federal authorit...
213919 KANSAS CITY, Mo. (AP)The Chiefs will be withou... KANSAS CITY, Mo. (AP)The Chiefs will be withou...
171509 The crew of Insight, WCNY's weekly public affa... The crew of Insight, WCNY's weekly public affa...
... ... ...
2732 The Arkansas Supreme Court on Thursday cleared... The Arkansas Supreme Court on Thursday cleared...
183367 Larry Pegram, co-owner of Pure Ohio Wellness, ... Larry Pegram, co-owner of Pure Ohio Wellness, ...
134291 Joe Biden will spend the next five months pres... Joe Biden will spend the next five months pres...
239270 Find out where your Texas representatives stan... Find out where your Texas representatives stan...
246070 SAN TAN VALLEY — Two men have been charged wit... SAN TAN VALLEY — Two men have been charged wit...
9875 rows × 2 columns
As you can see, the stop words weren't removed. Any help you can provide would be greatly appreciated.
Handle the case, simplify pattern,
remove_words = ["marijuana", "cannabis", "hemp", "thc", "cbd"]
pat = '|'.join(remove_words)
desc['new_desc'] = desc.description.str.lower().replace(pat,'', regex=True)
description new_desc
0 The Kentucky Cannabis Company and Bluegrass He... the kentucky company and bluegrass he...
1 Ohio County Sheriff ohio county sheriff
2 According to new reports from federal authorit... according to new reports from federal authorit...
3 KANSAS CITY, Mo. (AP)The Chiefs will be mariju... kansas city, mo. (ap)the chiefs will be witho...
4 The crew of Insight, WCNY's weekly public affa... the crew of insight, wcny's weekly public affa...
I have no idea what I'm doing wrong but I'm getting an N/A error.
I have a cell that contains comma separated addresses. If one of the addresses matches an address in another table, I need a value from another cell in the prior table.
INDEX( GoogleFormResults[B], MATCH("*"&[#Address]&"*", GoogleFormResults[A],0))
Example Address in the Table:
707 W Cesar Chavez Ave LOS ANGELES CA 90067
Example GoogleFormResults[A]
4101 Crenshaw Blvd LOS ANGELES CA 90008, 707 W Cesar Chavez Ave LOS ANGELES CA 90067, 6820 Eastern Ave BELL GARDENS CA 90201, 12270 Paramount Blvd DOWNEY CA 90242, 1399 Artesia Blvd. GARDENA CA 90247, 14441 Inglewood Ave HAWTHORNE CA 90250, 4651 Firestone Blvd SOUTH GATE CA 90280, 5871 Firestone Blvd SOUTH GATE CA 90280, 19503 Normandie Ave TORRANCE CA 90501, 19340 Hawthorne Blvd TORRANCE CA 90503, 22015 Hawthorne Blvd TORRANCE CA 90503, 2601 Skypark Dr TORRANCE CA 90505, 8450 La Palma Ave BUENA PARK CA 90620, 5420 Lapalma Ave LA PALMA CA 90623, 1000 E Imperial Hwy LA HABRA CA 90631, 1340 SOUTH BEACH BLVD LA HABRA CA 90631, 1390 S. Beach Blvd. LA HABRA CA 90631, 14865 Telegraph Rd LA MIRADA CA 90638, 11729 Imperial Hwy NORWALK CA 90650, 8500 Washington Blvd PICO RIVERA CA 90660, 13310 TELEGRAPH ROAD SANTA FE SPRINGS CA 90670, 12540 Beach Blvd. STANTON CA 90680, 12840 Beach Blvd STANTON CA 90680, 12701 TOWNE CENTER D CERRITOS CA 90703, 2770 CARSON STREET LAKEWOOD CA 90712, 12120 Carson Street HAWAIIAN GARDENS CA 90716, 14501 LAKEWOOD BLVD PARAMOUNT CA 90723, 20226 AVALON BLVD. CARSON CA 90746, 151 EAST 5TH STREET LONG BEACH CA 90802, 3705 E. South Street LONG BEACH CA 90805, 7250 Carson Blvd LONG BEACH CA 90808, 7480 Carson Blvd LONG BEACH CA 90808, 6750 Kimball Ave Chino CA 91708, 3943 Grand Ave CHINO CA 91710, 3951 Grand Ave CHINO CA 91710, 1275 N Azusa Ave COVINA CA 91722, 4901 Santa Anita Ave EL MONTE CA 91731, 1425 N Hacienda Blvd LA PUENTE CA 91744, 17150 Gale Ave CITY OF INDUSTRY CA 91745, 17835 E. Gale Ave. HACIENDA HEIGHTS CA 91745, 4155 Wineville Ave MIRA LOMA CA 91752, 4250 Hamner Ave MIRA LOMA CA 91752, 1333 N Mountain Ave ONTARIO CA 91762, 951 N. Milliken Ave. ONTARIO CA 91764, 1180 S Diamond Bar Blvd Diamond Bar CA 91765, 80 RIO RANCHO ROAD POMONA CA 91766, 780 E Arrow Hwy Pomona CA 91767, 1827 WALNUT GROVE BLVD ROSEMEAD CA 91770, 1445 E Foothill Blvd UPLAND CA 91786, 1540 W. FOOTHILL BLVD UPLAND CA 91786, 2735 E Eastland Center Dr WEST COVINA CA 91791, 1550 Leucadia Blvd ENCINITAS CA 92024, 1266 E Valley Parkway ESCONDIDO CA 92025, 1330 East Grand Ave ESCONDIDO CA 92027, 1046 Mission Ave OCEANSIDE CA 92054, 2100 VISTA WAY OCEANSIDE CA 92054, 3405 Marron Rd OCEANSIDE CA 92056, 705 COLLEGE BLVD OCEANSIDE CA 92057, 2121 Imperial Ave SAN DIEGO CA 92102, 4840 Shawline St SAN DIEGO CA 92111, 3412 COLLEGE AVE. SAN DIEGO CA 92115, 6336 College Grove Way SAN DIEGO CA 92115, 3382 Murphy Canyon Rd SAN DIEGO CA 92123, 575 Saturn Blvd SAN DIEGO CA 92154, 710 DENNERY ROAD SAN DIEGO CA 92154, 13553-A San Bernardino Avenue FONTANA CA 92334, 4210 EAST HIGHLAND A HIGHLAND CA 92346, 16555 Von Karman Ave IRVINE CA 92606, 26502 TOWNE CENTER DRI FOOTHILL RANCH CA 92610, 71 Technology Dr IRVINE CA 92618, 8230 TALBERT AVENUE HUNTINGTON BEACH CA 92646, 6912 Edinger Ave HUNTINGTON BEACH CA 92647, 21134 Beach Blvd HUNTINGTON BEACH CA 92648, 951 Avenida Pico SAN CLEMENTE CA 92673, 27470 ALICIA PKWY LAGUNA NIGUEL CA 92677, 13331 BEACH BLVD WESTMINSTER CA 92683, 30491 Avenida De Las Flores RANCHO SANTA MARGARITA CA 92688, 3600 W McFadden Ave SANTA ANA CA 92704, 17099 Brookhurst St. FOUNTAIN VALLEY CA 92708, 121 N Beach Ave ANAHEIM CA 92801, 440 N Euclid St ANAHEIM CA 92801, 1120 S Anaheim Blvd ANAHEIM CA 92805, Lemon and Orange thorpe ANAHEIM CA 92817, 2595 EAST IMPERIAL HGWY BREA CA 92821, 629 S. Placentia Ave. FULLERTON CA 92831, 10912 Katella Ave GARDEN GROVE CA 92840, 11822 Gilbert St GARDEN GROVE CA 92841, 2300 NORTH TUSTIN ST ORANGE CA 92865, 479 N McKinley St CORONA CA 92879, 1290 E Ontario Ave CORONA CA 92881, 1375 E. Ontario Ave. CORONA CA 92881, 1560 West 6th St CORONA CA 92882
Here is an example of using FIND to return a value in some other column related to the cell where there is a partial match.
If you execute FIND against the column of the table, it will return an array of #VALUE and a number depending on whether the item exists in the string. You can then do a LOOKUP to find the last match by making the LOOKUP value a very large number, and the result vector the column you want to return.
You should be able to adapt the following to your data:
Given this Table: (note your long list of addresses is in B3)
and with the Address you are searching for in B9, you can use this formula to return the contents of the cell in column D:
=LOOKUP(1E+307,FIND(B9,Table1[Long List of Addresses]),Table1[Cell to Return])