I was inspecting the full list of IANA top level domains and came across some uncommon ones, but also some very uncommon ones, particularly 153 top level domains starting with XN:
XN--11B4C3D
XN--1CK2E1B
XN--1QQW23A
XN--2SCRJ9C
XN--30RR7Y
XN--3BST00M
What are the domains starting with XN?
Note
Here's some R code to extract the full list of XN domains for exploration:
library(tidyverse)
library(rvest)
domains <- read_html("http://data.iana.org/TLD/tlds-alpha-by-domain.txt") %>%
html_nodes("body") %>%
html_text %>%
str_split("\n") %>%
unlist %>%
as.data.frame %>%
`colnames<-`("tld")
# Starts with XN
domains %>%
filter(substr(tld, 1, 2) == "XN")
# tld
# 1 XN--11B4C3D
# 2 XN--1CK2E1B
# 3 XN--1QQW23A
# 4 XN--2SCRJ9C
# 5 XN--30RR7Y
# 6 XN--3BST00M
# 7 XN--3DS443G
# 8 XN--3E0B707E
# 9 XN--3HCRJ9C
# 10 XN--3OQ18VL8PN36A
# 11 XN--3PXU8K
# 12 XN--42C2D9A
# 13 XN--45BR5CYL
# --- ---
# 146 XN--WGBL6A
# 147 XN--XHQ521B
# 148 XN--XKC2AL3HYE2A
# 149 XN--XKC2DL3A5EE0H
# 150 XN--Y9A3AQ
# 151 XN--YFRO4I67O
# 152 XN--YGBI2AMMX
# 153 XN--ZFR164B
Full set
XN--11B4C3D XN--1CK2E1B XN--1QQW23A XN--2SCRJ9C XN--30RR7Y XN--3BST00M XN--3DS443G XN--3E0B707E XN--3HCRJ9C XN--3OQ18VL8PN36A XN--3PXU8K XN--42C2D9A XN--45BR5CYL XN--45BRJ9C XN--45Q11C XN--4GBRIM XN--54B7FTA0CC XN--55QW42G XN--55QX5D XN--5SU34J936BGSG XN--5TZM5G XN--6FRZ82G XN--6QQ986B3XL XN--80ADXHKS XN--80AO21A XN--80AQECDR1A XN--80ASEHDB XN--80ASWG XN--8Y0A063A XN--90A3AC XN--90AE XN--90AIS XN--9DBQ2A XN--9ET52U XN--9KRT00A XN--B4W605FERD XN--BCK1B9A5DRE4C XN--C1AVG XN--C2BR7G XN--CCK2B3B XN--CCKWCXETD XN--CG4BKI XN--CLCHC0EA0B2G2A9GCD XN--CZR694B XN--CZRS0T XN--CZRU2D XN--D1ACJ3B XN--D1ALF XN--E1A4C XN--ECKVDTC9D XN--EFVY88H XN--FCT429K XN--FHBEI XN--FIQ228C5HS XN--FIQ64B XN--FIQS8S XN--FIQZ9S XN--FJQ720A XN--FLW351E XN--FPCRJ9C3D XN--FZC2C9E2C XN--FZYS8D69UVGM XN--G2XX48C XN--GCKR3F0F XN--GECRJ9C XN--GK3AT1E XN--H2BREG3EVE XN--H2BRJ9C XN--H2BRJ9C8C XN--HXT814E XN--I1B6B1A6A2E XN--IMR513N XN--IO0A7I XN--J1AEF XN--J1AMH XN--J6W193G XN--JLQ480N2RG XN--JLQ61U9W7B XN--JVR189M XN--KCRX77D1X4A XN--KPRW13D XN--KPRY57D XN--KPUT3I XN--L1ACC XN--LGBBAT1AD8J XN--MGB9AWBF XN--MGBA3A3EJT XN--MGBA3A4F16A XN--MGBA7C0BBN0A XN--MGBAAKC7DVF XN--MGBAAM7A8H XN--MGBAB2BD XN--MGBAH1A3HJKRD XN--MGBAI9AZGQP6J XN--MGBAYH7GPA XN--MGBBH1A XN--MGBBH1A71E XN--MGBC0A9AZCG XN--MGBCA7DZDO XN--MGBCPQ6GPA1A XN--MGBERP4A5D4AR XN--MGBGU82A XN--MGBI4ECEXP XN--MGBPL2FH XN--MGBT3DHD XN--MGBTX2B XN--MGBX4CD0AB XN--MIX891F XN--MK1BU44C XN--MXTQ1M XN--NGBC5AZD XN--NGBE9E0A XN--NGBRX XN--NODE XN--NQV7F XN--NQV7FS00EMA XN--NYQY26A XN--O3CW4H XN--OGBPF8FL XN--OTU796D XN--P1ACF XN--P1AI XN--PGBS0DH XN--PSSY2U XN--Q7CE6A XN--Q9JYB4C XN--QCKA1PMC XN--QXA6A XN--QXAM XN--RHQV96G XN--ROVU88B XN--RVC1E0AM3E XN--S9BRJ9C XN--SES554G XN--T60B56A XN--TCKWE XN--TIQ49XQYJ XN--UNUP4Y XN--VERMGENSBERATER-CTB XN--VERMGENSBERATUNG-PWB XN--VHQUV XN--VUQ861B XN--W4R85EL8FHU5DNRA XN--W4RS40L XN--WGBH1C XN--WGBL6A XN--XHQ521B XN--XKC2AL3HYE2A XN--XKC2DL3A5EE0H XN--Y9A3AQ XN--YFRO4I67O XN--YGBI2AMMX XN--ZFR164B
They're punycode versions of non-ASCII URLs, e.g., for example
% IANA WHOIS server
% for more information on IANA, visit http://www.iana.org
% This query returned 1 object
domain: ଭାରତ
domain-ace: XN--3HCRJ9C
The domains are punycode:
Punycode is a simple and efficient transfer encoding syntax designed for use with Internationalized Domain Names in Applications. It uniquely and reversibly transforms a Unicode string into an ASCII string. ASCII characters in the Unicode string are represented literally, and non-ASCII characters are represented by ASCII characters that are allowed in host name labels (letters, digits, and hyphens).
Here are all 153 punycode domains after decoding:
[1] "कॉम" "セール" "佛山"
[4] "ಭಾರತ" "慈善" "集团"
[7] "在线" "한국" "ଭାରତ"
[10] "点看" "คอม" "ভাৰত"
[13] "ভারত" "八卦" "ישראל"
[16] "موقع" "বাংলা" "公益"
[19] "公司" "香格里拉" "网站"
[22] "移动" "我爱你" "москва"
[25] "қаз" "католик" "онлайн"
[28] "сайт" "联通" "срб"
[31] "бг" "бел" "קום"
[34] "时尚" "微博" "淡马锡"
[37] "ファッション" "орг" "नेट"
[40] "ストア" "アマゾン" "삼성"
[43] "சிங்கப்பூர்" "商标" "商店"
[46] "商城" "дети" "мкд"
[49] "ею" "ポイント" "新闻"
[52] "家電" "كوم" "中文网"
[55] "中信" "中国" "中國"
[58] "娱乐" "谷歌" "భారత్"
[61] "ලංකා" "電訊盈科" "购物"
[64] "クラウド" "ભારત" "通販"
[67] "भारतम्" "भारत" "भारोत"
[70] "网店" "संगठन" "餐厅"
[73] "网络" "ком" "укр"
[76] "香港" "亚马逊" "诺基亚"
[79] "食品" "飞利浦" "台湾"
[82] "台灣" "手机" "мон"
[85] "الجزائر" "عمان" "ارامكو"
[88] "ایران" "العليان" "اتصالات"
[91] "امارات" "بازار" "موريتانيا"
[94] "پاکستان" "الاردن" "بارت"
[97] "بھارت" "المغرب" "ابوظبي"
[100] "البحرين" "السعودية" "ڀارت"
[103] "كاثوليك" "سودان" "همراه"
[106] "عراق" "مليسيا" "澳門"
[109] "닷컴" "政府" "شبكة"
[112] "بيتك" "عرب" "გე"
[115] "机构" "组织机构" "健康"
[118] "ไทย" "سورية" "招聘"
[121] "рус" "рф" "تونس"
[124] "大拿" "ລາວ" "みんな"
[127] "グーグル" "ευ" "ελ"
[130] "世界" "書籍" "ഭാരതം"
[133] "ਭਾਰਤ" "网址" "닷넷"
[136] "コム" "天主教" "游戏"
[139] "VERMöGENSBERATER" "VERMöGENSBERATUNG" "企业"
[142] "信息" "嘉里大酒店" "嘉里"
[145] "مصر" "قطر" "广东"
[148] "இலங்கை" "இந்தியா" "հայ"
[151] "新加坡" "فلسطين" "政务"
And here is the R code used to decode the domains (but you can also use a punycode converter:
library(tidyverse)
library(rvest)
domains <- read_html("http://data.iana.org/TLD/tlds-alpha-by-domain.txt") %>%
html_nodes("body") %>%
html_text %>%
str_split("\n") %>%
unlist %>%
as.data.frame %>%
`colnames<-`("tld")
punycode_domains <- domains %>%
filter(substr(tld, 1, 2) == "XN") %>%
pull(tld)
# devtools::install_github("hrbrmstr/punycode") # Run once to install
library(punycode)
puny_encode(punycode_domains)
Related
I'm attempting to order some color swatches by "maximum difference" and am getting some odd results. For example, in the image below, colors numbers 5 and 14 appear, at least to my (non-colorblind) eye, rather similar, much more so than some of the following colors, but yet seem to have a higher minimum ΔE (when calculated against all the previous colors) than many of those colors that follow it.
Is ΔE considered a reliable way of calculating a "perceptual distance" between colors, or should I be using something else?
In order, the colors shown here are:
> x.hex
[1] "#060186" "#8EF905" "#F5C5F7" "#805200" "#0DE0FE" "#D0135D" "#0B834F" "#FABF74" "#417BEA" "#FA4D01"
[11] "#DC39FC" "#590708" "#919913" "#01DDAE" "#068896" "#D28B8B" "#7C4C8E" "#A3BCE7" "#0C5378" "#F1E11E"
[21] "#A24731" "#495C0D" "#01B633" "#4A30FE" "#BB7D0A" "#680F41" "#C1D597" "#FC75C1" "#A786C7" "#29A4DD"
[31] "#FD0A3D" "#43A99B" "#B16A8D" "#D002A2" "#BA7755" "#FECBB6" "#253467" "#FF9143" "#8A763A" "#5960A6"
[41] "#B79D66" "#70A271"
And the minimum ΔE against the previous colors:
> DeList
[1] 117.25473 69.53788 55.00019 46.90173 38.54371 37.20359 36.32807 35.23608 28.57360 27.10889
[11] 26.77178 25.24130 24.39958 24.24133 22.51525 22.23315 20.50791 19.93881 19.63842 19.45253
[21] 19.31200 19.04087 18.90102 18.64973 18.25806 18.08846 17.55115 17.19687 16.82420 15.35578
[31] 15.17018 14.95605 14.77414 14.67706 14.67703 14.37527 14.16665 14.02716 14.00375 13.90574
[41] 13.84133
I'm calculating ΔE using the R package spacesXYZ, using the formula:
spacesXYZ::DeltaE( lab.matrix[i,], lab.matrix[j,], 2000 )
and calculating the hex code from the LAB matrix using:
x <- lab.matrix[pal.list,] # extract the LAB numbers from the matrix
x.lab <- colorspace::LAB(x) # convert to an LAB colorspace class
(x.hex <- colorspace::hex(x.lab)) # convert to hex string
I'm trying to find the longest shortest path(s) between 2 counties. I was given 2 .txt files, one with all of the nodes (county ID, population, latitude and longitude, and commuters inside the county) and one with the links (source county, destination county, distance, number of commuters).
01001 43671 32.523283 -86.577176 7871
01003 140415 30.592781 -87.748260 45208
01005 29038 31.856515 -85.331312 8370
01007 20826 33.040054 -87.123243 3199
01009 51024 33.978461 -86.554768 8966
01011 11714 32.098285 -85.704915 2237
01013 21399 31.735884 -86.662232 5708
01015 112249 33.741989 -85.817544 39856
01017 36583 32.891233 -85.288745 9281
01019 23988 34.184158 -85.621930 4645
01021 39593 32.852554 -86.689982 8115
01023 15922 32.027681 -88.257855 3472
01025 27867 31.688155 -87.834164 7705
...
01001 01001 0 7871
01001 01007 76.8615966430995 7
01001 01013 87.9182871130127 37
01001 01015 152.858742124667 5
01001 01021 38.1039665382023 350
01001 01031 140.051395101308 8
01001 01037 57.6726084645634 12
01001 01047 48.517875245493 585
01001 01051 38.9559472915165 741
01001 01053 169.524277177911 5
01001 01059 245.323879285783 7
01001 01065 102.775324022097 2
01001 01073 114.124721221283 142
...
01003 48439 932.019063970525 9
01003 53033 3478.13978129133 11
01003 54081 997.783781484149 10
01005 01005 0.000134258785931453 8370
01005 01011 44.3219329413987 72
01005 01021 168.973302699063 7
...
The first file with the nodes is called "THE_NODES.txt" and the second is "THE_LINKS.txt".
How would I use python code to find the longest shortest path(s) between any of the two counties? I assume I start with making a graph of the network, and since the second file has the connections, use 'THE_LINKS.txt' for the edges(I don't know if the weights would be the distance?)? Also, I think these files can only be read as a csv (correct me if I'm wrong), so I can't (or don't know how to) use networkx for this problem.
You can use the read_table function with | separator to read .txt files
node = pd.read_table('node.txt', sep='|', header=None)
links = pd.read_table('links.txt', sep='|', header=None)
Then you need to find the location of countries ( please refer this link : How to select rows from a DataFrame based on column values? ). Then you have to calculate the distance between the countries.
What have you tried so far ? Include that too.
Can anyone please suggest me how to extract tabular data from a PDF using python/java program for the below borderless table present in a pdf file?
This table might be difficult one for tabla. How about using guess=False, stream=True ?
Update: As of tabula-py 1.0.3, guess and stream should work together. No need to set guess=False to use stream or lattice option.
I solved this problem via tabula-py
conda install tabula-py
and
>>> import tabula
>>> area = [70, 30, 750, 570] # Seems to have to be done manually
>>> page2 = tabula.read_pdf("nar_2021_editorial-2.pdf", guess=False, lattice=False,
stream=True, multiple_tables=False, area=area, pages="all",
) # `tabula` doc explains params very well
>>> page2
and I got this result
> 'pages' argument isn't specified.Will extract only from page 1 by default. [
> ShortTitle Text \ 0
> Arena3Dweb 3D visualisation of multilayered networks 1
> Aviator Monitoring the availability of web services 2
> b2bTools Predictions for protein biophysical features and 3
> NaN their conservation 4
> BENZ WS Four-level Enzyme Commission (EC) number ..
> ... ... 68
> miRTargetLink2 miRNA target gene and target pathway
> 69 NaN networks
> 70 mmCSM-PPI Effects of multiple point mutations on
> 71 NaN protein-protein interactions
> 72 ModFOLD8 Quality estimates for 3D protein models
>
>
> URL 0 http://bib.fleming.gr/Arena3D 1
> https://www.ccb.uni-saarland.de/aviator 2
> https://bio2byte.be/b2btools/ 3
> NaN 4 https://benzdb.biocomp.unibo.it/ ..
> ... 68 https://www.ccb.uni-saarland.de/mirtargetlink2 69
> NaN 70 http://biosig.unimelb.edu.au/mmcsm ppi 71
> NaN 72 https://www.reading.ac.uk/bioinf/ModFOLD/ [73
> rows x 3 columns]]
This is an iterable obj, so you can manipulate it via for row in page2:
Hope it help you
Tabula-py borderless table extraction:
Tabula-py has stream which on True detects table based on gaping.
from tabula convert_into
src_pdf = r"src_path"
des_csv = r"des_path"
convert_into(src_pdf, des_csv, guess=False, lattice=False, stream=True, pages="all")
I am trying to work with the velox package in R 3.4.1, using the current (0.2.0) velox package version. I want to extract raster pixel values using the VeloxRaster_extract_points functionality and after failures with my own data, I ran the exact code provided on page 19 of the current reference manual. This returned the error shown (below). I have been unable to find any relevant references to this error online. Any suggestions?
Thanks
> ## Make VeloxRaster with two bands
> set.seed(0)
> mat1 <- matrix(rnorm(100), 10, 10)
> mat2 <- matrix(rnorm(100), 10, 10)
> vx <- velox(list(mat1, mat2), extent=c(0,1,0,1), res=c(0.1,0.1),crs="+proj=longlat +datum=WGS84 +no_defs")
> ## Make SpatialPoints
> library(sp)
> library(rgeos)
> coord <- cbind(runif(10), runif(10))
> spoint <- SpatialPoints(coords=coord)
> ## Extract
> vx$extract_points(sp=spoint)
Error in envRefInferField(x, what, getClass(class(x)), selfEnv) :
‘extract_points’ is not a valid field or method name for reference class “VeloxRaster”
When trying, it worked fine for my case:
library('velox')
## Make VeloxRaster with two bands
set.seed(0)
mat1 <- matrix(rnorm(100), 10, 10)
mat2 <- matrix(rnorm(100), 10, 10)
vx <- velox(list(mat1, mat2), extent=c(0,1,0,1), res=c(0.1,0.1),
crs="+proj=longlat +datum=WGS84 +no_defs")
## Make SpatialPoints
library(sp)
library(rgeos)
coord <- cbind(runif(10), runif(10))
spoint <- SpatialPoints(coords=coord)
## Extract
vx$extract_points(sp=spoint)
[,1] [,2]
[1,] 0.76359346 -0.4125199
[2,] 0.35872890 0.3178857
[3,] 0.25222345 -1.1195991
[4,] 0.00837096 2.0247614
[5,] 0.77214219 -0.5922254
[6,] 0.00837096 2.0247614
[7,] 1.10096910 0.5989751
[8,] 1.15191175 -0.9558391
[9,] 0.14377148 -1.5236149
[10,] 1.27242932 0.0465803
I think you may need to reinstall the package.
companies = pd.read_csv("http://www.richard-muir.com/data/public/csv/CompaniesRevenueEmployees.csv", index_col = 0)
companies.head()
I'm getting this error please suggest what approaches should be tried.
"utf-8' codec can't decode byte 0xb7 in position 7"
Try encoding as 'latin1' on macOS.
companies = pd.read_csv("http://www.richardmuir.com/data/public/csv/CompaniesRevenueEmployees.csv",
index_col=0,
encoding='latin1')
Downloading the file and opening it in notepad++ shows it is ansi-encoded. If you are on a windows system this should fix it:
import pandas as pd
url = "http://www.richard-muir.com/data/public/csv/CompaniesRevenueEmployees.csv"
companies = pd.read_csv(url, index_col = 0, encoding='ansi')
print(companies)
If not (on windows), you need to research how to convert ansi-encoded text to something you can read.
See: https://docs.python.org/3/library/codecs.html#standard-encodings
Output:
Name Industry \
0 Walmart Retail
1 Sinopec Group Oil and gas
2 China National Petroleum Corporation Oil and gas
... ... ...
47 Hewlett Packard Enterprise Electronics
48 Tata Group Conglomerate
Revenue (USD billions) Employees
0 482 2200000
1 455 358571
2 428 1636532
... ... ...
47 111 302000
48 108 600000