Weird result from DBPedia - dbpedia

I'm trying to query a list of all airports and their IATA code:
PREFIX p: <http://dbpedia.org/property/>
PREFIX o: <http://dbpedia.org/ontology/>
PREFIX rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#>
SELECT DISTINCT ?airport ?iata ?name
WHERE {
?airport rdf:type o:Airport ;
p:iata ?iata ;
p:name ?name
}
ORDER by ?airport
Executing it looks mostly fine, but there are some weird blocks where airport get assigned the wrong name, such as:
http://dbpedia.org/resource/Prince_Abdul_Majeed_bin_Abdul_Aziz_Domestic_Airport "ULH"#en "Prince Abdul Majeed bin Abdul Aziz Airport"#en
http://dbpedia.org/resource/Prince_Albert_(Glass_Field)_Airport "YPA"#en "Prince Abdul Majeed bin Abdul Aziz Airport"#en
http://dbpedia.org/resource/Prince_George_Airport "YXS"#en "Prince Abdul Majeed bin Abdul Aziz Airport"#en
http://dbpedia.org/resource/Prince_Mohammad_Bin_Abdulaziz_Airport "MED"#en "Prince Abdul Majeed bin Abdul Aziz Airport"#en
http://dbpedia.org/resource/Prince_Rupert/Seal_Cove_Water_Airport "ZSW"#en "Prince Abdul Majeed bin Abdul Aziz Airport"#en
http://dbpedia.org/resource/Prince_Rupert_Airport "YPR"#en "Prince Abdul Majeed bin Abdul Aziz Airport"#en
http://dbpedia.org/resource/Prince_Said_Ibrahim_International_Airport "HAH"#en "Prince Abdul Majeed bin Abdul Aziz Airport"#en
http://dbpedia.org/resource/Princess_Juliana_International_Airport "SXM"#en "Prince Abdul Majeed bin Abdul Aziz Airport"#en
Besides all having "Prince" in their name they seem to have nothing in common. Clicking through to the resource also suggests no relation to the name they've been assigned.
What am I doing wrong?
EDIT - found solution:
Removing "ORDER by ?airport" or changing it to "ORDER by ?iata" fixes the problem.

The DBpedia ontology (dbpedia-owl) data tends to be a bit cleaner than the older infobox data (dbprop), so I thought you might want to use a query that uses the dbpedia-owl properties:
SELECT ?airport ?iata ?name
WHERE {
?airport a dbpedia-owl:Airport ;
dbpedia-owl:iataLocationIdentifier ?iata ;
rdfs:label ?name .
FILTER langMatches( lang( ?name ), "EN" )
}
order by ?airport
SPARQL Results
The data is somewhat better, but there are still some bizarre results like:
http://dbpedia.org/resource/Prince_Albert_(Glass_Field)_Airport "YPA"#en "Prince Albert (Glass Field) Airport"#en
http://dbpedia.org/resource/Prince_George_Airport "YXS"#en "Prince Albert (Glass Field) Airport"#en
http://dbpedia.org/resource/Prince_Mohammad_Bin_Abdulaziz_Airport "MED"#en "Prince Albert (Glass Field) Airport"#en
http://dbpedia.org/resource/Prince_Rupert/Seal_Cove_Water_Airport "ZSW"#en "Prince Albert (Glass Field) Airport"#en
http://dbpedia.org/resource/Prince_Rupert_Airport "YPR"#en "Prince Albert (Glass Field) Airport"#en
http://dbpedia.org/resource/Prince_Said_Ibrahim_International_Airport "HAH"#en "Prince Albert (Glass Field) Airport"#en
http://dbpedia.org/resource/Princess_Juliana_International_Airport "SXM"#en "Prince Albert (Glass Field) Airport"#en
http://dbpedia.org/resource/Princeton_Airport_(New_Jersey) "PCT"#en "Prince Albert (Glass Field) Airport"#en
In the interest of trying a few different approaches, I also decided to try grouping by the ?airport and ?iata, and then sampling the name:
SELECT ?airport ?iata sample(?name)
WHERE {
?airport a dbpedia-owl:Airport ;
dbpedia-owl:iataLocationIdentifier ?iata ;
rdfs:label ?name .
FILTER langMatches( lang( ?name ), "EN" )
}
group by ?airport ?iata
order by ?airport
SPARQL Results
This gets different, but equally strange results, e.g.:
http://dbpedia.org/resource/%22Solidarity%22_Szczecin-Goleni%C3%B3w_Airport "SZZ"#en ""Solidarity" Szczecin-Goleniów Airport"#en
http://dbpedia.org/resource/%C3%81ngel_Albino_Corzo_International_Airport "TGZ"#en ""Solidarity" Szczecin-Goleniów Airport"#en
http://dbpedia.org/resource/%C3%84ngelholm-Helsingborg_Airport "AGH"#en ""Solidarity" Szczecin-Goleniów Airport"#en
http://dbpedia.org/resource/%C3%85lesund_Airport,_Vigra "AES"#en ""Solidarity" Szczecin-Goleniów Airport"#en
http://dbpedia.org/resource/%C3%85re_%C3%96stersund_Airport "OSD"#en ""Solidarity" Szczecin-Goleniów Airport"#en
And yet, if we group by name instead, and select the name and count the number of airports with a given name, we get 1 across the board, but some names appear twice!
SELECT count(?airport) ?name
WHERE {
?airport a dbpedia-owl:Airport ;
dbpedia-owl:iataLocationIdentifier ?iata ;
rdfs:label ?name .
FILTER langMatches( lang( ?name ), "EN" )
}
group by ?name
order by ?name
SPARQL Results
1 "Abraham González International Airport"#en
1 "Abraham González International Airport"#en
...
1 "Prince Albert (Glass Field) Airport"#en
1 "Prince Albert (Glass Field) Airport"#en
1 "Prince Albert (Glass Field) Airport"#en
1 "Prince Albert (Glass Field) Airport"#en
1 "Prince Albert (Glass Field) Airport"#en
1 "Prince Albert (Glass Field) Airport"#en
1 "Prince Albert (Glass Field) Airport"#en
1 "Prince Albert (Glass Field) Airport"#en
This truly is bizarre. It doesn't look like there is anything wrong with your query, but that something weird is going on over at DBpedia. You can take a look at some of these strange entries, and the data that DBpedia will show doesn't match these results. For instance, one of the results from the original query is
http://dbpedia.org/resource/Prince_Mohammad_Bin_Abdulaziz_Airport "MED"#en "Prince Albert (Glass Field) Airport"#en
but if you visit http://dbpedia.org/page/Prince_Mohammad_Bin_Abdulaziz_Airport and search the page for "Albert", you won't find it there.

Related

Split one column into multiple columns by multiple delimiters in Pandas

Given a dataframe as follows:
player score
0 Sergio Agüero Forward — Manchester City 209.98
1 Eden Hazard Midfield — Chelsea 274.04
2 Alexis Sánchez Forward — Arsenal 223.86
3 Yaya Touré Midfield — Manchester City 197.91
4 Angel María Midfield — Manchester United 132.23
How could split player into three new columns name, position and team?
player score name position team
0 Sergio Agüero Forward — Manchester City 209.98 Sergio Forward Manchester City
1 Eden Hazard Midfield — Chelsea 274.04 Eden Midfield Chelsea
2 Alexis Sánchez Forward — Arsenal 223.86 Alexis Forward Arsenal
3 Yaya Touré Midfield — Manchester City 197.91 Yaya Midfield Manchester City
4 Angel María Midfield — Manchester United 132.23 Angel Midfield Manchester United
I have considered split it two columns with df[['name_position', 'team']] = df['player'].str.split(pat= ' — ', expand=True), then split name_position to name and position. But is there any better solutions?
Many thanks.
You can use str.extract as well if you want to do it in one go:
print(df["player"].str.extract(r"(?P<name>.*?)\s.*?\s(?P<position>[A-Za-z]+)\s—\s(?P<team>.*)"))
name position team
0 Sergio Forward Manchester City
1 Eden Midfield Chelsea
2 Alexis Forward Arsenal
3 Yaya Midfield Manchester City
4 Angel Midfield Manchester United
You can split a python string by space with string.split(). This will break up your text into 'words', then you can simply access the one you like, like this:
string = "Sergio Agüero Forward — Manchester City"
name = string.split()[0]
position = string.split()[2]
team = string.split()[4] + (string.split().has_key(5) ? string.split()[5] : '')
For more complex patterns, you can use regex, which is a powerful string pattern finding tool.
Hope this helped :)

check amount of time between different rows of data (time) and date and name of employee

I have a df with this info ['Name', 'Department', 'Date', 'Time', 'Activity'],
so for example looks like this:
Acosta, Hirto 225 West 28th Street 9/18/2019 07:25:00 Punch In
Acosta, Hirto 225 West 28th Street 9/18/2019 11:57:00 Punch Out
Acosta, Hirto 225 West 28th Street 9/18/2019 12:28:00 Punch In
Adams, Juan 225 West 28th Street 9/16/2019 06:57:00 Punch In
Adams, Juan 225 West 28th Street 9/16/2019 12:00:00 Punch Out
Adams, Juan 225 West 28th Street 9/16/2019 12:28:00 Punch In
Adams, Juan 225 West 28th Street 9/16/2019 15:30:00 Punch Out
Adams, Juan 225 West 28th Street 9/18/2019 07:04:00 Punch In
Adams, Juan 225 West 28th Street 9/18/2019 11:57:00 Punch Out
I need to calculate the time between the punch in and the punch out in the same day for the same employee.
i manage to just clean the data
like:
self.raw_data['Time'] = pd.to_datetime(self.raw_data['Time'], format='%H:%M').dt.time
sorted_db = self.raw_data.sort_values(['Name', 'Date'])
sorted_db = sorted_db[['Name', 'Department', 'Date', 'Time', 'Activity']]
any suggestions will be appreciated
so i found the answer of my problem and i wanted to share it.
first a separate the "Punch in" and the "Punch Out" if two columns
def process_info(self):
# filter data and organized --------------------------------------------------------------
self.raw_data['in'] = self.raw_data[self.raw_data['Activity'].str.contains('In')]['Time']
self.raw_data['pre_out'] = self.raw_data[self.raw_data['Activity'].str.contains('Out')]['Time']
after i sort the information base in date and time
sorted_data = self.raw_data.sort_values(['Date', 'Name'])
after that i use the shift function to move on level up the 'out' column so in parallel with the in.
sorted_data['out'] = sorted_data.shift(-1)['Time']
and finally i take out the extra out columns that was created in the first step. but checking if it is by itself.
filtered_data = sorted_data[sorted_data['pre_out'].isnull()]

Hive can't find partitioned data written by Spark Structured Streaming

I have a spark structured streaming job, writing data to IBM Cloud Object Storage (S3):
dataDf.
writeStream.
format("parquet").
trigger(Trigger.ProcessingTime(trigger_time_ms)).
option("checkpointLocation", s"${s3Url}/checkpoint").
option("path", s"${s3Url}/data").
option("spark.sql.hive.convertMetastoreParquet", false).
partitionBy("InvoiceYear", "InvoiceMonth", "InvoiceDay", "InvoiceHour").
start()
I can see the data using the hdfs CLI:
[clsadmin#xxxxx ~]$ hdfs dfs -ls s3a://streaming-data-landing-zone-partitioned/data/InvoiceYear=2018/InvoiceMonth=9/InvoiceDay=25/InvoiceHour=0 | head
Found 616 items
-rw-rw-rw- 1 clsadmin clsadmin 38085 2018-09-25 01:01 s3a://streaming-data-landing-zone-partitioned/data/InvoiceYear=2018/InvoiceMonth=9/InvoiceDay=25/InvoiceHour=0/part-00000-1e1dda99-bec2-447c-9bd7-bedb1944f4a9.c000.snappy.parquet
-rw-rw-rw- 1 clsadmin clsadmin 45874 2018-09-25 00:31 s3a://streaming-data-landing-zone-partitioned/data/InvoiceYear=2018/InvoiceMonth=9/InvoiceDay=25/InvoiceHour=0/part-00000-28ff873e-8a9c-4128-9188-c7b763c5b4ae.c000.snappy.parquet
-rw-rw-rw- 1 clsadmin clsadmin 5124 2018-09-25 01:10 s3a://streaming-data-landing-zone-partitioned/data/InvoiceYear=2018/InvoiceMonth=9/InvoiceDay=25/InvoiceHour=0/part-00000-5f768960-4b29-4bce-8f31-2ca9f0d42cb5.c000.snappy.parquet
-rw-rw-rw- 1 clsadmin clsadmin 40154 2018-09-25 00:20 s3a://streaming-data-landing-zone-partitioned/data/InvoiceYear=2018/InvoiceMonth=9/InvoiceDay=25/InvoiceHour=0/part-00000-70abc027-1f88-4259-a223-21c4153e2a85.c000.snappy.parquet
-rw-rw-rw- 1 clsadmin clsadmin 41282 2018-09-25 00:50 s3a://streaming-data-landing-zone-partitioned/data/InvoiceYear=2018/InvoiceMonth=9/InvoiceDay=25/InvoiceHour=0/part-00000-873a1caa-3ecc-424a-8b7c-0b2dc1885de4.c000.snappy.parquet
-rw-rw-rw- 1 clsadmin clsadmin 41241 2018-09-25 00:40 s3a://streaming-data-landing-zone-partitioned/data/InvoiceYear=2018/InvoiceMonth=9/InvoiceDay=25/InvoiceHour=0/part-00000-88b617bf-e35c-4f24-acec-274497b1fd31.c000.snappy.parquet
-rw-rw-rw- 1 clsadmin clsadmin 3114 2018-09-25 00:01 s3a://streaming-data-landing-zone-partitioned/data/InvoiceYear=2018/InvoiceMonth=9/InvoiceDay=25/InvoiceHour=0/part-00000-deae2a19-1719-4dfa-afb6-33b57f2d73bb.c000.snappy.parquet
-rw-rw-rw- 1 clsadmin clsadmin 38877 2018-09-25 00:10 s3a://streaming-data-landing-zone-partitioned/data/InvoiceYear=2018/InvoiceMonth=9/InvoiceDay=25/InvoiceHour=0/part-00000-e07429a2-43dc-4e5b-8fe7-c55ec68783b3.c000.snappy.parquet
-rw-rw-rw- 1 clsadmin clsadmin 39060 2018-09-25 00:20 s3a://streaming-data-landing-zone-partitioned/data/InvoiceYear=2018/InvoiceMonth=9/InvoiceDay=25/InvoiceHour=0/part-00001-1553da20-14d0-4c06-ae87-45d22914edba.c000.snappy.parquet
However, when I try to query the data:
hive> select * from invoiceitems limit 5;
OK
Time taken: 2.392 seconds
My table DDL looks like this:
CREATE EXTERNAL TABLE `invoiceitems`(
`invoiceno` int,
`stockcode` int,
`description` string,
`quantity` int,
`invoicedate` bigint,
`unitprice` double,
`customerid` int,
`country` string,
`lineno` int,
`invoicetime` string,
`storeid` int,
`transactionid` string,
`invoicedatestring` string)
PARTITIONED BY (
`invoiceyear` int,
`invoicemonth` int,
`invoiceday` int,
`invoicehour` int)
ROW FORMAT SERDE
'org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe'
STORED AS INPUTFORMAT
'org.apache.hadoop.hive.ql.io.parquet.MapredParquetInputFormat'
OUTPUTFORMAT
'org.apache.hadoop.hive.ql.io.parquet.MapredParquetOutputFormat'
LOCATION
's3a://streaming-data-landing-zone-partitioned/data'
I've also tried with the correct case for column/partition names - this doesn't work either.
Any ideas why my query isn't finding the data?
UPDATE 1:
I have tried setting the location to a directory containing the data without partitions and this still doesn't work, so I'm wondering if it is a data formatting issue?
CREATE EXTERNAL TABLE `invoiceitems`(
`InvoiceNo` int,
`StockCode` int,
`Description` string,
`Quantity` int,
`InvoiceDate` bigint,
`UnitPrice` double,
`CustomerID` int,
`Country` string,
`LineNo` int,
`InvoiceTime` string,
`StoreID` int,
`TransactionID` string,
`InvoiceDateString` string)
PARTITIONED BY (
`InvoiceYear` int,
`InvoiceMonth` int,
`InvoiceDay` int,
`InvoiceHour` int)
STORED AS PARQUET
LOCATION
's3a://streaming-data-landing-zone-partitioned/data/InvoiceYear=2018/InvoiceMonth=9/InvoiceDay=25/InvoiceHour=0/';
hive> Select * from invoiceitems limit 5;
OK
Time taken: 2.066 seconds
Read from Snappy Compression parquet file
The data is in snappy compressed Parquet file format.
s3a://streaming-data-landing-zone-partitioned/data/InvoiceYear=2018/InvoiceMonth=9/InvoiceDay=25/InvoiceHour=0/part-00000-1e1dda99-bec2-447c-9bd7-bedb1944f4a9.c000.snappy.parquet
So set the ‘PARQUET.COMPRESS’=’SNAPPY’ table property in create table DDL statement. You can alternatively set parquet.compression=SNAPPY in the “Custom hive-site settings” section in Ambari for either IOP or HDP.
Here is an example of using the table property during a table creation statement in Hive:
hive> CREATE TABLE inv_hive_parquet(
trans_id int, product varchar(50), trans_dt date
)
PARTITIONED BY (
year int)
STORED AS PARQUET
TBLPROPERTIES ('PARQUET.COMPRESS'='SNAPPY');
Update Parition metadata in External table
Also, for an external Partitioned table, we need to update the partition metadata whenever any external job (spark job in this case) writes the partitions to Datafolder directly, because hive will not be aware of these partitions unless the explicitly updated.
that can be done by either:
ALTER TABLE inv_hive_parquet RECOVER PARTITIONS;
//or
MSCK REPAIR TABLE inv_hive_parquet;

How to count entries with specific time range in Excel?

Can anyone help me with this brain teaser :)
I need to count entries by hour and date and as the list is huge formula will save my life.
Bellow is the example how it looks.
Thank you in advance for your help!
17/05/2017 00:40
17/05/2017 01:10
17/05/2017 04:30
17/05/2017 05:00
17/05/2017 05:00
17/05/2017 05:05
17/05/2017 05:15
17/05/2017 05:20
17/05/2017 05:20
17/05/2017 05:30
17/05/2017 05:30
17/05/2017 05:30
17/05/2017 05:40
17/05/2017 05:45
17/05/2017 05:45
17/05/2017 05:50
17/05/2017 06:00
17/05/2017 06:00
17/05/2017 06:00
17/05/2017 06:20
17/05/2017 06:25
To do it with your date and times in one column use:
=SUMPRODUCT((MOD($A$1:$A$21,1)>=C1)*(MOD($A$1:$A$21,1)<=D1))
Edit: If the date and time is in one column, just use DATA --> Text to Columns, and use SPACE as the delimiter to put them in to two columns. There may be other ways to get your answer, keeping the info in one column, but that would likely be a relatively convoluted/complex formula. Text to Columns allows for quicker analysis.
If your data is in two columns, you can use COUNTIFS():
=COUNTIFS($B$1:$B$21,">="&E1,$B$1:$B$21,"<="&F1)
If your data is in one column, then add to the right column the following formula
=REPLACE(REPT(REPT("0",2-LEN(HOUR(A1)))&HOUR(A1),2),3,0,":00 - ")&":55"
and then use pivot table to count each group
enter image description here

Script for converting intermingled cell data to interaction matix

I have bibliographic data from Web of Science that I need to configure into an interaction matrix (basically a tabulation table of authors working together). However, the cells are configured awkwardly.
1: [Hussain, Raja Azadar; Badshah, Amin] Quaid I Azam Univ, Dept Chem, Coordinat Chem Lab, Islamabad, Pakistan; [Tahir, Muhammad Nawaz] Univ Sargodha, Dept Phys, Sargodha, Punjab, Pakistan; [Tamoor-ul- Hassan; Bano, Asghari] Quaid I Azam Univ, Phytoharmone Lab, Dept Plant Sci, Islamabad, Pakistan
2: [Shahida, Shabnam; Khan, Muhammad Haleem] Univ Azad Jammu & Kashmir, Dept Chem, Muzaffarabad, Ajk, Pakistan; [Ali, Akbar] Pakistan Inst Nucl Sci & Technol, Div Chem, Islamabad, Pakistan
And I need it to look like this:
1: Hussain, Raja Azadar, Quaid I Azam Univ, Dept Chem, Coordinat Chem Lab, Islamabad, Pakistan
1: Badshah, Amin, Quaid I Azam Univ, Dept Chem, Coordinat Chem Lab, Islamabad, Pakistan
1: Tamoor-ul- Hassan, Quaid I Azam Univ, Phytoharmone Lab, Dept Plant Sci, Islamabad, Pakistan
1: Bano, Asghari, Quaid I Azam Univ, Phytoharmone Lab, Dept Plant Sci, Islamabad, Pakistan
2: Shahida, Shabnam, Univ Azad Jammu & Kashmir, Dept Chem, Muzaffarabad, Ajk, Pakistan
2: Khan, Muhammad Haleem, Univ Azad Jammu & Kashmir, Dept Chem, Muzaffarabad, Ajk, Pakistan
2: Ali, Akbar, Pakistan Inst Nucl Sci & Technol, Div Chem, Islamabad, Pakistan
Any help would be greatly appreciated!
Split on `; [` --> arr1
For each element in arr1:
Split on `]` --> arr2(0) and arr2(1)
split arr2(0) on `;` -->arr3
For each element in arr3:
Combine arr3(x) with arr2(1) - put in cell
Loop till done

Resources