Search This Blog

Tuesday, February 1, 2011

Semantic Data Acquisition from Wikipedia

We've created a group of three developers: Anna Cudzich, me (Bartosz Kosarzycki), Sławomir Wałkowski and started working on Semantic Data Acquisition project at PUT. The idea was simple - get data from Wikipedia and convert it to an ontology (RDF file). We wanted to benefit from the structure of Wikipedia - like tables with preformatted data. We focused on just a few basic fact on American Inventors:
- name
- date of birth/death
- place of birth/death
- native american or immigrant
- inventions

The first step was to download wiki page content as described here:
http://kosiara87.blogspot.com/2011/01/wget-downloading-web-page-content.html

Then we:
- parsed HTML files to get: *.out (with text info on inventors) and *.table (with structured HTML table data)
- got rid of all trash along the way (empty spaces, white characters etc)
- written a parser to get desired info from strings of text (*.out) and *.table and merge it together. We differentiated word types - nouns, verbs, pronouns etc to get better quality of information.
- created an ontology (written an XML file in RDF with appropriate info)

The file can be loaded with TWINKLE tool:
http://www.ldodds.com/projects/twinkle/

We used SPARQL to retrieve information from RDF. SPARQL is a query language defined by w3c for RDF files. more info here:
http://www.w3.org/RDF/

Sample queries in SPARQL:

just get the surnames of inventors:
------------------------------------
prefix inv: <http://www.cs.put.poznan.pl/inventors/#>
SELECT ?surname
WHERE
{ ?x inv:surname ?surname }

Get some more info:
----------------------------
prefix inv: <http://www.cs.put.poznan.pl/inventors/#>
SELECT ?place ?firstname1 ?surname1 ?firstname2 ?surname2
WHERE
{
?x inv:firstname ?firstname1 .
?x inv:surname ?surname1 .
?x inv:wasbornin ?place .
?y inv:surname ?surname2 .
?y inv:firstname ?firstname2 .
?y inv:wasbornin ?place
}
ORDER BY ASC (?surname1)

To Count professions:
-------------------------
prefix sparql: <http://www.w3.org/2005/xpath-functions#>
prefix inv: <http://www.cs.put.poznan.pl/inventors/#>
SELECT (COUNT(sparql:lower-case(?profession )) AS ?total)  ?profession
WHERE
{
?x inv:profession ?profession
}
GROUP BY (?profession)
ORDER BY ASC (?total)


The project site at PUT can be found here:
http://semantic.cs.put.poznan.pl/dokuwiki/doku.php


technologies, IDEs and programs used: 

No comments:

Post a Comment

If you like this post, please leave a comment :)