2011-11-23

Intrinsic property database setup

The goal of this research is to extract taxonomic (isA or Instance of) relations from Wikipedia category structure through category link classification using the intrinsic property of each category link. The java program for the classification algorithm is being developed by My DoungHyun Choi. In order to be able to evaluate the algorithm, we need to plot the generated taxonomy on a graph, and then compare it with the Wordnet generated taxonomic relations’ graph.

What has been done?

The Intrinsic property database setup completed successfully after several days of running on the server. It required the following tables to be setup in order for the analysis to proceed. Over all, there are 6 database tables that are needed for the analysis.

1. Redirect table

2. Categorylinks table

3. Page table

4. CategorylinksPure (8,638,676 links)

5. Intrinsic_ARTCAT_AHDCW_INTERNAL (70,374,120 articles)

6. Intrinsic_CATCAT_AHDCW_INTERNAL (8715355 categories)

The first three are directly extracted from the Wikipedia dump (20111007), while the last three were generated from the program. The extraction and creation of the last three tables took several days to complete. They were many challenges; among them are the java version incompatibility and MySQL database access issue, but were resolved with the help of some colleagues.

What is next?

I have to study the classified taxonomic relations, and then select a library that supports graph visualization of the classified taxonomic relations. I will keep posting an update as the work progresses.