Wikipedia Dumps and their installation

I was working on developing an API to tap the tremendous amount of information contained in the worlds largest and freely available encyclopedia, Wikipedia. A web interface of the API is already running, but because of the extremely slow web retrieval and the “don’t scrape our pages, just download our file dumps” policy of wikipedia, we decided to download the Wikipedia dumps freely available for research purposes and make a local copy of the database.

I hope this post will be helpful to anyone who wants to do the same, i.e, make a local copy of Wikipedia’s database.

Things required

  • First you should be able to decide what data you want from wikipedia and download the corresponding tables accordingly. The mostly commonly preferred knowledge, however, is the page content. and this can be obtained by downloading ‘pages-latest-articles.xml.gz’ file . This dump populates three tables , i.e, page, revision, text . The latest Wikipedia dumps are available at Wiki Dumps . The other prominent tables include pagelinks and categorylinks. A complete description of the Wikipedia Schema is available at WikiSchema.
  • There are two kinds of dumps given by Wikipedia , sql dumps and xml dumps. SQL dumps can directly be uploaded to the database using the following command (similar in windows and linux)

mysql -u username -ppassword database < filename
Note that there is no space between ‘-p’ and ‘password’. And make sure the comeplete path of the sql file is given.

  • The xml files however can be converted to sql files using a tool mwdumper which is written in java. An example of using mwdumper is as follows.

java -jar mwdumper.jar enwiki-latest-pages-articles.xml -format=sql:1.5 > enwiki-latest-pages-articles-dump.sql

  • Once we get the sql files we cannot directly dump them into the database because the dump file page-articles contains a set of insert statements only. So the database schema should be loaded prior to executing the dumping of page-articles . This schema file is available at WikiSchema . Download the tables.sql file and create the schema as follows.

mysql -u username -ppassword database < tables.sql

  • After uploading the schema, you can proceed with uploading the sql files.
  • NOTE:  a good point while uploading pagelinks table.  
     

    ~ by Pradeep Varma on April 16, 2008.

    Leave a Reply