NAME
wfb2sql - Converts the CIA World Factbook into SQL statements
SYNOPSIS
wfb2sql [options] --database=[db2|postgresql|mysql]
DESCRIPTION
wfb2sql is a Perl script that converts the CIA World Factbook HTML pages into SQL statements.
It was tested with the HTML version of the World Factbook 2003 and the World Factbook 2002 and probably has to be modified to extract data from further versions.
By default, all data that can be interpreted, is included in the SQL statements. If you want to restrict the data to only some information, use the --no... command line options as described below. The data is written to STDOUT, so you can directly pass it to a database client.
In the current version, three database management systems are supported: IBM DB/2, PostgreSQL and mySQL. The mysql schema data produced by wfb2sql uses a very basic SQL language, which, for example, doesn't include foreign keys. At least this schema should work with other DBMSs as well. If you plan to adapt wfb2sql to other SQL dialects, focus on BLOB handling first. Not all databases support import of files in INSERT statements.
Notes for IBM DB/2
The db2 data had to be split into several files, since DB/2 doesn't support file inclusion in INSERT statements without a multimedia extender. Because of that, wfb2sql outputs the schema data to STDOUT and creates extra files with spool data. By default, it tries to create the extra files in the current directory, but you may specify an alternative directory using the --tmp-dir option. Please note, that you have to delete the extra files manually after script execution. To insert the world factbook data into a database, create a new database and run
wfb2sql [your options] | db2 -t
Use the --database-name parameter to tell the script the name of your database. It will automatically add a CONNECT TO <database> statement to the produced sql code.
Notes for MySQL
The flag and map images are loaded into the database via the LOAD_FILE() function. The use of this function requires the FILE privilege. So, please make sure, that the user has sufficient rights to execute this function. Otherwise, MySQL will not report an error and all the blob fields will just contain NULL. To insert the world factbook data into a database, create a database (e.g. factbook) and run
wfb2sql [your options] | mysql <databasename>
The script needs some time, so you might want to turn on the --verbose option.
Notes for PostgreSQL
Please note that you must have Postgres superuser privilege to use server-side lo_import(), which is needed for maps and flags. To insert the world factbook data into a database, create a new database (with createdb) and run
wfb2sql [your options] | psql <databasename>
Configuration file
wfb2sql supports the use of a configuration file called wfb2sqlrc. The script first tries to find the file wfb2sqlrc in the current directory. If it is not found, it tries to read the file .wfb2sqlrc in the users home directory. If both files exist, the latter will be ignored. The file format is similar to standard UN*X configuration files. Comments begin with a # and all behind the # will be ignored. Options are specified by
optionname = value
and whitespaces are silently ignored. If a configuration file is load, these settings will override the defaults. Additional command line arguments will override the configuration file settings. Note that the options in the configuration file don't have leading --. A --no... option might look like this in the configuration file (using the option --noboundaries in this example):
boundaries = no
You can as an alternative use boundaries = yes to explicitly turn on the option boundaries.
OPTIONS
- --database=dialect
- Selects an SQL dialect for the SQL output. Currently supported values are db2, mysql and postgresql.
- --database-name=name
- Adds a 'CONNECT TO name' to the schema definition. This is needed if you want to pipe the output directly into db2. For databases other than db2 this parameter is ignored.
- --data-only
- Output only SQL data and emit schema information. Use this function if you already have created the tables.
- --directory=dir
- Specifies the root directory of the factbook installation. The root directory containts the file index.html and the print/-directory.
- --noboundaries
- Include no boundary information. Otherwise, the borders to surrounding countries are stored in an extra table called boundaries, the total boundaries are stored in the country entity.
- --nocommunications
- Emit data on communication. Otherwise the following fields will be added:
- phone_mainlines
- Number of main lines in use.
- phone_mobile
- Number of cellphones.
- internet_users
- Number of internet users.
- isps
- Number of Internet service providers.
- --nodependent-countries
- Emit dependent country information.
- --noflags
- Emit flag data and schema information.
- --nomaps
- Emit map data and schema information.
- --noorganizations
- Emit data about organizations memberships.
- --noschema
- Do not add a schema name (default: wfb) when printing the SQL CREATE TABLE commands.
- --notext-fields
- Do not include textual fields like background information.
- --schema-name=name
- Specifies an alternate schema name. This applies only to databases which support schemata (DB/2 and PostgreSQL Version 7.3 or higher). If you want to disable the schema names, use the --noschema option. The default schema name is wfb.
- --schema-only
- Prints only SQL schema data. Use this function if you just want to create the SQL tables first.
- --small-flags
- Include only small flags. By default, the large flags will be included in the database. Use the --noflags option to emit all kind of flags.
- --temp-dir=dir
- The db2 creation method requires the use of temporary files. By default they will be placed in the current directory, but you may specify a different directory using this option.
- -v, --verbose
- Prints some more or less useful information to STDERR.
REQUIRES
Perl 5.004, Getopt::Long, Pod::Usage, Options::General
Getopt::Long is available at http://search.cpan.org/author/JV/Getopt-Long-2.33/
Pod::Usage is available at http://search.cpan.org/author/MAREKR/PodParser-1.23/
Options::General is available at http://search.cpan.org/author/TLINDEN/Config-General-2.21/
AUTHOR
Jan Schreiber <mail@jschreiber.com>
COPYRIGHT
COPYRIGHT (C) 2003 Arbeitsbereich Datenbanken und Informationssysteme, Eberhard-Karls-Universitaet Tuebingen, Germany.
BUGS
``Cyprus'' does not contain correct values for some entries since all values consist of two values: on for the ``Greek Cypriot area'' and one for the ``Turkish Cypriot area''. Maybe this data should be handled manually.
Please send bug reports to the author.
SEE ALSO
CIA World Factbook http://www.cia.gov/cia/publications/factbook/
CIA World Factbook 2002 http://www.cia.gov/cia/download2002.htm