MyGenbank

MyGENBANK with mySQL: Rolling your own relational sequence database

BioTeam.Net HOWTO #1

“BioSQL with
MySQL:

Rolling your own
relational sequence database”

 

Additional titles considered:

  • “Oracle DBA not required :)”
  • “Escape from flatfile hell?”

Summary:

A step-by-step diary written by a database novice
describing how to use freely available tools and schemas from the Open
Bioinformatics Foundation to create a MySQL-based relational database
repository for biological sequence and annotation data.

 

Author:

Chris Dagdigian, dag@sonsorol.org, http://BioTeam.net

Version:

1.1 – Last updated May 30, 2002

Revision History:

1.0; Original version – May 2002

1.1; Reader feedback and error corrections; first ‘complete’ revision – May 2002

Acknowledgements:

·Keith Allen,
PhD, Paradigm Genetics, USA

·Mark Lambrecht,
PhD – University of Leuven, Centre of Microbial and Plant Genetics, Belgium

·Adam Witney, St. George’s Hospital Medical School,
University of London, UK

Additional Resources:

Open Bio Database Access website, http://obda.open-bio.org

Open-bio-l mailing list, http://open-bio.org/mailman/listinfo/open-bio-l


Contents:

Background & purpose

Assumptions

     •About the OBDA effort

About BioSQL

About Bioperl-db

Step-by-step guide

1.Installing MySQL the lazy way (via RPM) and configuring

2. Installing perl interfaces to MySQL

3. Installing Bundle::BioPerl via CPAN.pm

4. Installing BioPerl via CPAN.pm

5. Downloading & installing bioperl-db directly from the OBF CVS
repository

6.Downloading biosql-schema directly from the OBF CVS repository

7. Putting it all together: creating a biosql instance within MySQL

8.Loading a test protein dataset: SwissProt release 40

9.Observations

10.Next steps…

 

Background & purpose


While working on an informatics/Bio-IT project at Harvard University’s new Bauer Center for Genomics Research (http://cgr.harvard.edu) I met some people who were interested in storing sequence data into a relational database so they could avoid having to parse or index big flat text files.

Knowing that (a) The Open Bioinformatics Foundation had started work on solution and (b) I am a complete and utter moron when it comes to relational databases I thought that it would be a fun experiment to document my efforts at building such a server starting with a bare-bones RedHat Linux 7.2 server.

This document is essentially a diary of my experiences. It took me roughly a day to get to the point where I could load Swissprot into a biosql database via Jason Stajich’s load_seqdatabase.pl script.

Another reason I wrote this is that as a member of the Open Bioinformatics Foundation board of directors I am well aware that our website(s) are horrific and that our documentation often lags behind the functionality of our code base.

In particular the bioperl website gives little or no mention to “bioperl-db” and the related obda.open-bio.org website is totally devoid of any real information.

In fact- one practically needs to be an existing developer on one of our existing projects to even know that this stuff exists.

So- besides being a fun way to spend a day this is also a way for me to get myself more up to speed with the OBDA website and related work so that I can eventually help contribute to the website and project documentation efforts.

Comments/corrections/additions/flames regarding this document are welcome. Just send them along to me at dag@sonsorol.org

 

Assumptions

 

·This document describes installing a biosql-compliant database on an x86-based server running RedHat 7.2.

·  Different hardware architectures or OS’s may require different install procedures. Specifically the server used for this process was a Dell PowerEdge 6450 system with dual Xeon CPUs, 4GB physical memory and the 2.4.18 linux kernel.

· This document assumes the user has root access to the server system for the purposes of installing software and perl modules.

· This document assumes at least basic knowledge of RPM packages and how to install them

·The process described here includes downloading code from the internet via anonymous CVS which some firewalls may not allow. Explaining the use of CVS-over-SSH is not within the scope of this document.


About the Open Bio Database Access (OBDA) effort

OBDA is standard developed by the Open Bioinformatics Foundation to
standardize the way all of the OBF-affiliated projects access sequence data
resources. This includes local sequence resources (indexed flatfiles), remote
databases (web accessible sequence resources), CORBA servers (implementing the BioCORBA/BSANE specification) and
relational databases using the BioSQL
schema.

The website for the OBDA effort will eventually be http://obda.open-bio.org/ . At this time
though the site is only online in template form. The OBF is actively soliciting
people to take over the website and OBDA documentation efforts.

The primary discussion forum for the OBDA effort is the
“open-bio-l” mailing list. The list signup page is located at: http://open-bio.org/mailman/listinfo/open-bio-l
. The list archives are online at http://open-bio.org/pipermail/open-bio-l/.

 

About BioSQL

The website and primary email discussion list for all
things BioSQL related will be the ones listed above in the OBDA section.

A BioSQL ER diagram in PDF form can be downloaded
directly
(PDF) from the OBF anonymous CVS server. A formatted listing of
the tables described in the schema can be viewed
online directly
(HTML) from the same server.

About BioPerl-db

 

The bioperl-db project was started by Ewan Birney with
major work contributed by Elia Stupka and continued support coming from the
bioperl community.Its purpose is a
standalone sequence database with little external dependencies and tight
integration with the bioperl distribution.
Support for more databases and bindings in java and python by Biojava
and Biopython projects are welcomed and encouraged.All questions and comments should be directed to the bioperl list
<bioperl-l@bioperl.org> and
more information can be found about the related projects at http://bio.perl.org and http://www.open-bio.org.

Bioperl-db is a standalone module located within the CVS
code repository operated by the BioPerl Project. Anonymous CVS access and
tarball downloads are available at http://cvs.bioperl.org.
Look in the bioperl repository for the “bioperl-db” module.



Step by step guide

 

 

Step 1 Install MySQL the
lazy way (via RPM) and configure

 

While logged in as root I installed the following RedHat RPMS:

 

mysqlclient9-3.23.22-6

mysql-3.23.41-1

mysql-server-3.23.41-1

mysql-devel-3.23.41-1

 

The install was verified by issuing the following command:

 

# rpm –qa | grep
mysql

mysqlclient9-3.23.22-6

mysql-3.23.41-1

mysql-server-3.23.41-1

mysql-devel-3.23.41-1

 

MySQL
version warning:

Mark Lambrecht recommends the use of MySQL version 3.23 or later. When trying
to install version 3.22 on an Alpha-based system he received errors relating to
the creation of the “
fattribute_to_feature
table.

 

By default the RPM MySQL install puts sample configuration files
into /usr/share/mysql/

 

The base directory that the RPM MySQL install uses is
/var/lib/mysql

 

The global configuration file is kept at /etc/my.cnf

 

It is important to edit the configuration file at /etc/my.conf BEFORE
starting the database for the first time particularly if you want (like me) to
have the MySQL data files live somewhere other than /var/lib/mysql.

 

In my case I wanted to do two main things to the global config
file, (a) change the default location to the much larger /usr/local/ partition
and (b) incorporate some of the config suggestions found in /usr/share/mysql/
specific to large dedicated database servers.

 

WARNING; BioSQL databases can be very large. It may be wise to double
check the size of your /var partition before continuing on with a totally
default RPM-based install. The 1GB /var partition present on the Dell 6450
server is far too small. As an example:
The swissprot40 database is 307MB in size. When this dataset was loaded
into MySQL the resulting database files and indices consumed roughly 399MB of
disk space.

 

NOTE: Several BioSQL users have found
that very large sequence records will greatly exceed some of the default
settings that MySQL uses. In particular one may need to set the configuration
value “
max_allowed_packet” to a fairly high value (several megabytes at least). If you
encounter any error messages that are similar to: “
DBD::mysql::st execute failed: MySQL server
has gone away…
” than it is likely that max_allowed_packet

needs to be set to a larger value. In an email post to the bioperl-l (http://bioperl.org/mailman/listinfo/bioperl-l)
mailing list Keith Allen covers his experience with this problem in excellent
detail. The full email message can be read at the following URL: http://bioperl.org/pipermail/bioperl-l/2002-May/007987.html

 

 

After revising the value of max_allowed_packet to ‘3M’ due to
feedback from Keith Allen (see above Note) the global configuration file
/etc/my.conf now looks like this:

 

[mysqld]

port=3306

datadir=/usr/local/mysql-database

socket=/usr/local/mysql-database/mysql.sock

skip-locking

set-variable= key_buffer=256M

set-variable=
max_allowed_packet=3M

set-variable=
table_cache=256

set-variable=
sort_buffer=1M

set-variable=
record_buffer=1M

set-variable=
myisam_sort_buffer_size=64M

set-variable=
thread_cache=8

set-variable=
thread_concurrency=4

server-id= 1

tmpdir= /tmp/

 

[mysql.server]

user=mysql

basedir=/usr/lib

 

[safe_mysqld]

err-log=/var/log/mysqld.log

pid-file=/var/run/mysqld/mysqld.pid

 

 

[mysqldump]

quick

set-variable=
max_allowed_packet=16M

 

[mysql]

no-auto-rehash

safe-updates

 

[isamchk]

set-variable= key_buffer=128M

set-variable= sort_buffer=128M

set-variable= read_buffer=2M

set-variable= write_buffer=2M

 

[myisamchk]

set-variable= key_buffer=128M

set-variable= sort_buffer=128M

set-variable= read_buffer=2M

set-variable= write_buffer=2M

 

[mysqlhotcopy]

interactive-timeout

 

 

Just in case I missed anything I decided to add a symbolic link
back to /var/lib/mysql – this turns out to be important because my
configuration file still has “basedir=/var/lib” which is a total mistake.

 

 

# cd /var/lib

# ln -s
/usr/local/mysql-database/ ./mysql

 

Now we can start the MySQL datbase…

 

/etc/rc.d/init.d/mysql
start

 

 

Important! Protect your database by setting a root MySQL password:

 

# mysqladmin -u
root password ‘xxxxxxx’

 

 

After installing, configuring, starting and enabling a root
password it should be possible to logoff as root and re-login as a ‘normal’
system user.

 

To test this I made the switch from root user to my normal ‘dag’
user account and confirmed that I can login as root to the mysql instance:

 

[dag@pe3
scripts]$ mysql -u root -p

Enter password:

Welcome to the
MySQL monitor.Commands end with ; or
\g.

Your MySQL
connection id is 8 to server version: 3.23.41-log

 

Type ‘help;’ or
‘\h’ for help. Type ‘\c’ to clear the buffer.

 

mysql> exit

Bye

 

 

Success! We now have a working MySQL instance.

 

 

 

Step 2 Install the perl interfaces to MySQL

 

In future steps we are going to use perl to help us talk to the
MySQL database. In order to do this we need some additional perl modules
installed. We need:

 

·
DBI:: – The perl generic database interface

·
DBD::MySQL – The perl MySQL database driver module

 

Given the lazy RPM install we previously did with the MySQL
database it makes sense to install the preconfigured RPM versions of these
modules that RedHat provides. The specific modules installed were:

 

·
perl-DBI-1.18-1.i386.rpm

·
perl-DBD-MySQL-1.2216-4.i386.rpm

 

Installation of these modules went smoothly. This was a welcome
relief as given the still traumatic memories of spending days trying to get
DBD::Oracle to install and work cleanly on Compaq Alphaservers and HP-UX
machines.

 

 

Step 3 Install Bundle::BioPerl via CPAN.pm

 

Note: Skip this step if BioPerl is already on your system.

 

Some of the perl scripts needed to help load the database with
sequence data we will be using in future steps are themselves dependent on the
BioPerl distribution (http://bioperl.org/)

for critical functions such as sequence parsing etc.

 

The BioPerl distribution itself has its own dependencies on Perl
modules and other programs that may or may not be found by default on a system.
These external dependencies are not required but if installed they will give
the BioPerl distribution additional functionality and features.

 

The BioPerl people have created a special CPAN package called
“Bundle::BioPerl” that can assist with downloading and automatically installing
all of the CPAN resident modules that BioPerl likes to use. By itself
Bundle::BioPerl is next to useless but when used with CPAN.pm it becomes very
convenient.

 

For the purposes of building a BioSQL database we don’t really
care about all of the various external BioPerl dependencies enough to give
special attention to the issue. The proper ‘lazy’ way then is just to fire up
CPAN.pm and let it do all of the work.

 

People unfamiliar with the CPAN.pm module can read its
documentation by issuing the system command ‘perldoc CPAN’.

 

To fire up the CPAN module in shell mode issue the following
command as root:

 

# perl –MCPAN –e shell

 

If this is the first time that CPAN.pm has been used there will be
an initial setup and configuration phase where CPAN configures itself to the
local network environment. The user is also prompted to select from a list of
available mirror sites.

 

To install Bundle::BioPerl while in the CPAN shell environment:

 

cpan> install Bundle::BioPerl

 

That’s about it. The process may take some time depending on the
speed of the server and available network connection.

 

NOTE:

When attempting this for the first time the installation of
Bundle::BioPerl completely failed because one of the perl modulesimplementing SOAP functionality listed in
the Bundle is apparently only distributed as a .ZIP archive rather than the
more traditional tar.gz format.The
install process failed because CPAN.pm had not been configured to know where a
suitable ‘unzip’ utility could be found. The solution was to exit from the CPAN
shell, manually install the unzip rpm and then restart the CPAN shell.Within the restarted CPAN shell environment
the following command will make CPAN aware of the newly installed unzip
utility:

 

cpan> o conf unzip/usr/bin/unzip

cpan> install Bundle::BioPerl

 

 

Step 4 Install BioPerl via CPAN.pm

 

Note: Skip this step if BioPerl is already on your system.

 

Now that the various helper modules and external dependencies of
BioPerl have been installed it should now be possible to use the same CPAN.pm
shortcut to automatically download and install the full BioPerl distribution.

 

Asking CPAN to install the ‘Bio::Seq’ module is enough to get it
to download and install the full distribution:

 

cpan> install Bio::Seq

 

BioPerl is a very large distribution (possibly the single largest
collection of related perl modules in the entire CPAN repository) – you may
wish to download and manually install the full BioPerl distribution directly
from http://bioperl.org/.

 

 

Step 5 Download and install bioperl-db directly from the OBF CVS
repository

 

Getting the latest and greatest version of the bioperl-db
distribution involves downloading straight from the public source code server
located at http://cvs.open-bio.org/

 

Instructions for downloading code via anonymous CVS are available
on the cvs.open-bio.org website. Your firewall or local network configuration
may not allow anonymous CVS connections to occur. See below for a workaround:

 

How to use the web and avoid anonymous CVS altogether:

If a firewall or network gateway device blocks CVS connection
attempts it should be possible to download code and full distributions directly
from the http://cvs.open-bio.org/
website. The CGI gateway to the CVS repository has a nice feature entitled
“Download Tarball”. When clicked the server will generate a fresh code checkout
and will send it to your browser as a .tar.gz archive file which can be saved
locally to disk. The files may all be named “cvs_root.tar.gz” so beware of
potential file overwrite issues when downloading multiple files or codebases.

 

How to checkout via anonymous:

 

The anonymous CVS password is ‘cvs’

Full instructions are available at http://cvs.open-bio.org

Example session with output:

 

# cvs
-d:pserver:cvs@cvs.open-bio.org:/home/repository/bioperl login

CVS password:

#

# cvs
-d:pserver:cvs@cvs.open-bio.org:/home/repository/bioperl co bioperl-db

cvs server:
Updating bioperl-db

U
bioperl-db/BUGS

U
bioperl-db/Changes

U
bioperl-db/LICENSE

U
bioperl-db/MANIFEST

U
bioperl-db/MANIFEST.SKIP

U
bioperl-db/Makefile.PL

U
bioperl-db/README

cvs server:
Updating bioperl-db/Bio

cvs server:
Updating bioperl-db/Bio/DB

U
bioperl-db/Bio/DB/Annotation.pm

 

<FULL
OUTPUT CUT FOR BREVITY>

 

Installing bioperl-db

 

The bioperl-db distribution is in the standard CPAN-style format
so installation is similar to any other external perl module:

 

Once the distribution is uncompressed and unpacked there are only
three commands necessary:

 

# perl Makefile.PL

# make

# make install

 

NOTE: This install will fail if the DBD:: and DBI::Mysql modules
cannot be found.

 

 

Step 6 Download and install biosql-schema directly from the OBF
CVS repository

 

Reference the above discussion on anonymous CVS, firewalls and the
ability to download tar archives directly from the http://cvs.open-bio.org/ website.

 

The OBDA schema for BioSQL databases lives in its own CVS code
repository because its use spans many of the Open Bio projects and coding
efforts.

 

The repository is called “biosql”, the module is called
“biosql-schema”. The CVS root for the respository can be found at
“/home/repository/biosql”.

 

 

Example anonymous CVS checkout session:

 

[root@pe3]# cvs
-d :pserver:cvs@cvs.open-bio.org:/home/repository/biosql login

Logging in to
:pserver:cvs@cvs.open-bio.org:2401/home/repository/biosql

CVS password:

[root@pe3
biosql]# cvs -d :pserver:cvs@cvs.open-bio.org:/home/repository/biosql checkout
biosql-schema

cvs server:
Updating biosql-schema

cvs server:
Updating biosql-schema/doc

U
biosql-schema/doc/README

U
biosql-schema/doc/biosql-ERD.pdf

U
biosql-schema/doc/biosql.html

cvs server:
Updating biosql-schema/scripts

U
biosql-schema/scripts/create_mysql_db.pl

U
biosql-schema/scripts/transform_sql.pl

cvs server:
Updating biosql-schema/sql

U
biosql-schema/sql/README

U
biosql-schema/sql/biopipelinedb-mysql.sql

U
biosql-schema/sql/biosql-accelerators-pg.sql

U
biosql-schema/sql/biosqldb-mysql.sql

U
biosql-schema/sql/biosqldb-pg.sql

U
biosql-schema/sql/biosqldb-views-pg.sql

U
biosql-schema/sql/makefile

cvs server: Updating
biosql-schema/sql/ontology

U
biosql-schema/sql/ontology/biosqldb-ontology-mysql.sql

U
biosql-schema/sql/ontology/biosqldb-ontology-pg.sql

cvs server:
Updating biosql-schema/t

[root@pe3
biosql]#

 

 

The files found in the biosql-schema do not need to be installed
or otherwise processed. We are going to use the scripts and documents found in
this module directly.

 

 

Step 7 – Putting it all
together: creating a biosql instance within MySQL

 

Inside the biosql-schema/scripts folder there is a perl script called
“create_mysqldb.pl”. This script is just a helper script that
automates the process of connecting to the datbase and issuingthe SQL statements that cause the proper
database tables to be created. The script reads the actual SQL commands from
the file found in biosql-schema/sql/biosql-mysqldb.sql

 

After looking at how simple this script is I felt the need to
bypass it completely and just perform the steps manually for curiosity sake.

 

Here is the output from creating an initial database named ‘biosql’:

 

 

[dag@pe3]$
mysql -u root -p

Enter password:

Welcome to the
MySQL monitor.Commands end with ; or
\g.

Your MySQL
connection id is 9 to server version: 3.23.41-log

 

Type ‘help;’ or
‘\h’ for help. Type ‘\c’ to clear the buffer.

 

mysql>
create database biosql;

Query OK, 1 row
affected (0.00 sec)

mysql> exit

Bye

 

 

Now that the database is created we need to connect to the
database and issue the SQL commands contained in
biosql-scheme/sql/biosqldb-mysql.sql file:

 

Within the mysql client interface we can use the command
“source” to read in SQL statements from a file.

 

Here is the output from that process:

 

[dag@pe3]$
mysql -u root -p biosql

Enter password:

Welcome to the
MySQL monitor.Commands end with ; or
\g.

Your MySQL
connection id is 11 to server version: 3.23.41-log

 

Type ‘help;’ or
‘\h’ for help. Type ‘\c’ to clear the buffer.

 

mysql>
source sql/biosqldb-mysql.sql

 

Query OK, 0
rows affected (0.00 sec)

Records: 0Duplicates: 0Warnings:

 

<snip – lots of boring output cut
here…>

 

Query OK, 1 row
affected (0.00 sec)

 

Query OK, 1 row
affected (0.00 sec)

 

Query OK, 1 row
affected (0.00 sec)

 

Query OK, 1 row
affected (0.00 sec)

 

Query OK, 1 row
affected (0.00 sec)

 

Query OK, 1 row
affected (0.00 sec)

 

Query OK, 1 row
affected (0.00 sec)

 

Query OK, 1 row
affected (0.00 sec)

 

Query OK, 1 row
affected (0.00 sec)

 

Query OK, 0
rows affected (0.00 sec)

mysql>

 

 

Now that we are still connected to datbase ‘biosql’ let us see
what we created by issuing the command “show tables;”

 

 

mysql> show
tables;

+—————————-+

|
Tables_in_biosql|

+—————————-+

|
biodatabase|

| bioentry|

|
bioentry_direct_links|

|
bioentry_qualifier_value|

|
bioentry_reference|

| bioentry_taxa|

|
biosequence|

|
cache_corba_support|

| comment|

| dbxref|

|
dbxref_qualifier_value|

|
location_qualifier_value|

|
ontology_term|

|
reference|

|
remote_seqfeature_name|

|
seqfeature|

|
seqfeature_location|

|
seqfeature_qualifier_value |

|
seqfeature_relationship|

|
seqfeature_source|

| taxa|

+—————————-+

21 rows in set
(0.00 sec)

 

mysql> exit

Bye

 

Excellent! We have just created a BioSQL database within MySQL !

 

 

Step 8 – Loading a test
protein dataset (Swissprot) into the database

 

Before tackling all of GenBank, a smaller dataset is probably a
good idea as a test case to learn about the process of loading up the new
biosql database with sequence data.

 

The dataset selected was SwissProt downloaded from ftp://ftp.ebi.ac.uk/pub/databases/swissprot/release/sprot40.dat

 

·
Total uncompressed size: 307MB

·
Apparent number of sequence entries: 101,707

 

 

The database loader script used was the”load_seqdatabase.pl” script from the /scripts/ folder
inside the bioperl-db distribution.

 

The script is pretty basic. There are some internal configuration
settings inside the script that can be manually configured. Each of the
configurable options can also be overridden by command line arguments.

 

The basic syntax for the load_seqdatabase.pl script is as follows:

 

./load_seqdatabase.pl <options>
<dataset identifier> /path/to/data/file(s)

 

Some of the more important options and arguments that can be
hard-coded into the perl script or passed via the command line are:

 

-h (hostname
of the mysql server)

-sqldb (the name of the biosql database
instance that was created)

-dbuser (username to connect to MySQL
with)

-dbpass (MySQL user password)

-format (the format of the sequence
datafile that is going to be read/loaded)

 

One of the key options needed is the “dataset identifier”. This
identifier allows one to store many ‘biodatabases” inside the single BioSQL
instance.

 

Given that we are trying to load release 40 of the Swissprot
dataset a logical identifier could be something like “swiss”, “swissprot”,
“sprot” or “swissprot40” etc.

 

 

Here we go…

 

[dag@pe3
biosql]$ ./load_seqdatabase.pl -host localhost -sqldb biosql -dbuser root
-dbpass XXX -format swiss swissprot40 /n/data/mirrors/swissprot/sprot40.dat

Reading
/n/data/mirrors/swissprot/sprot40.dat

[dag@pe3
biosql]$

 

 

Wow! It worked!

 

 

 

Step 9 – Observations

 

  • Loading
    Swissprot into the MySQL database took 2 hours and 10 minutes. The slow
    speed is caused by the overhead of having to parse each entry into a
    BioPerl sequence object prior to inserting it into the database. There is
    much room for performance tuning; one approach may be to start an
    internet-accessible repository where people can download pre-built
    tab-delimited MySQL data files that can be bulk imported into a
    biosql-compliant database very quickly.

 

  • The
    size of sprot40.dat is 307MB

 

  • After
    swissprot was loaded the cumulative size of all the files inside
    /var/lib/mysql/biosql/ was 399MB

 

 

Step 10 – What
next?

 

This document is a work in progress. Among the things I’d
like to do next:

 

  • Write
    perl scripts to validate the information that was loaded into the database

 

  • Figure
    out how to export/dump the database and see how quickly we can recreate
    the database with these raw files instead of laboriously using BioPerl to
    parse and load objects one at a time. Loading the database is slow and it
    may be cool to package up tab-delimited biosql exports so that others can
    load their own databases much faster.

 

  • Run
    more time tests to see where things can be made faster

 

  • Start
    trying to cram all of GenBank into a biosql database!