MyGENBANK with mySQL: Rolling your own relational sequence database
BioTeam.Net HOWTO #1
“BioSQL with
MySQL:
Rolling your own
relational sequence database”
Additional titles considered:
- “Oracle DBA not required :)”
- “Escape from flatfile hell?”
Summary:
A step-by-step diary written by a database novice
describing how to use freely available tools and schemas from the Open
Bioinformatics Foundation to create a MySQL-based relational database
repository for biological sequence and annotation data.
Author:
Chris Dagdigian, dag@sonsorol.org, http://BioTeam.net
Version:
1.1 – Last updated May 30, 2002
Revision History:
1.0; Original version – May 2002
1.1; Reader feedback and error corrections; first ‘complete’ revision – May 2002
Acknowledgements:
- Open Bioinformatics Foundation, http://www.open-bio.org
- Bauer Center for Genomics Research, Harvard University, http://cgr.harvard.edu
·Keith Allen,
PhD, Paradigm Genetics, USA
·Mark Lambrecht,
PhD – University of Leuven, Centre of Microbial and Plant Genetics, Belgium
·Adam Witney, St. George’s Hospital Medical School,
University of London, UK
Additional Resources:
Open Bio Database Access website, http://obda.open-bio.org
Open-bio-l mailing list, http://open-bio.org/mailman/listinfo/open-bio-l
Contents:
•Background & purpose
•Assumptions
•About the OBDA effort
•About BioSQL
•About Bioperl-db
•Step-by-step guide
1.Installing MySQL the lazy way (via RPM) and configuring
2. Installing perl interfaces to MySQL
3. Installing Bundle::BioPerl via CPAN.pm
4. Installing BioPerl via CPAN.pm
5. Downloading & installing bioperl-db directly from the OBF CVS
repository
6.Downloading biosql-schema directly from the OBF CVS repository
7. Putting it all together: creating a biosql instance within MySQL
8.Loading a test protein dataset: SwissProt release 40
9.Observations
10.Next steps…
Background & purpose
While working on an informatics/Bio-IT project at Harvard University’s new Bauer Center for Genomics Research (http://cgr.harvard.edu) I met some people who were interested in storing sequence data into a relational database so they could avoid having to parse or index big flat text files.
Knowing that (a) The Open Bioinformatics Foundation had started work on solution and (b) I am a complete and utter moron when it comes to relational databases I thought that it would be a fun experiment to document my efforts at building such a server starting with a bare-bones RedHat Linux 7.2 server.
This document is essentially a diary of my experiences. It took me roughly a day to get to the point where I could load Swissprot into a biosql database via Jason Stajich’s load_seqdatabase.pl script.
Another reason I wrote this is that as a member of the Open Bioinformatics Foundation board of directors I am well aware that our website(s) are horrific and that our documentation often lags behind the functionality of our code base.
In particular the bioperl website gives little or no mention to “bioperl-db” and the related obda.open-bio.org website is totally devoid of any real information.
In fact- one practically needs to be an existing developer on one of our existing projects to even know that this stuff exists.
So- besides being a fun way to spend a day this is also a way for me to get myself more up to speed with the OBDA website and related work so that I can eventually help contribute to the website and project documentation efforts.
Comments/corrections/additions/flames regarding this document are welcome. Just send them along to me at dag@sonsorol.org
Assumptions
·This document describes installing a biosql-compliant database on an x86-based server running RedHat 7.2.
· Different hardware architectures or OS’s may require different install procedures. Specifically the server used for this process was a Dell PowerEdge 6450 system with dual Xeon CPUs, 4GB physical memory and the 2.4.18 linux kernel.
· This document assumes the user has root access to the server system for the purposes of installing software and perl modules.
· This document assumes at least basic knowledge of RPM packages and how to install them
·The process described here includes downloading code from the internet via anonymous CVS which some firewalls may not allow. Explaining the use of CVS-over-SSH is not within the scope of this document.
About the Open Bio Database Access (OBDA) effort
OBDA is standard developed by the Open Bioinformatics Foundation to
standardize the way all of the OBF-affiliated projects access sequence data
resources. This includes local sequence resources (indexed flatfiles), remote
databases (web accessible sequence resources), CORBA servers (implementing the BioCORBA/BSANE specification) and
relational databases using the BioSQL
schema.
The website for the OBDA effort will eventually be http://obda.open-bio.org/ . At this time
though the site is only online in template form. The OBF is actively soliciting
people to take over the website and OBDA documentation efforts.
The primary discussion forum for the OBDA effort is the
“open-bio-l” mailing list. The list signup page is located at: http://open-bio.org/mailman/listinfo/open-bio-l
. The list archives are online at http://open-bio.org/pipermail/open-bio-l/.
About BioSQL
The website and primary email discussion list for all
things BioSQL related will be the ones listed above in the OBDA section.
A BioSQL ER diagram in PDF form can be downloaded
directly (PDF) from the OBF anonymous CVS server. A formatted listing of
the tables described in the schema can be viewed
online directly (HTML) from the same server.
About BioPerl-db
The bioperl-db project was started by Ewan Birney with
major work contributed by Elia Stupka and continued support coming from the
bioperl community.Its purpose is a
standalone sequence database with little external dependencies and tight
integration with the bioperl distribution.
Support for more databases and bindings in java and python by Biojava
and Biopython projects are welcomed and encouraged.All questions and comments should be directed to the bioperl list
<bioperl-l@bioperl.org> and
more information can be found about the related projects at http://bio.perl.org and http://www.open-bio.org.
Bioperl-db is a standalone module located within the CVS
code repository operated by the BioPerl Project. Anonymous CVS access and
tarball downloads are available at http://cvs.bioperl.org.
Look in the bioperl repository for the “bioperl-db” module.
Step by step guide
Step 1 Install MySQL the
lazy way (via RPM) and configure
While logged in as root I installed the following RedHat RPMS:
mysqlclient9-3.23.22-6
mysql-3.23.41-1
mysql-server-3.23.41-1
mysql-devel-3.23.41-1
The install was verified by issuing the following command:
# rpm –qa | grep
mysql
mysqlclient9-3.23.22-6
mysql-3.23.41-1
mysql-server-3.23.41-1
mysql-devel-3.23.41-1
MySQL
version warning:
Mark Lambrecht recommends the use of MySQL version 3.23 or later. When trying
to install version 3.22 on an Alpha-based system he received errors relating to
the creation of the “fattribute_to_feature”
table.
By default the RPM MySQL install puts sample configuration files
into /usr/share/mysql/
The base directory that the RPM MySQL install uses is
/var/lib/mysql
The global configuration file is kept at /etc/my.cnf
It is important to edit the configuration file at /etc/my.conf BEFORE
starting the database for the first time particularly if you want (like me) to
have the MySQL data files live somewhere other than /var/lib/mysql.
In my case I wanted to do two main things to the global config
file, (a) change the default location to the much larger /usr/local/ partition
and (b) incorporate some of the config suggestions found in /usr/share/mysql/
specific to large dedicated database servers.
WARNING; BioSQL databases can be very large. It may be wise to double
check the size of your /var partition before continuing on with a totally
default RPM-based install. The 1GB /var partition present on the Dell 6450
server is far too small. As an example:
The swissprot40 database is 307MB in size. When this dataset was loaded
into MySQL the resulting database files and indices consumed roughly 399MB of
disk space.
NOTE: Several BioSQL users have found
that very large sequence records will greatly exceed some of the default
settings that MySQL uses. In particular one may need to set the configuration
value “max_allowed_packet” to a fairly high value (several megabytes at least). If you
encounter any error messages that are similar to: “DBD::mysql::st execute failed: MySQL server
has gone away…” than it is likely that max_allowed_packet
needs to be set to a larger value. In an email post to the bioperl-l (http://bioperl.org/mailman/listinfo/bioperl-l)
mailing list Keith Allen covers his experience with this problem in excellent
detail. The full email message can be read at the following URL: http://bioperl.org/pipermail/bioperl-l/2002-May/007987.html
After revising the value of max_allowed_packet to ‘3M’ due to
feedback from Keith Allen (see above Note) the global configuration file
/etc/my.conf now looks like this:
[mysqld]
port=3306
datadir=/usr/local/mysql-database
socket=/usr/local/mysql-database/mysql.sock
skip-locking
set-variable= key_buffer=256M
set-variable=
max_allowed_packet=3M
set-variable=
table_cache=256
set-variable=
sort_buffer=1M
set-variable=
record_buffer=1M
set-variable=
myisam_sort_buffer_size=64M
set-variable=
thread_cache=8
set-variable=
thread_concurrency=4
server-id= 1
tmpdir= /tmp/
[mysql.server]
user=mysql
basedir=/usr/lib
[safe_mysqld]
err-log=/var/log/mysqld.log
pid-file=/var/run/mysqld/mysqld.pid
[mysqldump]
quick
set-variable=
max_allowed_packet=16M
[mysql]
no-auto-rehash
safe-updates
[isamchk]
set-variable= key_buffer=128M
set-variable= sort_buffer=128M
set-variable= read_buffer=2M
set-variable= write_buffer=2M
[myisamchk]
set-variable= key_buffer=128M
set-variable= sort_buffer=128M
set-variable= read_buffer=2M
set-variable= write_buffer=2M
[mysqlhotcopy]
interactive-timeout
Just in case I missed anything I decided to add a symbolic link
back to /var/lib/mysql – this turns out to be important because my
configuration file still has “basedir=/var/lib” which is a total mistake.
# cd /var/lib
# ln -s
/usr/local/mysql-database/ ./mysql
Now we can start the MySQL datbase…
/etc/rc.d/init.d/mysql
start
Important! Protect your database by setting a root MySQL password:
# mysqladmin -u
root password ‘xxxxxxx’
After installing, configuring, starting and enabling a root
password it should be possible to logoff as root and re-login as a ‘normal’
system user.
To test this I made the switch from root user to my normal ‘dag’
user account and confirmed that I can login as root to the mysql instance:
[dag@pe3
scripts]$ mysql -u root -p
Enter password:
Welcome to the
MySQL monitor.Commands end with ; or
\g.
Your MySQL
connection id is 8 to server version: 3.23.41-log
Type ‘help;’ or
‘\h’ for help. Type ‘\c’ to clear the buffer.
mysql> exit
Bye
Success! We now have a working MySQL instance.
Step 2 Install the perl interfaces to MySQL
In future steps we are going to use perl to help us talk to the
MySQL database. In order to do this we need some additional perl modules
installed. We need:
·
DBI:: – The perl generic database interface
·
DBD::MySQL – The perl MySQL database driver module
Given the lazy RPM install we previously did with the MySQL
database it makes sense to install the preconfigured RPM versions of these
modules that RedHat provides. The specific modules installed were:
·
perl-DBI-1.18-1.i386.rpm
·
perl-DBD-MySQL-1.2216-4.i386.rpm
Installation of these modules went smoothly. This was a welcome
relief as given the still traumatic memories of spending days trying to get
DBD::Oracle to install and work cleanly on Compaq Alphaservers and HP-UX
machines.
Step 3 Install Bundle::BioPerl via CPAN.pm
Note: Skip this step if BioPerl is already on your system.
Some of the perl scripts needed to help load the database with
sequence data we will be using in future steps are themselves dependent on the
BioPerl distribution (http://bioperl.org/)
for critical functions such as sequence parsing etc.
The BioPerl distribution itself has its own dependencies on Perl
modules and other programs that may or may not be found by default on a system.
These external dependencies are not required but if installed they will give
the BioPerl distribution additional functionality and features.
The BioPerl people have created a special CPAN package called
“Bundle::BioPerl” that can assist with downloading and automatically installing
all of the CPAN resident modules that BioPerl likes to use. By itself
Bundle::BioPerl is next to useless but when used with CPAN.pm it becomes very
convenient.
For the purposes of building a BioSQL database we don’t really
care about all of the various external BioPerl dependencies enough to give
special attention to the issue. The proper ‘lazy’ way then is just to fire up
CPAN.pm and let it do all of the work.
People unfamiliar with the CPAN.pm module can read its
documentation by issuing the system command ‘perldoc CPAN’.
To fire up the CPAN module in shell mode issue the following
command as root:
# perl –MCPAN –e shell
If this is the first time that CPAN.pm has been used there will be
an initial setup and configuration phase where CPAN configures itself to the
local network environment. The user is also prompted to select from a list of
available mirror sites.
To install Bundle::BioPerl while in the CPAN shell environment:
cpan> install Bundle::BioPerl
That’s about it. The process may take some time depending on the
speed of the server and available network connection.
NOTE:
When attempting this for the first time the installation of
Bundle::BioPerl completely failed because one of the perl modulesimplementing SOAP functionality listed in
the Bundle is apparently only distributed as a .ZIP archive rather than the
more traditional tar.gz format.The
install process failed because CPAN.pm had not been configured to know where a
suitable ‘unzip’ utility could be found. The solution was to exit from the CPAN
shell, manually install the unzip rpm and then restart the CPAN shell.Within the restarted CPAN shell environment
the following command will make CPAN aware of the newly installed unzip
utility:
cpan> o conf unzip/usr/bin/unzip
cpan> install Bundle::BioPerl
Step 4 Install BioPerl via CPAN.pm
Note: Skip this step if BioPerl is already on your system.
Now that the various helper modules and external dependencies of
BioPerl have been installed it should now be possible to use the same CPAN.pm
shortcut to automatically download and install the full BioPerl distribution.
Asking CPAN to install the ‘Bio::Seq’ module is enough to get it
to download and install the full distribution:
cpan> install Bio::Seq
BioPerl is a very large distribution (possibly the single largest
collection of related perl modules in the entire CPAN repository) – you may
wish to download and manually install the full BioPerl distribution directly
from http://bioperl.org/.
Step 5 Download and install bioperl-db directly from the OBF CVS
repository
Getting the latest and greatest version of the bioperl-db
distribution involves downloading straight from the public source code server
located at http://cvs.open-bio.org/
Instructions for downloading code via anonymous CVS are available
on the cvs.open-bio.org website. Your firewall or local network configuration
may not allow anonymous CVS connections to occur. See below for a workaround:
How to use the web and avoid anonymous CVS altogether:
If a firewall or network gateway device blocks CVS connection
attempts it should be possible to download code and full distributions directly
from the http://cvs.open-bio.org/
website. The CGI gateway to the CVS repository has a nice feature entitled
“Download Tarball”. When clicked the server will generate a fresh code checkout
and will send it to your browser as a .tar.gz archive file which can be saved
locally to disk. The files may all be named “cvs_root.tar.gz” so beware of
potential file overwrite issues when downloading multiple files or codebases.
How to checkout via anonymous:
The anonymous CVS password is ‘cvs’
Full instructions are available at http://cvs.open-bio.org
Example session with output:
# cvs
-d:pserver:cvs@cvs.open-bio.org:/home/repository/bioperl login
CVS password:
#
# cvs
-d:pserver:cvs@cvs.open-bio.org:/home/repository/bioperl co bioperl-db
cvs server:
Updating bioperl-db
U
bioperl-db/BUGS
U
bioperl-db/Changes
U
bioperl-db/LICENSE
U
bioperl-db/MANIFEST
U
bioperl-db/MANIFEST.SKIP
U
bioperl-db/Makefile.PL
U
bioperl-db/README
cvs server:
Updating bioperl-db/Bio
cvs server:
Updating bioperl-db/Bio/DB
U
bioperl-db/Bio/DB/Annotation.pm
<FULL
OUTPUT CUT FOR BREVITY>
Installing bioperl-db
The bioperl-db distribution is in the standard CPAN-style format
so installation is similar to any other external perl module:
Once the distribution is uncompressed and unpacked there are only
three commands necessary:
# perl Makefile.PL
# make
# make install
NOTE: This install will fail if the DBD:: and DBI::Mysql modules
cannot be found.
Step 6 Download and install biosql-schema directly from the OBF
CVS repository
Reference the above discussion on anonymous CVS, firewalls and the
ability to download tar archives directly from the http://cvs.open-bio.org/ website.
The OBDA schema for BioSQL databases lives in its own CVS code
repository because its use spans many of the Open Bio projects and coding
efforts.
The repository is called “biosql”, the module is called
“biosql-schema”. The CVS root for the respository can be found at
“/home/repository/biosql”.
Example anonymous CVS checkout session:
[root@pe3]# cvs
-d :pserver:cvs@cvs.open-bio.org:/home/repository/biosql login
Logging in to
:pserver:cvs@cvs.open-bio.org:2401/home/repository/biosql
CVS password:
[root@pe3
biosql]# cvs -d :pserver:cvs@cvs.open-bio.org:/home/repository/biosql checkout
biosql-schema
cvs server:
Updating biosql-schema
cvs server:
Updating biosql-schema/doc
U
biosql-schema/doc/README
U
biosql-schema/doc/biosql-ERD.pdf
U
biosql-schema/doc/biosql.html
cvs server:
Updating biosql-schema/scripts
U
biosql-schema/scripts/create_mysql_db.pl
U
biosql-schema/scripts/transform_sql.pl
cvs server:
Updating biosql-schema/sql
U
biosql-schema/sql/README
U
biosql-schema/sql/biopipelinedb-mysql.sql
U
biosql-schema/sql/biosql-accelerators-pg.sql
U
biosql-schema/sql/biosqldb-mysql.sql
U
biosql-schema/sql/biosqldb-pg.sql
U
biosql-schema/sql/biosqldb-views-pg.sql
U
biosql-schema/sql/makefile
cvs server: Updating
biosql-schema/sql/ontology
U
biosql-schema/sql/ontology/biosqldb-ontology-mysql.sql
U
biosql-schema/sql/ontology/biosqldb-ontology-pg.sql
cvs server:
Updating biosql-schema/t
[root@pe3
biosql]#
The files found in the biosql-schema do not need to be installed
or otherwise processed. We are going to use the scripts and documents found in
this module directly.
Step 7 – Putting it all
together: creating a biosql instance within MySQL
Inside the biosql-schema/scripts folder there is a perl script called
“create_mysqldb.pl”. This script is just a helper script that
automates the process of connecting to the datbase and issuingthe SQL statements that cause the proper
database tables to be created. The script reads the actual SQL commands from
the file found in biosql-schema/sql/biosql-mysqldb.sql
After looking at how simple this script is I felt the need to
bypass it completely and just perform the steps manually for curiosity sake.
Here is the output from creating an initial database named ‘biosql’:
[dag@pe3]$
mysql -u root -p
Enter password:
Welcome to the
MySQL monitor.Commands end with ; or
\g.
Your MySQL
connection id is 9 to server version: 3.23.41-log
Type ‘help;’ or
‘\h’ for help. Type ‘\c’ to clear the buffer.
mysql>
create database biosql;
Query OK, 1 row
affected (0.00 sec)
mysql> exit
Bye
Now that the database is created we need to connect to the
database and issue the SQL commands contained in
biosql-scheme/sql/biosqldb-mysql.sql file:
Within the mysql client interface we can use the command
“source” to read in SQL statements from a file.
Here is the output from that process:
[dag@pe3]$
mysql -u root -p biosql
Enter password:
Welcome to the
MySQL monitor.Commands end with ; or
\g.
Your MySQL
connection id is 11 to server version: 3.23.41-log
Type ‘help;’ or
‘\h’ for help. Type ‘\c’ to clear the buffer.
mysql>
source sql/biosqldb-mysql.sql
Query OK, 0
rows affected (0.00 sec)
Records: 0Duplicates: 0Warnings:
<snip – lots of boring output cut
here…>
Query OK, 1 row
affected (0.00 sec)
Query OK, 1 row
affected (0.00 sec)
Query OK, 1 row
affected (0.00 sec)
Query OK, 1 row
affected (0.00 sec)
Query OK, 1 row
affected (0.00 sec)
Query OK, 1 row
affected (0.00 sec)
Query OK, 1 row
affected (0.00 sec)
Query OK, 1 row
affected (0.00 sec)
Query OK, 1 row
affected (0.00 sec)
Query OK, 0
rows affected (0.00 sec)
mysql>
Now that we are still connected to datbase ‘biosql’ let us see
what we created by issuing the command “show tables;”
mysql> show
tables;
+—————————-+
|
Tables_in_biosql|
+—————————-+
|
biodatabase|
| bioentry|
|
bioentry_direct_links|
|
bioentry_qualifier_value|
|
bioentry_reference|
| bioentry_taxa|
|
biosequence|
|
cache_corba_support|
| comment|
| dbxref|
|
dbxref_qualifier_value|
|
location_qualifier_value|
|
ontology_term|
|
reference|
|
remote_seqfeature_name|
|
seqfeature|
|
seqfeature_location|
|
seqfeature_qualifier_value |
|
seqfeature_relationship|
|
seqfeature_source|
| taxa|
+—————————-+
21 rows in set
(0.00 sec)
mysql> exit
Bye
Excellent! We have just created a BioSQL database within MySQL !
Step 8 – Loading a test
protein dataset (Swissprot) into the database
Before tackling all of GenBank, a smaller dataset is probably a
good idea as a test case to learn about the process of loading up the new
biosql database with sequence data.
The dataset selected was SwissProt downloaded from ftp://ftp.ebi.ac.uk/pub/databases/swissprot/release/sprot40.dat
·
Total uncompressed size: 307MB
·
Apparent number of sequence entries: 101,707
The database loader script used was the”load_seqdatabase.pl” script from the /scripts/ folder
inside the bioperl-db distribution.
The script is pretty basic. There are some internal configuration
settings inside the script that can be manually configured. Each of the
configurable options can also be overridden by command line arguments.
The basic syntax for the load_seqdatabase.pl script is as follows:
./load_seqdatabase.pl <options>
<dataset identifier> /path/to/data/file(s)
Some of the more important options and arguments that can be
hard-coded into the perl script or passed via the command line are:
-h (hostname
of the mysql server)
-sqldb (the name of the biosql database
instance that was created)
-dbuser (username to connect to MySQL
with)
-dbpass (MySQL user password)
-format (the format of the sequence
datafile that is going to be read/loaded)
One of the key options needed is the “dataset identifier”. This
identifier allows one to store many ‘biodatabases” inside the single BioSQL
instance.
Given that we are trying to load release 40 of the Swissprot
dataset a logical identifier could be something like “swiss”, “swissprot”,
“sprot” or “swissprot40” etc.
Here we go…
[dag@pe3
biosql]$ ./load_seqdatabase.pl -host localhost -sqldb biosql -dbuser root
-dbpass XXX -format swiss swissprot40 /n/data/mirrors/swissprot/sprot40.dat
Reading
/n/data/mirrors/swissprot/sprot40.dat
[dag@pe3
biosql]$
Wow! It worked!
Step 9 – Observations
- Loading
Swissprot into the MySQL database took 2 hours and 10 minutes. The slow
speed is caused by the overhead of having to parse each entry into a
BioPerl sequence object prior to inserting it into the database. There is
much room for performance tuning; one approach may be to start an
internet-accessible repository where people can download pre-built
tab-delimited MySQL data files that can be bulk imported into a
biosql-compliant database very quickly.
- The
size of sprot40.dat is 307MB
- After
swissprot was loaded the cumulative size of all the files inside
/var/lib/mysql/biosql/ was 399MB
Step 10 – What
next?
This document is a work in progress. Among the things I’d
like to do next:
- Write
perl scripts to validate the information that was loaded into the database
- Figure
out how to export/dump the database and see how quickly we can recreate
the database with these raw files instead of laboriously using BioPerl to
parse and load objects one at a time. Loading the database is slow and it
may be cool to package up tab-delimited biosql exports so that others can
load their own databases much faster.
- Run
more time tests to see where things can be made faster
- Start
trying to cram all of GenBank into a biosql database!
You must be logged in to post a comment.