Bioinformatics Requirements/Use
_______________________________

Bioinformatics requirements in the context of cloud and grid were
presented to the partners. There are of 3 types, related to the data,
the management of jobs and the interfaces between the users and the
resources. Users are scientist and engineers, experts in Biology and
Bioinformatics.

Regarding the data, users should have access from any nodes of the
cloud to the international databases recording biological resources
such as complete genomes, protein structures and sequences. These
databases are well-known from the communities as they are enumerated
in the annual edition of the scientific journal Nucleic Acids
Research. In 2010, there were 1230. One machine need to have
read-write rights to maintain the biological repository up-to-date,
and the others one in read-only mode to access intensively these data
sources. Users should also have access to a shared common storage
space where they can put and get their data related to their
institutes, laboratories or projects, and such storage services must
integrate high-level security components.

Distributing the computation is also an important requirements because
Bioinformatics could require very different kind of job according the
analysis to perform: for example multiple alignments of sequences or
genome assembling. Daily, biologists and bioinformaticians are
combining multiple pieces of software to analyze their data, and they
mainly access them on Web portal and services interfaces, such as SOAP
or RESTful. Cloud solutions should then propose high level interface
to manage the virtual machines, and also to access the biological
applications deployed on the virtual machine.

Bioinformatics engineers may not be expert in cloud or grid
technologies. It is a important point we have identified especially
these last years during the activities of the French Bioinformatics
grid initiative RENABI GRISBI. To help to satisfy such requirement,
integrated solution such as StratusLab system should provide engineers
working on the bioinformatics platform with from-the-shelves solutions
to deploy grid site on their local resource devoted to Biology, and
update with new releases. A clear goal is to have base images of grid
components from the appliance repository: UI, SE, CE-WN, that could be
solve by using the Claudia system. Where bioinformatic admins should
only have to put several parameters in the contextualization step:
like host certificates or MA address. Such grid appliances should have
been validated enough to go in production mode. An other specific
issues of bioinformatics platforms is related their network
organization: public versus private IPs; existing DHCP. Most of sites
are using private networks with NAT and proxies to link to the
Internet, and are already in production steady state with for example
existing DHCP and proxies.

Regarding the computations themselves and the worker nodes, we have
not some much commercial software, the main requirements are related
to the software dependencies, and to satisfy the very different
behavior of the biological applications in term of cpu and memory: for
example one bioinformatics software will need 1 cpu-few MB, other ones
lot of cpu-few memory like 24cores-16GB for multiple alignment of
sequences, or few cpu-lot of memory as for example 2cores-96GB for the
assembling step of the NGS (Next Generation Sequencing)
applications. Bioinformatics user should be able to find these
different worker nodes or user interfaces configuration at the
execution step. And ideally this should be bring at the submission and
scheduling steps for WNs, but that will require a strong connection
between the cloud service manager and the grid workload management
system.