EUAsiaGrid BioWorkshop
Documentation and Workshop Materials

(NEW!) 5/5: Most presentations have been uploaded. Please check this page regularly for updates.

Porting Biological Problems to the Grid

Introduction to Grid computing

Dr Ludek Matyska
Professor Ludek Matyska will be providing an overview of Grid computing as well as the biomedical applications.

Keynote Presentation on EUAsiaGrid (Professor Ludek Matyska)

Biomedical Applications of Grid Computing (PPT)
(Professor Ludek Matyska on behalf of Healthgrid)

gLITE Grid Computing Hands-on

Dr Marco Fargetta
Dr Giuseppe La Rocca
Professors Marco Fargetta and Giuseppe La Rocca will be covering the basic concepts of Grid computing on the EUAsiaGrid as well as providing hands-on training experience for participants in using the grid.
 
Grid-Enabled Applications in Bioinformatics
(NEW!) Grid-Enabled Solutions

Dr Marco Fargetta
Dr Giuseppe La Rocca
Jim Ho
Wang HsiKai
Sample JDL files for submitting bioinformatics applications to Grid
 
1. Grid enabling phylogenetic inference

Jim Ho
Wang HsiKai
Hu Yong Li
Dr. Chanditha Hapuarachchi
Lee Kim-Sung
Deng Lu
Phylogenetic analysis of HIV proteomes
Hu Yong Li, Dept of Biochemistry, NUS
Powerpoint presentation of problem scenario (HIV)

Currently, there are nearly 200,000 HIV proteomes in our HIV database. We carry out phylogenetic inference analysis regularly on non-grid standalone servers. To migrate to a grid/cloud platform will help us increase our speed of analysis. The software we use for phylogenetic analysis is the publicly accessible PHYLIP package, in particular the protdist program and others such DNAML if we wish to include analysis at the DNA level. If we can grid-enable PHYLIP which is already done on many platforms, we can scale up our analysis.


Chikungunya and Dengue virus DNA sequences using BEAST
Chanditha Hapuarachchi, Lee Kim-Sung, Deng Lu,
Environmental Health Institute, National Environmental Agency (NEA) Singapore
Powerpoint of problem scenario (BEAST)

Currently, our group is analysing full and partial genome sequences of viruses (mainly vector borne viruses such as dengue, chikungunya etc) implicated in common infectious diseases in Singapore. The objective is to understand the molecular epidemiology of the disease. The information will be used to understand the pattern of spread and importations of these viruses in the country. We often use phylogenetic analyses to infer the genetic relatedness between different groups of viruses. This involves multiple alignment and subsequent analysis of alignments using a variety of bioinformatics tools.

Problem: One of the tools that is being currently used is the BEAST package, which is a publicly accessible, java based programme. The package is used to construct trees, to calculate rates of evolution and to understand the population dynamics and spatial distribution of viruses. The programme involves a lot of analytical steps and consumes a huge amount of computing power. For an example, the analysis of a 12 kb full genome dataset with 90 sequences takes us approximately a week to complete under the currently available computing power. Most often, the same analysis has to be repeated with different parameter settings several times and takes weeks to months to complete. The problem becomes even worse when two or more analyses have to be done at the same time. Accessing to external computer resources that support similar types of analyses may be a good solution for this problem. We hope that this workshop will provide us some clues about whether such programme packages could be grid-enabled, that makes our analysis much faster and broader. I have attached a mock dataset to test during the workshop

2. Grid-enabled parameter sweep for SVM parameter optimization of caspase cleavage sites prediction

Jim Ho
Wang HsiKai
Dr Lawrence Wee

Dr Lawrence Wee, SiGN, Singapore
Powerpoint of problem scenario (SVM)

LibSVM is regularly used for our prediction of Caspases. A typical run takes a minute on a machine. However, we need to optimize the SVM prediction and this is carried out based on a parameter sweep, based on the type of SVM prediction used. If we can accelerate the processes, and run LibSVM multiple times to optimize the training of the machine learning process, we can enhance our prediction of caspase protein sequences.

3. Grid Enabling Genome Search to identify T3SS effectors

Jim Ho
Wang HsiKai
Sun Guang Wen
Dr Sun Guang Wen, Department of Biochemistry, NUS
Powerpoint presentation of problem scenario (T3SS)

Currently there are 1,500 records of experimentally verified or suspected T3 effector proteins. Using getorf in the popular EMBOSS sequence analysis package, it is possible to identify 100,000 open reading frames in a typical bacteria genome such as that of Burkholderia species of which only 10% may actually be functional. Of these, we would like to identify which of these open reading frames by code for Effector Proteins using BLAST. We wish to analyse this for all bacteria genomes to identify families of effector proteins, starting with Burkholderia. If we can scale up and carry out multiple sequence alignments of similar protein sequences, using CLUSTAL, MUSCLE, PROMALS, T-Coffee, etc we can classify the groups of putative Effectors for further analysis. The best group of effectors can be selected for extracting patterns and motifs, and feature extracted for development Support Vector Machine prediction (e.g. LibSVM, SVMLight) of novel effectors which we can verify in the laboratory.

 
4. Grid-enabled Ligand-Receptor Docking

Jim Ho
Wang HsiKai
Lam Tze Hau
Dr. Heru Suhartanto
Ligand Docking to MHC Class I molecules
Lam Tze Hau, I2R, A*STAR, Singapore
Powerpoint presentation of problem scenario (Autodock)

A single MHC-peptide ligand-receptor docking using Autodock or ICM takes 10 t o 15mins to complete, depending on the constraints. We typically need to adjust for multiple constraints and evaluate the results. We currently have thousands o f dockings we need to carryout for different MHC molecules and different binding peptides. If we can grid-enable this process, we can have significant speed up to evaluate more MHC and more peptides.


Docking with Autodock and Molecular Dynamic analysis with Gromacs: Indonesian Herbal Pharmacological screening in silico
Dr Heru Suhartanto, Universitas Indonesia
Powerpoint presentation of problem scenario (Gromacs)
 
5. Grid-enabled Multiple Sequence Alignment

Jim Ho
Wang HsiKai
Lee Hong Kai
Thomas Tay

Enabling Multiple Sequence Comparison by Log-Expectation (MUSCLE)
Lee Hong Kai and Thomas Tay, NUHS Molecular Diagnostic Center, Singapore
Powerpoint presentation of problem scenario (MUSCLE)

Purpose: A need for multiple sequence alignment of norovirus from genogroup I,II and IV for effective primer probe design, phylogeny analysis, SNP analysis as well as sequence variability analysis. Multiple sequence alignment tools like ClustalW is very time-consuming, taking about 9 hours to run for about 1000 sequences. MUSCLE is much faster but is still limited by memory, therefore we are seeking for ways to grid enable MUSCLE to speed up multiple sequence alignment.

Invited Talk
1. Grids or Clouds?

Simone Brunozzi
(NEW!) Presentation slides

Simone Brunozzi, Amazon Web Services Technology Evangelist for APAC, will demystify common beliefs about Cloud Computing, briefly explain its main features, and show examples on how to use Amazon Web Services to run HPC tasks, MapReduce jobs, or in general to tap into a vast on-demand computing resource provided by Amazon to solve Bioinformatics problems or other computational tasks. The talk will cover both technical and economic aspects of Cloud Computing. At the end, attendees will be encouraged to ask questions.

 
Practical Parallel Computing in R

Xie Chao

TW Tan

R programming environment is frequently used in bioinformatics. For example the very popular BioConductor packages for R is widely used by biological researchers and highly cited. This session will consider how to parallelise applications with the R environment at different scales of granularity. A comparison of different platforms implementing this parallelised application is made across multicore, cluster, grid and cloud computing to illustrate the issues encountered. (30mins)

Calculating a million biostatistical correlations using R script for

  1. Multi-core computer
  2. ad hoc Linux cluster
  3. gLITE
  4. Cloud computing instances
Setting up a Biocloud using BioSlax

Mark De Silva

KS Lim

TW Tan
Bioinformatics users need to standardise an operating system containing a controlled and predictable programming environment without having to worry about versions of programming languages and applications. BioSlax is designed as a standard BioLinux platform which is easily adapted to Grid or Cloud environments.
  1. Introducing BioSlax (15mins)
  2. Adding New Modules to BioSlax (5mins)
    1. apt-get install
    2. dir2lzm /usr/local/
    3. cp .lzm /modules
    4. Activating new module in BioSlax
  3. Install Xen Hypervisor on your cloud servers (1hr)
    1. Configuration
    2. Remote mounting of database for use by all instances on /mnt/hda1
    3. Controlling the instantiation of your cloud servers
  4. Distributing your jobs to each cloud instance (1hr)
  5. Collecting your data from each cloud instance (30mins)
  6. Discussion of security issues (10mins)