GCC2015 Talk Abstracts
The deadlines for oral and poster presentations has passed. However, late oral and poster abstracts are still being accepted and will be considered as cancellations occur, or space opens up.
- 1 GCC2015 Talk Abstracts
- 2 Oral Presentations
- 2.1 Session 1
- 2.2 Session 2
- 2.2.1 BioJS2Galaxy: Automatic Conversion of BioJS Visualisation Components into Galaxy Plugins
- 2.2.2 Proteomics Visualization in Galaxy
- 2.2.3 Integration and visualization of sequence results across experiments for method development and quality control
- 2.2.4 GSuite Tools – efficiently manage and analyze collections of genomic data
- 2.2.5 Reproducible galaxy: Improved development and administration
- 2.3 Session 3
- 2.4 Session 4
- 2.5 Session 5
- 2.6 Session 6
- 2.6.1 Galaxy Interactive Environments – a new way to interact with your data
- 2.6.2 Opening Galaxy to script execution by everyone
- 2.6.3 Using Galaxy resources from the command line
- 2.6.4 Integrating Galaxy and Tripal: Cyberinfrastructure for the Genome Community Database
- 2.6.5 Simplifying IT for Local Galaxy
- 2.7 Session 7
- 2.7.1 Creating dynamic tools with Galaxy ProTo
- 2.7.2 Beyond Galaxy: portable workflows and tool definitions with the CWL
- 2.7.3 Extending Galaxy’s reach: recent progress towards complete multi-omic data analysis workflows
- 2.7.4 A Genomics Virtual Laboratory in practice
- 2.7.5 IRIDA: A Genomic Epidemiology Platform Built on top of Galaxy
- 2.8 Session 8
- 3 Poster Presentations
- 4 Submit a late abstract
These presentations have been accepted and the authors have confirmed that they will present these topics at GCC2015. This are not currently listed in any particular order.
Modeling molecular heterogeneity between individuals and single cells
The analysis of large-scale expression datasets is frequently compromised by hidden structure between samples. In the context of genetic association studies, this structure can be linked to differences between individuals, which can reflect their genetic makeup (such as population structure) or be traced back to environmental and technical factors.
In this talk, I will discuss statistical methods to reconstruct this structure from the observed data to account for it in genetic analyses.
In the second part of this talk I will extend the introduced class of latent variable models to model biological and technical sources of heterogeneity in single-cell transcriptome datasets. In applications to a T helper cell differentiation study, we show how this model allows for dissecting expression patterns of individual genes and reveals new substructure between cells that is linked to cell differentiation.
I will finish with an outlook of modeling challenges and initial solutions that enable combining multiple omics layers that are profiled in the same set of single cells.
Galaxy as backend for TraIT genotype to phenotype studies
Youri Hoogstrate1, Freek de Bruijn2, Ruslan Forostianov3, Wim van der Linden4
2 VUmc Amsterdam
3 The Hyve NL
The Center for Translation and Molecular Medicine Translational Research IT project (TraIT) aims to facilitate an IT infrastructure for translation research, and to enable multi-domain access to clinical, imaging, biobanking and experimental data.
TraIT offers a public Galaxy server (http://galaxy-demo.ctmm-trait.nl/) for general use and a private Galaxy server (http://galaxy.ctmm-trait.nl/) that can be securely used by anyone participating in collaborating biomedical studies. Several Galaxy tools and workflows have been created specifically for CTMM projects which include CGtag, RNA-Seq EdgeR, QDNAseq Copynumber Aberration Tool and iReport.
For the current release of our TraIT platform, -omics results and experimental meta-data are integrated in a datawarehouse, tranSMART, whilst the analytical workflows are delivered to the end user from TraIT Galaxy. Results from user cohort selection in tranSMART can be analysed using our tranSMART to Galaxy API service which processes the data on the galaxy server and returns the resultant output (tables, visuals, etc ..) back to tranSMART.
To extend our current TraIT analytical infrastructure to other genotype to phenotype resources we plan to develop, in collaboration with the European Bioinformatics Institute, a generalised European Genome-phenome Archive (EGA) Galaxy connector with functionality similar to the existing Galaxy-European Nucleotide Archive connector. This connectivity with EGA will deliver an “end to end” analytical environment for genotype to phenotype analysis with the TraIT platform (Galaxy & tranSMART).
Enabling large scale Genotype-Tissue Expression studies using Galaxy
Genna Gliner1, Ian McDowell2, Barbara E Engelhardt3
2 Computational Biology and Bioinformatics, Duke University
3 Computer Science Department and Center for Statistics and Machine Learning, Princeton University
The Princeton BEEHIVE Group develops statistical models and methods for high-dimensional genomic data. As part of the Genotype-Tissue Expression (GTEx) consortium, we are involved in processing vast quantities of RNA-sequencing and whole genome sequence data for different types of statistical and functional genomics studies, including cis- and trans-eQTLs, non-coding RNA regulation, and allele specific expression studies. The creation, testing, and deployment of the processing pipelines for each of these different study types require comprehensive analysis of large datasets through a dedicated pipeline used by all members of the group. With the ability to create custom tools and share and modify workflows, Galaxy provides a robust framework to develop this pipeline for use across our lab, but incorporating our diverse set of analysis tools into Galaxy is a non-trivial task.
In this talk we chronicle the evolution of the Princeton BEEHIVE Galaxy Pipeline. We illustrate our vision for a flexible, scalable, and streamlined pipeline using Galaxy for statistical genomics studies. We explore how our pipeline evolved by highlighting how our lab addressed the challenges of tool creation and integration, data processing and organization, and training lab members to use our Galaxy instance.
BioJS2Galaxy: Automatic Conversion of BioJS Visualisation Components into Galaxy Plugins
Sebastian Wilzbach1, Manuel Corpas2
Proteomics Visualization in Galaxy
Thomas McGowan1, James E Johnson1, Ira Cooke2, Pratik D Jagtap3,4, Timothy Griffin3,4
1 Minnesota Supercomputing Institute, University of Minnesota, Minneapolis, Minnesota, United States
2 Life Sciences Computation Centre, La Trobe University, Melbourne, Australia
3 Center for Mass Spectrometry and Proteomics, University of Minnesota, Minneapolis, Minnesota, United States
4 Department of Biochemistry, Molecular Biology, and Biophysics, University of Minnesota, Minneapolis, Minnesota, United States
The Galaxy-P project incorporated proteomics tools into the Galaxy framework, enabling multi-omics analysis from a single framework. A centralized Galaxy server can provide the computational and storage capacity required to manage the large and diverse datasets and applications required for that analysis, and in additional manage collaborative access.
A researcher may want to investigate results in a proteomics analysis using interactive visualization, for example to verify peptide spectral matches (PSMs). While there are excellent applications for visualizing PSMs, many require downloading large data files to the user’s computer. In addition, many proteomics applications record cross-references to input data using absolute file paths. This limits portability of the resulting output datasets.
The Galaxy Visualization plugin framework provides a lightweight solution for visualization that allows the big data to remain on the server. Visualization of a Galaxy dataset requires a REST-like dataset dataprovider that can respond interactively to client requests.
To achieve interactive access for proteomics data, we added the SQLIte datatype into Galaxy along with dataproviders to return results for client queries. We developed Galaxy tools to consolidate proteomics datasets into a sqlite dataset.
Our ProtViz Galaxy visualization plugin offers tabular views of the proteomics data from which to inspect and filter results, and visualization components, such as the lorikeet viewer, to analyze individual PSMs. We demonstrate the use of these novel visualization tools in interpreting and filtering results from MS-based proteomics data.
Integration and visualization of sequence results across experiments for method development and quality control
Bradley W. Langhorst1, Erbay Yigit1, Eileen T. Dimalanta1, Theodore B. Davis1
Manual synthesis of results across experiments is error-prone and laborious.We have constructed a galaxy tool, database, and visualization solution, SeqResults, to capture results and metadata and use it to understand how changes to library preparation affect sequence data. This extensible system includes modules to extract information from bam files, fastq files, coverage bed files, GC bias, and other summary metrics. Results and metadata are acquired in galaxy and sent to a custom relational database where they can be edited and deleted using a simple web front end. Finally we have constructed dynamic visualization tools to allow users to select data by date, flowcell, sample name, run type, read group, etc and compare sequence quality metrics, artifacts, coverage depths, etc. The SeqResults system has captured millions of data points generated from more ~1700 sequence experiments so far and continues to grow.
GSuite Tools – efficiently manage and analyze collections of genomic data
1 University of Oslo
2 University of Silesia
Advances in sequencing technologies provide abundance of genomic data, often in the form of genomic tracks. There is already a multitude of tools that can handle and analyze single tracks, but not many that allow one to efficiently manage and meaningfully analyze large collections of tracks, even though it is the natural next step in genome analysis. We here propose a simple, extensible tabular format called GSuite (Genomic Suite) for representing collections of datasets, along with a set of tools that allow efficient retrieval of collections of datasets from public repositories like ENCODE, convenient manipulation of each dataset in a collection, as well as novel analyses involving the full collections. The toolkit is openly available at http://hyperbrowser.uio.no/gsuite as an extension to the existing Genomic HyperBrowser, which is powered by Galaxy. Dataset lists in Galaxy provide a similar concept to GSuite, allowing analysis on each track (or track pair) in a list and downloading of all tracks in a list, but has limited features and requires manual list compilation in its present form. GSuite Tools provides various forms of automated compilation of GSuite files, provides a simple means to include metadata with each dataset, and provides greater ease of manipulation on both collections and individual datasets. To explore features and potential use cases freely, we have developed GSuite independently of the existing dataset lists functionality in Galaxy, but will work towards a tighter integration or even a merger of the two.
Reproducible galaxy: Improved development and administration
Aarif Mohamed Nazeer Batcha1, Sebastian Schaaf, Guokun Zhang, Sandra Fischer, Ashok Varadharajan, Ulrich Mansmann
Ever faced the issue not to turn the need for reproducibility into reality? The Munich NGS-Fablab wasn’t an exception. Creating an NGS infrastructure dedicated for clinicians and research scholars in order to perform some experimental diagnostic procedures and basic biomedical research was the task at hand. Reproducing the results and its working environment is as important in medicine as in other fields. Understanding the issues of reproducibility, intra- and inter-compatibility among instances within and outside our institute, we came up with our own shell setup scripts introduced in GCC2014. Later, we converted our scripts into more elegant Ansible-playbooks.
Ansible is one of the easy-to-script, open-source software platforms for configuration and management of computers. Ansible from our point of view should be regarded as revolutionary for administration and development as Galaxy turned out to be for scientific users in a bioinformatics flavored setting: it disburdens from technical ‘house-keeping’ work and thus enables more sophisticated work. It manages nodes over SSH. These ansible playbooks have also been used at the Galaxy Main. Our Ansible-playbook scripts can setup an orderly and clean working environment completely within few minutes by providing an inifile and a blank UNIX. Although developed in SLES, the scripts were modularized for further developments to work on other linux systems. We would like to present a short review on the experiences we gained and the flow towards ansible scripting and finally provide answers to the question “what DevOps take home from their daily work?”
Galaxy Community Update (State of the Galaxy)
Anton Nekrutenko1, James Taylor2
2 Johns Hopkins University
A review of what’s happened and what’s coming in the Galaxy Project.
Galaxy and the RNA Bioinformatics Center
Cameron Smith1, Torsten Houwaart1, Björn Grüning1
The recently launched German Network for Bioinformatics Infrastructure aims to provide comprehensive bioinformatics services to users in life sciences research, industry and medicine. Within this network, the RNA Bioinformatics Center (RBC) is responsible for supporting RNA related research in Germany, such as the detection of noncoding RNAs and RNA structure prediction. In this talk we will present the RBC and the RNA workbench in more detail.
The RNA workbench is a ready-to-run Docker based Galaxy instance, bundled with a variety of RNA analysis tools, sample data and teaching material. This image has already been proven to be useful as a platform for training users in Galaxy driven bioinformatics analysis.
Support of RNA research also includes enabling seamless access to diverse data sources. As a first step towards this goal, the RBC has extended the Galaxy documentation by providing a base example for including external databases as Galaxy accessible sources. Our experience with data sources and the communication with different database administrators will be outlined.
The RNA-workbench provides a well documented interface for creating new Galaxy flavours, allowing users to easily include their chosen toolset, define desired indices and provide custom data. We would like to raise the awareness of the importance of RNA related research and to kickstart an RNA focused Galaxy community.
Data-Driven Science: Advanced Storage Systems for Genomics Analysis
A brief perspective of computational solutions for genomics analysis with an eye towards how the generation and manipulation of genomics data has both enabled and constrained the science. An overview of a few SGI customers and their workflows in the genomics research space is presented. With ever-expanding genomics workflows in mind, we will introduce the SGI UV system with NVMe storage as a tool capable of addressing both present and especially future workflows, enabling the science in ways not possible with other architectures.
Galaxy Tool Shed: Tool Discovery and Repository Management
Martin Čech1, Galaxy Team2
1 Department of Biochemistry and Molecular Biology, PSU, USA, http://galaxyproject.org/
Galaxy uses the Tool Shed (TS) as an App Store-like platform for tool exploration and deployment with support for sharing reproducible workflows. This talk will review the current state of the TS, and recent and upcoming work.
Today the TS contains over 3000 tools in many areas of computational research, and a vibrant community is updating and improving these tools, led by the efforts of the Intergalactic Utilities Commission (IUC).
The Fall 2014 questionnaire identified tool discovery and repository management as priorities areas for the TS. We have rewritten search from scratch to allow deployers to easily identify high quality repositories. A review by the IUC, significant traffic, good ratings, and the number of downloads are all indications of high quality repositories, and can also be used to increase these repositories’ visibility. Moreover it is now possible to search for individual tools directly, rather than just repositories (which may contain multiple tools).
Groups have also been introduced in the TS ecosystem. This feature aims at unified presentation of labs and development teams and their consolidated work. To streamline the process of tool development authors will soon be able to work on new additions to their repositories in private mode, affording more control over what is visible to users and what is still work in progress.
ReGaTE, Registration of Galaxy Tools in Elixir
Olivia Doppelt-Azeroual1, Fabien Mareuil1, Eric Deveaud1, Matus Kalas2, Hervé Menager1
1 Center of Bioinformatics, Biostatistics and Integrative Biology, Institut Pasteur, Paris, France
2 Computational Biology Unit, University of Bergen, Norway
ReGaTE is a software component enabling the automated publication of Galaxy tools and workflows into the ELIXIR Tools and Data Services Registry (https://elixir-registry.cbs.dtu.dk/#/). This registry is a web portal for the exploration of bioinformatics resources, such as software packages, web services, websites, or reference databases. Through a dedicated interface, its users can search and locate relevant tools and data resources, and bioinformatics resource providers can enhance the visibility of their services. The registration of resources in the registry can be performed either manually, by filling a form on a web user interface, and providing the required description elements, or automatically by using the registry API.
ReGaTE uses the BioBlend API and the Registry API to completely automate the registration of the tools installed on any given Galaxy portal.
Central to the development of this tool is the mapping of the Galaxy datatype system to the EDAM Ontology. EDAM provides a controlled vocabulary for the description of scientific topics, software operations, types of data and data formats and it is used to describe the contents of the ELIXIR Registry. This mapping enables the automation of the registration of Galaxy tools by describing the format of their input and output data in the controlled vocabulary of the registry.
This mapping is being developed in collaboration with members of the Galaxy team, the EDAM ontology and the Common Workflow Language project.
ReGaTE is available at http://github.com/bioinfo-center-pasteur-fr/ReGaTE.
François Moreews1, Olivier Sallou2, Yvan le Bras2, Marie Grosjean3, Cyril Monjeaud2, Thomas Darde4, Olivier Collin2, Christophe Blanchet3
2 Genouest Bioinformatics facility – INRIA/IRISA – Rennes, France
3 French Institute of Bioinformatics – CNRS IFB-Core UMS3601 – Gif-sur-Yvette, France
4 INSERM U625 – Rennes France
Nowadays, Docker containers are used to ease application deployment, from command lines tools to cluster management1. This technology has a strong impact in bioinformatics where specialized software can often require multiple dependencies. It is a long term preservation solution for legacy and unmaintained tools and it enables a better process isolation in a multi-user environment. Docker as a way to quickly integrate new tools is already used with Galaxy. We have setup a functional prototype of a web registry of Docker images, BioShaDock,2 dedicated to bioinformatics tools and utilities. We created a set of tools descriptors based on Docker images available in our toolshed3. Even if a general purpose registry can be used to hold shared Docker containers, we think that a domain centric registry, e.g. for the French life science community through a registry linked to the cloud of the French Institute of Bioinformatics (IFB8), would have a significant impact on bioinformatician productivity and help to spread best practices. With a clear open source and domain orientation, it could federate container providers4,5 more easily. It would also be able to include validation and curation to eliminate redundant tools, organize versioning and standardize documentation. Future works will concern advanced searching capabilities, possible referencing within the ELIXIR Tools and Data Services Registry6 and in the IFB one (as the ELIXIR French node). We want also to contribute to standardize containers7 and evaluate if benchmarks5 could be produced from a meta-data enriched, Docker registry.
2 BioShaDock, a Bioinformatics Shared Docker registry : http://docker-ui.genouest.org
3 GUGGO Galaxy Tooshed : http://toolshed.genouest.org
4 Hexabio Docker repository : http://biodocker.github.io
5 Nucleotid.es, continuous, objective and reproducible evaluation of genome assemblers using docker containers : http://nucleotid.es
6 ELIXIR Tools and Data Services Registry : https://elixir-registry.cbs.dtu.dk
7 Bioboxes, a standard for creating interchangable bioinformatics software containers : http://bioboxes.org
8 IFB academic Cloud : http://www.france-bioinformatique.fr/?q=en/core/e-infrastructure-team/ifb-cloud
A galaxy metagenomic workflow for reference-tree based phylogenetic placement (MG-RTPP)
Ambrose Andongabo1*, Ian M. Clark1*, Dariush Rowlands1, Keywan Hassani-Pak1, Penny R. Hirsch1, Elisa loza1, Andy Neal1*
Background: High-throughput sequencing of environmental nucleic acids is revolutionizing and dramatically expanding our understanding of the diversity and functionality of complex microbial communities. There are a number of tools which allow community structure to be surveyed using metagenomics or meta-transcriptomics at the rRNA level, or by using COG- or KEGG-based functional assignments. However, there are limited complementary approaches to investigate the phylogenetic diversity of functionally important individual genes in large sequence databases.
Results: We have designed a workflow for reference-tree based phylogenetic placement (MG-RTPP) of metagenomics and meta-transcriptomics samples. The inputs to the workflow are unassembled reads, a multiple sequence alignment (MSA) of the genes of interest and large public sequence databases. Reference nucleotide profile hidden Markov models (pHMMs) are built from the MSA and are used as queries. Homologous reads are checked for accuracy before being placed on a reference phylogenetic tree, maximising phylogenetic likelihood. The workflow retains considerable flexibility, allowing for tuning of redundancy in the nucleotide pHMMs used as queries to recover as many true hits as possible.
Conclusions: MG-RTPP facilitates fast interrogation of sequence databases in a flexible and robust fashion. It avoids misidentification of false positives while pHMM tuning allows for maximum recovery of sequences. Phylogenetic placement provides unique visualization approaches which reveal the phylogenetic relationships between environment-derived sequences and sequenced organisms and between samples. The approach compliments tools such as QIIME, MG-RAST and MEGAN in allowing interrogation of individual gene abundance and diversity in samples. Keywords: metagenome, metatranscriptome, assembly-free, community analysis, functional genes, phylogeny.
Less Click, More Quick: Unattended Installation of Galaxy’s Built-in Reference Data
Daniel Blankenberg1,2, The Galaxy Team2
1 Department of Biochemistry and Molecular Biology, Pennsylvania State University, University Park, PA 16802
Once a Galaxy administrator has completed the configuration of their Galaxy instance as a production server, they will have a well-tuned machine that is capable of many things, but will actually do very little. In order to allow users to perform useful analyses, the administrator will need to install the desired set of tools. While this formidable task can be accomplished using the Galaxy ToolShed, it only solves part of the problem. These tools lack the reference datasets needed to make them really useful. Data Managers allow an administrator to install built-in datasets through a web-based interface. Traditionally, an administrator performs each part manually in a step-wise fashion: obtaining genomes directly from their source repositories and then building indexes on the retrieved genomes. Although this solves many technical hurdles, it is time consuming and repetitive. This no longer needs to be the case.
Here, we demonstrate new enhancements to the Data Manager framework that greatly eases the burden of configuring large amounts of reference data. By harnessing the Galaxy Project’s rysnc server, we allow Galaxy administrators to quickly and effortlessly fetch and configure the pre-computed reference data utilized at UseGalaxy.org. Administrators can filter by dbkey or Data Table name, or simply grab it all. Additionally, we provide a set of utilities to streamline the entire process. Using these utilities, an administrator can perform all the steps needed for populating pre-built data, from defining new dbkeys to building any number of mapping indexes or other reference data, with a single command.
Galaxy flavours – shipped by a whale
Björn Grüning1, Eric Rasche2, John Chilton3, Dannon Baker4
1 Bioinformatics Group, Department of Computer Science, University of Freiburg, Germany
2 Center for Phage Technology, Texas A&M University, USA
3 Department of Biochemistry and Molecular Biology, PSU, USA
4 Department of Biology, Johns Hopkins University, USA
For years Galaxy has made advanced bioinformatics software accessible to biologists directly by providing an intuitive web interface to these applications while fostering reproducibility through the automatic creation of re-runnable protocols of each analysis. With the Tool Shed, Galaxy gained a flexible deployment platform enabling identical software installations across Galaxies.
A major hurdle in using Galaxy today is simply finding an instance with the correct set of tools and with enough computational power and storage necessary for a particular analysis. An interesting solution to this challenge (and one required by certain data usage policies) is to move Galaxy and tools to the data instead of the more traditional approach of uploading the data to a remote server running Galaxy.
Galaxy is using Docker to solve this problem in a way that achieves an even greater level of reproducibility by delivering the entire software stack in a container. Each new release of Galaxy is now available as a production-ready Docker container. Additionally, this Docker image can be extended to build personalized Galaxy flavours, with site-specific sets of tools. For example, a Galaxy Docker flavour containing all necessary tools for RNA-seq analysis, or a genome annotation flavour with the NCBI BLAST suite. These flavours are simple to create and can be easily deployed on Linux, OS-X and Windows.
For more traditional Galaxy deployments, tools may now be configured to run securely in Docker containers. The isolation provided by running Galaxy jobs in this fashion provides a much higher degree of security than running them as native processes allowing both process and data isolation.
Planemo – A Galaxy Tool SDK
John Chilton1, Björn Grüning2, Eric Rasche3, Kyle Ellrott4, Galaxy Team
1 Department of Biochemistry and Molecular Biology, PSU, USA
2 Bioinformatics Group, Department of Computer Science, University of Freiburg, Germany
3 Center for Phage Technology, Texas A&M University, USA
4 Center for Biomolecular Science & Engineering, University of California Santa Cruz, USA
This talk will summarize a year’s worth of focus on making tools more expressive as well as easier to develop and test. These efforts will be highlighted in part through the prism of the new command-line toolkit Planemo.
Last fall the Galaxy team solicited community feedback via questionnaire and identified testing as the largest hurdle in the tool development process. So Planemo was built to vastly simplify tool testing and Galaxy now features many new testing facilities allowing more expressive tests. Additionally, Planemo features “linting” functionality to catch tool and Tool Shed artifact problems before even running tests.
This is but one process improvement of many made over the last year, the Galaxy Intergalactic Utilities Commision and core development team have moved all tool development to GitHub and written a best practice guide for tool development. This talk will discuss the benefits of GitHub as well as Jenkins build scripts leveraging Planemo and Tool Shed API enhancements developed to automate testing and publishing tool repositories en masse to the Tool Shed.
In addition to making current developers more productive, Planemo lowers the barriers to entry. Planemo has great documentation and utilities to bootstrap new best-practice tools quickly using example commands as templates. Developed in part for entrants to the DREAM SMC-Het challenge (which will introduce many new developers to the ecosystem by including a reproducibility focused sub-challenge requiring submission of Galaxy tools and workflows) Planemo virtual appliances package complete development environments for Galaxy tools.
Galaxy Interactive Environments – a new way to interact with your data
Eric Rasche1, Björn Grüning2, John Chilton3, Dannon Baker4
1 Texas A&M University, College Station, Texas, United States
2 University of Freiburg, Germany;
3 Penn State University, Pennsylvania, United States
4 Johns Hopkins University, Maryland, United States
Slides, demo (video)
A common complaint leveled at Galaxy by bioinformaticians is that it lacks the flexibility and interactivity of the Unix shell and scripting languages. Heavy use of the command line often results in homemade scripts and a non-portable and non-transparent analysis, which is hard for biologists to understand, and hard for bioinformaticians to reproduce.
Here, we present a new concept in Galaxy called Interactive Environments (IEs). IEs are perfectly suited for bioinformaticians and can offer the missing flexibility of a modern scripting language – even shell access if desired – enabling rapid, iterative, and interactive bioinformatics analysis and software prototyping directly in Galaxy, next to your big data. IEs reduce the barriers that bioinformaticians often encounter while using Galaxy.
We will present one IE in detail that integrates the popular IPython environment in a secure manner in Galaxy. IPython is a platform providing a web-based interactive computing and visualization environment. Galaxy IPython allows Galaxy users to run IPython inside Galaxy and access it via their web browser. Additionally, it extends the default IPython environment by providing easy, secure access to Galaxy, it’s API, and the user’s data. As Galaxy IPython is deployed on the Galaxy server, it removes the overhead of big-data downloads and uploads during analysis. Galaxy has long been a great platform for bioinformatics education, but Galaxy IPython makes it a great platform to teaching bioinformatics programming as well.
Opening Galaxy to script execution by everyone
Marius van den Beek1, Christophe Antoniewski1
Currently galaxy users are limited to tools that are already wrapped and installed in a Galaxy instance. While important in ensuring accessibility and reproducibility, tool wrapping remains a hurdle for users and developers not familiar to Galaxy’s tool wrapping process. In addition complex workflows that involve loops and/or conditions have not been implemented in Galaxy to date.
To circumvent these limitations we extended Ross Lazarus’ Galaxy Tool Factory into the Docker toolfactory. This tool sends script execution into an isolated docker container that only has access to the script, the input and the output data. We will demonstrate that the docker toolfactory opens up the possibility for bioinformaticians to run and store their scripts within a history entry, side by side with its input and output data. Other applications include execution of complex workflows through API scripts that are not possible solely by using galaxy’s UI and the possibility to run and store very specific scripts that were required to generate figures in a publication.
In combination with interactive environments, such as IPython and Rstudio, our tool improves the attractiveness of Galaxy as a development platform for any bioinformatician/data-analyst. It also reduces the barrier to learn writing scripts, as one can still use all the features of galaxy, such as pre-installed tools, workflows, libraries, cluster and job management, while focusing on the script. The docker toolfactory is available in the testtoolshed and https://bitbucket.org/mvdbeek/dockertoolfactory.
Using Galaxy resources from the command line
Clare Sloggett1, Nuwan Goonasekera1, David Powell2, Simon Gladman1, Enis Afgan3, Andrew Lonie1
2 Monash University, Australia
3 John Hopkins University, USA
As a part of the Genomics Virtual Laboratory project1, we have built CloudMan-enabled, scalable machine images providing bioinformatics researchers with Galaxy, RStudio, IPython Notebook, and the linux command line in one server. This allows users to work in different environments, and to move between platforms as appropriate – for instance, carrying out parts of an analysis in Galaxy and RStudio seamlessly.
To handle the technical challenge of maintaining bioinformatics resources for multiple platforms, we have exploited Galaxy and the Galaxy Toolshed. The Toolshed2 has developed into a comprehensive management interface for bioinformatics tools, with the ability to install the underlying tool dependencies. More recently, Data Managers have been added to the Toolshed, allowing management of reference data and genome indices through Galaxy.
We have implemented a set of scripts which, in part:
- create environment modules3 for bioinformatics tools that have been installed through the Toolshed. This approach allows access to multiple versions of a tool,
- give the ability to mount Galaxy Datasets as appropriately-named files via FUSE, for direct read-only access from the command line, RStudio, or IPython Notebook,
- provide convenient symlinks to Galaxy reference genomes and indices.
In addition, the BioBlend library4 is installed into all GVL machine images, providing programmatic access to the Galaxy workflow engine.
These scripts are run as part of the setup of a GVL instance. They are implemented as an Ansible playbook, allowing them to be easily adapted to other Galaxy servers.
2 Blankenberg et al. (2014) Dissemination of scientific software with Galaxy ToolShed. Genome Biology 15: 403
3 Environment modules website: http://modules.sourceforge.net/ ; Furlani, J.L. : Modules: Providing a Flexible User Environment, Proceedings of the Fifth Large Installation Systems Administration Conference (LISA V), pp. 141-152, San Diego, CA, September 30 – October 3, 1991
4 Sloggett, C., Goonasekera, N., and Afgan, E. (2013) BioBlend: automating pipeline analyses within Galaxy and CloudMan. Bioinformatics 29: 1685-1686
Integrating Galaxy and Tripal: Cyberinfrastructure for the Genome Community Database
1 University of Connecticut Department of Ecology and Evolutionary Biology, Storrs, CT 06269, USA
2 Washington State University Department of Horticulture, Pullman, WA 99164, USA
3 Clemson University Department of Electrical & Computer Engineering, Clemson, SC, 29634, USA
4 Clemson University, Clemson Computing and Information Technology, Anderson, SC 29625 USA
5 University of Tennessee Institute of Agriculture Department of Entomology and Plant Pathology, Knoxville, TN 37996, USA
6 Clemson University Department of Genetics & Biochemistry, Clemson, SC, 29634, USA
Model or clade organism databases (i.e. community research databases) enable both basic and applied research by offering curated data, visualization, and analytical tools. Tripal is a widely adopted, open-source toolkit for construction of online genomic and genetic databases. Tripal combines the power of Chado, an open-source database schema and Drupal, an open-source content management system, to facilitate construction of genomic and genetic websites while allowing complete customization. Advances in sequencing technology create new opportunities and challenges in genomics research for all organisms. Access, sharing, and analysis of these large data sets is hindered by transfer speeds, incompatible file formats, and insufficient metadata. The NSF DIBBs-funded Tripal Gateway project (ACI-1443040) is aimed at addressing these issues through development of three new Tripal modules that 1) improves data transfer by exploring software defined networking technologies (Tripal SDN module); 2) provides a RESTful web service framework with the goal of cross-database querying (Tripal Exchange module); and 3) integrates with Galaxy workflows to seamlessly provide commonly used analytical workflows to site patrons (Tripal Galaxy module). Development of the Tripal Galaxy module will include the creation of PHP bindings for the Galaxy API (usable outside of Tripal), integration of Galaxy workflows into Tripal, and coordination of data transfer for use in workflows to computational facilities and back to the community database. The Tripal Gateway project will be implemented for the legume, grains, cotton and tree crop communities but will be available for use by any Tripal site.
Simplifying IT for Local Galaxy
Creating dynamic tools with Galaxy ProTo
Morten Johansen1, Sveinung Gundersen2, Abdulrahman Azab2, Eivind Hovig1, Geir Kjetil Sandve1
2 University of Oslo
Creating a Galaxy tool is not straightforward and has limitations. One has to write a XML file defining the inputs and outputs of a tool. This is practical when one has a predefined number of input fields with static options, but becomes complex when the options can change dynamically, and even impossible if the number of input fields can change (e.g. depending on what the user selected in a previous selection box).
The Galaxy Prototyping Tool API (Galaxy ProTo) is a new tool building methodology, introduced by the Genomic HyperBrowser project. Galaxy ProTo is an unofficial alternative for defining Galaxy tools. Instead of XML files, Galaxy ProTo supports defining the user interface of a tool as a Python class. Each input box is defined in a method that provides a high level of dynamicity. For instance one could read the beginning of an input file and provide dynamic options based on the file contents.
Beyond Galaxy: portable workflows and tool definitions with the CWL
Peter Amstutz1, Nebojša Tijanić2, Stian Soiland-Reyes3, John Kern4, Luka Stojanovic2, Tim Pierce1, John Chilton5, Maxim Mikheev6, Samuel Lampa7, Hervé Ménager8, Scott Frazer9, Venkat S. Malladi10, Michael R. Crusoe11
1 Curoverse Inc.
2 Seven Bridges Genomics, Inc.
3 University of Manchester, School of Computer Science
4 AccuraGen Inc.
5 Penn State University, The Galaxy Project
6 BioDatomics LLC.
7 Uppsala University, Department of Pharmaceutical Biosciences; BILS (Bioinformatics Infrastructure for Life Sciences)
8 Center of Bioinformatics, Biostatistics and Integrative Biology, Institut Pasteur, Paris, France
9 The Broad Institute
10 Stanford University
11 University of California, Davis; School of Veterinary Medicine; Lab for Data Intensive Biology
With Galaxy one gets all the benefits of bioinformatics workflow platforms: provenance tracking, execution and data management, repeatability, and an environment for data exploration and visualization. But what are the options when we want to move to another platform?
To address this four engineers started working together at the BOSC 2014 Codefest with an initial focus on developing a portable means of representing, sharing, and invoking command line tools and a secondary focus on portable workflow descriptions.
On March 31st, 2015 the group released their second draft of the Common Workflow Language specification. Descriptions are a YAML document: validated by an Apache Avro schema and can be interpreted as an RDF graph using JSON-LD. The documents are also valid Wf4Ever ‘wfdesc’ descriptions after a simple transformation. Future drafts will include the use of the EDAM ontology to describe the tools enabling discovery via the ELIXIR tool registry.
Seven Bridges Genomics, the Galaxy Project, and the organization behind Arvados (Curoverse) have started to implement support for the Common Workflow Language, with interest from other projects and organizations like Apache Taverna, BioDatomics and the Broad Institute. Developers on the Galaxy Team are exploring adding CWL tool description support with plans to add support for the CWL workflow descriptions. Tool authors and other community members will benefit as they will only have to describe their tool and workflow interfaces once. This will enable scientists, researchers and other analysts to share their workflows and pipelines in an interoperable and yet human readable manner.
Extending Galaxy’s reach: recent progress towards complete multi-omic data analysis workflows
Timothy J Griffin1, James Johnson1, Getiria Onsongo1, Pratik D Jagtap1, Candace R Guerrero1, Kevin Murray1, Ira Cooke2, Bjoern Gruening3, Lennart Martens4, Marc Vaudel5, Harald Barsnes5
2 La Trobe University, AUSTRALIA,
3 University of Freiburg, GERMANY;
4 Ghent University, BELGIUM;
5 University of Bergen, NORWAY
Integrative analysis of different ‘omic data types, also known as multi-omics, is gaining momentum as a powerful biological discovery tool. Galaxy offers an ideal platform for these types of data analysis applications, which require sophisticated workflow development utilizing disparate tools for different data types (e.g. genomic, transcriptomic, proteomic data). Here, we will present recent progress from our global research team in this area, focusing on proteogenomic applications. Proteogenomics utilizes genomic and/or transcriptomic data as a template to translate in-silico possible encoded protein products, including novel sequences arising from genomic variation (splice isoforms, mutations, frameshifts etc). Mass spectrometry (MS)-based proteomics data is matched against these protein sequences, confirming known protein products, as well as novel sequences. We have built a unique Galaxy-based workflow offering complete proteogenomic analysis. Our workflow utilizes well-known Galaxy tools for working with transcriptomic and genomic data to identify potentially novel protein coding sequences such as splice isoforms and non-synonomous indels (e.g. TopHat, SamTools, SNPeff). We are developing the powerful SearchGUI/PeptideShaker platform, implemented in Galaxy, to match proteomics data to the generated protein sequences. This platform enables combined use of several proteomic database searching algorithms to provide more confident matches of data to novel protein sequences, and flexible outputs for further downstream analysis and evaluation to ensure high confident reporting of novel protein sequences. Finally, the results are compatible with visualization and interpretation using the popular Integrated Genome Viewer. We will demonstrate the use of this powerful proteogenomic workflow in the analysis of several biologically-relevant datasets.
A Genomics Virtual Laboratory in practice
Enis Afgan1, Clare Sloggett2, Nuwan Goonasekera2, Igor Manukin3, Derek Benson3, Mark Crowe4, Simon Gladman2, Yousef Kowsar2, Michael Pheasant3, Ron Horst3, Andrew Lonie2
1 John Hopkins University, USA
2 University of Melbourne, Australia
3 University of Queensland, Australia
4 Queensland Facility for Advanced Bioinformatics, Australia
Over the last 4 years we have designed and implemented the Genomics Virtual Laboratory (GVL: http://genome.edu.au) as a middleware layer of machine images, cloud management tools, and online services that enable researchers to build arbitrary sized Galaxy compute clusters on demand, pre-populated with fully configured bioinformatics tools, reference datasets and workflow and visualisation options. The platform is flexible in that users can conduct analyses through multiple web-based (Galaxy, RStudio, IPython Notebook) or command-line interfaces, and add/remove compute nodes and data resources as required. Best practice tutorials and protocols provide a path from introductory training to practice. The GVL is available on the OpenStack-based Australian Research Cloud (http://nectar.org.au) and the Amazon Web Services cloud via a dedicated web-based launcher application (http://launch.genome.edu.au).
We now have GVL implementations at major Australian research institutes including the Universities of Queensland, Melbourne, Monash and Western Australia, and the Peter MacCallum Cancer Centre; plus many hundreds of individual launches by researchers and students across the country. We have learned a great deal about the usage patterns of the platform, including scalability, reliability, and accessibility. This presentation will discuss progress on the GVL project, lessons learned in architecting for the cloud, and uptake and usage by the Australian research community.
IRIDA: A Genomic Epidemiology Platform Built on top of Galaxy
Aaron Petkau1, Franklin Bristow1, Thomas Matthews1, Josh Adam1, Philip Mabon1, Eric Enns1, Jennifer Cabral1,2, Joel Thiessen1,2, Cameron Sieffert1, Natalie Knox1, Damion Dooley3, Emma Griffiths5, Geoff Winsor5, Matthew Laird5, Mélanie Courtot3,5, Peter Kruczkiewicz6, Alex Keddy7, Robert G. Beiko7, William Hsiao3,4, Gary Van Domselaar1,2, Fiona Brinkman5
2 University of Manitoba, Winnipeg, Canada
3 BC Public Health Microbiology and Reference Laboratory, Vancouver, Canada
4 University of British Columbia, Vancouver, Canada
5 Simon Fraser University, Burnaby, Canada
6 Laboratory for Foodborne Zoonoses, Lethbridge, Canada
7 Dalhousie University, Halifax, Canada
Whole genome sequencing (WGS) is revolutionizing epidemiological methods for identification and investigation of infectious disease outbreaks. However, the routine use of WGS has been hindered due to the complexity in data management and the lack of pipelines supporting quality control and data analysis standards. While an increasing number of pipelines for genomic epidemiology are being developed, each typically has different installation and execution requirements. This leads to a difficulty in the integration of these pipelines into a single genomic epidemiology system.
Galaxy offers a solution by providing a system to integrate, execute, and maintain data analysis pipelines. In addition, Galaxy provides a community of developers who contribute and maintain the bioinformatics tools used for genomic epidemiology. Our project, IRIDA (Integrated Rapid Infectious Disease Analysis), builds on top of Galaxy a platform for genomic epidemiology. IRIDA provides a system for the storage and management of sequencing data and sample metadata, an interface for the execution of data analysis pipelines, and the storage, auditing and visualization of results. Within IRIDA, we provide standard pipelines for genomic epidemiology including SNVPhyl, our SNV (Single Nucleotide Variant) phylogeny pipeline. These pipelines are executed using a Galaxy instance internal to IRIDA and additional support is provided for exporting genomic sequence data to external Galaxy instances.
By building on top of Galaxy we hope to simplify the process of pipeline integration, to share our pipelines with the bioinformatics community, and to contribute to the development of standards for genomic epidemiology. More information can be found at http://irida.ca.
Building Galaxy Community VM
Ryota Yamanaka1, Tazro Ohta2, Manami Kato3, Hiroyuki Aburatani1
1 Genome Science Division, The University of Tokyo
2 Database Center for Life Science, ROIS
3 Laboratory for Disease Systems Modeling, IMS, RIKEN
For biomedical researchers, there are two barriers when they start using Galaxy. First, while Galaxy tools and workflows are shared in public repositories, new users can hardly get the information on how other research institutes use those tools and design workflows. Second, they may often not be able to reproduce the workflows they used before, since it is difficult for individual researchers or small laboratories to maintain their systems, so their Galaxy environments get often unrecoverable when they change the settings or reset their computers. To solve these problems, Galaxy Community Japan holds a monthly meet-up to share our workflows. We also distribute a virtual machine image, on which we configured Galaxy with necessary tools and workflows, and make available our practical know-how about these tools and workflows on our website. Users can download and run the virtual machine on their own PC or launch it on AWS, so they can immediately try pre-installed analysis workflows with their own data. The latest version of this virtual machine is running on our public test site, while the older versions are kept downloadable too. As a result, users can run the same workflows on different computational infrastructures and always reconstruct the Galaxy environments they have used before. This will also help developers advertise their new tools to potential users. We would like to introduce several newly developed unique tools on our Galaxy, as well as our experiences in the local activities such as Galaxy Workshop Tokyo.
An initiative to federate the galactic community in France: the IFB Galaxy Working Group
Gwendoline ANDRES1, Loraine BRILLET-GUEGUEN1, Christophe CARON1, Alexis DEREEPER2, Sandra DEROZIER3, Olivia DOPPELT-AZEROUAL4, Jean François DUFAYARD5, Franck GIACOMONI6, Olivier INIZAN7, Gildas LE CORGUILLE1, Alban LERMINE8, Valentin LOUX3, Sarah MAMAN9, Fabien MAREUIL4, Misharl MONSOOR1
1 CNRS-UPMC Station Biologique de Roscoff
2 IRD Southgreen Montpellier
3 INRA MaIAGE Jouy en Josas
4 Center of Bioinformatics, Biostatistics and Integrative Biology, Institut Pasteur, Paris, France
5 CIRAD Southgreen Montpellier
6 INRA PFEM Clermont Ferrand
7 INRA URGI Versailles
8 Institut Curie Paris
9 INRA Genotoul/SIGENAE Toulouse
As the Galaxy “tour de france” showed in 2012, the Galaxy platform aroused a great interest in France. Today this platform meet a great success throughout the different bioinformatics infrastructures in the country. In March 2013 a working group dedicated to Galaxy and supported by national infrastructure “Institut Français de Bioinformatique” (French Institute of Bioinformatics) has been set up: IFB Galaxy Working Group (IFB GWG). This Working Group has been built upon several national and regional bioinformatics platforms which use and deploy Galaxy (for training sessions, for analysis, …). The IFB has given to the Working Group the mission to federate the french Galaxy community (biologist and bioinformaticians). This was based on three main actions: animation, training and technology. After two years of activity we would like to present the more significant results in the animation of the french community: events (Galaxy Days), training sessions (Galaxy4Bioinformatics), development (Toolshed, Best Practice Guides). We will also present how we are using Galaxy as a Hub to build some federating projects (IFB/France Genomique & IFB/MetaboHUB) between different communities addressing scientifics/technological challenges.
Submit a late abstract
The deadlines for oral and poster presentations has passed. However, late oral and poster abstracts are still being accepted and will be considered as cancellations occur, or space opens up.
Abstracts are submitted electronically and should be 250 words of plain text or less. See the GCC2014 abstracts list to see the broad range of topics presented in 2014.
There will also be an opportunity for lightning talks, which will be solicited during the meeting.
Oral presentations will be 15 or 20 minutes long.
Talks and posters on any topics of interest to the Galaxy community are welcome.
Please Note: By submitting an abstract you:
- Agree to make your slides/posters freely available on this web site no later than 15 August 2015.
- Those giving oral presentations agree to have their presentations videotaped and made publicly available during and after the conference.