Currently the target application of PIPE is processing and analysis of images of gene expression patterns in Drosophila obtained by confocal scanning microscope.
PIPE is developed as an open system which easily allows to add new processing and analysis methods and integrate third parties tools. The system architecture is based on the technology of multi-agent systems. PIPE can be run locally or used to organize the on-line collaboration of investigators from different laboratories via the Internet. All PIPE code will be available under the GNU library general public license (LGPL).
Nowadays among biological community the understanding grows of the importance of quantitative approach to the analysis of gene expression information. The quantification of gene expression will make it possible to apply mathematical and computational methods to reveal fine details of biological processes and to establish new regularities.
Unfortunately relative few methods are currently available for extracting quantitative data from images of gene expression patterns and linking this data to other biological information [3]. Little research addresses the problem of management, retrieval and analysis of gene expression data in situ not to say about design of universal formats for sharing these images between users [4]. Thus the development of a software for quantification and analysis of gene expression in situ is an important task for bioinformatics.
Over a period of several years these teams have developed various in-house tools for quantification of segmentation gene expression in Drosophila, construction of spatio-temporal atlas of gene expression at cellular resolution and analysis of image information. Within PIPE project these and many other tools will become publicly available for scientific community both via the Internet and for download.
The requirements to such software can be formulated as follows:
Looking at currently used client/server architectures it becomes immediately clear that only multi-tier or multiagent architecture can meet functional requirements formulated above. The important advantage of multiagent systems is their inherent modularity. Due to modularity these systems are scalable and easy extendable. Moreover multiagent systems with redundant components (databases, agents, application programs, other services) are robust, highly adaptable to functional extensions and have high readiness and reactivity characteristics. Thus the multiagent architecture satisfies nearly all our requirements to the behavior of the system for image and data processing and analysis. Therefore we decide to use multiagent approach to design this system.
Figure 1 presents the system architecture. The current configuration of the system includes two servers each containing all system components.
Figure 1. System architecture.
All agents are designed as multithreaded Java HTTP servers, which can actively communicate with other agents. The agents exchange messages via HTTP protocol using GET and POST commands, and hence can be used in networks with FireWall and Proxy Server. The agents implement complex scenarios of distributed interactions in heterogeneous environment.
System configuration
The information about agents and their functions is stored in a coordination agent (CA) database. To ensure the actuality of information about system configuration
Each agent selects a counteragent (or a required service) with regard to its availability and load, the counteragent located on the same server being selected first. This means that the interactions of agents are not static and reorganize dynamically.
It is the continuous tracking of actual system configuration and the dynamic reorganization of agent interactions that ensure the capability of the system to reconfigure. This property allows to extend the functionality of the system and modify it in operation mode increasing the efficiency of the system use.
Applications and program modules
In general each application program for image processing or data analysis consists of many steps. For example, as it will be described below prediction of an embryo age includes image filtering, segmentation, removal of background signal, Fast Fourier Transform and Principal Component Analysis. Although it is possible to write one program, which will perform all these steps, it is more convenient to write a separate program for each step and to run these programs sequentially. This approach to the program design is known as modular programming. Modular programming becomes particulary valuable in cases when a user needs to control processing/analysis steps and visualize intermediate results. Another strong argument in favor of modular programming consists in possibility to use previously developed code in other applications.
A challenge in implementing the modular programming environment is that each program should be designed with the interface that can talk to the program before and after it in the chain of calculations. This interface needs to be flexible enough to support communication with different modules in different applications. Our system provides a very powerful way to implement such interface. Program modules communicate with each other via agents. An agent can insert data into a database, send it directly to the next program module and modify configuration files and other auxiliary data, if necessary.
We implement program modules in C++ or Java.
A given program module can be used in different applications; modules are grouped into types according to their function.
The application program is constructed by joining program modules and represents a directed acyclic graph (DAG). In this graph the nodes shown as rectangles are modules, while the edges displayed as arrows are data-dependency links, which specify that the output of one module serves as input to another ones.
Figure 2. Construction of a new program.
After a program module was visually constructed, its parameters are specified. Program modules forming one application program may be located on one server or may be distributed among several server machines.
Figure 3. Selection of program modules.
Besides construction of new programs the interface supports program editing and visualization of both final and intermediate results. Display operator is used to visualize data, while ImageDisplay operator serves to visualize images. Graph operator is used to visualize quantitative gene expression data as a graph, table or reconstructed image.
Program execution can be visualized as a graph. It is possible to save this graph as JPEG or MAP file.
Figure 4. The visualization of program execution.
Like all other insects, the body of the fruit fly Drosophila is made up of repeated units called segments. Immediately following fertilization and egg deposition, the newly formed zygotic nucleus starts to divide. After 9 rapid and synchronous divisions the nuclei migrate to the periphery of the egg. This begins the 'syncytial blastoderm' stage which is denoted as stage 4 in the standard nomenclature [5] (Fig. 5). The cleavage cycle 14A (embryonic stage 5) lasts from 130 to 180 minutes after fertilization. At this time the segments are determined and the invagination of membranes and the cellularization of cells happens [6].
Figure 5. The image of a representative individual embryo at syncytial blastoderm stage. The embryo looks like roughly prolate spheroid and is composed of about 5000 nuclei not separated into distinct cells.
The classical genetics of segmentation is well characterized [7, 8]. The initial determination of the segments is a consequence of expression of 16 genes which are mainly transcription factors. Most of these genes are zygotic and are expressed in patterns that become more spatially refined over time. Of particular importance are members of the 'gap' and 'pair-rule' classes of segmentation genes [9, 10].
Image processing and data analysis procedures
Images of gene expression obtained from these embryos serve as a raw material for the quantification of gene expression. Quantitative data are obtained in several steps, for each step specialized application methods for image and data processing are developed and implemented. As a result the reference data on expression of segmentation genes at cellular resolution and at each time point are constructed. Images and quantitative gene expression data from individual embryos, as well as reference gene expression data are used to study the dynamics of formation of gene expression domains, the precision of development and pattern formation and the mechanisms of segment determination. The relational database known as FlyEx is used to store data from individual embryos, reference data and data generated by processing and analysis procedures.
We will demonstrate the capabilities of our system on several real life examples of image processing and analysis
Temporal characterization of embryos
Problem statement
One of the important procedures of image processing is temporal
characterization of embryos.
The problem of embryo age detection arises as gene expression data
have been acquired from fixed embryos, for which a precise
developmental time was not known. The experimental methods for
detection of embryo age are time consuming and relatively
expensive. For the automated assignment of precise developmental
age to an embryo the pattern recognition method was developed
[11]. The embryos are staged on a basis of expression pattern of
pair-rule gene eve, which is highly dynamic during cleavage
cycle 14A. This is accomplished by standardizing eve
expression pattern against developmental time measured in
experiment.
The method for detection of embryo age requires to apply two
additional image processing procedures as a necessary preliminary
step, namely segmentation of images and removal of non-specific
background signal.
Segmentation of images
The confocal images contain not only
information about the gene expression inside nuclei, but besides
the nuisance information about null expression in internuclear
space. We apply the simple segmentation procedure allowing to
exclude the internuclear areas from consideration. Primarily the
image is filtered by the multi-valued nonlinear filter (MVF) [11]
for the elimination of noise, and subjected to image equalization
in order to amplify the details and contrast range. Then it is
thresholded by the following rule: every pixel whose brightness is
lower than a given threshold is replaced by 0 and every pixel
larger than or equal to the threshold is replaced by 255. The
output of the procedure is a binary image in which contiguous
groups of 'on' pixels, separated from other groups by 'off'
pixels, define the intranuclear regions. Now the numerical data is
read off the binary image and presented as an ASCII table.
Background removal
All the quantitative data at our
disposal are relative values, as in confocal imaging maximal
fluorescence and background levels were defined visually by a
human expert. Due to the difference in their background level gene
expression data obtained under different experimental conditions
could not be directly compared and simultaneously processed. The
problem is to bring the data to the unified standard form with a
zero background and to get rid of distortions of gene expression
patterns caused by the presence of a background signal. Our method
for removal of background signal is based on the observation that
the level of a given gene expression in a null mutant embryo for
that gene is well fit by a very broad two dimensional paraboloid.
The background paraboloid is automatically determined from the
areas of wild type embryos in which a given gene is not expressed
and the whole image is then normalized by this paraboloid to
remove background from the entire embryo.
Age detection
The development of the method can be
subdivided into two major stages, of which the first is the
extraction of characteristic features from the expression pattern
of eve gene in embryos of different age, and the second is
the standardization of this pattern against experimentally
determined developmental time. The developmental time of an embryo
is obtained by experimentally measuring the degree of membrane
invagination and using the standard curve, which gives membrane
invagination as a function of developmental time [12]. Precise
developmental age was experimentally determined for 120 embryos
belonging to cleavage cycle 14A from our dataset. These embryos
were used as a training set for temporal analysis.
The problem of high dimensionality of the feature space often
arises in the regression prediction. As the spectral phases are
strongly correlated and hence the feature set is redundant, it
makes possible to reduce the dimension of the feature vector by
the principal component analysis (PCA). Applying PCA we construct
a new set of uncorrelated features, which are linear combinations
of the initial variables, and in case of their high correlation
just few first new features may hold almost all the information
originally contained in the whole initial set.
Application program
Figure 6 presents a result of execution of the program for age
prediction. This program was constructed to visualize intermediate
results produced by each module. If the result is an embryo image
this image is rescaled by 30% its original size before
visualization. The program modules denoted as PAM:one2001,
..., PAM:one2007 perform.
Figure 6. Result of execution of the program for age
prediction.
Constructing a multiple (triple) stained image of an embryo
on-the-fly
Processing of quantitative data on expression of kni in embryos
kw1 and kw10