PIPE - Package for Image ProcEssing and analysis

Goals

Background

Who

The technology

System requirements
System architecture
Applications and program modules
Image server
User interface

PIPE in use

Segmentation in fruit fly
Image processing and data analysis procedures

Real life examples

Temporal characterization of embryos Problem statement
Application program

Constructing of multiple (triple) stained image of an embryo on-the-fly Problem statement
Application program

Processing of quantitative data on expression of kni in embryos kw1 and kw10 Problem statement
Application program
Future plans

References

Goals

The goal of PIPE is to develop a software system to support on-line processing and analysis of image information.

Currently the target application of PIPE is processing and analysis of images of gene expression patterns in Drosophila obtained by confocal scanning microscope.

PIPE is developed as an open system which easily allows to add new processing and analysis methods and integrate third parties tools. The system architecture is based on the technology of multi-agent systems. PIPE can be run locally or used to organize the on-line collaboration of investigators from different laboratories via the Internet. All PIPE code will be available under the GNU library general public license (LGPL).

Background

Currently observational information about in situ gene expression in Drosophila begin to accumulate at a continuously increasing rate. Use of confocal scanning microscopy and fluorescence techniques produces the images of gene expression patterns of high quality and resolution. For example, the confocal imaging of specimens stained using the multiplex-fluorescence in situ hybridization technique allows to detect simultaneously the expression of up to ten genes in one cell [1, 2].

Nowadays among biological community the understanding grows of the importance of quantitative approach to the analysis of gene expression information. The quantification of gene expression will make it possible to apply mathematical and computational methods to reveal fine details of biological processes and to establish new regularities.

Unfortunately relative few methods are currently available for extracting quantitative data from images of gene expression patterns and linking this data to other biological information [3]. Little research addresses the problem of management, retrieval and analysis of gene expression data in situ not to say about design of universal formats for sharing these images between users [4]. Thus the development of a software for quantification and analysis of gene expression in situ is an important task for bioinformatics.

Who

PIPE is developed by members of three teams: the team of Dr Maria Samsonova from Department of Computational Biology of the St. Petersburg State Polytechnic University (Russia), John Reinitz lab at Stony Brook University (USA) and Dave Kosman's team at the University of California, San Diego (USA).

Over a period of several years these teams have developed various in-house tools for quantification of segmentation gene expression in Drosophila, construction of spatio-temporal atlas of gene expression at cellular resolution and analysis of image information. Within PIPE project these and many other tools will become publicly available for scientific community both via the Internet and for download.

The technology

System requirements

Currently the analysis of image information about gene expression patterns is ongoing in many laboratories. PIPE is aimed to support the work of these groups by provision of flexible software system, which can be run locally or used to organize the on-line collaboration of investigators from different laboratories via the Internet.

The requirements to such software can be formulated as follows:

  • extendability to deal with
    1. new teams and users,
    2. continuously growing number of images and data volumes,
    3. introduction of new processing and analysis methods,
    4. integration with third parties tools;
  • each team may need to use in-house software to process and analyze data or images acquired by other teams, as well as to use programs developed by other teams to process and analyze proprietary information;
  • provision of simultaneous access of multiple users to shared data and methods;
  • flexibility in specification and modification of analysis methods;
  • support of distributed processing and analysis of data;
  • support of autonomous task performance upon connection hang up, as well as notification about processing results;
  • heterogeneous software/hardware platforms can be used as server and client machines;
  • portability across software platforms;
  • provision of access through FireWall and Proxy servers;
  • scalability if a number of active computers or agents changes;
  • no need in programming skills or familiarization how to install special software libraries and program tools for processing and analysis of data;
  • availability of powerful and friendly user interface, as well as visualization tools;
  • provision of continuous work, when new components are added or old one are removed;
  • failure-resistance, if malfunction of hardware or software components happens;
  • sufficient response time and readiness characteristics;
  • preferably based on open source software.

Looking at currently used client/server architectures it becomes immediately clear that only multi-tier or multiagent architecture can meet functional requirements formulated above. The important advantage of multiagent systems is their inherent modularity. Due to modularity these systems are scalable and easy extendable. Moreover multiagent systems with redundant components (databases, agents, application programs, other services) are robust, highly adaptable to functional extensions and have high readiness and reactivity characteristics. Thus the multiagent architecture satisfies nearly all our requirements to the behavior of the system for image and data processing and analysis. Therefore we decide to use multiagent approach to design this system.

System architecture

Figure 1 presents the system architecture. The current configuration of the system includes two servers each containing all system components.

Figure 1. System architecture.

All agents are designed as multithreaded Java HTTP servers, which can actively communicate with other agents. The agents exchange messages via HTTP protocol using GET and POST commands, and hence can be used in networks with FireWall and Proxy Server. The agents implement complex scenarios of distributed interactions in heterogeneous environment.

System configuration

The information about agents and their functions is stored in a coordination agent (CA) database. To ensure the actuality of information about system configuration

  • each agent database stores the list of counteragents and their URLs, the list of functions, reference to the monitoring program, load and authorization characteristics;
  • each agent registers with CAs by reporting URL, the logical names of executed services, as well as the parameters of designed load characteristics;
  • each agent notifies CAs about its scheduled sign-off (e.g., due to the decrease of load on a given service or its modification);
  • all agents update the information about system configuration by notifying CAs about their current load;
  • if any agent or service is unavailable its counteragents notify CAs about their failure to establish connection;
  • functionality of the system as a whole and each registered service separately is periodically monitored;
  • CAs notify registered agents about changes in the configuration of the system by sending HTTP messages;
  • system administrator is notified by e-mail if malfunction of the system happens.

Each agent selects a counteragent (or a required service) with regard to its availability and load, the counteragent located on the same server being selected first. This means that the interactions of agents are not static and reorganize dynamically.

It is the continuous tracking of actual system configuration and the dynamic reorganization of agent interactions that ensure the capability of the system to reconfigure. This property allows to extend the functionality of the system and modify it in operation mode increasing the efficiency of the system use.

Applications and program modules

In general each application program for image processing or data analysis consists of many steps. For example, as it will be described below prediction of an embryo age includes image filtering, segmentation, removal of background signal, Fast Fourier Transform and Principal Component Analysis. Although it is possible to write one program, which will perform all these steps, it is more convenient to write a separate program for each step and to run these programs sequentially. This approach to the program design is known as modular programming. Modular programming becomes particulary valuable in cases when a user needs to control processing/analysis steps and visualize intermediate results. Another strong argument in favor of modular programming consists in possibility to use previously developed code in other applications.

A challenge in implementing the modular programming environment is that each program should be designed with the interface that can talk to the program before and after it in the chain of calculations. This interface needs to be flexible enough to support communication with different modules in different applications. Our system provides a very powerful way to implement such interface. Program modules communicate with each other via agents. An agent can insert data into a database, send it directly to the next program module and modify configuration files and other auxiliary data, if necessary.

We implement program modules in C++ or Java.

A given program module can be used in different applications; modules are grouped into types according to their function.

Image Server

Image Server (IS) is a distinct program module performing conversion of image formats and image scaling using ImageMagic library. It also executes standard operations on images (e.g. contrast enhancement, intensity filtering, combining of several images, etc.). In addition IS participates in the visualization of processed images as JPEGs.

User interface

User interface supports visual construction of application program from program modules, program execution and the visualization of both intermediate and final results.

The application program is constructed by joining program modules and represents a directed acyclic graph (DAG). In this graph the nodes shown as rectangles are modules, while the edges displayed as arrows are data-dependency links, which specify that the output of one module serves as input to another ones.

Figure 2. Construction of a new program.

After a program module was visually constructed, its parameters are specified. Program modules forming one application program may be located on one server or may be distributed among several server machines.

Figure 3. Selection of program modules.

Besides construction of new programs the interface supports program editing and visualization of both final and intermediate results. Display operator is used to visualize data, while ImageDisplay operator serves to visualize images. Graph operator is used to visualize quantitative gene expression data as a graph, table or reconstructed image.

Program execution can be visualized as a graph. It is possible to save this graph as JPEG or MAP file.

Figure 4. The visualization of program execution.

PIPE in use

Currently PIPE runs programs for processing and analysis of images of segmentation gene expression patterns.

Segmentation in fruit fly

Like all other insects, the body of the fruit fly Drosophila is made up of repeated units called segments. Immediately following fertilization and egg deposition, the newly formed zygotic nucleus starts to divide. After 9 rapid and synchronous divisions the nuclei migrate to the periphery of the egg. This begins the 'syncytial blastoderm' stage which is denoted as stage 4 in the standard nomenclature [5] (Fig. 5). The cleavage cycle 14A (embryonic stage 5) lasts from 130 to 180 minutes after fertilization. At this time the segments are determined and the invagination of membranes and the cellularization of cells happens [6].

Figure 5. The image of a representative individual embryo at syncytial blastoderm stage. The embryo looks like roughly prolate spheroid and is composed of about 5000 nuclei not separated into distinct cells.

The classical genetics of segmentation is well characterized [7, 8]. The initial determination of the segments is a consequence of expression of 16 genes which are mainly transcription factors. Most of these genes are zygotic and are expressed in patterns that become more spatially refined over time. Of particular importance are members of the 'gap' and 'pair-rule' classes of segmentation genes [9, 10].

Image processing and data analysis procedures

Images of gene expression obtained from these embryos serve as a raw material for the quantification of gene expression. Quantitative data are obtained in several steps, for each step specialized application methods for image and data processing are developed and implemented. As a result the reference data on expression of segmentation genes at cellular resolution and at each time point are constructed. Images and quantitative gene expression data from individual embryos, as well as reference gene expression data are used to study the dynamics of formation of gene expression domains, the precision of development and pattern formation and the mechanisms of segment determination. The relational database known as FlyEx is used to store data from individual embryos, reference data and data generated by processing and analysis procedures.

We will demonstrate the capabilities of our system on several real life examples of image processing and analysis

Real life examples

Temporal characterization of embryos

Problem statement

One of the important procedures of image processing is temporal characterization of embryos.

The problem of embryo age detection arises as gene expression data have been acquired from fixed embryos, for which a precise developmental time was not known. The experimental methods for detection of embryo age are time consuming and relatively expensive. For the automated assignment of precise developmental age to an embryo the pattern recognition method was developed [11]. The embryos are staged on a basis of expression pattern of pair-rule gene eve, which is highly dynamic during cleavage cycle 14A. This is accomplished by standardizing eve expression pattern against developmental time measured in experiment.

The method for detection of embryo age requires to apply two additional image processing procedures as a necessary preliminary step, namely segmentation of images and removal of non-specific background signal.

Segmentation of images

The confocal images contain not only information about the gene expression inside nuclei, but besides the nuisance information about null expression in internuclear space. We apply the simple segmentation procedure allowing to exclude the internuclear areas from consideration. Primarily the image is filtered by the multi-valued nonlinear filter (MVF) [11] for the elimination of noise, and subjected to image equalization in order to amplify the details and contrast range. Then it is thresholded by the following rule: every pixel whose brightness is lower than a given threshold is replaced by 0 and every pixel larger than or equal to the threshold is replaced by 255. The output of the procedure is a binary image in which contiguous groups of 'on' pixels, separated from other groups by 'off' pixels, define the intranuclear regions. Now the numerical data is read off the binary image and presented as an ASCII table.

Background removal

All the quantitative data at our disposal are relative values, as in confocal imaging maximal fluorescence and background levels were defined visually by a human expert. Due to the difference in their background level gene expression data obtained under different experimental conditions could not be directly compared and simultaneously processed. The problem is to bring the data to the unified standard form with a zero background and to get rid of distortions of gene expression patterns caused by the presence of a background signal. Our method for removal of background signal is based on the observation that the level of a given gene expression in a null mutant embryo for that gene is well fit by a very broad two dimensional paraboloid. The background paraboloid is automatically determined from the areas of wild type embryos in which a given gene is not expressed and the whole image is then normalized by this paraboloid to remove background from the entire embryo.

Age detection

The development of the method can be subdivided into two major stages, of which the first is the extraction of characteristic features from the expression pattern of eve gene in embryos of different age, and the second is the standardization of this pattern against experimentally determined developmental time. The developmental time of an embryo is obtained by experimentally measuring the degree of membrane invagination and using the standard curve, which gives membrane invagination as a function of developmental time [12]. Precise developmental age was experimentally determined for 120 embryos belonging to cleavage cycle 14A from our dataset. These embryos were used as a training set for temporal analysis.

  • Extraction of characteristic features To detect the age of an embryo on the basis of knowledge about its eve expression pattern it is necessary to present the pattern in terms of a small number of parameters which well characterize temporal changes in eve expression domains. It has been shown in our previous study [11] that the frequency domain representation of images may be used to detect the characteristic features, which mark the development of expression patterns over time. The Fourier spectrum is extracted from the images by means of the Fast Fourier Transform, and the phases of low frequency coefficients of Fourier spectra are considered as parameters for the age detection algorithm.

  • Training set At the next stage we use the group of 120 embryos in which the precise developmental age was determined as training data for creating the regression function with characteristic features used as independent variables. However, the spectral phases cannot be directly involved into the regression analysis for two reasons: first, phases are periodic values and, second, number of independent parameters is too big as compared with the size of the training set. To get rid of periodicity for each parameter the standard range of values is defined so that the maximal in absolute value pair-wise difference between values, which the parameter takes over the training set, is set to minimum.
    The problem of high dimensionality of the feature space often arises in the regression prediction. As the spectral phases are strongly correlated and hence the feature set is redundant, it makes possible to reduce the dimension of the feature vector by the principal component analysis (PCA). Applying PCA we construct a new set of uncorrelated features, which are linear combinations of the initial variables, and in case of their high correlation just few first new features may hold almost all the information originally contained in the whole initial set.

  • Construction of the regression function Each embryo of the training set is now characterized by a multidimensional vector containing as components the value of developmental age together with 'new' parameters of gene expression patterns. The regression function for the age prediction is created from the training data applying the Support Vector (SV) regression method [13].

  • Age prediction To determine the age of a new embryo the confocal image of eve expression pattern in this embryo is subjected to the same preprocessing and feature extraction procedures as the members of the training set were. Periodicity of the spectral phases is overcome by bringing their values to the standard range defined over the training set. Then applying the PCA the number of features is reduced to the required in the SV regression number, and the embryo age is defined using the regression function constructed for the training set. For the predicted age 95% confidence interval is constructed. The cleavage cycle 14A is divided into 8 temporal classes [14] and according to the predicted value the embryo is assigned to one of these classes.

Application program

Figure 6 presents a result of execution of the program for age prediction. This program was constructed to visualize intermediate results produced by each module. If the result is an embryo image this image is rescaled by 30% its original size before visualization. The program modules denoted as PAM:one2001, ..., PAM:one2007 perform.

Figure 6. Result of execution of the program for age prediction.

Constructing a multiple (triple) stained image of an embryo on-the-fly

Problem statement

Each gene expression pattern was detected in a single channel of a confocal microscope, and for each channel a single stained image was obtained displaying the expression pattern of a single gene in given embryo. The single stained images obtained in an individual embryo can be used to generate an image on-the-fly, which displays the expression patterns of all the genes scanned in the embryo. This image is called a multiple stained image. Each expression pattern is displayed in different color.

Application program

Figure 7 presents a result of combining three single stained images of embryo cq7 displaying the patterns of expression of even-skipped, Kruppel and giant genes. The program consist of 7 modules, three displaying intermediate results and four representing of Image Server (IS) calls. Calls Is01Zoom 01cq7.1 eve, Is01Zoom 01cq7.2Kr, and Is01Zoom 01cq7.3gt extract the single stained images from the database and scale them by 10% of the original size for display, while the call Is01Zoom 01cq7 combines the extracted images and scales the resultant multiple stained image by 10% the original size for display.

Figure 7. Result of combining three single stained images of embryo cq7.

Processing of quantitative data on expression of kni in embryos kw1 and kw10

Problem statement

Quantitative gene expression data are processed in several steps. These steps include data normalization, registration and averaging. During a data normalization step, quantitative gene expression data are rescaled in order to get rid of distortions caused by the presence of background signal. To eliminate small individual differences between embryos, the normalized quantitative gene expression data are subjected to registration. The registration method is based on the extraction of the characteristic features in each image (which are named ground control points, GCPs), and application of a coordinate transformation to make the corresponding GCPs in different images coincide as closely as possible. The GCPs are extracted by quadratic spline approximation.

Application program

The program (Figure 8) performs sequentially the removal of background signal, excision of 10% strip along anteroposterior axis of an embryo and data registration. On each processing step data from both embryos are combined, displayed as a graph and a resultant image is rescaled for display.

Figure 8. Result of the removal of background signal, excision of 10% strip and data registration.

Future plans

Current version of PIPE is a prototype of software system supporting the distributed processing of images. Our nearest goal is to decide how to organize data and modules and develop a user policy. We will continue to link new application programs to the system. We also plan to integrate our system with several third parties tools (e.g. Khoros image processing package) and develop web-based interface.

References

  1. M. Levsky, S.M. Shenoy, R.C. Pezo, and R.H. Singer (2002). Single-cell gene expression profiling. Science 297, 836-840.
  2. D.Kosman (2003). Multiplex fluorescent mRNA in situ hybridization http://www.biology.ucsd.edu/~davek/index.html.
  3. Jason R. Swedlow, Ilya Goldberg, Erik Brauner, Peter K. Sorger (2003). Informatics and Quantitative Analysis in Biological Imaging. Science 300, 100-102.
  4. A.Pisarev, E.Poustelnikova, M.Samsonova, P.Baumann (2003). Mooshka: a system for management of multidimensional gene expression data in situ. Information Technologies, 28, 269-285.
  5. J.A. Campos-Ortega, V. Hartenstein (1985). The embryonic development of Drosophila melanogaster. Springer-Verlag: Berlin.
  6. V.A. Foe and B.M. Alberts (1983). Studies of nuclear and cytoplasmic behaviour during the five mitotic cycles that precede gastrulation in Drosophila embryogenesis. Journal of Cell Science, 61, 31-70.
  7. E.Wieschaus, C.Nusslein-Volhard, and G.Jurgens (1984). Mutations affecting the pattern of the larval cuticle in Drosophila melanogaster. III. Zgotic loci on the {X}-chromosome and fourth chromosome. Roux's Archives of Developmental Biology, 193, 296-307.
  8. C.Nusslein-Volhard, E.Wieschaus, H.Kluding (1984). Mutations affecting the pattern of the larval cuticle in Drosophila melanogaster. {I.} Zygotic loci on the second chromosome. Roux's Archives of Developmental Biology, 193, 267-282.
  9. M.Akam (1987). The molecular basis for metameric pattern in the Drosophila embryo. Development, 101, 1-22.
  10. P.W. Ingham (1988). The molecular genetics of embryonic pattern formation in Drosophila. Nature, 335, 25-34.
  11. E. Myasnikova, A. Samsonova, M. Samsonova and J. Reinitz (2002). Support vector regression applied to the determination of the developmental age of a Drosophila embryo from its segmentation gene expression patterns. Bioinformatics, 18, S87-S95.
  12. P.T.Merrill, D.Sweeton, and E.Wieschaus (1988). Requirements for autosomal gene activity during precellular stages of Drosophila melanogaster. Development, 104, 495-509.
  13. A.Smola and B.Scholkopf (1998). A tutorial on support vector regression. NeuroCOLT2 Technical Report Series. NC2-TR-1998-030, http://www.neurocolt.com.
  14. E.Myasnikova, A.Samsonova, K.Kozlov, M.Samsonova, and John Reinitz (2001). Registration of the expression patterns of Drosophila segmentation genes by two independent methods. Bioinformatics, 17(1), 3-12.