Home >  PHYLIP (Phylogeny Inference Package) Version 3.57c by Joseph Felsenstein July, 1995

PHYLIP (Phylogeny Inference Package) Version 3.57c by Joseph Felsenstein July, 1995


              PHYLIP (Phylogeny Inference Package) Version 3.57c 

                             by Joseph Felsenstein 

                                  July, 1995 

                               COPYRIGHT NOTICE 

(c) Copyright 1986-1995 by Joseph Felsenstein and the University of Washington.

Permission is granted to copy this document provided that no fee is charged for

it and that this copyright notice is not removed. 

                           CONTENTS OF THIS DOCUMENT 

   Copyright notice 

   Contents of this document 

   General description of PHYLIP 

   Contents of this package 

   What the programs do 

   Overview of the input and output formats

      Input File Format

      The Options Menu

      The Output File

      The Tree File 

   The Options and How to Invoke Them

      Options Information in the Input File

      Common Options in the Menu

         The U (User Tree) option

         The G (Global) option

         The J (Jumble) option

         The O (Outgroup) option

         The T (Threshold) option

         The M (multiple data sets) option

         The option to write out the trees into a tree file

         The (0) terminal type option

      Common Options Requiring Information in the Input File

         The Weights option 

   The Algorithm for Constructing Trees

      Local Rearrangements

      Global Rearrangements

      Multiple Jumbles 

   Strategy for Finding the Best Tree 

   A Warning on Interpreting Results 

   Relative Speed of Different Programs and Machines

      Relative speed of the different programs

      Speed with different numbers of species

      Relative speed of different machines

      Published benchmarks 


   General Comments on Adapting the Package to Different Computer Systems

      Compiling the programs

      Using "make"

      Getting PHYLIP onto your microcomputer

      Microsoft Quick C and Microsoft C

      Turbo C++ for PCDOS

      Waterloo C/386

      Think C for Macintosh


      VMS VAX systems

      OpenVMS DEC Alpha systems


      IBM mainframes running CMS

      Other Computer Systems 

   Frequently Asked Questions

     "If I copied PHYLIP from a friend without you knowing, ...?"

     "How do I make a citation to the PHYLIP package ...?"

     "How do I bootstrap? Why has DNABOOT disappeared?"

     "How do I specify a multi-species outgroup? ..."

     "How do I force certain groups to remain monophyletic ...?"

     "How can I reroot one of the trees written out by PHYLIP?"

     "Why doesn't NEIGHBOR read my DNA sequences correctly?"

     "What do I do about deletions and insertions in my sequences?"

     "Why don't your parsimony programs print out branch lengths?"

     "Why can't your programs handle unordered multistate characters?"

     "Where can I get a printed version of the PHYLIP documents"

     "Why have I been dropped from your newsletter mailing list?"

     "How many copies of PHYLIP have been distributed?" 

   Additional Frequently Asked Questions, or:

      "Why didn't it occur to you to..."

     write these programs in Pascal?"

     forget about all those inferior systems and just develop PHYLIP for Unix?"

     write these programs in PROLOG (or Ada, or Modula-2, or Simula, or ...)?"

     include in the package a program to do the Distance Wagner method ... ?

     include in the package ordination methods and more clustering algorithms?"

     include in the package a program to do nucleotide sequence alignment ...?"

     send me the programs over the electronic network I use, BUTTERFLYNET?"

     let me log in to your computer in Seattle and copy the files ....?"

     send me a listing of your program?"

     write a magnetic tape in our computer center's favorite format ....?"

     give us a version of these in FORTRAN?" 

   New Features in Recent Versions 

   Coming Attractions, Future Plans 

   References for the Documentation Files 


   Other phylogeny programs available elsewhere




      Random Cladistics












      Wetzel/Huson programs




      Zharkikh programs 




















   How You Can Help Me 

   In case of trouble 

              PHYLIP - Phylogeny Inference Package (version 3.5) 

     This is a FREE package of programs for inferring phylogenies and  carrying

out certain related tasks.  At present it contains 30 programs, which carry out

different algorithms on different kinds of data.  The programs in  the  package


      ---------- Programs for molecular sequence data ----------

  PROTPARS  Protein parsimony          DNAPARS   Parsimony method for DNA

  DNAMOVE   Interactive DNA parsimony  DNAPENNY  Branch and bound for DNA

  DNACOMP   Compatibility for DNA      DNAINVAR  Phylogenetic invariants

  DNAML     Maximum likelihood method  DNAMLK    DNA ML with molecular clock

  DNADIST   Distances from sequences   PROTDIST  Distances from proteins

  RESTML    ML for restriction sites   SEQBOOT   Bootstraps sequence data sets

      ----------- Programs for distance matrix data ------------

  FITCH     Fitch-Margoliash and least-squares methods

  KITSCH    Fitch-Margoliash and least squares methods with evolutionary clock

  NEIGHBOR  Neighbor-joining and UPGMA methods

      -------- Programs for gene frequencies and continuous characters -------

  CONTML    Maximum likelihood method  GENDIST  Computes genetic distances

  CONTRAST  Computes contrasts and correlations for comparative method studies

      ------------- Programs for 0-1 discrete state data -----------

  MIX       Wagner, Camin-Sokal, and mixed parsimony criteria

  MOVE      Interactive Wagner, C-S, mixed parsimony program

  PENNY     Finds all most parsimonious trees by branch-and-bound

  DOLLOP, DOLMOVE, DOLPENNY   same as preceding four programs, but for

     the Dollo and polymorphism parsimony criteria

  CLIQUE    Compatibility method       FACTOR    recode multistate characters

      ---------- Programs for plotting trees and consensus trees -------

  DRAWGRAM  Draws cladograms and phenograms on screens, plotters and printers

  DRAWTREE  Draws unrooted phylogenies on screens, plotters and printers

  CONSENSE  Majority-rule and strict consensus trees

  RETREE    Reroots, changes names and branch lengths, and flips trees 

There is also an Unsupported Division  containing  two  programs,  makeinf  and

ProtML, which were contributed by others and are maintained by their authors. 

The package includes extensive documentation files that provide the information

necessary to use and modify the programs. 

The programs are written in a very standard subset of C,  a  language  that  is

available on most computers (including microcomputers). The programs require no

modifications  to  run  on  most  machines:  for  example  they  work   without

modification  with  Microsoft  C,  Turbo  C,  Think  C,  and on the C compilers

available on Unix and VAX VMS systems.  C source code  is  distributed  in  the

regular  version  of  PHYLIP.  To use it, you must have a C compiler.  A Pascal

version can also be supplied on request.  Precompiled executables are available

for  PCDOS,  386  PCDOS,  386  Windows, PowerMacs, and Macintoshes as described


NETWORK DISTRIBUTION:   The  package  is  available  by  "anonymous  ftp"  over

electronic networks (including the PCDOS, 386 PCDOS, 386 Windows, and Macintosh

executables) from evolution.genetics.washington.edu (  Contact me

by  electronic  mail  for details or start by fetching file pub/phylip/Read.Me.

European users may (or may not) get faster service from bioss.sari.ac.uk, which

mirrors  our  distribution.   Look in directory pub/phylogeny.  I can also send

the source code and documentation files (but not executables) over  Bitnet/EARN

and  other networks.   The easiest method of network distribution is to use our

World Wide Web site:


DISKETTE DISTRIBUTION:  The  package  is  also  distributed  in  a  variety  of 

microcomputer  diskette  formats.  You should send FORMATTED diskettes, which I

will return with the package written on them.  See below for how many diskettes

to send.  The source code of the programs on the electronic network or magnetic

tape versions may of course also be moved to microcomputers and compiled there. 

PRECOMPILED VERSIONS: Precompiled executable programs for PCDOS,  386  Windows,

386  PCDOS,  and  Macintosh  systems  are  available from me.  Specify the "386

Windows executable version", "386 PCDOS executable version", "PCDOS  executable

version"  or  "Macintosh  executable  version" and send the number of diskettes

indicated below.  Source code sent will be in C unless you specify Pascal. 

HOW MANY DISKETTES TO SEND: The following table shows for different formats how

many diskettes to send, and how many extra diskettes to send for the executable


  Diskette size     Density   For source code    For executables send

                              and documentation      in addition

  3.5 inch PCDOS     1.44 Mb         1                     3

  5.25 inch PCDOS    1.2 Mb          1                     3

  Macintosh          High density    1                     1

Some other formats are also available.  You MUST tell me EXACTLY which of these

formats  you need.  The diskettes MUST be formatted by you before being sent to

me.  Sending an extra diskette may be helpful. 

POLICIES: The package is distributed free.  It will be written on the diskettes

or tape, which will be mailed back.  They can be sent to: 

                                         Joe Felsenstein

                                         Department of Genetics

                                         University of Washington

Electronic mail addresses:               Box 357360

         joe@genetics.washington.edu     Seattle, Washington 98195-7360, U.S.A. 

                           CONTENTS OF THIS PACKAGE 

     The source code and documentation of the package  consists  of  87  files,

plus  4  more  for the programs in the Unsupported Division.  In the electronic

mail version some of these files may be split into parts, so there may be more.

The  package  is  organized  into  three  major  parts,  the  source  code, the

documentation, and the unsupported programs.  The  documentation  is  organized

hierarchically,  with groups of documentation files for different kinds of data

each preceded by a documentation file for the group as well.  The  "unsupported

division"  of PHYLIP contains programs contributed by others (and not supported

by us) that we feel may of use to you. 

  Files               Contents

  ----                --------

    1    README          -- describes the contents of the package

    2    main.doc        -- this general documentation file

The Source code

    3    Makefile        -- the "Makefile" to be used by C's that have "make"

    4    Makefile.qc     -- the Makefile for Microsoft C and Quick C

    5    Makefile.tc     -- the Makefile for Borland Turbo C and Borland C

    6    phylip.h        -- the PHYLIP "header file"

    7    compile.com     -- a VMS command file to compile all of PHYLIP

    8    vaxfix.c        -- procedures needed to fix VMS printf(" %hd ")

    9    protpars.c      -- parsimony for protein sequence data

   10    dnapars.c       -- DNA parsimony program

   11    dnamove.c       -- interactive DNA parsimony

   12    dnapenny.c      -- branch and bound method for DNA

   13    dnacomp.c       -- DNA compatibility program

   14    dnainvar.c      -- computation of Lake's and Cavender's invariants

   15    dnaml.c         -- DNA maximum likelihood program, part 1

   16    dnaml2.c        -- DNA maximum likelihood program, part 2

   17    dnamlk.c        -- DNA maximum likelihood with molecular clock

   18    dnamlk2.c       -- DNA maximum likelihood with clock, part 2

   19    dnadist.c       -- computes distance matrix from sequences

   20    protdist.c      -- computes distance matrix from sequences

   21    restml.c        -- maximum likelihood for restriction sites

   22    restml2.c       -- maximum likelihood for restriction sites, part 2

   23    seqboot.c       -- makes multiple data sets by bootstrap resampling

   24    fitch.c         -- Fitch-Margoliash and least-squares methods

   25    kitsch.c        -- F-M, L-S methods with evolutionary clock

   26    neighbor.c      -- neighbor-joining and UPGMA methods

   27    contml.c        -- maximum likelihood program

   28    gendist.c       -- computes genetic distances

   29    contrast.c      -- contrasts etc. for comparative method studies

   30    mix.c           -- Wagner, Camin-Sokal parsimony and mixtures, part 1

   31    mix2.c          -- Wagner, Camin-Sokal parsimony and mixtures, part 2

   32    move.c          -- interactive Wagner, Camin-Sokal and mixed parsimony

   33    penny.c         -- finds all most parsimonious trees

   34    dollop.c        -- Dollo and polymorphism parsimony methods

   35    dolmove.c       -- interactive Dollo and polymorphism parsimony

   36    dolpenny.c      -- branch and bound for Dollo, polymorphism

   37    clique.c        -- compatibility program

   38    factor.c        -- recode multistate to binary characters

   39    drawgraphics.h  -- header file for drawgraphics.c

   40    drawgraphics.c  -- routines used in both drawgram.c and drawtree.c

   41    interface.h     -- header for Mac interface

   42    interface.c     -- Mac routines used in Mac interface

   43    drawgram.c      -- makes plots of cladograms, phenograms

   44    drawtree.c      -- makes plots of unrooted phylogenies

   45    font1           -- digitized font (simple sans-serif Roman)

   46    font2           -- digitized font (medium quality sans-serif Roman) 

   47    font3           -- digitized font (high quality serifed Roman)

   48    font4           -- digitized font (medium quality sans-serif Italic)

   49    font5           -- digitized font (high quality serifed Italic)

   50    font6           -- digitized font (Russian Cyrillic)

   51    consense.c      -- majority-rule and strict consensus trees

   52    retree.c        -- reroots, rearranges and changes lengths on trees

The Documentation

   53    sequence.doc    -- documentation for molecular sequence programs

   54    protpars.doc      -- documentation for protpars.c

   55    dnapars.doc       -- documentation for dnapars.c

   56    dnamove.doc       -- documentation for dnamove.c

   57    dnapenny.doc      -- documentation for dnapenny.c

   58    dnacomp.doc       -- documentation for dnacomp.c

   59    dnainvar.doc      -- documentation for dnainvar.c

   60    dnaml.doc         -- documentation for dnaml.c and dnaml2.c

   61    dnamlk.doc        -- documentation for dnamlk.c and dnamlk2.c

   62    dnadist.doc       -- documentation for dnadist.c

   63    protdist.doc      -- documentation for protdist.c

   64    restml.doc        -- documentation for restml.c and restml2.c

   65    seqboot.doc       -- documentation for seqboot.c

   66    distance.doc   -- documentation for distance matrix programs

   67    fitch.doc         -- documentation for fitch.c

   68    kitsch.doc        -- documentation for kitsch.c

   69    neighbor.doc      -- documentation for neighbor.c

   70    contchar.doc   -- documentation for gene frequency

                             and continuous character programs

   71    contml.doc        -- documentation for contml.c

   72    gendist.doc       -- documentation for gendist.c

   73    contrast.doc      -- documentation for contrast.c

   74    discrete.doc    -- documentation for discrete character programs

   75    mix.doc           -- documentation for mix.c

   76    move.doc          -- documentation for move.c

   77    penny.doc         -- documentation for penny.c

   78    dollop.doc        -- documentation for dollop.c

   79    dolmove.doc       -- documentation for dolmove.c

   80    dolpenny.doc      -- documentation for dolpenny.c

   81    clique.doc        -- documentation for clique.c

   82    factor.doc        -- documentation for factor.c

   83    draw.doc       -- documentation for tree plotting programs

   84    drawgram.doc      -- documentation for drawgram.c

   85    drawtree.doc      -- documentation for drawtree.c

   86    consense.doc   -- documentation for consense.c

   87    retree.doc     -- documentation for retree.c

The Unsupported Division

   88    makeinf.doc    -- documentation for makeinf (by Arend Sidow)

   89    makeinf.c      -- C source for makeinf

   90    protml.doc     -- documentation for ProtML (by Adachi and Hasegawa)

   91    protml.pas     -- Pascal source for ProtML 

                             WHAT THE PROGRAMS DO 

Here is a short description  of  each  of  the  programs.   For  more  detailed

discussion you should definitely read the documentation file for the individual

program and the documentation file for the group of programs it is in. 

PROTPARS.  Estimates phylogenies from protein sequences (input using the

   standard one-letter code for amino acids) using the parsimony method, in

   a variant which counts only those nucleotide changes that change the amino

   acid, on the assumption that silent changes are more easily accomplished. 

DNAPARS.  Estimates phylogenies by the parsimony method using nucleic acid

   sequences.  Allows use the full IUB ambiguity codes, and estimates

   ancestral nucleotide states.  Gaps treated as a fifth nucleotide state. 

DNAMOVE.  Interactive construction of phylogenies from nucleic acid sequences,

   with their evaluation by parsimony and compatibility and the display of

   reconstructed ancestral bases.  This can be used to find parsimony or

   compatibility estimates by hand. 

DNAPENNY.  Finds all most parsimonious phylogenies for nucleic acid sequences

   by branch-and-bound search.  This may not be practical (depending on the

   data) for more than 10 or 11 species. 

DNACOMP.   Estimates phylogenies from nucleic acid sequence data using the

   compatibility criterion, which searches for the largest number of sites

   which could have all states (nucleotides) uniquely evolved on the same

   tree.  Compatibility is particularly appropriate when sites vary greatly in

   their rates of evolution, but we do not know in advance which are the less

   reliable ones. 

DNAINVAR.  For nucleic acid sequence data on four species, computes Lake's and

   Cavender's phylogenetic invariants, which test alternative tree topologies.

   The program also tabulates the frequencies of occurrence of the different

   nucleotide patterns.  Lake's invariants are the method which he calls

   "evolutionary parsimony". 

DNAML.   Estimates phylogenies from nucleotide sequences by maximum

   likelihood.  The model employed allows for unequal expected frequencies of

   the four nucleotides, for unequal rates of transitions and transversions,

   and for different (prespecified) rates of change in different categories of

   sites, with the program inferring which sites have which rates. 

DNAMLK.   Same as DNAML but assumes a molecular clock.  The use of the

   two programs together permits a likelihood ratio test of the

   molecular clock hypothesis to be made. 

DNADIST.  Computes four different distances between species from nucleic acid

   sequences.  The distances can then be used in the distance matrix programs.

   The distances are the Jukes-Cantor formula, one based on Kimura's 2-

   parameter method, Jin and Nei's distance which allows for rate variation

   from site to site, and a maximum likelihood method using the model employed

   in DNAML.  The latter method of computing distances can be very slow. 

PROTDIST.  Computes a distance measure for protein sequences, using maximum

   likelihood estimates based on the Dayhoff PAM matrix, Kimura's 1983

   approximation to it, or a model based on the genetic code plus a

   constraint on changing to a different category of amino acid.  The

   distances can then be used in the distance matrix programs. 

RESTML.  Estimation of phylogenies by maximum likelihood using restriction

   sites data (not restriction fragments but presence/absence of individual

   sites).  It employs the Jukes-Cantor symmetrical model of nucleotide

   change, which does not allow for differences of rate between transitions

   and transversions.  This program is VERY slow. 

SEQBOOT.  Reads in a data set, and produces multiple data sets from

   it by bootstrap resampling.  Since most programs in the current version of

   the package allow processing of multiple data sets, this can be used

   together with the consensus tree program CONSENSE to do bootstrap (or

   delete-half-jackknife) analyses with most of the methods in this package.

   This program also allows the Archie/Faith technique of permutation of 

   species within characters. 

FITCH.  Estimates phylogenies from distance matrix data under the "additive

   tree model" according to which the distances are expected to equal the sums

   of branch lengths between the species.  Uses the Fitch-Margoliash criterion

   and some related least squares criteria.  Does not assume an evolutionary

   clock.  This program will be useful with distances computed from DNA

   sequences, with DNA hybridization measurements, and with genetic distances

   computed from gene frequencies. 

KITSCH.  Estimates phylogenies from distance matrix data under the

   "ultrametric" model which is the same as the additive tree model except

   that an evolutionary clock is assumed.  The Fitch-Margoliash criterion and

   other least squares criteria are assumed.  This program will be useful with

   distances computes from DNA sequences, with DNA hybridization measurements,

   and with genetic distances computed from gene frequencies. 

NEIGHBOR.  An implementation by Mary Kuhner and John Yamato of Saitou and

   Nei's "Neighbor Joining Method," and of the UPGMA (Average Linkage

   clustering) method.  Neighbor Joining is a distance matrix method producing

   an unrooted tree without the assumption of a clock.  UPGMA does assume a

   clock.  The branch lengths are not optimized by the least squares criterion

   but the methods are very fast and thus can handle much larger data sets. 

CONTML.  Estimates phylogenies from gene frequency data by maximum likelihood

   under a model in which all divergence is due to genetic drift in the

   absence of new mutations.  Does not assume a molecular clock.  An

   alternative method of analyzing this data is to compute Nei's genetic

   distance and use one of the distance matrix programs. 

GENDIST.  Computes one of three different genetic distance formulas from gene

   frequency data.  The formulas are Nei's genetic distance, the Cavalli-

   Sforza chord measure, and the genetic distance of Reynolds et. al.  The

   former is appropriate for data in which new mutations occur in an infinite

   isoalleles neutral mutation model, the latter two for a model without

   mutation and with pure genetic drift.  The distances are written to a file

   in a format appropriate for input to the distance matrix programs. 

CONTRAST.  Reads a tree from a tree file, and a data set with continuous

   characters data, and produces the independent contrasts for those

   characters, for use in any multivariate statistics package.  Will also

   produce covariances, regressions and correlations between characters for

   those contrasts. 

MIX.   Estimates phylogenies by some parsimony methods for discrete character

   data with two states (0 and 1).  Allows use of the Wagner parsimony method,

   the Camin-Sokal parsimony method, or arbitrary mixtures of these.  Also

   reconstructs ancestral states and allows weighting of characters. 

MOVE.  Interactive construction of phylogenies from discrete character data

   with two states (0 and 1).  Evaluates parsimony and compatibility criteria

   for those phylogenies and displays reconstructed states throughout the

   tree.  This can be used to find parsimony or compatibility estimates by


PENNY.  Finds all most parsimonious phylogenies for discrete-character data

   with two states, for the Wagner, Camin-Sokal, and mixed parsimony criteria

   using the branch-and-bound method of exact search.  May be impractical

   (depending on the data) for more than 10-11 species. 

DOLLOP.  Estimates phylogenies by the Dollo or polymorphism parsimony criteria 

   for discrete character data with two states (0 and 1).  Also reconstructs

   ancestral states and allows weighting of characters.  Dollo parsimony is

   particularly appropriate for restriction sites data; with ancestor states

   specified as unknown it may be appropriate for restriction fragments data. 

DOLMOVE.  Interactive construction of phylogenies from discrete character data

   with two states (0 and 1) using the Dollo or polymorphism parsimony

   criteria.  Evaluates parsimony and compatibility criteria for those

   phylogenies and displays reconstructed states throughout the tree.  This

   can be used to find parsimony or compatibility estimates by hand. 

DOLPENNY.  Finds all most parsimonious phylogenies for discrete-character data

   with two states, for the Dollo or polymorphism parsimony criteria using the

   branch-and-bound method of exact search.  May be impractical (depending on

   the data) for more than 10-11 species. 

CLIQUE.  Finds the largest clique of mutually compatible characters, and the

   phylogeny which they recommend, for discrete character data with two

   states.  The largest clique (or all cliques within a given size range of

   the largest one) are found by a very fast branch and bound search method.

   The method does not allow for missing data.  For such cases the T

   (Threshold) option of MIX may be a useful alternative.  Compatibility

   methods are particular useful when some characters are of poor quality and

   the rest of good quality, but when it is not known in advance which ones

   are which. 

FACTOR.  Takes discrete multistate data with character state trees and

   produces the corresponding data set with two states (0 and 1).  Written by

   Christopher Meacham. 

DRAWGRAM.  Plots rooted phylogenies, cladograms, and phenograms in a

   wide variety of user-controllable formats.  The program is

   interactive and allows previewing of the tree on PC graphics screens,

   and Tektronix or DEC graphics terminals.  Final output can be on

   a laser printer (such as the Apple Laserwriter or HP Laserjet),

   on graphics screens or terminals, on pen plotters (Hewlett-Packard or

   Houston Instruments) or on dot matrix printers capable of graphics

   (Epson, Okidata, Imagewriter, or Toshiba). 

DRAWTREE.  Similar to DRAWGRAM but plots unrooted phylogenies. 

CONSENSE.  Computes consensus trees by the majority-rule consensus tree

   method, which also allows one to easily find the strict consensus tree.

   Does NOT compute the Adams consensus tree.  Trees are input in a tree file

   in standard nested-parenthesis notation, which is produced by many of the

   tree estimation programs in the package when the Y option is invoked.

   This program can be used as the final step in doing bootstrap analyses for

   many of the methods in the package. 

RETREE.  Reads in a tree (with branch lengths if necessary) and allows

   you to reroot the tree, to flip branches, to change species names and

   branch lengths, and then write the result out.  Can be used to convert

   between rooted and unrooted trees. 

Programs in the Unsupported Division 

     The Unsupported Division of PHYLIP consists of two programs contributed by

others  that  may  be  useful  to you and have kindly been contributed by their

authors.   Those  authors  retain  full  copyright  to   their   programs   and

documentation  files.  They are provided in the PHYLIP source code distribution

but have not been provided as executables in the executables distribution.  All 

questions  about  these  programs  should  be  directed to their authors, whose

electronic mail addresses  and  regular  mail  addresses  are  given  in  their

documentation files. 

MAKEINF.  This program by Arend Sidow can be used to translate the output files

from Jotun Hein's popular multiple-sequence alignment program into PHYLIP input

files.  It also allows you to selectively analyze different codon positions and

different  organisms.   The  output  from  other  alignment programs can rather

easily be edited into a form that it will read. 

PROTML.  This large Pascal program from Jun Adachi and Masami Hasegawa  carries

out  maximum  likelihood  estimation of phylogenies from protein sequence data.

It is quite analogous to DNAML, but uses instead of a model for  DNA  evolution

the  PAM  matrix  model  of  Margaret Dayhoff.  Because of the larger number of

states (20 instead of 4) it is necessarily slower than DNAML by a large factor.

However  the  authors  have  adopted  a  different,  and  faster, rearrangement

strategy to search among tree topologies for the best one.  ProtML does not yet

incorporate  the  Categories feature of DNAML and DNAMLK which allows different

rates of evolution at different sites, without the user specifying  in  advance

which  site  has  which  rate  of  evolution.  For support, contact them at the

Internet  addresses  hasegawa@ism.ac.jp  and  adachi@sunmh.ism.ac.jp   at   the

Institute of Statistical Mathematics, Tokyo, Japan. 


     When you run most of these programs,  a  menu  will  appear  offering  you

choices  of  the various options available for that program.  The data that the

program reads should be in an input file called (in most cases)  "infile".   If

there is no such file the programs will ask you for the name of the input file.

Below we describe the input file format, and then the menu. 

Input File Format

----- ---- ------ 

     I have tried to adhere to a rather stereotyped input  and  output  format.

For the parsimony, compatibility and maximum likelihood programs, excluding the

distance matrix methods, the simplest version of the input file looks something

like this: 

   6   13







The first line of the input file contains the number of species and the

number of characters, in free format, separated by blanks (not by

commas).  The information for each species follows, starting with a

ten-character species name (which can include punctuation marks and blanks),

and continuing with the characters for that species.  In the

discrete-character, DNA and protein sequence programs the characters are each a

single letter or digit, sometimes separated by blanks.  In

the continuous-characters programs they are real numbers with decimal points,

separated by blanks: 

Latimeria  2.03  3.457  100.2  0.0  -3.7 

The conventions about continuing the data  beyond  one  line  per  species  are

different  between  the  molecular  sequence  programs  and  the  others.   The

molecular sequence programs can take the data  in  "aligned"  or  "interleaved"

format,  with  some  lines giving the first part of each of the sequences, then

lines giving the next part of each, and so on.  Thus the sequences  might  look

like this: 

   6   39













Note that in these sequences we have a blank  every  ten  sites  to  make  them

easier  to  read:  any such blanks are allowed.  The blank line which separates

the two groups of lines (the ones containing sites  1-20  and  ones  containing

sites  21-39)  may  or may not be present, but if it is, it should be a line of

zero length and not contain any extra blank characters (this is  because  of  a

limitation  of the current versions of the programs).  It is important that the

number of sites in each group be the same for all species (i.e., it will not be

possible to run the programs successfully if the first species line contains 20

bases, but the first line for the second species contains 21 bases). 

     Alternatively, an option can be selected to take the data in  "sequential"

format,  with all of the data for the first species, then all of the characters

for the next species, and so on.  This  is  also  the  way  that  the  discrete

characters  programs  and  the  gene  frequencies  and  quantitative characters

programs want to read the data.  They do not allow the "interleaved" format. 

     In the sequential format, the character data can run on to a new  line  at

any  time  (except in a species name or in the case of continuous character and

distance matrix programs where you cannot go to a new line in the middle  of  a

real number).  Thus it is legal to have: 

Archaeopt 001100


or even: 



though note that the FULL ten characters of  the  species  name  MUST  then  be

present:  in  the above case there must be a blank after the "t".  In all cases

it is possible to put internal blanks between any of the character  values,  so


Archaeopt 0011001101 0111011100 

is allowed. 

If you make an error in the input file, the programs  will  often  detect  that

they have been fed an illegal character or illegal numerical value and issue an

error message such as "BAD CHARACTER STATE:", often printing out the bad value,

and  sometimes  the  number  of the species and character in which it occurred.

The program will then stop shortly after.  One of the things which can lead  to

a  bad value is the omission of something earlier in the file, or the insertion

of something superfluous, which cause the reading of the file  to  get  out  of

synchronization.   The program then starts reading things it didn't expect, and

concludes that they are in error.  So if you see this error  message,  you  may

also want to look for the earlier problem that may have led to this. 

     The other major  variation  on  the  input  data  format  is  the  options

information.   Many options are selected using the menu, but a few are selected

by including extra information in the input file.  Some options  are  described


The Options Menu

--- ------- ---- 

     The menu is straightforward.  It typically looks like this  (this  one  is

for DNAPARS): 

DNA parsimony algorithm, version 3.57c 

Setting for this run:

  U                 Search for best tree?  Yes

  J   Randomize input order of sequences?  No. Use input order

  O                        Outgroup root?  No, use as outgroup species  1

  T              Use Threshold parsimony?  No, use ordinary parsimony

  M           Analyze multiple data sets?  No

  I          Input sequences interleaved?  Yes

  0   Terminal type (IBM PC, VT52, ANSI)?  ANSI

  1    Print out the data at start of run  No

  2  Print indications of progress of run  Yes

  3                        Print out tree  Yes

  4          Print out steps in each site  No

  5  Print sequences at all nodes of tree  No

  6       Write out trees onto tree file?  Yes 

Are these settings correct? (type Y or the letter for one to change) 

If you want to accept the default settings (they are shown in the  above  case)

you  can  simply  type "Y" followed by a carriage-return (Enter) character.  If

you want to change any of the options, you should type the letter shown to  the

left  of  its  entry  in  the  menu.  For example, to set a threshold type "T".

Lower-case letters will also work.  For many of the options  the  program  will

ask for supplementary information, such as the value of the threshold. 

     Note the "Terminal type" entry, which you will  find  on  all  menus.   It

allows  you  to specify which type of terminal your screen is.  The options are

an IBM PC screen, an ANSI standard terminal (such as a DEC VT100), a DEC  VT52-

compatible  terminal,  such as a Zenith Z29, or no terminal type.  Choosing "0"

toggles among these four options in cyclical order, changing each time the  "0"

option is chosen.  If one of them is right for your terminal the screen will be

cleared before the menu is displayed.  If none works the "none"  option  should 

probably be chosen.  Keep in mind that VT-52 compatible terminals can freeze up

if they receive the screen-clearing commands for the  ANSI  standard  terminal!

If  this  is  a problem it may be helpful to recompile the program, setting the

constants near its beginning so that the program starts up with the VT52 option


     The other numbered options control  which  information  the  program  will

display  on  your  screen  or  on  the  output  files.   The  option  to "Print

indications of progress of run" will show information such as the names of  the

species  as they are successively added to the tree, and the progress of global

rearrangements.  You will usually want to see these  as  reassurance  that  the

program  is running and to help you estimate how long it will take.  But if you

are running the program "in background" as can  be  done  on  multitasking  and

multiuser  systems such as Unix, and do not have the program running in its own

window, you may want to turn this option off so that it does not  disturb  your

use of the computer while the program is running. 

The Output File

--- ------ ---- 

     Most of the programs write their  output  onto  a  file  called  (usually)

"outfile",  and  a  representation  of  the  trees  found  onto  a  file called


     The exact contents of the output file vary from  program  to  program  and

also depend on which menu options you have selected.  For many programs, if you

select all possible output information, the output will consist of (1) the name

of  the  program and its version number, (2) the input information printed out,

(3) a series of phylogenies, some with associated  information  indicating  how

much change there was in each character or on each part of the tree.  A typical

rooted tree looks like this: 



        !                            !      +------------------Orang

        !                            +------4

        !                                   !  +---------Gorilla

  +-----3                                   +--6

  !     !                                      !    +---------Chimp

  !     !                                      +----5

--1     !                                           +-----Human

  !     !

  !     +-----------------------------------------------Mouse



The interpretation of the tree is fairly straightforward: it "grows" from  left

to  right.   The  numbers  at the forks are arbitrary and are used (if present)

merely to identify the forks.  In some of the programs asterisks ("*") are used

instead  of  numbers.   For many of the programs the tree produced is unrooted.

It is printed out in nearly the same form, but with a warning message: 

   remember: this is an unrooted tree! 

The warning message ("remember: ...") indicates that this is an  unrooted  tree

(mathematicians  still call this a tree, though some systematists unfortunately

use the term "network".  This conflicts with standard mathematical usage, which

reserves  the  name  "network"  for a completely different kind of graph).  The

root of this tree could be anywhere, say on the  line  leading  immediately  to

Mouse.  As an exercise, see if you can tell whether the following tree is or is 

not a different one from the above: 



   +---------4                                   +------------------Orang

   !         !                            +------3

   !         !                            !      !       +---------Chimp

---6         +----------------------------1      !  +----2

   !                                      !      +--5    +-----Human

   !                                      !         !

   !                                      !         +---------Gorilla

   !                                      !

   !                                      +-------------------Gibbon



   remember: this is an unrooted tree! 

(it is NOT different).  It is IMPORTANT also to realize that the lengths of the

segments  of  the  printed  tree  may  not  be  significant:  some may actually

represent branches of zero length, in the sense that there is no evidence  that

the  branches  are nonzero in length.  Some of the diagrams of trees attempt to

print branches approximately proportional to estimated branch lengths, while in

others  the  lengths are purely conventional and are presented just to make the

topology visible.  You will have to look  closely  at  the  documentation  that

accompanies  each  program  to see what it presents and what is known about the

lengths of the branches on the tree.  The  above  tree  attempts  to  represent

branch  lengths approximately in the diagram.  But even in those cases, some of

the smaller branches are likely to be artificially lengthened to make the  tree

topology clearer.  Here is what a tree from DNAPARS looks like, when no attempt

is made to make  the  lengths  of  branches  in  the  diagram  proportional  to

estimated branch lengths: 



           +--4  +--Chimp

           !  !

        +--3  +-----Gorilla

        !  !

     +--2  +--------Orang

     !  !

  +--1  +-----------Gibbon

  !  !

--6  +--------------Mouse



  remember: this is an unrooted tree! 

     Some of the parsimony programs in the package can print out a table of the

number of steps that different characters (or sites) require on the tree.  This

table may not be obvious at first.  A typical example looks like this: 

steps in each site:

         0   1   2   3   4   5   6   7   8   9


    0!       2   2   2   2   1   1   2   2   1

   10!   1   2   3   1   1   1   1   1   1   2

   20!   1   2   2   1   2   2   1   1   1   2

   30!   1   2   1   1   1   2   1   3   1   1

   40!   1 

The numbers across the top and down the  side  indicate  which  site  is  being

referred  to.   Thus  site  23 is column "3" of row "20" and has 1 step in this


The Tree File

--- ---- ---- 

     In output from most programs, a representation of the tree is also written

into  the  tree  file (usually named "treefile").  The tree is specified by the

nested pairs of parentheses, enclosing names and separated by commas.  If there

are any blanks in the names, these must be replaced by the underscore character

"_".  Trailing blanks  in  the  name  may  be  omitted.   The  pattern  of  the

parentheses  indicates  the  pattern  of  the  tree  by  having  each  pair  of

parentheses enclose all the members of a monophyletic group.  The tree file for

the above tree would have its first line look like this: 


In the above tree the first fork separates the lineage  leading  to  Mouse  and

Bovine  from the lineage leading to the rest.  Within the latter group there is

a fork separating Gibbon from the rest, and so on.  The entire tree is enclosed

in  an outermost pair of parentheses.  The tree ends with a semicolon.  In some

programs such as DNAML, FITCH, and CONTML, the tree will be completely unrooted

and  specified  by  a  bottommost  fork  with  a  three-way  split,  with three

"monophyletic" groups separated by two commas: 


The three "monophyletic" groups here are A, (B,C,D),  and  (E,F).   The  single

three-way  split  corresponds to one of the interior nodes of the unrooted tree

(it can be any interior node).  The remaining forks are encountered as you move

out from that first node, and each then appears as a two-way split.  You should

check the documentation files for the particular programs you are using to  see

in  which of these forms you can expect the user tree to be in.  Note that many

of the programs that estimate an unrooted tree produce trees in the treefile in

rooted  form!  This is done for reasons of arbitrary internal bookkeeping.  The

placement of the root is arbitrary. 

     For programs estimating branch lengths, these are given in  the  trees  in

the  tree  file as real numbers following a colon, and placed immediately after

the group descended from that branch.  Here  is  a  typical  tree  with  branch





Note that the tree may continue to a new line at any time except in the  middle

of  a  name  or the middle of a branch length, although in trees written to the

tree file this will only be done after a comma. 

     These representations of trees are a subset of  the  standard  adopted  on

June  24, 1986 at the annual meetings of the Society for the Study of Evolution

at an meeting (the final session in Newick's lobster restaurant  --  hence  its

name  --  the  Newick  standard)  of  an informal committee consisting of Wayne

Maddison (MacClade), David Swofford (PAUP), F. James  Rohlf  (NTSYS-PC),  Chris

Meacham  (COMPROB  and  plotting  programs),  James  Archie  (character  coding

program), William H.E. Day, and me.   This  standard  is  a  generalization  of

PHYLIP's  format, itself based on a well-known representation of trees in terms

of parenthesis patterns which has  been  around  for  almost  a  century.   The 

standard  is now employed by most phylogeny computer programs but unfortunately

has yet to be decribed in a formal published description. 

                      THE OPTIONS AND HOW TO INVOKE THEM 

     Most of the programs allow  various  options  that  alter  the  amount  of

information  the  program is provided or what it is to do with the information.

Most options are selected in the menu.  However a  few  are  specified  in  the

input file, or require part of their specification to be in the input file. 

Options Information in the Input File

------- ----------- -- --- ----- ---- 

     In such cases, the program is notified that an option has been invoked  by

the  presence of one or more letters after the last number on the first line of

the input file.  These letters may or may not be separated from each  other  by

blanks,  though  it  is usually necessary to separate them from the number by a

blank.  They can be in any order.  Thus to invoke options A and  W,  the  input

file starts with the line: 

   12   20 WA


   12   20 A W 

The options are described individually in the other documents of this  package.

For  the  options  that require information to be in the input file, additional

information must be provided.  For all but one of these,  this  information  is

provided  by  placing  a  line after the first line of the file, but before the

beginning of the species data.  The first character of that line  should  match

the  option  letter.   These  auxiliary  information lines can be in any order.

Thus if options A and W are both invoked, both of the  following  formats  (and

two others as well) are legal: 

   12   20 AW                            12   20  A W

A         0001111000                  Weights   00112221A0

Weights   00112221A0                  A         0001111000

(then the species information)        (then the species information) 

One of the options requires special discussion.  Many of the programs  have  in

their  menu  the option U, which signals that one or more user-defined trees is

to be provided for evaluation.  This "user tree" is supplied in the input  file

(not the tree file), but AFTER the species data, rather than before it. It does

not require any indication to be placed in the first line of the input file, as

do the options that place information before the species data.  After the data,

there is a line containing the number  of  user-defined  trees  being  defined.

Each  user-defined  tree  starts  on a new line.  It is in the same form as the

trees in the tree files mentioned above, namely  the  New  Hampshire  standard.

Here is an example with one user-defined tree: 

    6   13

Archaeopt 0011001110000



B. virgini1111011101101






     In using the user tree option, check the pattern of parentheses carefully.

The  programs do not always detect whether the tree makes sense, and if it does

not there will probably be a crash (hopefully,  but  not  inevitably,  with  an

error message indicating the nature of the problem). 

Common Options in the Menu

------ ------- -- --- ---- 

     Seven options from the menu, the U (User tree), G (Global), J (Jumble),  O

(Outgroup), T (Threshold), M (multiple data sets), and the tree output options,

are used so widely that it is best to discuss them in this document. 

     (1) The U (User tree) option.  This option  toggles  between  the  default

setting,  which  allows  the  program to search for the best tree, and the User

tree setting, which reads a tree or trees ("user trees") from  the  input  file

and  evaluates  them.   The user trees must follow the other information in the

data set, and be preceded by a line specifying the number to  user  trees  that

are  to  be  evaluated.   Each  user  tree then is given in standard form, each

starting on a new line.  The form that the user trees must take is described in

some  detail  below, under the description of the program output of tree files.

In some cases a program may require that the trees fed in be rooted trees, even

though  the program cannot infer the placement of the root.  In those cases you

can place the root anywhere.  Program RETREE can be  used  to  convert  between

rooted and unrooted trees. 

     (2) The G (Global) option.  In  the programs which construct trees (except

for  NEIGHBOR,  the "...PENNY" programs and CLIQUE, and of course the "...MOVE"

programs where you construct the trees yourself), after all species  have  been

added to the tree a rearrangements phase ensues.  In most of these programs the

rearrangements are automatically global, which in this case means that subtrees

will  be  removed  from  the tree and put back on in all possible ways so as to

have a better chance of  finding  a  better  tree.   Since  this  can  be  time

consuming (it roughly triples the time taken for a run) it is left as an option

in some of the programs, specifically  CONTML,  FITCH,  and  DNAML.   In  these

programs  the  G menu option toggles between the default of local rearrangement

and global rearrangement.  The rearrangements are explained more below. 

     (3) The J (Jumble) option.  In most  of  the  tree  construction  programs

(except  for  the  "...PENNY"  programs  and  CLIQUE), the exact details of the

search of different trees depend on the order of input of  species.   In  these

programs  J  option  enables  you  to  tell  the program to use a random number

generator to choose the input order of species.  This option is toggled on  and

off  by selecting option J in the menu.  The program will then prompt you for a

"seed" for the random number generator.  The seed should be an integer  between

1 and 32767, and should of form 4n+1, which means that it must give a remainder

of 1 when divided by 4.  This can be judged by looking at the last  two  digits

of  the  number.  Each different seed leads to a different sequence of addition

of species.  By simply changing the  random  number  seed  and  re-running  the 

programs  one can look for other, and better trees.  If the seed entered is not

odd, the program will not proceed, but will prompt for another seed. 

     The Jumble option also causes the program to ask you how  many  times  you

want  to  restart  the  process.   If  you  answer 10, the program will try ten

different orders of species in constructing the trees, and the results  printed

out  will  reflect  this  entire  search process (that is, the best trees found

among all 10 runs will be printed out, not the best trees from each  individual


     (4) The O (Outgroup) option.  This specifies which species is to  be  used

to  root  the tree by having it become the outgroup.  This option is toggled on

and off by choosing O in the menu.  When it is on, the program will then prompt

for  the number of the outgroup (the species being taken in the numerical order

that they occur in the input file).   Responding  by  typing  "6"  and  then  a

carriage-return  (Enter) character indicates that the sixth species in the data

is the outgroup.  Outgroup-rooting will not  be  attempted  if  the  data  have

already  established a root for the tree from some other consideration, and may

not be if it is a user-defined tree, despite your invoking  the  option.   Thus

programs  such  as  DOLLOP  that  produce  only  rooted  trees do not allow the

Outgroup option.  It is also not available in KITSCH, DNAMLK, or CLIQUE.   When

it  is used, the tree as printed out is still listed as being an unrooted tree,

though the outgroup is connected to the bottommost node so that it is  easy  to

visually convert the tree into rooted form. 

     (5) The T (Threshold) option.  This sets a  threshold  such  that  if  the

number of steps counted in a character is higher than the threshold, it will be

taken to be the threshold value rather than the actual number  of  steps.   The

default  is  a  threshold  so high that it will never be surpassed.  The T menu

option toggles on and off asking the user to supply a threshold.   The  use  of

thresholds  to  obtain methods intermediate between parsimony and compatibility

methods is described in my 1981b paper. When the T  option  is  in  force,  the

program will prompt for the numerical threshold value.  This will be a positive

real number greater than 1.  In programs MIX, MOVE, PENNY,  PROTPARS,  DNAPARS,

DNAMOVE,  and  DNAPENNY, do not use threshold values less than or equal to 1.0,

as they have no meaning and lead to a tree which depends only on considerations

such  as the input order of species and not at all on the character state data!

In programs DOLLOP, DOLMOVE, and DOLPENNY the threshold should never be 0.0  or

less, for the same reason.  The T option is an important and underutilized one:

it is, for example, the only way in this package (except for  program  DNACOMP)

to do a compatibility analysis when there are missing data.   It is a method of

de-weighting characters that evolve rapidly.  I wish more people were aware  of

its properties. 

     (6) The M (Multiple data sets) option.  In menu programs  there  is  an  M

menu  option  which allows one to toggle on the multiple data sets option.  The

program will ask you how many data sets it should expect.  The data  sets  have

the  same format as the first data set.  Here is a (very small) input file with

two five-species data sets: 

     5    6

Alpha     CCACCA

Beta      CCAAAA

Gamma     CAACCA

Delta     AACAAC

Epsilon   AACCCA

     5    6

Alpha     CACACA

Beta      CCAACC

Gamma     CAACAC

Delta     GCCTGG

Epsilon   TGCAAT 

The main use of this option will be to  allow  all  of  the  methods  in  these

programs  to  be bootstrapped.  Using the program SEQBOOT one can take any DNA,

protein, restriction sites, or binary character data set and make multiple data

sets  by  bootstrapping.   Trees  can  be produced for all of these using the M

option.  They will be written on the tree output file if that option is left in

force.   Then the program CONSENSE can be used with that tree file as its input

file.  The result is a majority rule consensus tree which can be used  to  make

confidence  intervals.  The present version of the package allows, with the use

of SEQBOOT and CONSENSE and the M option, bootstrapping of many of the  methods

in the package. 

     (7) The option to write out the trees into a tree  file.   This  specifies

that  you  want the program to write out the tree not only on its usual output,

but also onto a file in nested-parenthesis notation (as described above).  This

option  is  sufficiently useful that it is turned on by default in all programs

that allow it.  You can optionally turn it off  if  you  wish,  by  typing  the

appropriate  number  from  the  menu (it varies from program to program).  This

option is useful for creating tree files that can be  directly  read  into  the

plotting programs, the consensus tree program, and can be incorporated into the

input file to specify user-defined trees in many of the other programs. 

     (8) The (0) terminal  type  option.   The  program  will  default  to  one

particular  assumption  about your terminal (except in the case of Macintoshes,

the default will be an ANSI compatible terminal). You can alternatively  select

it to be either an IBM PC, a DEC VT52, or nothing.  This affects the ability of

the programs to clear the  screen  when  they  display  their  menus,  and  the

graphics  characters  used  to  display  trees  in  the programs DNAMOVE, MOVE,

DOLMOVE, and RETREE.  If you are running a PCDOS system any have  the  ANSI.SYS

driver  installed  in your CONFIG.SYS file, you may find that the screen clears

correctly even with the default setting of ANSI. 

Common Options Requiring Information in the Input File

------ ------- --------- ----------- -- --- ----- ---- 

     There are a number of options (Ancestor, Factors, Categories and  Weights)

that  are  specified  in the input file.  Some of them must also be selected in

the menu.  Of these, the Ancestor and  Factors  options  are  specific  to  the

Discrete  Characters  programs  and are described in their group document.  The

Categories option is specific to some of the molecular sequence programs and is

described  in  their group document.  The Weights option is used throughout the

package and is best introduced here. 

     This allows us to specify weights on the individual  characters.   Weights

are invoked by placing a W on the first line of the file.  The weights are then

specified by a line or lines which start with W and then have enough characters

or  blanks  to  complete  the  full length of a species name.  Then they have a

single character (0-9 or A-Z) for each character.  Thus they look like the data 

for a species: 

Weights   0001111001112 


W         1110000ZZZZZ1 

The weights cause a character to be counted as if it were n characters, where n

is  the  weight.   The  values 0-9 give weights 0 through 9, and the values A-Z

give weights 10 through 35.  By use of the weights  we  can  give  overwhelming

weight to some characters, and drop others from the analysis.  In the molecular

sequence programs only two values of the weights, 0 or 1 are allowed. 

     Weights can be  used  to  analyze  different  subsets  of  characters  (by

weighting  the  rest  as  zero).   Alternatively,  in  the  discrete characters

programs they can be used to force a certain group to appear on  the  phylogeny

(in  effect confining consideration to only phylogenies containing that group).

This is done by adding an imaginary character that has 1's for the  members  of

the group, and 0's for all the other species.  That imaginary character is then

given the highest weight possible: the result will be that any  phylogeny  that

does  not  contain  that group will be penalized by such a heavy amount that it

will not (except in the most unusual circumstances) be considered.  Of  course,

the  new  character brings extra steps to the tree, but the number of these can

be calculated in advance and subtracted out of the  total  when  reporting  the

results.   This  use  of  weights is an important one, and one sadly ignored by

many users who could profit from it.  In the case  of  molecular  sequences  we

cannot  use  weights this way, so that to force a given group to appear we have

to add a large extra segment of sites to the molecule, with (say) A's for  that

group and C's for every other species. 


     All of the programs except FACTOR, DNADIST,  GENDIST,  DNAINVAR,  SEQBOOT,

CONTRAST, RETREE, and the plotting and consensus tree programs act to construct

an estimate of a phylogeny.  MOVE, DOLMOVE, and DNAMOVE let  you  construct  it

yourself  by  hand.   All of the rest but NEIGHBOR, the "...PENNY" programs and

CLIQUE make use of a common approach involving  additions  and  rearrangements.

They  are  trying  to  minimize or maximize some quantity over the space of all

possible evolutionary trees.  Each program contains  a  part  that,  given  the

topology  of  the  tree,  evaluates  the  quantity  that  is being minimized or

maximized.  The straightforward approach would be to evaluate all possible tree

topologies one after another and pick the one which, according to the criterion

being used, is best.  This would not be possible for more than a  small  number

of species, since the number of possible tree topologies is enormous.  A review

of the literature on the counting of evolutionary trees will be found one of my

papers (Felsenstein, 1978a). 

     Since we cannot search all topologies, these programs are  not  guaranteed

to  always find the best tree, although they seem to do quite well in practice.

The strategy they employ is as follows: the species are taken in the  order  in

which they appear in the input file.  The first two (in some programs the first

three) are taken and a tree constructed containing only those.  There  is  only

one  possible  topology  for this tree.  Then the next species is taken, and we

consider where it might be added to the tree.  If the initial tree is  (say)  a

rooted tree with two species and we want the resulting three-species tree to be

a bifurcating tree, there are only three places where we could  add  the  third

species.  Each of these is tried, and each time the resulting tree is evaluated

according to the criterion.  The best one is chosen to be the basis for further 

operations.   Now  we  consider adding the fourth species, again at each of the

five possible places that would result in a bifurcating tree.  Again, the  best

of these is accepted. 

Local Rearrangements

----- -------------- 

     The process continues in this manner, with one important exception.  After

each species is added, and before the next is added, a number of rearrangements

of the tree are tried, in an effort to improve it.  The algorithms move through

the  tree,  making  all  possible  local  rearrangements  of the tree.  A local

rearrangement involves an internal segment of the tree in the following manner.

Each  internal  segment  of  the tree is of this form (where T1, T2, and T3 are

subtrees -- parts of the tree that can contain further forks and tips): 

           T1      T2       T3

            \      /        /

             \    /        /

              \  /        /

               \/        /

                *       /

                 *     /

                  *   /

                   * /




the segment we are discussing  being  indicated  by  the  asterisks.   A  local

rearrangement  consists of switching the subtrees T1 and T3 or T2 and T3, so as

to obtain one of the following: 

          T3       T2      T1            T1       T3      T2

           \       /       /              \       /       /

            \     /       /                \     /       /

             \   /       /                  \   /       /

              \ /       /                    \ /       /

               \       /                      \       /

                \     /                        \     /

                 \   /                          \   /

                  \ /                            \ /

                   !                              !

                   !                              !

                   !                              ! 

Each time a local rearrangement is successful in finding a better tree, the new

arrangement  is accepted.  The phase of local rearrangements does not end until

the program can traverse the  entire  tree,  attempting  local  rearrangements,

without finding any that improve the tree. 

     This strategy of adding species and making local rearrangements will  look

at  about (n-1) times (2n-3) different topologies, though if rearrangements are

frequently successful the number may be larger.  I  have  been  describing  the

strategy when rooted trees are being considered.  For unrooted trees there is a

precisely similar strategy, though the first tree constructed may be  a  three-

species  tree  and the rearrangements may not start until after the addition of

the fifth species. 

     Though we are not guaranteed to have found the best tree topology, we  are

guaranteed  that  no  nearby  topology (i. e. none accessible by a single local 

rearrangement) is better.  In this sense we have reached a local optimum of our

criterion.   Note that the whole process is dependent on the order in which the

species are present in the input file.  We can try  to  find  a  different  and

better  solution  by  reordering  the species in the input file and running the

program again (or, more easily, by using the  J  option).   If  none  of  these

attempts finds a better solution, then we have some indication that we may have

found the best topology, though we can never be certain of this. 

     Note also that a new topology is never accepted unless it is  better  than

the  previous  one,  so  that  the rearrangement process can never fall into an

endless loop.  This is also the way ties in our criterion are resolved,  namely

by sticking with the tree found first.  However, the tree construction programs

other than CLIQUE, CONTML, FITCH, and DNAML do keep a record of all trees found

that  are  tied with the best one found.  This gives you some immediate idea of

which parts of the tree can be altered without affecting  the  quality  of  the


Global Rearrangements

------ -------------- 

     A feature of most of the programs, such  as  PROTPARS,  DNAPARS,  DNACOMP,


optimization of the tree.  In four of these (CONTML, FITCH, DNAML  and  DNAMLK)

this  is  an  option, 'G'.  In the others it automatically applies.  When it is

present there is an additional stage to the search  for  the  best  tree.  Each

possible  subtree  is removed from the tree from the tree and added back in all

possible places.  This process continues until all subtrees can be removed  and

added  again  without  any  improvement in the tree.  The purpose of this extra

rearrangement is to make it less likely that one or more a species gets "stuck"

in  a  suboptimal region of the space of all possible trees.  The use of global

optimization results in approximately a tripling (3x) of the run-time, which is

why I have left it as an option in some of the slower programs. 

     The programs doing global optimization print out  a  dot  "."  after  each

group  is removed and re-added to the tree, to give the user some sign that the

rearrangements are proceeding.  A new line of dots is started  whenever  a  new

round of global rearrangements is started following an improvement in the tree.

On the line before the dots are printed there is printed  a  bar  of  the  form

"!--------------!"  to  show  how  many  dots  to expect.  The dots will not be

printed out at a uniform rate, but the later dots, which represent  removal  of

larger  groups from the tree and trying them consequently in fewer places, will

print out more quickly.  With some compilers each row of dots  is  not  printed

out until it is complete. 

     It should be noted that PENNY, DOLPENNY, DNAPENNY and CLIQUE  use  a  more

sophisticated strategy of "depth-first search" with a "branch and bound" search

method that guarantees that all of the best trees will be found.  In  the  case

of  PENNY,  DOLPENNY  and  DNAPENNY  there  can  be a considerable sacrifice of

computer time if the number of species is greater  than  about  ten:  it  is  a

matter  for you to consider whether it is worth it for you to guarantee finding

all the most parsimonious trees, and that depends on  how  much  free  computer

time  you  have!   CLIQUE  finds all largest cliques, and does so without undue

burning of computer time. 

Multiple Jumbles

-------- ------- 

     As just mentioned, for most of these programs the search  depends  on  the

order  in  which  the  species are entered into the tree.  Using the J (Jumble) 

option you can supply a random number seed which will allow the program to  put

the species in in a random order.  A new feature (with version 3.5) is to allow

this to be done multiple times.  If you tell the program to do it 10 times,  it

will  go  through  the  tree-building  process  10 times, each with a different

random order of adding species.  It will keep a record of the  trees  tied  for

best  over the whole process.  In other words, it does not just record the best

trees from each of the 10 runs, but records the best ones overall.   Of  course

this  is slow, taking 10 times longer than a single run.  But it does give us a

much greater chance of finding all of the  most  parsimonious  trees.   In  the

terminology  of  Maddison (1991) it can find different "islands" of trees.  The

present algorithms do not guarantee us to find all trees in  a  given  "island"

from  a single run, so multiple runs also help explore those "islands" that are



     In practice, it is advisable to use the Jumble  option  to  evaluate  many

different  orderings of the input species.  When the programs which have global

branch-swapping as default (such as DNAPARS) are used or when the G  option  is


THAT IT BE DONE MANY TIMES (AS MANY AS TEN) to use different orderings  of  the

input  species).   When the G (Global rearrangement) option is not being used I

have also found it useful to do multiple Jumbles. 

     People who want a magic "black box" program whose results they do not have

to  question  (or think about) often are upset that these programs give results

that are dependent on the order in which the species are entered in  the  data.

To  me  this  property  is  an  advantage,  for it permits you to try different

searches for better trees, simply by varying the input order  of  species.   If

you  do  not  use  the  multiple Jumble option, but do multiple individual runs

instead, you can easily decide which to pay most attention to  --  the  one  or

ones  that  are  best  according  to  the criterion employed (for example, with

parsimony, the one out of the runs that results in the  tree  with  the  fewest


     In practice, in a single run, it usually seems best to  put  species  that

are likely to be sources of confusion in the topology last, as by the time they

are added the arrangement of the earlier species will have  stabilized  into  a

good  configuration,  and  then  the  last few species will by fitted into that

topology.  There will be less chance this way of a poor initial  topology  that

would  affect  all  subsequent  parts  of  the  search.   However, a variety of

arrangements of the input order of species should be tried, as can be  done  if

the  J  option  is  used, and no species should be kept in a fixed place in the

order of input.  Note that the results of the "...PENNY"  programs  and  CLIQUE

are  not sensitive to the input order of species, and NEIGHBOR is only slightly

sensistive to it,  so  that  multiple  Jumbling  is  not  possible  with  those

programs.   Note  also  that  with  global  search,  which  is standard in many

programs and in others is an option,  each  group  (including  each  individual

species)  will  be  removed  and  re-added in all possible positions, so that a

species causing confusion will have more chance of moving  to  a  new  location

than it would without global rearrangement. 


     Probably the most important thing to keep in mind while running any of the

parsimony  or  compatibility programs is not to overinterpret the result.  Many

users treat the set of most parsimonious trees  as  if  it  were  a  confidence 

interval.   If  a group appears in all of the most parsimonious trees then they

treat it  as  well  established.   Unfortunately  THE  CONFIDENCE  INTERVAL  ON


TREES (Felsenstein, 1985b).  Likewise,  variation  of  result  among  different

methods  will  not  be a good indicator of the size of the confidence interval.

Consider a simple data set in which, out of 100 binary characters, 51 recommend

the  rooted  tree  ((A,B),C) and 49 the tree (A,(B,C)).  Many different methods

will all give the same result on such a data set: they will estimate  the  tree

as  ((A,B),C).   Nevertheless  it  is clear that the 51:49 margin by which this

tree is favored is not significantly  different  from  50:50.   So  CONSISTENCY



Relative speed of the different programs

-------- ----- -- --- --------- -------- 

     C compilers differ in efficiency of the code they generate, and some  deal

with  some  features  of  the language better than with others.  Thus a program

which is unusually fast on one computer  may  be  unusually  slow  on  another.

Nevertheless,  as a rough guide to relative execution speeds, I have tested the

programs on three data sets, each of which has 10 species  and  20  characters.

The  first  is  an imaginary one in which all characters are compatible - ("The

Willi Hennig Memorial Data Set" as J. S. Farris once called it).  The second is

the  binary  recoded  form  of  the  fossil  horses data set of Camin and Sokal

(1965).  The third data set has data that is completely random: 10 species  and

20  characters  with  a 50% chance that each character state is 0 or 1 (or A or

G).  The data sets range from a completely compatible one in which there is  no

homoplasy  (paralellism  or  convergence),  through  the horses data set, which

requires 29 steps where the possible minimum number would be 20, to the  random

data  set,  which  requires  49  steps.   We  can  thus see how this increasing

messiness of the data affects running times. 

     Here are the nucleotide sequence versions of the three data sets: 

   10   20











   10   20











   10   20











     Here are the timings of many of the version 3.5 programs  on  these  three

data sets as run after being compiled by Microsoft Quick C on an 16 MHz 80386SX

computer under PCDOS 5.0.  An 80387 math co-processor was present and was  used

by the compiled code. 

                 Hennigian Data    Horses Data        Random Data 

    PROTPARS         82.83              86.23             148.03

    DNAPARS           5.98               5.66              11.54

    DNAPENNY         46.03              23.51            5305.97

    DNACOMP           7.14               6.43              11.86

    DNAINVAR          0.61               0.66               0.61

    DNAML          1928.99            2069.32            2611.48

    DNAMLK         2247.12            6094.81            4993.00

    DNADIST           3.57               4.50               5.38

    RESTML         6818.34           13422.15           28418.34

    FITCH            35.92              48.61              38.17

    KITSCH           12.42              12.36              13.18

    NEIGHBOR          2.20               2.14               2.903

    CONTML           56.85              57.56              59.15

    GENDIST           1.00               1.00               1.00

    MIX              13.62              14.60              25.92

    PENNY             8.41              21.31            3851.1

    DOLLOP           26.69              26.86              46.30

    DOLPENNY         12.25              56.57           23934.22

    CLIQUE            0.77               0.71               0.77

    FACTOR            0.39               0.44               0.44 

In all cases the programs  were  run  under  the  default  options,  except  as

specified  here.   The data sets used for the discrete characters programs have

0's and 1's instead of A's and C's.  For CONTML the 0's and 1's were made  into

0.0's  and 1.0's and considered as 20 2-allele loci.  For the distance programs

10 x 10 distance matrices were computed from the three data sets.  Nor does  it

make much sense to benchmark MOVE, DOLMOVE, or DNAMOVE, although when there are

many characters and many species the response time after each alteration of the

tree  should  be  proportional  to the product of the number of species and the

number of characters.  For DNAML and DNAMLK the frequencies of the  four  bases

were set to be equal rather than determined empirically as is the default.  For

RESTML the number of enzymes was set to 1. 

     Several patterns will be apparent from this.  The algorithms (MIX, DOLLOP,


that use the above-described addition strategy  have  run  times  that  do  not

depend  strongly  on  the messiness of the data.  The only exception to this is

that if a data set such as the Random data requires one extra round  of  global

rearrangements  it  takes longer.  The programs differ greatly in run time: the

likelihood programs RESTML, DNAML and CONTML are quite a bit  slower  than  the

others.  The protein sequence parsimony program, which has to do a considerable 

amount of bookkeeping to keep track of which amino acids  can  mutate  to  each

other, is also relatively slow. 

     Another class of algorithms includes PENNY, DOLPENNY, DNAPENNY and CLIQUE.

These  are  branch-and-bound  methods:  in principle they should have execution

times that rise exponentially with the number of species and/or characters, and

they  might be much more sensitive to messy data.  This is apparent with PENNY,

DOLPENNY, and DNAPENNY, which go from being reasonably fast with clean data  to

very slow with messy data.  DOLPENNY is paritcularly slow on messy data -- this

is  because  this  algorithm  cannot  make  use  of  some  of  the  lower-bound

calculations that are possible with DNAPENNY and PENNY.  CLIQUE is very fast on

all data sets.  Although in theory it should bog down if the number of  cliques

in the data is very large, that does not happen with random data, which in fact

has few cliques and those small ones.  Apparently the  "worst-case"  data  sets

are much rarer for CLIQUE than for the other branch-and-bound methods. 

     NEIGHBOR is quite fast compared to FITCH and KITSCH, and  should  make  it

possible  to  run  much larger cases, although the results are expected to be a

bit rougher than with those programs. 

Speed with different numbers of species

----- ---- --------- ------- -- ------- 

     How will the speed depend on the number  of  species  and  the  number  of

characters?   For  the  sequential-addition  algorithms,  the  speed  should be

proportional to the cube of the  number  of  species,  and  to  the  number  of

characters.   Thus a case that has, instead of 10 species and 20 characters, 20

species and 50 characters would take 2 x 2 x 2 x 2.5 = 20 times as long.   This

implies  that cases with more than 20 species will be slow, and cases with more

than 40 species  VERY  slow.   This  places  a  premium  on  working  on  small

subproblems rather than just dumping a whole large data set into the programs. 

     An exception to these rules will be some of the DNA programs that  use  an

aliasing  device to save execution time.  In these programs execution time will

not necessarily increase proportional to the number of  sites,  as  sites  that

show  the  same  pattern  of  nucleotides will be detected as identical and the

calculations for them will be done only once,  which  does  not  lead  to  more

execution  time.   This  is  particularly likely to happen with few species and

many sites,  or  with  data  sets  that  have  small  amounts  of  evolutionary


     For programs FITCH and KITSCH, the distance matrix is square, so that when

we  double  the number of species we also double the number of "characters", so

that running times will go up as the fourth power  of  the  number  of  species

rather  than the third power.  Thus a 20-species case with FITCH is expected to

run sixteen times more slowly than a 10-species case. 

     For programs like PENNY and CLIQUE the run times will rise faster than the

cube  of  the  number  of species (in fact, they can rise faster than any power

since these algorithms are not guaranteed to  work  in  polynomial  time).   In

practice,  PENNY will frequently bog down above 11 species, while CLIQUE easily

deals with larger numbers. 

     For NEIGHBOR the speed should vary only as the square  of  the  number  of

species, so a case twice as large will take only four times as long.  This will

make it an attractive alternative to FITCH and KITSCH for large data sets. 

     If you are unsure of how long a program will take, try it first on  a  few

species,  then work your way up until you get a feel for the speed and for what

size programs you can afford to run. 

     Execution time  is  not  the  most  important  criterion  for  a  program,

particularly   as  computer  time  gets  much  cheaper  than  your  time  or  a

programmer's time.  With workstations on which background jobs can be  run  all

night,  execution  speed  is not overwhelmingly relevant.  Some of us have been

conditioned by  an  earlier  era  of  computing  to  consider  execution  speed

paramount.   But  ease  of use, ease of adaptation to your computer system, and

ease of modification are much more important in practice, and in these respects

I  think  these programs are adequate.  Only if you are engaged in 1960's style

mainframe computing is minimization of execution time paramount. 

     Nevertheless it would have been nice to have  made  the  programs  faster.

The  present speeds are a compromise between speed and effectiveness: by making

them slower and trying more rearrangements in the trees, or by enumerating  all

possible  trees,  I  could  have made the programs more likely to find the best

tree.  By trying fewer rearrangements I could have speeded them up, but at  the

cost  of  finding  worse  trees.   I could also have speeded them up by writing

critical sections in assembly language, but this would have sacrificed ease  of

distribution  to new computer systems.  There are also some options included in

these programs  that  make  it  harder  to  adopt  some  of  the  economies  of

bookkeeping  that  make  other  programs faster.  However to some extent I have

simply made the  decision  not  to  spend  time  trying  to  speed  up  program

bookkeeping  when  there  were  new  likelihood  and  statistical methods to be


Relative speed of different machines 

     It is interesting to compare  different  machines  using  DNAPARS  as  the

standard  task.  One can rate a machine on the DNAPARS benchmark by summing the

times for all three of the data sets.  Here are relative total timings over all

three  data  sets  (done  with  various versions of DNAPARS) for some machines,

taking Microsoft Quick C running under PCDOS on a 16 MHz  80386  clone  as  the

standard.   Pascal benchmarks from version 3.4 of the program are also included

-- they are compared only with each other and their times are  in  parentheses.

This  use  of  two  separate  standards  is  necessary not because of different

languages but because different versions of the  package  are  being  compared.

Thus,  the  "Time"  is  the  ratio  of the Total to that for the 386SX, for the

appropriate standard, so that the Time for the Macintosh  Classic  for  DNAPARS

3.4  on  Think  Pascal 3 is compared to the Time for the 386/SX running DNAPARS

3.4 on Turbo Pascal 6.0, but the Time for the Macintosh Classic running version

3.5  on  Think  C  is compared to the Time for the 386SX running version 3.5 on

Quick C.  The Speed is the reciprocal of the Time. 

  Machine             DOS        Compiler            Total     Time     Speed

  -------             ---        --------            -----     ----     ----- 

  Toshiba T1100+      PCDOS    Turbo Pascal 3.01A   (269)      7.912      0.126

  Apple Mac Plus      MacOS    Lightspeed Pascal 2  (175.84)   5.172      0.193

  Toshiba T1100+      PCDOS    Turbo Pascal 5.0     (162)      4.765      0.210

  Macintosh Classic   MacOS    Think Pascal 3       (160)      4.706      0.212

  Macintosh Classic   MacOS    Think C                43.0     3.58       0.279

  IBM PS2/60          PCDOS    Turbo Pascal 5.0      (58.76)   1.728      0.579

  80286 (12 Mhz)      PCDOS    Turbo Pascal 5.0      (47.09)   1.385      0.722

  Apple Mac IIcx      MacOS    Think Pascal 3        (42)      1.235      0.810

  Apple Mac SE/30     MacOS    Think Pascal 3        (42)      1.235      0.810

  Apple Mac IIcx      MacOS    Lightspeed Pascal 2   (39.84)   1.172      0.853

  Apple Mac IIcx      MacOS    Lightspeed Pascal 2#  (39.69)   1.167      0.857

  Zenith Z386 (16MHz) PCDOS    Turbo Pascal 5.0      (38.27)   1.155      0.866

  Macintosh SE/30     MacOS    Think C                13.6     1.132      0.883

  80386SX (16 MHz)    PCDOS    Turbo Pascal 6.0      (34)      1.0        1.0

  80386SX (16 MHz)    PCDOS    Microsoft Quick C      12.01    1.0        1.0 

  Sequent-S81         DYNIX    Silicon Valley Pascal (13.0)    0.382      2.615

  VAX 11/785          Unix     Berkeley Pascal       (11.9)    0.35       2.857

  80486-33            PCDOS    Turbo Pascal 6.0      (11.46)   0.337      2.967

  Sun 3/60            SunOS    Sun C                   3.93    0.327      3.056

  NeXT Cube (68030)   Mach     Gnu C                   2.608   0.217      4.605

  Sequent S-81        DYNIX    Sequent Symmetry C      2.604   0.217      4.612

  VAXstation 3500     Unix     Berkeley Pascal        (7.3)    0.215      4.658

  Sequent S-81        DYNIX    Berkeley Pascal        (5.6)    0.1647     6.07

  Unisys 7000/40      Unix     Berkeley Pascal        (5.24)   0.1541     6.49

  VAX 8600            VMS      DEC VAX Pascal         (3.96)   0.1165     8.59

  Sun SPARC IPX       SunOS    Gnu C version 2.1       1.28    0.1066     9.383

  VAX 6000-530        VMS      DEC C                   0.858   0.0714    13.998

  VAXstation 4000     VMS      DEC C                   0.809   0.0674    14.845

  IBM RS/6000 540     AIX      XLP Pascal             (2.276)  0.0669    14.94

  NeXTstation(040/25) Mach     Gnu C                   0.75    0.0624    16.013

  Sun SPARC IPX       SunOS    Sun C                   0.68    0.0566    17.662

  486DX (33 MHz)      Linux    Gnu C #                 0.63    0.0525    19.063

  Sun SPARCstation-1+ Unix     Sun Pascal             (1.7)    0.05      20.00

  DECstation 5000/200 Unix     DEC Ultrix C            0.45    0.0375    26.69

  Sun SPARC 1+        SunOS    Sun C                   0.40    0.0333    30.025

  DECstation 3100     Unix     DEC Ultrix RISC Pascal (0.77)   0.0226    44.16

  IBM 3090-300E       AIX      Metaware High C         0.27    0.0225    44.48

  DECstation 5000/125 Unix     DEC Ultrix RISC C       0.267   0.0222    44.98

  DECstation 5000/200 Unix     DEC Ultrix RISC C       0.256   0.0222    44.98

  Sun SPARC 4/50      SunOS    Sun C                   0.249   0.02073   48.23

  DEC 3000/400 AXP    Unix     DEC C                   0.224   0.01865   53.62

  DECstation 5000/240 Unix     DEC Ultrix RISC C       0.1889  0.01573   63.58

  SGI Iris R4000      Unix     SGI C                   0.184   0.1532    65.27

  IBM 3090-300E       VM       Pascal VS              (0.464)  0.0136    73.28

  DECstation 5000/200 Unix     DEC Ultrix RISC Pascal (0.39)   0.0114    87.18 

The Toshiba T1100+ should be exactly as fast as an  8  MHz  PC  clone.   For  a

couple  of  the machines I am not sure that this benchmark is representative of

timings on non-numerical programs in PHYLIP.  This is particularly the case for

the  DEC  3000/400  AXP  (the DEC "Alpha") which is probably quite a bit faster

than indicated here.  The numerical programs benchmark below gives it a  fairer

test.   The  IBM RS/6000 is probably up to ten times faster than shown here: it

may have been ill-served by its Pascal compiler. 

     Note that parallel machines like the Sequent are not  really  as  slow  as

indicated  by  the  data  here,  as these runs did nothing to take advantage of

their parallelism. 

     For a picture of speeds for a more numerically intensive program, here are

benchmarks  using DNAML, with the 16 MHz 386SX with math co-processor active as

the standard.  Numbers are total run times (total user  time  in  the  case  of

Unix) over all three data sets. 


  Machine             System         Compiler       Seconds   Time    Speed

  -------             ---------      --------       -------   ----    ----- 

  386SX 16 Mhz          PCDOS   Turbo Pascal 6    (7826)     1.0        1.0

  386SX 16 Mhz          PCDOS   Quick C            6549.79   1.0        1.0

  Compudyne 486DX/33    Linux   Gnu C              1599.9    0.2441     4.096

  SUN Sparcstation 1+   SunOS   Sun C              1402.8    0.2142     4.669

  Everex STEP 386/20    PCDOS   Turbo Pascal 5.5  (1440.8)   0.1841     5.432

  486DX/33              PCDOS   Turbo C++          1107.2    0.1690     5.916

  Compudyne 486DX/33    PCDOS   Waterloo C/386     1045.78   0.1597     6.263

  Sun SPARCstation IPX  SunOS   Gnu C               960.2    0.1466     6.821

  NeXTstation(68040/25) Mach    Gnu C               916.6    0.1399     7.146 

  486DX/33              PCDOS   Waterloo C/386      861.0    0.1314     7.607

  Sun SPARCstation IPX  SunOS   Sun C               787.7    0.1203     8.315

  486DX/33              PCDOS   Gnu C               650.9    0.0994    10.063

  VAX 6000-530          VMS     DEC C               637.0    0.0973    10.282

  DECstation 5000/200   Unix    DEC Ultrix RISC C   423.3    0.0646    15.473

  IBM 3090-300E         AIX     Metaware High C     201.8    0.0308    32.46

  Convex C240/1024      Unix    C                   101.6    0.01551   64.47

  DEC 3000/400 AXP      Unix    DEC C                98.29   0.01501   66.64 

     You are invited to send me figures  for  your  machine  for  inclusion  in

future tables.  Use the data sets above and compute the total times for DNAPARS

and for DNAML for the three data sets (setting  the  frequencies  of  the  four

bases  to  0.25  each  for  the  DNAML  runs).  Be sure to tell me the name and

version of your compiler, and the version of PHYLIP you tested. 

Published Benchmarks

--------- ---------- 

     Some of you may have seen the "benchmark" published by Luckow and Pimentel

(1985).   PHYLIP's WAGNER (an immediate ancestor of MIX) did not do well in it,

either in terms of the quality of result or execution speed.  I do not  believe

that  this  was  a fair benchmark.  WAGNER was run only with one order of input

species, not ten as recommended here.  Had it been, perhaps the  shortest  tree

would  have  been  found  more  often.   No  credit was given to PHYLIP in that

article for its free distribution, availability on microcomputers, availability

in  source  code  form, or portability to new computers.  Pimentel's laboratory

commissioned the development  of  a  competing  package,  PHYSYS,  which  is  a

commercial product, and that involvement was not stated in the article. 

     The benchmarks  by  Fink  (1986)  are  fairer,  although  there  are  some

impressions  given  by  that article which do not apply to the present version.

In particular, I have since added to many of the programs the ability  to  save

multiple  equally-parsimonious  trees,  and  have  changed  the outputs so that

reconstruction of states in the hypothetical ancestral nodes  is  much  easier,

thus answering Fink's major criticisms.  I have since eliminated the Metropolis

annealing method algorithms which he criticized.  I disagree with  Fink's  view

OF  PHYLIP that one should "be wary of published results from an analysis using

it", as I do not think that a tree slightly longer than the  most  parsimonious

one should be rejected out of hand.  Nor do I agree that "it is really too slow

to use as a teaching tool", as in teaching one uses small data sets  and  speed

is  not of the essence.  Rather, simplicity of user interface is paramount, and

there PHYLIP does very well (so is ability to run on a variety of computers, in

which  respect  PHYLIP  is  also  superior).   In  fact, it is widely used as a

teaching tool. 

     Nevertheless MIX is undoubtably not as fast or as sophisticated as PAUP or

Hennig86.   The  present  version  of  PHYLIP  is  closer to its competitors in

quality of result than was the version Fink reviewed. 

     Platnick's (1987) benchmarks concentrated, as did the  other  benchmarkers

(all  of  them  members of the same school of systematists) on parsimony as the

only phylogeny criterion worthy of attention.  He concluded that  PHYLIP  could

be  used effectively, especially if up to ten different input orders of species

were used.  Again, as with the  other  benchmarks,  no  credit  was  given  for

diversity of methods, portability, price, or availability of source code. 

     Platnick's second benchmark paper  (1989)  concentrates  on  Hennig86  and

Paup,  and  concludes  that  PHYLIP  has not kept up with those programs in its

features.  Again, the review is entirely concerned with parsimony, and only the

barest mention is made of ... (you can complete this sentence). 

     Sanderson's (1990) benchmark paper breaks with the method of the others by

specifying  36  features  of  the packages rated and giving separate ratings in

each.  Like the other benchmark papers it concentrates  almost  exclusively  on

parsimony  as  applied to morphological characters, but does at least give some

credit where credit is due. 

     My own, obviously biased, feeling is that there is a  discrepancy  between

the benchmarkers' projections of how satisfied users of PHYLIP will be, and how

satisfied they actually are.  And that this discrepancy is in PHYLIP's favor. 


     Here are some comments  about  PHYLIP.   Explanatory  material  in  square

brackets is my own: 

   From the pages of Cladistics: 

   "Under no circumstances can we recommend  PHYLIP/WAG  [their  name  for  the

   Wagner parsimony option of MIX]."

                                     Luckow, M. and R. A. Pimentel (1985) 

   "PHYLIP has not proven very effective in implementing parsimony (Luckow  and

   Pimentel, 1985)."

                                     J. Carpenter (1987a) 

   "... PHYLIP.  This is the computer program where every newsletter concerning

   it  is  mostly  bug-catching,  some of which have been put there by previous

   corrections.  As Platnick (1987)  documents,  through  dint  of  much  labor

   useful  results  may  be  attained with this program, but I would suggest an

   easier way: FORMAT b:"

                                     J. Carpenter (1987b) 

   "PHYLIP is bug-infested and both less  effective  and  orders  of  magnitude

   slower than other programs ...."

                                     "T. N. Nayenizgani" [J. S. Farris] (1990) 

   "Hennig86 [by J. S. Farris]  provides  such  substantial  improvements  over

   previously  available programs (for both mainframes and microcomputers) that

   it should now become the tool of choice for practising systematists."

                                     N. Platnick (1989) 

and in the pages of other journals: 

   "The  availability,  within  PHYLIP  of  distance,  compatibility,   maximum

   likelihood,   and   generalized   'invariants'   algorithms   (Cavender  and

   Felsenstein, 1987) sets it  apart  from  other  packages  ....  One  of  the

   strengths of PHYLIP is its documentation ...."

                                     Michael J. Sanderson (1990)

   (Sanderson also criticizes PHYLIP for  slowness  and  inflexibility  of  its

   parsimony algorithms, and compliments other packages on their strengths). 

   "This package of programs has gradually become a basic necessity  to  anyone

   working  seriously  on  various  aspects  of phylogenetic inference .... The

   package includes more programs than any other known phylogeny package.   But

   it  is not just a collection of cladistic and related programs.  The package

   has great value added to the whole, and for this it is unique and of extreme 

   importance  ....  its  various  strengths  are in the great array of methods

   provided ...."

                                     Bernard R. Baum (1989) 

   (see also above under Benchmarks for W. Fink's critical  remarks  (1986)  on

   version 2.8 of PHYLIP). 


     In the sections following you will find instructions on how to  adapt  the

programs  to  different  computers  and compilers.  The programs should compile

without alteration on most versions of C.  They use  the  "malloc"  library  or

"calloc"  function  to  allocate  memory  so  that the upper limits on how many

species or how many sites or characters they can  run  is  set  by  the  system

memory available to that memory-allocation function. 

     In the document file for each program,  I  have  supplied  a  small  input

example, and the output it produces, to help you check whether the programs are

running properly. 

     Most of the programs read their data from a file called "infile" and write

their  output  to a file called "outfile" and a tree file to a file "treefile".

If "infile" does not exist the program will prompt you for its name. 

Compiling the programs

--------- --- -------- 

     Many machines that have C compilers, particularly  Unix  systems,  have  a

utility  called  "make"  available  that considerably simplifies the process of

compiling these programs.  I will first discuss how to compile  these  programs

with  "make"  and  then,  after  a  digression  on  how  to  move  PHYLIP  to a

microcomputer, discuss for different individual  systems  how  to  compile  the

programs.   As  we  shall  see  below, for some DOS and Macintosh compilers one

cannot simply use "make" and the standard Makefile. 

Using "make"

----- ------ 

     If your machine has "make" you can place all the programs for the package,

together  with  the  file  "Makefile"  and  the  header  files  "phylip.h", and

"drawgraphics.h",  in  one  directory.   The  Makefile  and  header  files  are

constructed  to  detect, for many varieties of C, which it is dealing with, and

inform the programs accordingly so that they can (by using "#ifdef")  adapt  to

the idiosyncracies of the compiler. 

     To compile all the programs just type:    make all 

     To compile just one program, such as DNAML, type:    make dnaml 

     After a time the  compiler  will  finish  compiling.   The  names  of  the

executables  will  be  the same as the names of the C programs, but without the

".c" suffix.  Thus dnaml.c compiles to make an executable called  "dnaml".   If

object modules ending in ".o" are found in the directory after compilation they

can be removed if you need space. 

Getting PHYLIP onto your microcomputer

------- ------ ---- ---- ------------- 

     C is  widely  available  on  microcomputers,  and  in  any  case  we  also

distribute  executable  versions  for  PCDOS, 386 PCDOS, and Macintosh systems.

Your institution may have an Internet connection, and if so there is probably a

PCDOS  system  or  a  Macintosh somewhere connected directly to it.  Using that

machine you could download the executables and put them directly into  diskette

for  transfer  to  your  own  machine.   You  can  also  get  the  source code,

documentation,  and  executables  by  sending  me  the  appropriate  number  of

diskettes (see the general information at the start of this document). 

     If you cannot do this, you may be able to transfer the entire package,  in

the form of self-extracting archives (which is one of the ways we distribute it

for microcomputers) to your system using a terminal program with file  transfer

capabilities.  Some users are sufficiently terrified of this prospect that they

prefer to  mail  us  diskettes  and  wait  for  several  weeks.   But  if  your

institution has an Internet connection it is much faster to do it that way.  If

you have a serial port to which a modem can be hooked, you can get  a  terminal

program  and  do  the  transfers  yourself.   For  most  microcomputer systems,

public-domain or  shareware  terminal  programs  are  available,  such  as  the

widely-distributed  KERMIT  and  MODEM  families  of programs.  Most university

computer centers have communications programs (KERMIT or XMODEM) to  "talk"  to

KERMIT, MODEM, or PC-TALK and transfer files to and from it. 

     Thus, if you cannot get from me a disk format readable  by  your  machine,

     you can: 

     (1) Get an account on your mainframe and learn to use its  facilities  for

     "anonymous ftp" (transfer of files over Internet) or electronic mail. 

     (2a) If you are on Internet (Or NSFNET) use the "anonymous ftp" method  to

     receive  the  self-extracting  archive  files  (start  by  downloading and

     reading the  file  "pub/phylip/Read.Me"  from  my  system  whose  Internet

     address is evolution.genetics.washington.edu (, or 

     (2b) if  your  institution  is  not  on  Internet  but  does  have  Bitnet

     electronic  mail,  you  can request that I send you the PHYLIP source code

     files and documentation as  e-mail  messages  over  BITNET/EARN  (not  the

     executables, however). 

     (3) Make sure the files are saved on your mainframe account (you will need

     about 2.2 Megabytes of space) under appropriate names. 

     (4) Use the file transfer provisions of your terminal program to  transfer

     the  archives  to  your  microcomputer,  or  if  they  came as many e-mail

     messages, to transfer  these  to  your  machine  individually  (most  file

     transfer  programs  can  transfer  many  files with one command) for later

     compilation of the C source. 

     If you cannot read the diskette formats that  I  can  write,  and  if  you

absolutely  INSIST that I distribute the package in this format, please send me

the computer and thirteen diskettes.  I will promptly write the  diskettes  and

return them (but of course I will keep your computer). 

     Now we turn to particular C compilers  and  describe  particular  problems

that may be encountered. 

Microsoft Quick C and Microsoft C

--------- ----- - --- --------- - 

     These comments apply to Microsoft Quick C but may also work with Microsoft

C.   A  Makefile for Microsoft Quick C is included with the source code.  It is

called "Makefile.qc".  If you copy it and call the copy "Makefile" (making sure

to first save the generic Makefile that comes with this package under some name

such as Makefile.old), you should be able to use  "make"  as  described  above,

except  that  it  is  called  "nmake".   Note  that the command you must use to

compile (for example) DNAPARS is "nmake dnapars.exe", not "nmake  dnapars",  as

the program that results is to be called "dnapars.exe" and the Quick C Makefile

is set up that way. 

     To compile individual programs without using the makefile, you need to  do

the  following.   For a non-graphics program use the following command (DOS> is

the PCDOS prompt, so you do not type it): 

DOS> qcl /AH /F 4000 /FPi [source files] 

If the program you are trying to compile  is  a  1-part  source  (for  example,

neighbor  only  has  one  part, neighbor.c) you should replace "[source files]"

with "neighbor.c".  So the command would be: 

DOS> qcl /AH /F 4000 /FPi neighbor.c 

If the program you are trying to compile is a 2-part source (for  example,  mix

has  two  parts,  mix.c and mix2.c) you can replace [source files] with both of

the source files.  Make sure that the first source file in  the  list  has  the

same  name  as the executable file you want.  i.e. use mix.c mix2.c and not the

other way around.  If you reorder them, the  executable  file  will  be  called

"MIX2.EXE".  For mix, the command would be: 

DOS> qcl /AH /F 4000 /FPi mix.c mix2.c 

to compile a graphics program (i.e. drawgram, drawtree) under quick  c  without

using the makefile, use one of the following commands:


DOS> qcl /AH /F 4000 /FPi drawgram.c drawgraphics.c graphics.lib [for drawgram]


DOS> qcl /AH /F 4000 /FPi drawtree.c drawgraphics.c graphics.lib [for drawtree] 

Turbo C++ for PCDOS

----- --- --- ----- 

The following instructions are for Turbo C++ but may also work for Turbo C  and

for  Borland C, perhaps with slight modifications.  Under normal situations you

can use the makefile.  The makefile for Turbo C++ is included in the package as

"Makefile.tc". Copy it and call the copy "Makefile" (it would be wise the first

rename the original  "Makefile"  to  "Makefile.old").  Then  to  compile,  say,

DNAPARS, just type: 

make dnapars.exe 

However, if for some reason you want to do it by  hand,  follow  the  following


For the non-graphical programs (all those other than DRAWGRAM and DRAWTREE): 

to compile dnapars.c type the following (DOS> is the PCDOS prompt) 

DOS> tcc -mh dnapars.c 

If the source file is sufficiently large to require two sources (for example,

dnaml.c and dnaml2.c), you will need to use both dnaml.c and dnaml2.c. 


DOS> tcc -mh dnaml.c dnaml2.c

DOS> tcc -mh neighbor.c 

If you would like to use the program under the TD debugger, you should

add a "-v" flag as a compiler option: 

DOS> tcc -mh -v restml.c restml2.c 

For the graphical programs (DRAWGRAM and DRAWTREE): 

   First you need to build the "BGI" drivers.  The BGI drivers are included

   with your TURBOC compiler, and should be in the "BGI" directory (this is

   a subdirectory of the main turboc directory).  To do this you need to use

   the "bgiobj" program, also in the BGI directory.  The current version

   of PHYLIP supports the EGA/VGA, CGA, and hercules drivers.  If you have

   modified the sources to take advantage of other drivers, you will have

   to include those as well. 

   To build the BGI drivers: 

   DOS> cd \tc\bgi [this should be replaced with whatever your turboc dir is]




   this generates the files "EGAVGA.OBJ", "CGA.OBJ", and "HERC.OBJ" in the

   current directory.  you want to copy this into your main source directory.

   (assume this is \phylip) 

   DOS> CP EGAVGA.OBJ \phylip [replace this with your source directory]

   DOS> CP CGA.OBJ \phylip

   DOS> CP HERC.OBJ \phylip 

   To compile the program, cd back to your source directory.  You want

   to compile each source file, plus a shared graphics file called

   "drawgraphics.c".  You also want to link it to the newly created BGI

   object files and to the graphics library. 


DOS> tcc -mh drawgram.c drawgraphics.c herc.obj egavga.obj cga.obj graphics.lib

DOS> tcc -mh drawtree.c drawgraphics.c herc.obj egavga.obj cga.obj graphics.lib 

   (to compile drawgram and drawtree, respectively) 

   If you want to compile for the TD debugger, add the -v flag as above. 

Waterloo C/386

-------- ----- 

     Waterloo C/386 is the compiler we use to create  the  386  PCDOS  and  386

Windows  versions  of  the  executables.   It  has  a  "make" capability called

"wmake".  We have had problems using this so  the  instructions  here  are  for 

individually compiling programs without wmake. 

     Watcom C/386 is a very flexible compiler  which  can  generate  executable

programs for many different environments.  Following are instructions for using

Watcom C/386 to compile for DOS using the DOS/4GW DOS extender  (included  with

the Watcom distribution) and for Microsoft windows. 


     to compile a program under watcom C/386 for the DOS/4GW dos  extender  use

the following (the "DOS>" is the PCDOS prompt, not something you type): 

DOS> wcl386 /l=dos4gw /p /k65520 [source files] 

If the program you are trying to compile  is  a  1-part  source  (for  example,

neighbor  only  has  one part, neighbor.c) you can replace [source files]  with

"neighbor.c".  So the command would be: 

DOS> wcl386 /l=dos4gw /p /k65520 neighbor.c 

If the program you are trying to compile is a 2-part source (for  example,  mix

has  two  parts,  mix.c and mix2.c) you can replace [source files] with both of

the source files.  Make sure that the first source file in  the  list  has  the

same  name  as the executable file you want.  i.e. use mix.c mix2.c and not the

other way around.  If you reorder them, the  executable  file  will  be  called

"MIX2.EXE".  For mix, the command would be: 

DOS> wcl386 /l=dos4gw /p /k65520 mix.c mix2.c 

The resultant executable file will take advantage  of  your  system's  extended

memory and will not be limited to using only the first 640K.  However, it needs

the file "dos4gw.exe" in order to run.  If you want  to  be  able  to  use  the

program  generated,  make sure that this program is somewhere in your path. (To

ensure this you can copy the program into  the  directory  where  the  compiled

program  resides).   This  "dos  extender"  is  bundled  with  the Watcom C/386

compiler and is freely redistributable. 

For Windows: 

to compile a program under watcom C/386 for windows use the following: 

DOS> wcl386 /l=win386 /zw /p /k65520 [source files] 

again, replace [source files] with either the complete program (ie  neighbor.c)

or both parts of the program (ie mix.c mix2.c). 

once you have compiled the windows program you are not quite ready to  run  the

program  under  windows.   The  final  step  is  to  link  it with the "windows

supervisor".  to do this do the following: 

DOS> wbind [program] -n 


DOS> wbind mix -n 

this  program  will  generate  [programname].exe.   this  application  will  be

runnable under windows. 


1. Make sure that when you use wbind that \watcom\binw is somewhere in

      your path.  if it is not, you may have to tell wbind explicitly where

      the windows supervisor file is, as in the following example: 

   DOS> wbind mix  -n  -s  c:\watcom\binw\win386.ext  which  will  replace  the

   c:\watcom\win386.ext with the full path of win386.ext. 

2. The draw programs (drawgram, drawtree) currently do not compile

      under windows.  Compile them for DOS/4GW and use it in a dos shell under


Think C for Macintosh

----- - --- --------- 

     For Symantec's Think C compiler (formerly called Lightspeed  C)  a  "make"

utility  is  not  available.  Thus you cannot use the Makefile but must compile

the programs individually.  Here are the steps you should follow to  compile  a

typical program. 

(1) Start up Think-C. 

(2) Click on "New project" in the Think C project menu.  You will be  asked  to

enter the name of the project. 

(3) Add the source code for the program to the project.  To add sources to  the

project, you need to click on "add" from the source menu.  You will need to add

the sources from the main program (i.e. "neighbor.c" in the case of  a  program

in  1  part  or "dnaml.c" and "dnaml2.c" in the case of a 2-part program).  You

also need to add "interface.c" (included with the distribution) and two  things

which are included with the think C compiler.  The first one is "MacTraps", and

is contained within the Think C folder under a directory called "MacLibraries".

The  second  one  is "ANSI", and is contained within the Think C folder under a

directory called "C Libraries" 

(4) Segment the project: After adding each of the sources to the  project,  you

need  to  segment  the project.  This means that every source file is contained

within its own 32K segment.  In order to do this within Think C, you can  click

on a source file name in the Think C project window (the window that lists each

of the sources) and drag it down to the bottom of the source list.   After  you

have done this for each of the source files, a dotted line should appear around

each source file in the project window. 

(5) Set up compile options: The first thing you need to do is set up what  sort

of  project you're compiling, and some of the characteristics of how the memory

is set up.  To do this, select "Set project type" in the  "Project"  menu,  and

make sure it's set up to be an Application with far code and far data. 

Depending on the hardware you will be  running  on,  you  may  want  to  select

different  compilation options.  Most notably, if your machine has a 68881 math

coprocessor, enable the use of the coprocessor by selecting "Options" under the

"Edit" window, selecting "Compiler settings" through the list at the upper left

corner of the display, and then  checking  the  box  next  to  "Generate  68881


(6) Compile the project: select "Make" under the source window.  After this has

completed  (assuming that there were no compile errors), you need to generate a

mac application.  To do this, select  "Build  Application"  under  the  project

menu.  Select  a  name for the application, and think C will create a Macintosh


Although this is more tedious than using a Makefile, Think C  works  very  well

with  the PHYLIP programs and is the compiler we use for creating the Macintosh




     I have already mentioned that under Unix you can use the "make" command to

compile  programs.   This  works on all Unix systems.  To compile an individual

program like dnapars.c you can give the command "make dnapars" or alternatively

"cc  dnapars.c -lm".    When compiling programs that come in two parts, such as

dnaml.c and dnaml2.c, you will  have  to  issue  three  commands,  two  compile

commands and one link command: 

cc -C dnaml.c

cc -C dnaml2.c

cc dnaml.o dnaml2.o -lm -o dnaml 

where the first two commands produced the object modules dnaml.o  and  dnaml2.o

and  the  third  command  links them together into an executable that is called


     In running the programs, you may sometimes want to put them in  background

so  you  can  proceed with other work.  On systems with a windowing environment

they can be put in their own window, and commands like "nice" used to make them

have lower priority so that they do not interfere with interactive applications

in other windows.  If there is no windowing environment, you will want  to  use

an  ampersand ("&") after the command file name when invoking it to put the job

in the background.  You will have to put all the responses to  the  interactive

menu  of  the program into a file and tell the background job to take its input

from that file. 

     For example: suppose you want to run DNAPARS in a background,  taking  its

input  data from a file called sequences.dat, putting its interactive output to

file called "screenout", and using a file called "input" as the place to  store

the interactive input.  The file "input" need only contain two lines: 


which is what you would  have  typed  to  run  the  program  interactively,  in

response  to  the program's request for an input file name if it did not find a

file named "infile", in in response the the menu. 

     To run the program in background, you would simply give the command: 

dnapars < input > screenout & 

which runs the program with input responses coming from "input" and interactive

output  being  put  into file "screenout".  The usual output file and tree file

will also be created by this run (keep that in mind as if  you  run  any  other

PHYLIP  program from the same directory while this one is running in background

you may overwrite the output file from one program with that from the other!). 

     If you wanted to give the program lower priority, so  that  it  would  not

interfere  with  other  work,  and  you  have  Berkeley  Unix  type job control

facilities in your Unix, you can use the "nice" command: 

nice +10 dnapars < input > screenout & 

which lowers the priority of the run.  To also time the run and put the  timing

at the end of "screenout", you can do this: 

nice +10 ( time dnapars < input ) >& screenout & 

which I will not attempt to explain. 

     You may also want to explore putting the interactive output into the  null

file  "/dev/null" so as to not be bothered with it (but then you cannot look at

it to see why something went wrong.  If you have problems with creating  output

files  that are too large, you may want to explore carefully the turning off of

options in the programs you run. 

     If you are doing several runs in  one,  as  for  example  when  you  do  a

bootstrap  analysis  using SEQBOOT, DNAPARS (say), and CONSENSE, you can use an

editor to create a "batch file" with these commands: 

seqboot < input1 > screenout

mv outfile infile

dnapars < input2 >> screenout

mv treefile infile

consense < input3 >> screenout 

and then take the file (say "foofile") containing these commands  and  give  it

execute  permission  by  using  the command  "chmod +x foofile" followed by the

command "rehash".  Then the job that foofile describes can be run as  a  single

job  in  background by giving the command "foofile &".  Note that you must also

have the interactive input commands for SEQBOOT (including  the  random  number

seed),  DNAPARS,  and  CONSENSE  in  the separate files "input1", "input2", and

"input3".   With Berkeley-style job control the  "nice"  command  can  be  used

within the batch file "foofile" before each program name to reduce the priority

with which the programs run. 

VMS VAX systems

--- --- ------- 

     On the VMS operating system with DEC VAX VMS C the programs  will  compile

without  alteration, except that we have to add some extra routines because the

"%hd" format in printf and fprintf does not work.  These extra routines are  in

the file VAXFIX.C.  The commands for compiling a typical program (DNAPARS) are: 





Once you use this "$ DEFINE" statement during a given interactive session,  you

need  not  repeat  it  again as the symbol "LNK$LIBRARY" is thereafter properly

defined.  The compilation process leaves a file DNAPARS.OBJ in your  directory:

this  can  be  discarded.  The executable program is named DNAPARS.EXE.  To run

the program one then uses the command: 


     The  compiler  defaults  to  the  filenames  "INFILE.",  "OUTFILE.",   and

"TREEFILE.".   If  the  input  file  "INFILE."  does not exist the program will

prompt you to type in its name.  Note that some commands on VMS such  as  "TYPE

OUTFILE"  will  fail  because the name of the file that it will attempt to type 

out will be not "OUTFILE." but "OUTFILE.LIS".  To get it to type the write file

you would have to instead issue the command "TYPE OUTFILE.". 

     Some of the programs come in several pieces that have to be  compiled  and

linked together.  For example, DNAML comes in two pieces, dnaml.c and dnaml2.c.

To compile  them  and  link  the  resulting  object  files  together  into  one

executable, use the commands: 






This will make an executable called DNAML.EXE plus two ".OBJ" files that can be

discarded.   Note that when a LINK command is issued the name of the first file

(in this case DNAML) becomes the name of the ".EXE" file that  is  produced  by

the linker. 

     To make it easier to compile all of the programs on VMS systems,  we  have

supplied  a command file, "compile.com" that will do this.  If you install that

file and issue the command "@compile" it will  compile  all  of  the  programs.

However  it  is  recommended  that  you  also  know how to recompile individual

programs so that they can be altered to your purposes. 

     The programs DRAWGRAM and DRAWTREE both use  routines  in  drawgraphics.c.

To compile (for example) DRAWGRAM, use: 






which  will create a file called DRAWGRAM.EXE, plus two ".OBJ" files.  When you

run  DRAWGRAM  you  must have a font file present in your directory, as well as

the tree file.  If they are not found under their  default  names  the  program

will  prompt  you  for  these.   When  you are using the interactive previewing

feature of DRAWGRAM (or DRAWTREE)  on  a  Tektronix  or  DEC  ReGIS  compatible

terminal, you will want before running the program to have issued the command: 


so that you do not run into trouble from the  VMS  line  length  limit  of  255

characters or the filtering of escape characters. 

     Some later versions of  Digital's  VAX  VMS  operating  system  have  a  C

compiler that no longer needs the VAXFIX patch.  If so, follow the instructions

below for OpenVMS and all will be well. 

OpenVMS DEC Alpha systems

------- --- ----- ------- 

     The OpenVMS operating system on Digital AlphaStations  and  other  Digital

Alpha AXP computers has many of the properties of the VAX VMS systems mentioned

above except  on  important  one.   It  does  not  need  any  of  the  VAXFIX.C

corrections.   Thus  the  programs should be compiled without this.  Renove all

mention of VAXFIX from COMPILE.COM (the lines compiling it and the  linking  of

it).    Also   take   PHYLIP.H   and   comment   out   the   section  in  which 

"vax_printf_is_broken"  is  defined.   Then  the  compilation  should   proceed




     A number of people (F. James Rohlf,  Kent  Fiala,  Shan  Duncan,  and  Ron

DeBry),  succeeded  in various ways in adapting the Pascal version of PHYLIP to

several models of Crays.  Recently Cray has been adopting Unicos, a Unix clone,

as  the operating system for its machines, and this means the Unix instructions

should work for compiling the programs on Crays. 

     However, although the underlying algorithms of most programs, which  treat

sites independently, should be amenable to vector processors, there are details

of the code which might best be changed.  In particular  within  the  innermost

loops  of  the  programs  there  are  often scalar quantities that are used for

temporary bookkeeping.  These quantities, such as sum1, sum2, zz, z1,  yy,  y1,

aa,  bb,  cc,  sum,  and  denom  in  procedure  makenewv  of DNAML (and similar

quantities in procedure nuview) are there  to  minimize  the  number  of  array

references.   For  vectorizing  compilers such as the Cray compilers it will be

better to replace them by arrays so that processing can occur simultaneously. 

IBM Mainframes running CMS

--- ---------- ------- --- 

     The following information applies not only to IBM mainframes, but to  IBM-

compatible  mainframes  such  as Amdahls, Fujitsu, Hitachis, and ICLs when they

run IBM operating systems or IBM-compatible operating  systems.   It  does  not

apply  to  IBM  mainframes running AIX (IBM's version of Unix) as for those one

can simply use the Unix instructions above without modification. 

Because IBM is IBM, it tried to impose the EBCDIC character code on the  world.

There  are  good  arguments  for and against EBCDIC; in any case, the ASCII (or

ISO) code is winning out.  I have chosen to  distribute  PHYLIP  in  the  ASCII

character  code,  as  more  likely  to  be  readable  on  more  machines.  Some

characters in ASCII have no equivalent in EBCDIC and  get  arbitrarily  changed

when  my  ASCII  files  are  read  into  an  EBCDIC machine.  You may find some

characters which look strange when viewed on a 3270 terminal on a  CMS  system,

but we have found none that cause trouble for the compiler. 

     Andrew Keeffe was asked to investigate how to compile  the  C  version  of

PHYLIP on our IBM 3090 system, and here is what he has found. 

     These are the procedures for compiling the phylip package in C on  an  IBM


     These instructions were developed using IBM C/370 on an IBM  3090  running

VM/XA CMS 5.6 Service Level 201. 

     If you fetch  PHYLIP  directly  as  an  ftp  binary  transfer,  getting  a

compressed  tar  archive  file,  as  available from our machine, we do not know

whether there is an "uncompress" and a "tar" utility available on CMS to extact

the files from the archive and translate them from ASCII to EBCDIC.  You should

ask your computer consultants about that.  Alternatively, you could  fetch  the

files to a PCDOS or Unix machine, extract the archives there, and then move the

resulting text files for the source code and documentation to the  CMS  system.

If  you  that,  after establishing the connection between the IBM and the other

host, type will translate the text files properly. 

     CMS prefers the names of files to have a minimum of two parts, called  the

filename  (abbreviated  fn)  and  the filetype (abbreviated ft), separated by a

space.  We have chosen "data" as the filetype, so that "infile" becomes "infile

data", "outfile" becomes "outfile data" and so forth. 

     All commands that you give to the host are shown in UPPER CASE.   You  can

type them in upper or lower case; CMS does not care. 

     Before compiling, give these commands to CMS: 

        SETUP C370


It would make sense to put these  commands  in  your  profile  exec  until  the

compiling and linking is complete. 

To compile a single program, such as dnapars.c: 

        CC DNAPARS 

If there are no errors, the compiler will produce a file with the same filename

and a filetype of 'text', DNAPARS TEXT in this case.  Now give these commands: 



The genmod command generates an executable module file (DNAPARS  MODULE)  which

may  be  invoked by typing its name on the command line.  Use this procedure to

compile all of the phylip programs except dnaml, dnamlk, restml, drawgram,  and


The source files for dnaml, dnamlk, and restml have been split into two  parts.

To compile one of these programs, give these commands: 

        CC DNAML

        CC DNAML2



Proceed similarly for dnamlk and restml. 

     The draw programs, drawgram and drawtree, both depend on common code which

is  stored in drawgraphics.c and drawgraphics.h.  These names will be truncated

to DRAWGRAP C and DRAWGRAP H on the CMS system.  The contents of the files  are

not affected. 

Compile the drawgraphics code: 

        CC DRAWGRAP 

Compile and link the draw programs: 







If you are having trouble getting the programs running on your machine, contact

me.   If I can't help, I can at least find out whether there is anyone else who

has adapted them to the same machine and put you in touch with them. 

Other Computer Systems

----- -------- ------- 

     As you can see from the  variety  of  different  systems  on  which  these

programs  have  been  successfully  run,  there  are no serious incompatibility

problems with most computer systems.  PHYLIP in various  past  Pascal  versions

has  also  been  compiled on 8080 and Z80 C/M Systems, Apple II systems running

UCSD Pascal, a variety of minicomputer systems such  as  DEC  PDP-11's  and  HP

1000's,  CDC  Cyber  systems,  and  so  on.   We  hope  gradually to accumulate

experience on a wider variety of C compilers.  If you succeed in compiling  the

C  version  of  PHYLIP on a different machine or a different compiler,, I would

like to hear the details so that I can include the  instructions  in  a  future

version of this manual. 

                          FREQUENTLY ASKED QUESTIONS 

(1)  "If I copied PHYLIP from a friend without you knowing,  should  I  try  to

keep  you from finding out?".  No.  It is to your advantage and mine for you to

let me know.  If you did not get PHYLIP "officially" from me  or  from  someone

authorized  by me, but copied a friend's version, you are not in my database of

users.   You  probably  also  have  an  old  version  which  has   since   been

substantially  improved  (see  the beginning of this main document file for the

date on which this version was  released).   I  don't  mind  you  "bootlegging"

PHYLIP (it's free anyway, and that saves me the work of writing diskettes), but

you should realize that you may have an outdated version.  You may be  able  to

get  the  latest  version  just  as  quickly over Internet.  You can read about

subsequent bug fixes in the electronic news bulletins the  person  you  got  it

from  may  (or may not) have subscribed to.  It will help both of us if you get

onto my mailing list.  If you are on it, then I will give your  name  to  other

nearby  users  when  they get a new copy, and they are urged to contact you and

update  your  copy.   (I  benefit  by  getting  a  better  feel  for  how  many

distributions  there have been, and having a better mailing list to use to give

other users local people to contact).  Send me  your  name  and  address  (five

lines  maximum), and your phone number, with the number of the version that you

have, plus the type of your computer, operating system, and C compiler, so that

I  can add you to the address list.  Note also the listserver information which

you can get, which provides news about PHYLIP  by  electronic  mail.   This  is

described in the next to last section of this document. 

(2)  "How do I make a citation  to  the  PHYLIP  package  in  the  paper  I  am

writing?"   One way is like this:

   Felsenstein, J.  1993.  PHYLIP (Phylogeny Inference Package) version 3.5c.

      Distributed by the author.  Department of Genetics, University of

      Washington, Seattle.

or if the editor for whom you are writing insists that the citation must be  to

a  printed  publication,  you  could cite a notice for version 3.2 published in


   Felsenstein, J.  1989.  PHYLIP -- Phylogeny Inference Package (Version 3.2).

      Cladistics  5: 164-166.

For a while a printed version of the PHYLIP documentation was available and one

could  cite that.  This is no longer true.  Other than that, this is difficult,

because I have never written a paper announcing  PHYLIP!   My  1985b  paper  in

Evolution (see the References section below) on the bootstrap method contains a

one-paragraph Appendix describing the availability of this  package,  and  that 

can  also  be  cited  as  a  reference  for  the  package, although it has been

distributed since 1980 while the bootstrap paper is 1985.   A paper  on  PHYLIP

is needed mostly to give people something to cite, as word-of-mouth, references

in other people's papers, and electronic newsgroup  postings  have  spread  the

word about PHYLIP's existence quite effectively. 

(3) "How do I bootstrap? Why has  DNABOOT  disappeared?"   DNABOOT,  BOOT,  and

DOLBOOT,  the  previous  parsimony-based  bootstrap programs, have been removed

from the package as there is now a  more  general  way  of  bootstrapping.   It

involves  running  SEQBOOT  to make multiple bootstrapped data sets out of your

one data set, then running one of the tree-making programs  with  the  Multiple

data  sets option to analyze them all, then running CONSENSE to make a majority

rule consensus tree from the resulting tree file.  Read  the  documentation  of

SEQBOOT  to  get  further information.  Before, only parsimony methods could be

bootstrapped.  With this new system almost any of the  tree-making  methods  in

the package can be bootstrapped.  It is somewhat more tedious but you will find

it much more rewarding. 

(4)  "How do I specify a multi-species outgroup with your parsimony  programs?"

It's  not  a  feature  but  is  not too hard to do in many of the programs.  In

parsimony programs like MIX, for which the W (Weights) and A (Ancestral states)

options are available, and weights can be larger than 1, all you need to do is:

  (a) In MIX, make up an extra character with states 0 for  all  the  outgroups

     and  1  for all the ingroups.  If using DNAPARS the ingroup can have (say)

     "G" and the outgroup "A".

  (b) Assign this character an enormous weight (such as Z for 35) using  the  W

     option, all other characters getting weight 1, or whatever weight they had


  (c) If it is available, Use the A (Ancestral states) option to designate that

     for  that  new  character the state found in the outgroup is the ancestral


  (d) In MIX do not use the O (Outgroup) option.

  (e) After the tree is found, the designated ingroup  should  have  been  held

     together  by the fake character.  The tree will be rooted somewhere in the

     outgroup (the program may or may not have a preference for  one  place  in

     the  outgroup  over  another).  Make sure that you subtract from the total

     number of steps on the tree all steps in the new character. 

     In programs like DNAPARS, you cannot use this method as weights  of  sites

     cannot  be  greater  than  1.   But you do an analogous trick, by adding a

     largish number of extra sites to the data, with one nucleotide state ("A")

     for the ingroup and another ("G") for the outgroup.  You will then have to

     use RETREE to manually reroot the tree in the desired place. 

(5) "How do I force certain groups to remain  monophyletic  in  your  parsimony

programs?"   By  the same method, using multiple fake characters, any number of

groups of species can be forced to be  monophyletic.   In  MOVE,  DOLMOVE,  and

DNAMOVE  you  can  specify  whatever  outgroups  you want without going to this


(6) "How can I reroot one of the trees written out by PHYLIP?"  Use the program

RETREE.  But keep in mind whether the tree inferred by the original program was

already rooted, or whether you are free to reroot it. 

(7) "Why doesn't NEIGHBOR read my DNA sequences correctly?".  Because it  wants

to  have as input a distance matrix, not sequences.  You have to use DNADIST to

make the distance matrix first. 

(8) "What do I do  about  deletions  and  insertions  in  my  sequences?"   The

molecular  sequence  programs  will  accept  sequences  that have gaps (the "-"

character).  They do various things with them,  mostly  not  optimal.   DNAPARS 

counts  "gap"  as  if it were a fifth nucleotide state (in addition to A, C, G,

and T).  Each site counts one change when a  gap  arises  or  disappears.   The

disadvantage  of  this  treatment is that a long gap will be overweighted, with

one event per gapped site.  So a gap of 10 nucleotides will count as  being  as

much  evidence  as  10  single site nucleotide substitutions.  If there are not

overlapping gaps, one way to correct this is to recode the first  site  in  the

gap  as "-" but make all the others be "?" so the gap only counts as one event.

Other programs such as DNAML and DNADIST count gaps as  equivalent  to  unknown

nucleotides  (or  unknown  amino  acids) on the grounds that we don't know what

would be there if  something  were  there.   This  completely  leaves  out  the

information  from  the presence or absence of the gap itself, but does not bias

the gapped sequence to be close  to  or  far  from  other  gapped  or  ungapped


(9) "Why don't your parsimony programs  print  out  branch  lengths?"   Because

there  are  problems  defining  the branch lengths.  If you look closely at the

reconstructions of the states of the hypothetical ancestral  nodes  for  almost

any  data  set  and  almost  any  parsimony method you will find some ambiguous

states on those nodes.  There is then usually an ambiguity as to  which  branch

the  change  is  actually  on.  Other parsimony programs resolve this in one or

another arbitrary fashion, sometimes with the user specifying how (for example,

methods  that push the changes up the tree as far as possible or down it as far

as possible).  I have preferred to leave it  to  the  user  to  do  this.   Few

programs  available  from  others  currently  correct  the  branch  lengths for

multiple changes of state that may have overlain each other.  One possible  way

to  get  branch  lengths  with  nucleotide  sequence  data  is to take the tree

topology that you got, use RETREE to convert  it  to  be  unrooted,  prepare  a

distance matrix from your data using DNADIST, and then use FITCH with that tree

as User Tree and see what branch lengths it estimates. 

(10) "Why can't your programs handle unordered multistate  characters?"   Well,

they  can if they are 4-state characters whose states are A, C, G, and T (or U)

because then one can use the DNA sequence parsimony programs.  But  in  general

the discrete characters parsimony programs can only handle two states, 0 and 1.

This is mostly because I have not yet had time to modify them to do so  --  the

modifications would have to be extensive.  Ultimately I hope to get these done,

but in the meantime the best I can do is suggest that you either use one of the

excellent parsimony programs produced by others (PAUP or Hennig86, for example)

or if you have four or fewer states recode your states to look like nucleotides

and use the parsimony programs in the molecular sequence section of PHYLIP. 

(11) "Where can I get a printed version of  the  PHYLIP  documents?"   For  the

moment,  you  can  only  get  a  printed  version by printing it yourself.  For

versions 3.1 to 3.3 a printed version was sold by Christopher Meacham  and  Tom

Duncan,  then  at  the  University Herbarium of the University of California at

Berkeley.  But they have had to discontinue this as it was too much work.   You

should  be  able to print out the documentation files on almost any printer and

make yourself a printed version of whichever of them you need. 

(12) "Why have I been dropped from your newsletter mailing list?"  You haven't.

The  newsletter  was  dropped.  It simply was too hard to mail it out to such a

large mailing list.  The last issue of the newsletter  was  Number  9  in  May,

1987.   I  am  hoping  that  the Listserver News Bulletins will replace the old

PHYLIP Newsletter.  If you have electronic mail access  you  should  definitely

sign  up  for  these  bulletins.  For details see the section on the Listserver

News Bulletins below. 

(13) "How many copies of PHYLIP have been distributed?"  Currently (July, 1995)

I have a bit over 2700 registered installations worldwide.  Of course there are

many more people who have got copies from friends.  PHYLIP is the  most  widely

distributed  phylogeny  package.   PAUP  is  catching  up  in terms of official 

registrations, but PHYLIP is probably far ahead in terms of numbers  of  actual

copies  out  there.  In terms of phylogenies published, however, PAUP is ahead,

but PHYLIP is gaining on it.  In recent years  magnetic  tape  distribution  of

PHYLIP  has declined precipitously, electronic mail distribution is decreasing,

and there has been a slow decrease of diskette distributions.  But all this has

been  more  than  offset  by a huge explosion of distributions by anonymous ftp

over Internet (a rate of about 6 ftp sessions per day, at the moment).  Because

some  people  who  get  the  package  by anonymous ftp forget to register their

copies, it is hard to estimate how many people have got it this way. 


                      "Why didn't it occur to you to ... 

     (1) ... write these programs in Pascal?"  These programs  started  out  in

Pascal  in  1980.   In  1993  we  have released both Pascal and C versions. All

future versions will be C-only.  I make fewer mistakes in Pascal  and  do  like

the language better than C, but C has overtaken Pascal and Pascal compilers are

starting to be hard to  find  on  some  machines.   Also  C  is  a  bit  better

standardized  which  makes  the  number  of modifications a user has to make to

adapt the programs to their system much less. 

     (2) ... forgot about all those inferior systems and  just  develop  PHYLIP

for  Unix?".  This is self-answering, since the same people first said I should

just develop it for Apple II's, then for CP/M Z-80's, then for IBM  PCDOS,  and

now  they're  starting to tell me to just develop it for Macintoshes or for Sun

workstations.  If I had listened to them and done any one  of  these,  I  would

have  had  a  very hard time adapting the package to any of the other ones once

these folks changed their mind! 

     (3) ... write these programs in PROLOG (or Ada, or Modula-2, or SIMULA, or

BCPL,  or  PL/I,  or APL, or LISP)?" These are all languages I have considered.

All have advantages, but they are not really spreading (C is). 

     (4) ... include in the package a program to do the Distance Wagner method,

(or  successive  approximations  character  weighting, or transformation series

analysis)?"  In most cases where I have  not  included  other  methods,  it  is

because  I  decided  that  they had no substantial advantages over methods that

were included (such as the programs FITCH, KITSCH, NEIGHBOR, the  T  option  of

MIX  and DOLLOP, and the "?" ancestral states option of the discrete characters

parsimony programs). 

     (5) ... include in the package  ordination  methods  and  more  clustering

algorithms?"   Because  this  is  NOT  a clustering package, it's a package for

phylogeny estimation.  Those are different tasks with different objectives  and

mostly  different  methods.   Mary Kuhner has, however, included in NEIGHBOR an

option for UPGMA clustering, which will be very similar to KITSCH in results. 

     (6) ... include in  the  package  a  program  to  do  nucleotide  sequence

alignment?"   Well,  yes,  I should have, and this is scheduled to be in future

releases.  But multiple sequence alignment programs, in the era after  Sankoff,

Morel,  and  Cedergren's  1973  classic paper, need to use substantial computer

horsepower to estimate the alignment and the tree together.  So I will be  slow

getting  this  into the package and in the meantime you may want to investigate

ClustalV or TreeAlign. 

     (7) ... send me the programs over  the  electronic  mail  network  I  use,

BUTTERFLYNET?"   Well,  I  am trying to.  Maybe there is a BUTTERFLYNET gateway

hanging off FISHNET, which hangs off HAIRNET, which ...    I  am  connected  to

Internet,  which connects to Bitnet.  I can mail to Bitnet (EARN, NetNorth) and 

to UUCP networks.  Keep in mind that the resulting  files  take  up  about  2.2

Megabytes  and that if you are not going to use them on the machine I send them

to, you will have to download the files to your other machine.   Also  in  some

cases  networks  and  gateways lose or truncate files (these can be up to about

60K long).  So sometimes diskette or tape are  a  better  medium.   I  hope  to

continually  expand  and solidify network distribution.  For a couple of years,

PHYLIP has been available over Internet by "anonymous  ftp"  from  my  machine,

evolution.genetics.washington.edu  (   You  can start by fetching

file "Read.Me" from directory pub/phylip.  My  electronic  mail  addresses  are

given  at  the  end of this document.  Contact me by electronic mail if you are

interested in getting PHYLIP over your network but cannot get ftp to work. 

     (8) ... let me log in to your computer in Seattle and copy the  files  out

over  a phone line?"  No thanks.  It would cost you for over two hours of long-

distance telephone time, plus a half hour of my time and yours in which  I  had

to explain to you how to log in and do the copying. 

     (9) ... send me a  listing  of  your  program?"   Damn  it,  it's  not  "a

program",  it's 30 programs, in a total of 87 files.  What were you thinking of

doing, having 1800-line programs typed in by slaves at your end?  If  you  were

going  to go to all that trouble why not try network transfer or diskettes?  If

you have these then you can print out all the listings you want to and add them

to  the  huge  stack of printed output in the corner of your office.  (This and

the following two questions,  once  common,  are  finally  disappearing,  I  am

pleased to report). 

     (10) ... write a magnetic tape in our computer  center's  favorite  format

(inverted  Lithuanian EBCDIC at 998 bpi)?"  Because the ANSI standard format is

the most widely used one, and even though your computer center may  pretend  it

can't read a tape written this way, if you sniff around you will find a utility

to read it.  It's just a LOT easier for me to let you do that work.  If I tried

to put the tape into your format, I would probably get it wrong anyway. 

     (11) ... give us a version of these in FORTRAN?"  Because the programs are

FAR  easier  to  write and debug in C or Pascal, and cannot easily be rewritten

into FORTRAN (they make extensive use of recursive calls  and  of  records  and

pointers).  In any case, C is widely available.  If you don't have a C compiler

or don't know how to use it, you are going to have to learn a language  like  C

or Pascal sooner or later, and the sooner the better. 

                        NEW FEATURES IN RECENT VERSIONS 

     Version 3.5 has many new features.  They include: 

1. The programs now exist in C as well as in Pascal.  In  the  future  we  will

support  only the C versions, and as of now will not make any more improvements

in the Pascal version.  It will cease to be distributed with the  next  release

of  PHYLIP.   A  Makefile has been included in the distribution to simplify the

problems of compiling the package.  The existence  of  a  C  compiler  on  most

workstations  means  that we have ceased to directly distribute executables for

workstations, as people can easily create  them  themselves  by  following  our


2. All programs now have had the upper limits on the  numbers  of  species  and

numbers  of  sites  (or characters) removed.  They instead use the "malloc" and

"free" functions of C to try to allocate as much memory as they need.  If  they

fail  to  find  it  they  will complain, and you will have to look for a bigger

machine, or install more memory, or remove other jobs that  are  competing  for

the memory.  We no longer have to guess how large a computer you have and where 

you want to put the tradeoff between species and sites. 

3. The program SEQBOOT has now fully superseded the  former  programs  DNABOOT,

BOOT,  and  DOLBOOT, which have been withdrawn.  SEQBOOT also now can carry out

Archie-Faith permutation of characters across species. 

4. The DNA likelihood programs DNAML and DNAMLK now have a  revised  Categories

option that allows them to cope with rate variation from site to site.  Instead

of the user specifying in advance the rate category of  each  site,  they  need

only  specify  how  many categories there are, what their rates are, what their

relative probabilities are, and how long are the patches of spread of a  single

rate  along the molecule, on average.  The program then computes the likelihood

allowing for all of these,  and  adding  up  over  all  possibilities  of  rate

patterns,  without  being  dependent  on assuming that it has inferred rates at

individual sites correctly.  This should go far to address the  criticism  that

maximum likelihood assumes constancy of rate at all sites. 

5. A new program PROTDIST has been added  to  compute  distance  matrices  from

protein  sequences,  using  several different methods.  This will allow protein

sequence data to be analyzed by distance matrix methods as  well  as  parsimony


6.  A  new  program,  RETREE,  has  been  added  to  allow  users  easily   and

interactively  to  reroot  trees, flip branches around, change or remove branch

lengths, change species names, and so on. 

7. Programs that estimate a tree with branch lengths now all not only can  read

in a user tree that has branch lengths and the program can be told to use these

rather than re-estimating the branch lengths (this  was  already  possible  for

DNAML  and  DNAMLK)  but  the ones that are estimating an unrooted tree (DNAML,

FITCH, RESTML and CONTML) can also read in a tree with branch lengths  on  some

branches  and  not  on others, and be told to hold the ones it read in constant

while iterating the rest.  Thus you can, for example, specify  that  a  certain

branch must have length zero. 

8. DRAWTREE and DRAWGRAM can now write out a PICT file that can be read by  the

MacDraw  drawing  program.   They can also write out the file format for the X-

windows drawing program XFIG, and the input format for  the  freely-distributed

ray  tracing  program RAYSHADE (for trees seen in 3 dimensions floating above a

landscape).  In addition they allow fonts to be  specified  for  species  names

when  a  Postscript  printer is being used, and they can also make an X-windows

X-bitmap file.  DRAWTREE has a new option that allows the program  to  (slowly)

calculate  node  positions  so  as  to make them avoid each other better.  Both

programs now, when plotting on raster devices such as dot-matrix printers,  use

round pens to make the lines smoother, and are faster at drawing the lines. 

9. DNADIST now computes its distances much more quickly.  It also  can  compute

the Nei and Jin (1991) distance that allows for rate variation among sites. 

10. The programs that estimate trees by adding species sequentially to  a  tree


DOLLOP) now allow the user the specify that multiple tries will  be  made  with

different input orders of species (using the Jumble option) with only the trees

tied for best overall being reported.  The trees found will be those  that  are

tied  for  best among all of those found by all these runs, not the trees found

as best by each run.  This improves the chances of finding the best tree. 

11.  A program COALLIKE was added to compute likelihood functions for 4Nu,  the

product  of  4 times the effective population size times the mutation rate, for

samples of genes from a single isolated  population,  where  the  program  read

trees  that had been sampled from the data by bootstrapping followed by maximum 

likelihood.  This method was described by  me  in  a  paper  in  late  1992  in

Genetical  Research.   Subsequent  work by Richard Hudson and our lab has shown

the method to be biased.  It has been withdrawn from  the  package  in  version

3.57.   It  is replaced by a program "coalesce" in a new package, LAMARC, which

is available from our ftp server. 

Version 3.4 also had many new features.  They included: 

1. All programs were given interactive menus which allow the user  to  see  and

alter  option  settings.   The  programs read from a file INFILE and write to a

file OUTFILE, as well as to a treefile TREEFILE.  The  result  should  be  much

easier  for novice users to deal with.  Most of the options which once were set

by altering the input file can now be selected using the  menu.   Only  options

that  require separate information for each character or site, such as Weights,

Ancestors, Factors,  and  the  Categories  option  continued  to  require  that

information be entered into the input file (although user-defined trees are put

there also). 

2. The molecular sequence programs now allowed either interleaved or sequential

sequence input (i.e. sequences put in in "aligned" form or by having all of one

sequence followed by all of another).  The choice is made using the interactive


3. Three new programs were  added:   NEIGHBOR  carried  out  Saitou  and  Nei's

neighbor-joining  method  for  distance  matrix  data which is much faster than

FITCH and KITSCH and should be able to handle much larger data sets.   It  also

carried out the UPGMA clustering method.  SEQBOOT allowed the user to bootstrap

nucleotide sequence  data  sets,  protein  sequence  data  sets,  or  discrete-

characters  data  sets  and  write  out  to  a file the multiple data sets that

result.  CONTRAST accepted a continuous-characters data set  and  a  series  of

user  trees,  and wrote out the series of contrasts for each character that are

independent under a Brownian motion model of character evolution,  as  well  as

regressions, correlations, and covariances between them. 

4. All of the programs that inferred trees now  accepted  multiple  data  sets.

This  allowed  us  to  use  SEQBOOT  together  with  this  feature  to  analyze

bootstrapped data sets and find different trees  for  the  different  bootstrap

replicates.   Their variation could be summarized by the consensus tree program

CONSENSE. Thus almost everything in this package could now be bootstrapped. 

5. A serious error that made the  DNA  likelihood  programs  and  DNADIST  give

incorrect  results  when the Categories option was used and there was more than

one category of rates was fixed, in version 3.31.  Categories  run  with  these

programs before that should be rerun. 

6. Almost all programs now printed out trees in the "phenogram"  form  so  that

they grew left-to-right, rather that in the triangular diagram used before. 

7. The tree-plotting programs DRAWGRAM and DRAWTREE now supported the  Hewlett-

Packard  Laserjet  printers and also could produce output files compatible with

the PC-Paint drawing program.  The code for  placement  of  interior  nodes  in

DRAWGRAM  was corrected, and preview of trees using Tektronix graphics was made

easier by having it clear the screen more often. 

8. The DNA likelihood program DNAML now ran about 60% faster. 

9. The restriction sites likelihood program RESTML now  allowed  for  the  data

arising from digests with multiple enzymes. 

                       COMING ATTRACTIONS, FUTURE PLANS 

     There are some obvious deficiencies in this version.  Some of these  holes

will be filled in the next few releases (3.6, 3.7, etc.).  They include: 

1.  A program to align molecular  sequences  on  a  predefined  User  Tree  may

ultimately be included.  This will allow alignment and phylogeny reconstruction

to procede iteratively by successive runs of two programs, one  aligning  on  a

tree  and  the  other  finding  a  better tree based on that alignment.  In the

shorter run a simple two-sequence alignment program may be included. 

2.  An interactive "likelihood explorer" for DNA  sequences  will  be  written.

This  will  allow,  either with or without the assumption of a molecular clock,

trees to be varied interactively so that the user can get a  much  better  feel

for the shape of the likelihood surface.  Likelihood will be able to be plotted

against branch lengths for any branch. 

3.  The DNAML and  DNAMLK  programs  will  reinstate  the  previous  Categories

option,  where  the  user  specified  categories of rates of evolution for each

site, but also retaining the present one, that infers them.   The  hope  is  to

allow  for variation in rate in 1st, 2nd and 3rd positions in a coding sequence

(these being identified by the user) while  also  allowing  for  autocorrelated

rates of evolution in adjacent codons. 

4. If possible we will  find  some  way  of  correcting  for  purine/pyrimidine

richness  variations  among  species,  within  the  framework  of  the  maximum

likelihood programs.  That they maximum likelihood programs do  not  allow  for

base composition variation is their major limitation at the moment. 

5. Inclusion of some kind of protein sequence maximum likelihood program is  an

obvious  need  (right  now  we  have  Adachi  and  Hasegawa's  program  in  the

Unsupported Division). 

6. The Categories option of DNAML and DNAMLK will be generalized to  allow  for

rates  at  sites to gradually change as one moves along the tree, in an attempt

to implement Fitch and Markowitz's (1970) notion of "covarions". 

7. Obviously we need to start thinking about a more visual X windows interface,

but only if that can be used on most systems. 

8.  Program PENNY and its relatives will improved so as to run faster and  find

all most parsimonious trees more quickly. 

9.  A more sophisticated compatibility program should be  included,  if  I  can

find one. 

10.  An "evolutionary clock" version of CONTML will be done, and the  same  may

also be done for RESTML. 

12 . We hope gradually to generalize the tree structures  in  the  programs  to

infer multifurcating trees as well as bifurcating ones. 

13. We hope to economize on the size of  the  source  code,  and  enforce  some

standardization  of  it,  by putting frequently used routines in a library from

which they can be linked into various programs.  This  will  enforce  a  rather

complete standardization of our code. 

14. We may decide to gradually move our code to  an  object-oriented  language,

most  lkely  C++.  One could describe the language that version 3.4 was written

in as "Pascal", version 3.5 as "Pascal written in C", version 4.0 as "C written

in C", and maybe version 4.1 as "C++ written in C" and then 4.2 as "C++ written 

in C++".  At least that scenario is one possibility. 

     Much of the  future  development  of  the  package  will  be  in  the  DNA

likelihood  programs  and  the  distance  matrix programs.  This is for several

reasons.  First, I am more interested in those problems.  Second, collection of

molecular  data is increasing rapidly, and those programs have the most promise

for future development for those data. 


     In the documentation files that follow I frequently refer to papers in the

literature.   In  order  to  centralize  the  references they are given in this

section.  If you want to find further papers beyond these, my Quarterly  Review

of  Biology review of 1982 and my Annual Review of Genetics review of 1988 list

many further references.  The chapter by David Swofford and Gary  Olsen  (1990)

is also an excellent review of the issues in phylogeny reconstruction. 

Adams, E. N.  1972.  Consensus techniques and the comparison of taxonomic

     trees.  Systematic Zoology  21: 390-397.

Adams, E. N.  1986.  N-trees as nestings: complexity, similarity, and

     consensus.  Journal of Classification  3: 299-317.

Archie, J. W.  1989.  A randomization test for phylogenetic information in

     systematic data.  Systematic Zoology  38: 219-252.

Astolfi, P., K. K. Kidd, and L. L. Cavalli-Sforza.  1981.  A comparison of

     methods of reconstructing evolutionary trees.  Systematic Zoology  30:


Baum, B. R.  1989.  PHYLIP: Phylogeny Inference Package. Version 3.2. (Software

     review).  Quarterly Review of Biology  64: 539-541.

Bron, C., and J. Kerbosch.  1973.  Algorithm 457: Finding all cliques of an

     undirected graph.  Communications of the Association for Computing

     Machinery  16: 575-577.

Camin, J. H., and R. R. Sokal.  1965.  A method for deducing branching

     sequences in phylogeny.  Evolution  19: 311-326.

Carpenter, J.  1987a.  A report on the Society for the Study of Evolution

     workshop "Computer Programs for Inferring Phylogenies".  Cladistics  3:


Carpenter, J.  1987b.  Cladistics of cladists.  Cladistics  3: 363-375.

Cavalli-Sforza, L. L., and A. W. F. Edwards.  1967.  Phylogenetic analysis:

     models and estimation procedures.  Evolution  32: 550-570 (also Amer. J.

     Human Genetics  19: 233-257).

Cavender, J. A. and J. Felsenstein.  1987.  Invariants of phylogenies in a

     simple case with discrete states.  Journal of Classification  4: 57-71.

Churchill, G.A.  1989.  Stochastic models for heterogeneous DNA sequences.

     Bulletin of Mathematical Biology  51: 79-94.

Conn, E. E. and P. K. Stumpf.  1963.  Outlines of Biochemistry.  John Wiley and

     Sons, New York.

Day, W. H. E.  1983.  Computationally difficult parsimony problems in

     phylogenetic systematics.  Journal of Theoretical Biology  103: 429-438.

Dayhoff, M. O.  1979.  Atlas of Protein Sequence and Structure, Volume 5,

     Supplement 3, 1978.  National Biomedical Research Foundation, Washington,


DeBry, R. W. and N. A. Slade.  1985.  Cladistic analysis of restriction

     endonuclease cleavage maps within a maximum-likelihood framework.

     Systematic Zoology  34:  21-34.

Dempster, A. P., N. M. Laird, and D. B. Rubin.  1977.  Maximum likelihood from

     incomplete data via the EM algorithm.  Journal of the Royal Statistical

     Society B  39: 1-38.

Eck, R. V., and M. O. Dayhoff.  1966.  Atlas of Protein Sequence and Structure

     1966.  National Biomedical Research Foundation, Silver Spring, Maryland. 

Edwards, A. W. F., and L. L. Cavalli-Sforza.  1964.  Reconstruction of

     evolutionary trees.  pp. 67-76 in Phenetic and Phylogenetic

     Classification, ed. V. H. Heywood and J. McNeill. Systematics Association

     Volume No. 6. Systematics Association, London.

Estabrook, G. F., C. S. Johnson, Jr., and F. R. McMorris.  1976a.  A

     mathematical foundation for the analysis of character compatibility.

     Mathematical Biosciences  23: 181-187.

Estabrook, G. F., C. S. Johnson, Jr., and F. R. McMorris.  1976b.  An algebraic

     analysis of cladistic characters.  Discrete Mathematics16: 141-147.

Estabrook, G. F., F. R. McMorris, and C. A. Meacham.  1985.  Comparison of

     undirected phylogenetic trees based on subtrees of four evolutionary

     units.  Systematic Zoology  34: 193-200.

Faith, D. P.  1990.  Chance marsupial relationships.  Nature  345: 393-394.

Faith, D. P. and P. S. Cranston.  1991.  Could a cladogram this short have

     arisen by chance alone?: On permutation tests for cladistic structure.

     Cladistics  7: 1-28.

Farris, J. S.  1977.  Phylogenetic analysis under Dollo's Law.  Systematic

     Zoology  26: 77-88.

Farris, J. S.  1978a.  Inferring phylogenetic trees from chromosome inversion

     data.  Systematic Zoology  27: 275-284.

Farris, J. S.  1981.  Distance data in phylogenetic analysis.  pp. 3-23 in

     Advances in Cladistics: Proceedings of the first meeting of the Willi

     Hennig Society, ed. V. A. Funk and D. R. Brooks.  New York Botanical

     Garden, Bronx, New York.

Farris, J. S.  1983.  The logical basis of phylogenetic analysis.  pp. 1-47 in

     Advances in Cladistics, Volume 2, Proceedings of the Second Meeting of the

     Willi Hennig Society.  ed. Norman I. Platnick and V. A. Funk.  Columbia

     University Press, New York.

Farris, J. S.  1985.  Distance data revisited.  Cladistics  1: 67-85.

Farris, J. S.  1986.  Distances and statistics.  Cladistics  2: 144-157.

Farris, J. S. ["T. N. Nayenizgani"].  1990.  The systematics association enters

     its golden years (review of "Prospects in Systematics", ed. D.

     Hawksworth).  Cladistics  6: 307-314.

Felsenstein, J.  1973a.  Maximum likelihood and minimum-steps methods for

     estimating evolutionary trees from data on discrete characters.

     Systematic Zoology  22: 240-249.

Felsenstein, J.  1973b.  Maximum-likelihood estimation of evolutionary trees

     from continuous characters.  Amer. J. Human Genetics  25: 471-492.

Felsenstein, J.  1978a.  The number of evolutionary trees.  Systematic Zoology

     27: 27-33.

Felsenstein, J.  1978b.  Cases in which parsimony and compatibility methods

     will be positively misleading.  Systematic Zoology  27: 401-410.

Felsenstein, J.  1979.  Alternative methods of phylogenetic inference and their

     interrelationship.  Systematic Zoology  28: 49-62.

Felsenstein, J.  1981a.  Evolutionary trees from DNA sequences: a maximum

     likelihood approach.  J. Molecular Evolution  17: 368-376.

Felsenstein, J.  1981b.  A likelihood approach to character weighting and what

     it tells us about parsimony and compatibility.  Biological Journal of the

     Linnean Society  16: 183-196.

Felsenstein, J.  1981c.  Evolutionary trees from gene frequencies and

     quantitative characters: finding maximum likelihood estimates.  Evolution

     35: 1229-1242.

Felsenstein, J.  1982.  Numerical methods for inferring evolutionary trees.

     Quarterly Review of Biology  57: 379-404.

Felsenstein, J.  1983b.  Parsimony in systematics: biological and statistical

     issues. Annual Review of Ecology and Systematics  14:313-333.

Felsenstein, J. 1984a.  Distance methods for inferring phylogenies: a

     justification. Evolution  38: 16-24.

Felsenstein, J.  1984b.  The statistical approach to inferring evolutionary

     trees and what it tells us about parsimony and compatibility.  pp. 169-191

     in: Cladistics: Perspectives in the Reconstruction of Evolutionary 

     History, edited by T. Duncan and T. F.  Stuessy.  Columbia University

     Press, New York.

Felsenstein, J.  1985a.  Confidence limits on phylogenies with a molecular

     clock.  Systematic Zoology  34: 152-161.

Felsenstein, J.  1985b.  Confidence limits on phylogenies: an approach using

     the bootstrap.  Evolution  39: 783-791.

Felsenstein, J.  1985c.  Phylogenies from gene frequencies: a statistical

     problem.  Systematic Zoology  34: 300-311.

Felsenstein, J.  1985d.  Phylogenies and the comparative method.  American

     Naturalist  125: 1-12.

Felsenstein, J.  1986.  Distance methods: a reply to Farris.  Cladistics  2:


Felsenstein, J.  and E. Sober.  1986.  Parsimony and likelihood: an exchange.

     Systematic Zoology  35: 617-626.

Felsenstein, J.  1988a.  Phylogenies and quantitative characters.  Annual

     Review of Ecology and Systematics  19: 445-471.

Felsenstein, J.  1988b.  Phylogenies from molecular sequences: inference and

     reliability.   Annual Review of Genetics  22: 521-565.

Felsenstein, J.  1992a.  Estimating effective population size from samples of

     sequences: inefficiency of pairwise and segregating sites as compared to

     phylogenetic estimates.  Genetical Research  59: 139-147.

Felsenstein, J.  1992b.  Phylogenies from restriction sites, a maximum

     likelihood approach.  Evolution  46: 159-173.

Felsenstein, J.  1992c.  Estimating effective population size from samples of

     sequences: a bootstrap Monte Carlo integration approach.  Genetical

     Research, (December issue), in press.

Fink, W. L.  1986.  Microcomputers and phylogenetic analysis.  Science  234:


Fitch, W. M., and E. Margoliash.  1967.  Construction of phylogenetic trees.

     Science  155: 279-284.

Fitch, W. M.  1971.  Toward defining the course of evolution: minimum change

     for a specified tree topology.  Systematic Zoology  20: 406-416.

Fitch, W. M.  1975.  Toward finding the tree of maximum parsimony.  pp. 189-230

     in Proceedings of the Eighth International Conference on Numerical

     Taxonomy, ed. G. F. Estabrook.  W. H. Freeman, San Francisco.

Fitch, W. M. and E. Markowitz.  1970.  An improved method for determining codon

     variability and its application to the rate of fixation of mutations in

     evolution.  Biochemical Genetics  4: 579-593.

George, D. G.,  L. T. Hunt, and W. C. Barker.  1988.  Current methods in

     sequence comparison and analysis.  pp. 127-149 in Macromolecular

     Sequencing and Synthesis, ed. D. H. Schlesinger.  Alan R. Liss, New York.

Gomberg, D.  1966.  "Bayesian" post-diction in an evolution process.

     unpublished manuscript: University of Pavia, Italy.

Graham, R. L., and L. R. Foulds.  1982.  Unlikelihood that minimal phylogenies

     for a realistic biological study can be constructed in reasonable

     computational time.  Mathematical Biosciences  60: 133-142.

Hasegawa, M. and T. Yano.  1984a.  Maximum likelihood method of phylogenetic

     inference from DNA sequence data.  Bulletin of the Biometric Society of

     Japan  No. 5:  1-7.

Hasegawa, M.  and T. Yano.  1984b.  Phylogeny and classification of Hominoidea

     as inferred from DNA sequence data.  Proceedings of the Japan Academy  60

     B: 389-392.

Hasegawa, M., Y. Iida, T. Yano, F. Takaiwa, and M. Iwabuchi.  1985a.

     Phylogenetic relationships among eukaryotic kingdoms as inferred from

     ribosomal RNA sequences.  Journal of Molecular Evolution  22: 32-38.

Hasegawa, M., H. Kishino, and T. Yano.  1985b.  Dating of the human-ape

     splitting by a molecular clock of mitochondrial DNA.  Journal of Molecular

     Evolution  22: 160-174.

Hendy, M. D., and D. Penny.  1982.  Branch and bound algorithms to determine

     minimal evolutionary trees.  Mathematical Biosciences  59: 277-290.

Higgins, D. G. and P. M. Sharp.  1989.  Fast and sensitive multiple sequence 

     alignments on a microcomputer.  Computer Applications in the Biological

     Sciences (CABIOS)  5: 151-153.

Holmquist, R., M. M. Miyamoto, and M. Goodman.  1988.  Higher-primate phylogeny

     -- why can't we decide?  Molecular Biology and Evolution  5: 201-216.

Inger, R. F.  1967.  The development of a phylogeny of frogs. Evolution 21:


Jin, L. and M. Nei.  1990.  Limitations of the evolutionary parsimony method of

     phylogenetic analysis.  Molecular Biology and Evolution  7: 82-102.

Jukes, T. H. and C. R. Cantor.  1969.  Evolution of protein molecules.  pp.

     21-132 in Mammalian Protein Metabolism, ed. H. N. Munro.  Academic Press,

     New York.

Kim, J.  and M. A. Burgman.  1988.  Accuracy of phylogenetic-estimation methods

     using simulated allele-frequency data.  Evolution  42: 596-602.

Kimura, M.  1980.  A simple model for estimating evolutionary rates of base

     substitutions through comparative studies of nucleotide sequences.

     Journal of Molecular Evolution  16: 111-120.

Kimura, M.  1983.  The Neutral Theory of Molecular Evolution.  Cambridge

     University Press, Cambridge.

Kingman, J. F. C.  1982a.  The coalescent.  Stochastic Processes and Their

     Applications  13: 235-248.

Kingman, J. F. C.  1982b.  On the genealogy of large populations.  Journal of

     Applied Probability  19A: 27-43.

Kishino, H. and M. Hasegawa.  1989. Evaluation of the maximum likelihood

     estimate of the evolutionary tree topologies from DNA sequence data, and

     the branching order in Hominoidea.  Journal of Molecular Evolution  29:


Kluge, A. G., and J. S. Farris.  1969.  Quantitative phyletics and the

     evolution of anurans.  Systematic Zoology  18: 1-32.

Lake, J. A.  1987.  A rate-independent technique for analysis of nucleic acid

     sequences: evolutionary parsimony.  Molecular Biology and Evolution  4:


Le Quesne, W. J.  1969.  A method of selection of characters in numerical

     taxonomy.  Systematic Zoology  18: 201-205.

Le Quesne, W. J.  1974.  The uniquely evolved character concept and its

     cladistic application.  Systematic Zoology  23: 513-517.

Lewis, H. R., and C. H. Papadimitriou.  1978.  The efficiency of algorithms.

     Scientific American  238: 96-109 (January issue)

Luckow, M.  and D. Pimentel.  1985.  An empirical comparison of numerical

     Wagner computer programs.  Cladistics  1: 47-66.

Lynch, M.  1990.  Methods for the analysis of comparative data in evolutionary

     biology.  Evolution  45: 1065-1080.

Maddison, D. R.  1991.  The discovery and importance of multiple islands of

     most-parsimonious trees.  Systematic Zoology  40: 315-328.

Margush, T. and F. R. McMorris.  1981.  Consensus n-trees.  Bulletin of

     Mathematical Biology  43: 239-244.

Nelson, G.  1979.  Cladistic analysis and synthesis: principles and

     definitions, with a historical not on Adanson's Familles des Plantes

     (1763-1764).  Systematic Zoology   28: 1-21.

Nei, M.  1972.  Genetic distance between populations.  American Naturalist

     106: 283-292.

Nei, M.  and W.-H. Li.  1979.  Mathematical model for studying genetic

     variation in terms of restriction endonucleases.  Proceedings of the

     National Academy of Sciences, USA  76: 5269-5273.

Page, R. D. M.  1989.  Comments on component-compatibility in historical

     biogeography.  Cladistics  5: 167-182.

Platnick, N.  1987.   An empirical comparison of microcomputer parsimony

     programs.  Cladistics  3: 121-144.

Platnick, N.  1989.  An empirical comparison of microcomputer parsimony

     programs. II.  Cladistics  5: 145-161.

Reynolds, J. B., B. S. Weir, and C. C. Cockerham.  1983.  Estimation of the

     coancestry coefficient: basis for a short-term genetic distance. 

     Genetics  105: 767-779.

Rohlf, F. J.  and M. C. Wooten.  1988.  Evaluation of the restricted maximum

     likelihood method for estimating phylogenetic trees using simulated

     allele- frequency data.  Evolution  42: 581-595.

Saitou, N., Nei, M.  1987.  The neighbor-joining method: a new method for

     reconstructing phylogenetic trees.  Molecular Biology and Evolution  4:


Sanderson, M. J.  1990.  Flexible phylogeny reconstruction: a review of

     phylogenetic inference packages using parsimony.  Systematic Zoology  39:


Sankoff, D. D., C. Morel, R. J. Cedergren.  1973.  Evolution of 5S RNA and the

     nonrandomness of base replacement.  Nature New Biology  245: 232-234.

Sokal, R. R. and P. H. A. Sneath.  1963.  Principles of Numerical Taxonomy.  W.

     H. Freeman, San Francisco.

Smouse, P. E. and W.-H. Li.  1987.  Likelihood analysis of mitochondrial

     restriction-cleavage patterns for the human-chimpanzee-gorilla trichotomy.

     Evolution  41: 1162-1176.

Sober, E.  1983a.  Parsimony in systematics: philosophical issues.  Annual

     Review of Ecology and Systematics  14: 335-357.

Sober, E.  1983b.  A likelihood justification of parsimony.  Cladistics  1:


Sober, E.  1988.  Reconstructing the Past: Parsimony, Evolution, and Inference.

     MIT Press, Cambridge, Massachusetts.

Sokal, R. R., and P. H. A. Sneath.  1963.  Principles of Numerical Taxonomy.

     W. H. Freeman, San Francisco.

Studier, J. A.  and K. J. Keppler.  1988.  A note on the neighbor-joining

     algorithm of Saitou and Nei.  Molecular Biology and Evolution  5: 729-731.

Swofford, D. L. and G. J. Olsen.  1990.  Phylogeny reconstruction.  Chapter 11,

     pages 411-501 in Molecular Systematics, ed. D. M. Hillis and C. Moritz.

     Sinauer Associates, Sunderland, Massachusetts.

Templeton, A. R.  1983.  Phylogenetic inference from restriction endonuclease

     cleavage site maps with particular reference to the evolution of humans

     and the apes. Evolution   37: 221-244.

Thompson, E. A.  1975.  Human Evolutionary Trees.  Cambridge University Press,


Wu, C. F. J.  1986.  Jackknife, bootstrap and other resampling plans in

     regression analysis.  Annals of Statistics   14: 1261-1295. 


     Over the years various granting agencies have contributed to  the  support

of the PHYLIP project (at first without knowing it).  They are: 

Years       Agency                       Grant or Contract Number 

1995-1999   NIH NIGMS                    1 R01 GM51929-01

1992-1995   National Science Foundation  DEB-9207558

1992-1994   NIH NIGMS Shannon Award      2 R55 GM41716-04

1989-1992   NIH NIGMS                    1 R01-GM41716-01

1990-1992   National Science Foundation  BSR-8918333

1987-1990   National Science Foundation  BSR-8614807

1979-1987   U.S. Department of Energy    DE-AM06-76RLO2225 TA DE-AT06-76EV71005 

I am particularly grateful  to  program  administrators  William  Moore,  Irene

Eckstrand, Peter Arzberger, and Conrad Istock, who have gone beyond the call of

duty to make sure that PHYLIP continued. 

     Booby prizes for funding are awarded to: 

(1) The people at the U.S. Department of Energy who, in 1987, decided they were

"not interested in phylogenies", 

(2) The members of the Systematics Panel of NSF who twice (in  1989  and  1992)

positively  recommended that my applications NOT be funded.  I am very grateful

to program director William Moore for courageously  overruling  their  decision

the  first  time.  The current (1992) Systematics Panel can claim no credit for

PHYLIP whatsoever. 

(3) The members of the 1992 Genetics Study Section of NIH who rated my proposal

in the 53rd percentile (I don't know if that's 53rd from the top or the bottom,

but does it matter?), thus denying it funding.  I am, however, grateful to  the

NIGMS  administrators  who  supported  giving  me  a  "Shannon award" partially

funding my work for a period in spite of this rating. 

     The original Camin-Sokal parsimony program and the polymorphism  parsimony

program  were  written  by  me  in 1977 and 1978.  They were Pascal versions of

earlier FORTRAN programs I wrote in 1966 and 1967 using the same  algorithm  to

infer  phylogenies  under  the Camin-Sokal and polymorphism parsimony criteria.

Harvey Motulsky worked for me  as  a  programmer  in  1971  and  wrote  FORTRAN

programs  to  carry  out the Camin-Sokal, Dollo, and polymorphism methods.  But

most of the work on PHYLIP other than my own was  by  Jerry  Shurman  and  Mark

Moehring.   Jerry  Shurman  worked  for me in the summers of 1979 and 1980, and

Mark Moehring worked for me in the  summers  of  1980  and  1981.   Both  wrote

original versions of many of the other programs, based on the original versions

of my Camin-Sokal parsimony program and  POLYM.   These  formed  the  basis  of

Version 1 of the Package, first distributed in October, 1980. 

     Version 2, released in the spring of  1982,  involved  a  fairly  complete

rewrite  by  me  of  many of those programs.  Jerry and Mark are not to be held

responsible for problems arising from use of these  programs.   Hisashi  Horino

has  for version 3.3 reworked some parts of the programs CLIQUE and CONSENSE to

make their output more comprehensible, and has added some  code  to  the  tree-

drawing programs DRAWGRAM and DRAWTREE as well. 

     My part-time programmers Akiko Fuseki, Sean Lamont and Andrew Keeffe  gave

me  substantial  help  with  the  current  release, and their excellent work is

greatly appreciated.  Akiko in particular did much of the hard work  of  adding

new  features  and  changing  old  ones in the 3.4 and 3.5 releases, and Andrew

prepared the Macintosh version, wrote RETREE, and  added  the  ray-tracing  and

PICT  code  to the DRAW programs.  Sean was central to the conversion to C, and

tested it extensively.  My postdoctoral fellow Mary Kuhner  and  her  associate

Jon  Yamato  created  NEIGHBOR, the neighbor-joining and UPGMA program, for the

current release, for which I am also grateful (Naruya Saitou kindly  encouraged

us to use some of the code from his own implementation of this method). 

     I am very grateful to many users for algorithmic  suggestions,  complaints

about  features  (or  lack  of features), and information about the behavior of

their operating systems and compilers.  Among these are: 

   Jim Archie               Timothy Goldsmith        Dan Nickrent

   Mary Barkworth           Rees Griffiths           Trang Nguyen

   Yves Bertheau            George Gutman            Cary O'Donnell

   Vincent Bauchau          Linda Hardison           Steve O'Kane

   Bernard Baum             Gene Hart                Gary Olsen

   Mary Berbee              Masami Hasegawa          John Olsen

   Biff Bermingham          Bill Hatheway            Steve O'Neill

   Yves Bertheau            David Hillis             Greg Orloff

   Pierre Boursot           Richard Holliday         Pekka Pamilo

   Tom Bruns                Eddie Holmes             David Penny

   Tsan Iang Chuang         Kent Holsinger           Norman Platnick 

   Stephen Clark            Dan Hough                Mark Ragan

   Bruce Cochrane           Richard Jensen           Neil Rawlings

   Joel Cracraft            Bo Johansson             Tom Ritch

   Ross Crozier             Quentin Kay              Alistair Robertson

   Mark Dalton              Steve Kelem              Joseph R. Rohrer

   Dan Davison              Kim Cheol-Min            Naruya Saitou

   Ron DeBry                Joseph H. Kirkbride      Kay Schneitz

   Allen Delaney            John Kirsch              Paul Sharp

   Terry Delaney            Andrew Knight            Arend Sidow

   John Devereux            Dennis Knudson           Hans Siegismund

   Tod Distotell            Mary Kuhner              Chuck Smart

   John Doebley             Jan Kwiatowski           Douglas Smith

   Ken Dodds                John LaDuke              Dave Spencer

   Jim Doyle                Lionel Landry            Lisa Steiner

   Guy Drouin               Franz Lang               Per Sundberg

   Shan Duncan              Niels Larsen             Susan Swensen

   Tom Duncan               Jerry Learn              David Swofford

   Robert Eaglen            Rev. Arthur Lee          John Sved

   Scott Edwards            Pierre Legendre          Naoko Takezaki

   Willem Ellis             Jack A.M. Leunissen      Eric Taylor

   Ted Emigh                Andrew Lloyd             Jeff Thorne

   John Endler              Wolfgang Ludwig          Clive Trotman

   Laurent Excoffier        David Maddison           John Turnbull

   James Farmer             Wayne Maddison           Hans Ullitz-Moeller

   David Featherston        George McKay             Michael Vodkin

   Kent Fiala               Brian McMahon            Carl Wadsworth

   Tim Flannery             Christopher Meacham      Ryk Ward

   Vera Ford                Brook Milligan           Daniel Weeks

   Kurt Fristrup            Sanzo Miyazawa           Loni West

   Douglas Futuyma          Janice Moore             George D.F. Wilson

   Michael Garrick          Susumu Nakayama          Thomas K. Wilson

   Don Gilbert              Jean-Marc Neuhaus        M. Zandee

   John Gillespie           Haolin Ni                Eric Zurcher

   Nick Goldman 

My apologies to anyone who has accidentally been left out of this  list.   Keep

making suggestions and you will get on eventually. 

     A growing contribution to this package has been  made  by  others  writing

programs or parts of programs.  Chris Meacham contributed the important program

FACTOR, long demanded by users, and the even more important  ones  PLOTREE  and

PLOTGRAM.  Important parts of the code in DRAWGRAM and DRAWTREE were taken over

from those two programs.  He is thus mostly to  blame  for  all  problems  with

these  programs.   Kent  Fiala  wrote  PROCEDURE reroot to do outgroup-rooting,

which was an essential part of many programs in earlier versions.   Someone  at

the  Western  Australia  Institute  of Technology suggested the name PHYLIP (by

writing it on a magnetic tape as the tape label), but they  all  seem  to  deny

having done so (and I've lost the relevant letter). 

     Arend Sidow contributed makeinf.c to  the  Unsupported  Division  of  this

release,  and  Masami  Hasegawa  and  Jun Adachi contributed ProtML.pas.  Their

generosity is much appreciated. 

     The distribution of the package also owes much to Buz  Wilson  and  Willem

Ellis, who have put a lot of effort into the past distribution of the PCDOS and

Macintosh versions respectively.  Christopher Meacham and Tom Duncan for  three

versions  distributed  a printed version of these documentation files (they are

no longer able to do so), and I am very grateful to  them  for  those  efforts.

William  H.E.  Day  and F. James Rohlf have been very helpful in setting up the

listserver news bulletin service. 

     I also wish to thank the people who have made computer resources available

to  me,  mostly  in  the  loan  of use of microcomputers.  These include Jeremy

Field, Clem Furlong, Rick Garber, Dan Jacobson, Rochelle Kochin, Monty Slatkin,

Jim Archie, Jim Thomas, and George Gilchrist. 

     I should also acknowledge the computers  used  to  develop  this  package:

These  include  a  CDC  6400, two DECSystem 1090s, my trusty old SOL-20, my old

Osborne-1, a VAX 11/780, a VAX 8600, my old   MicroVAX  I,  my  old  DECstation

3100,  my old Toshiba 1100+, and my present mainstays, a DECstation 5000/200, a

DECstation 5000/125, a Compudyne 486DX/33, a Trinity Genesis  386SX,  a  Zenith

Z386  and  a  Mac  Classic.   (One  of  the  reasons we have been successful in

achieving compatibility between different computer systems is that I  have  had

to run them myself under so many different operating systems and compilers). 


     Here are some of the other phylogeny packages that I know about.  Some  of

them  are  available  over  Internet from ftp server machines, or by World Wide

Web.  If you are on Internet you should familiarize yourself  with  the  server

machines  (see entries 6 and 7 below for more information).  Another major list

of phylogeny software is being compiled by David Maddison and Wayne Maddison as

part of their "Tree of Life" project on the World Wide Web.  Its URL is:


It is still very incomplete as of this writing but may be more up-to-date  than

this  listing can be.  The programs listed below include both free and non-free

ones; in some cases I do not know whether a program is free.  I have listed  as

free  those  that  I  knew  were  free;  for  the  others you have to ask their

distributor.   The  list  starts  with  programs  and  packages   to   estimate

phylogenies,  continues  with  alignment-and-phylogeny  programs, and ends with

programs to do other phylogeny-related tasks. 

     1.  David Swofford of the Laboratory of  Molecular  Systematics,  National

Museum of Natural History, Smithsonian Instition, Washington, D.C.  has written

PAUP (which originally meant Phylogenetic Analysis Using  Parsimony).   Version

3.0  was  available for Macintoshes.   It is currently not available, but a new

version, to be called  PAUP*,  will  be  released  by  Sinauer  Associates,  of

Sunderland, Massachusetts, in a new version called PAUP*, in late 1995 or early

1996.  It will have  Macintosh,  DOS,  and  Unix  versions.   It  will  include

parsimony, distance matrix, invariants, and maximum likelihood methods. 

     PAUP 3.0 was probably the most sophisticated parsimony program, with  many

options  and  close compatibility with MacClade (for which see below).  The new

program will become much broader with the inclusion of more methods.  The price

will  be  in  the  vicinity of $100 US.  Sinauer Associates's e-mail address is


     2.  If you have a Macintosh computer and any  interest  in  discrete-state

parsimony  methods (including DNA and protein parsimony), you should definitely

get MacClade.  It was written by Wayne  Maddison  and  David  Maddison  of  the

University  of  Arizona.  All distribution is by Sinauer Associates, Sunderland

Massachusetts 01375, USA.  Their phone number is: (413) 665  3722,  FAX:  (413)

665  7292.   A  disk with program, help file, and example data files, plus book

(which has about 100 pages of intro to phylogenetic theory, and  250  pages  of

program  instructions),  is  $75  U.S. ($40 for the book alone).  Site licenses

also available.   An earlier and less capable  Version  2  (which  for  example

cannot  read  nucleic  acid  sequences  and  has  fewer  features  for discrete

characters) is also available by anonymous  ftp  from  the  EMBL,  Indiana  and

Houston  molecular  biology  software servers.  Their addresses are given below

under the descriptions of TreeAlign and ClustalV.  MacClade 2.1 will  be  found 

among their Mac software, as a squeezed and then binhexed file. 

     MacClade enables you to use the  mouse-window  interface  to  specify  and

rearrange  phylogenies by hand, and watch the number of character steps and the

distribution of states of a given character on the tree change as  you  do  so.

MacClade  is  positively addictive and will give you a much better feel for the

tree and your data.  It's the closest thing to a phylogeny video  game  that  I

have  seen.   It  has been influential in spurring the inclusion of interaction

and graphics into other phylogeny programs.   (I  have  tried  to  supply  this

functionality  in  PHYLIP  by  incorporating  the  programs  MOVE, DOLMOVE, and

DNAMOVE,  which  act  somewhat  like  MacClade).   MacClade  does  not  have  a

sophisticated  search algorithm to find best trees: it largely relies on you to

do  it  by  hand  (which  is  surprisingly  effective),  with  only   a   local

rearrangement algorithm available to improve on that tree. 

     3.  J. S. Farris has produced Hennig86, a fast parsimony program including

branch-and-bound  search  for  most  parsimonious  trees  and  interactive tree

rearrangement.  Although complete benchmarks have not been published it is said

to  be faster than Swofford's PAUP; both are a great many times faster than the

parsimony programs in PHYLIP.  The program is distributed in executable  object

code  only  and  costs $50, plus $5 mailing costs ($10 outside of of the U.S.).

The user's name should be  stated,  as  copies  are  personalized  as  a  copy-

protection  measure.   It  is  distributed  by  Arnold  Kluge,  Amphibians  and

Reptiles, Museum of  Zoology,  University  of  Michigan,  Ann  Arbor,  Michigan

48109-1079,  U.S.A.  (Arnold.G.Kluge@um.cc.umich.edu)  and by Diana Lipscomb at

George Washington University (BIODL@gwuvm.gwu.edu).  It runs  on  PC-compatible

microcomputers  with  at  least  512K  of  RAM and needs no math coprocessor or

graphics monitor.  It can handle up to 180 taxa and 999 characters. 

     4.  Mark  Siddall,  of  the  Virginia   Institute   of   Marine   Sciences

(mes@vims.edu) has released Random Cladistics, a set of programs that can carry

out bootstrapping, jackknifing, and a variety of kinds  of  permutation  tests,

using Hennig86 to analyze the data.  To use it you must have a copy of Hennig86

(for whose distribution see above).   Random  Cladistics  will  carry  out  the

appropriate  transformations  of  your  data and will call Hennig86 and have it

analyze them, and then it will summarize the  results.   Random  Cladistics  is

available  free by anonymous ftp from zoo.utoronto.ca in directory "pub" (files

random.doc and random.exe). 

     5. J. S. Farris has recently released RNA (Rapid Nucleotide Analysis).  It

features  rapid  bootstrapping.   It is available from Arnold Kluge, Amphibians

and Reptiles, Museum of Zoology, University of Michigan,  Ann  Arbor,  Michigan

48109-1079,  U.S.A.  (Arnold.G.Kluge@um.cc.umich.edu  )  and  Diana Lipscomb at

George Washington University (BIODL@gwuvm.gwu.edu) who  may  be  contacted  for

details.  The cost is said to be about $30 US. 

     6. ClaDOS, an interactive program which allows rearrangement of trees  and

their  evaluation,  mapping of characters into them, and more, is available for

DOS systems from Kevin Nixon, L. H. Bailey Hortorium, Cornell  University,  467

Mann  Library,  Ithaca,  New York  14853.  Rumor has it that the cost is in the

vicinity of $55 US. 

     7. MEGA (Molecular Evolutionary Genetic Analysis) has been released at the

by  Sudhir  Kumar,  Koichiro  Tamura,  and  Masatoshi  Nei  of the Institute of

Molecular  Evolutionary  Genetics,  328   Mueller   Lab,   Pennsylvania   State

University,  University  Park,  Pennsylvania 16802, U.S.A.  It is an executable

program for DOS machines, and is menu-driven with  context-sensitive  help.  It

will  also  run  under Windows in a DOS Window.  It will analyze data from DNA,

RNA and protein sequences, and distance matrices produced from other  kinds  of

data  as  well.   It  will  include the Neighbor-Joining method distance matrix

method, a branch and bound parsimony method, and bootstrapping.  It  will  also 

plot  trees  on  many  kinds  of  printers.   The  program  costs  $15 (for the

documentation) Inquiries can also be made by mail to Joyce White at  the  above

address or by electronic mail to imeg@@psuvm.psu.edu. 

     8. Yves van de Peer of the University of Antwerp (yvdp@reks.uia.ac.be) has

developed  TREECON  3.0, a program package for analysis of molecular data sets.

It is menu driven and runs on 386 (and higher) DOS systems, and also on Windows

systems.   It  carries out inference of phylogenies by distance matrix methods,

with bootstrapping and a program to draw the trees.  It is written in C and  is

available  free  by  anonymous  ftp from uiam3.uia.ac.be.   It was described in

CABIOS 9: 177-182 (1993).  A fee is asked to defray expenses.  For  information

or  ordering  contact  Van  de  Peer  at  the  above  e-mail  address or at the

Department of Biochemistry, University of Antwerp (UIA), Universiteitsplein  1,

B-2610 Antwerpen, BELGIUM. 

     9. Jun Adachi and Masami Hasegawa  have  written  a  package  MOLPHY  2.2,

carrying  out maximum likelihood inference of phylogenies for either nucleotide

sequences or protein sequences.   Their  protein  sequence  maximum  likelihood

program,  ProtML,  is  a  successor  to  the  one they made available to me for

distribution on a nonsupported basis in PHYLIP, and is much improved over that.

It  is  the  best protein maximum likelihood program available.  The package is

distributed  free  in  C  source  code,  with  documentation,   by   ftp   from


     10. Gary Olsen, of the Department of Microbiology, University of Illinois,

has  developed  a  speeded-up  version  of  my program DNAML coded in C, called

"fastDNAml".  It achieves a number of economies and also is organized  so  that

it  can be run on parallel processors -- he and his co-workers have constructed

trees of very large size on a high-speed parallel processor.  The  program  can

be  compiled  using the "p4" portable parallel processing toolkit.  It can also

be run in ordinary serial mode on workstations where it is fatser  than  DNAML.

The C program is available by anonymous ftp from the Ribosomal Database Project

at info.mcs.anl.gov in directory pub/RDP/programs/fastDNAml. 

     11. Ziheng Yang of the Institute of  Molecular  Evolutionary  Genetics  at

Pennsylvania  State  University  (who is soon to be moving to the Department of

Integrative     Biology,     University     of      California,      Berkeley),

(yang@imeg.bio.psu.edu)  has  released  PAML  1.0,  a  program  for the maximum

likelihood analysis of nucleotide or protein sequences (including Hidden Markov

Model  analysis  like  the  features  we  have in DNAML).  It is available as C

source code for Unix systems, and is free by anonymous ftp from  the  molecular

biology  software  servers.   It  will  be  found  on  ftp.bio.indiana.edu, for

example, in directory molbio/evolve. 

     12.  Pablo  Goloboff,  of  the  American   Museum   of   Natural   History

(goloboff@amnh.org),   distributes  PEWEE  and  NONA,  to  carry  out  weighted

parsimony analyses.  The programs run on DOS with versions available  for  both

386-486-Pentium  machines  and  earlier 16-bit machines.  Goloboff's address is

Dept. of Entomology, American Museum of Natural History, Central Park  West  at

79th Street, New York, NY 10024.  His telephone number is 212 769 5619, and fax

number is 212 769 5277. 

     13. Yasuo Ina of  the  National  Institute  of  Genetics,  Mishima,  Japan

(yina@ddbj.nig.ac.jp)  has  developed  ODEN,  a  package  of programs for doing

distance matrix analyses on nucleotide or protein sequences.  It  is  described

in  CABIOS  10:  11-12  (1994).   It  is  available  free by anonymous ftp from

directory pub/oden in bioslave.uio.no as C source code for Unix systems. 

     14. A. Luettke and R. Fuchs have written MacT, a package of  programs  for

Macintoshes that compute distances and compute Neighbor-Joining phylogenies for

them.  The programs work  on  4  through  26  sequences,  and  source  code  in 

Microsoft  QuickBasic is provided as well as compiled executables.  The package

is free and is  available  on  the  molecualr  biology  software  servers.   On

ftp.bio.indiana.edu it will be found in directory molbio/mac.  The programs are

described in CABIOS 8: 591-594, 1992. 

     15. Andrey  A.  Zharkikh,   Andrey   Rzhetsky,   and  co-workers   in  the

Institute   of Cytology and Genetics, Siberian Branch of the Russian Academy of

Sciences, Novosibirsk, Russia, Ex-USSR, have produced  VOSTORG,  a  package  of

programs for alignment (both manual and automatic) and inferring phylogenies by

distance methods and parsimony for molecular sequences.  It  runs  on  IBM  PC-

compatibles  and includes some rather fancy graphics. The authors are currently

in the U.S., not in Siberia.  A version of the program  is  available  free  by

anonymous  ftp  from  gsbs18.gs.uth.tmc.edu  in directory pub/zharkikh/vostorg.

The programs are described  in  a paper by Zharkikh et. al. in Gene  101:  251-

254 (1991). 

     16. Rainer Wetzel and Daniel Huson have developed a Macintosh program  for

carrying  out  the  "split  decomposition"  method  of  A. Bandelt and A. Dress

(Molecular Phylogenetics 1:  242-252  (1992)).   Contact  huson@mathematik.uni-

bielefeld.de for details. 

     17.   James  Lake  distributes  "Evomony",  a  program   for   using   the

"evolutionary parsimony" (invariants) method for inferring phylogenies from DNA

or RNA sequences.  It runs on 286 or higher DOS  systems  with  at  least  500k

bytes of memory. A Macintosh version was also contemplated.  I do not know what

the current distribution arrangements are.  Lake's  address  is  Department  of

Biology, University of California, Los Angeles, California  90024. 

     18.   Walter  Fitch  (Department  of  Ecology  and  Evolutionary  Biology,

University  of  California,  Irvine,  California   92717, U.S.A.) has a package

"Molevol" available  free  (on  receipt  of  an  appropriate  number  of  PCDOS

formatted  floppy disks) with about 20 FORTRAN programs for not only estimating

trees by parsimony and distance methods but doing various  other  manipulations

of  data that might be needed such as format interconversions and searching for

homology and secondary structure.  They are available as FORTRAN source  and/or

as  PCDOS  executables.  The FORTRAN programs will also run on Sun workstations

(and probably others too, I would suspect).  His  electronic  mail  address  is


     19. Pierre Roux and Tim Littlejohn of  the  Informatics  Division  of  the

Organelle  Genome Megasequencing Program at the Universite de Montreal has made

available PARBOOT, a program that takes bootstrap sampled data sets and  splits

them  up,  submitting  each to a different computer, so as to run bootstrapping

quickly on networks of computers.  It is available free as C source code by ftp

from   megasun.bch.umontreal.ca   in  directory  pub/parboot.   It  requires  a

networked  system  of  computers  with  PHYLIP,   a  "perl"  interpreter,   and

appropriate accounts and permissions. 

     20. Andrey Zharkikh of the Genetics Centers at  the  University  of  Texas

Health  Sciences Center in Houston has programs for bootstrapping of nucleotide

sequences, including his innovative double-bootstrap method  for  getting  less

biased   P   values.    They   are   available   free   by   anonymous  ftp  at

gsbs18.gs.uth.tmc.edu/pub/zharkikh/bootstrap                                 or

gsbs18.gs.uth.tmc.edu/pub/zharkikh/bootstrap/double-bootstrap.    The  programs

njbootjc, njbootk2, and  njbootli  implement  methods  based  on  Jukes-Cantor,

Kimura, and Li distances, respectively. 

     21.  David Penny (Department of Botany  and  Zoology,  Massey  University,

Palmerston  North, New Zealand) has been offering for free distribution several

PCDOS programs, one a fast parsimony program, TurboTree.  There  are  also  two

others,   Hadtree   which   computes   expected  frequencies  of  all  possible 

distributions of nucleotides among species, and Great  Deluge,  an  approximate

search  for  the  most parsimonious tree by a quasi-random method.  He tells me

that funding exigiencies are such that he may soon have to start  charging  for

these.  His electronic mail address is dpenny@massey.ac.nz. 

     22. Jotun Hein, (Institute of Genetics and Ecology, University of  Aarhus,

8000  Aarhus  C, Denmark) has produced TreeAlign, a multiple sequence alignment

program that builds trees as it aligns DNA or protein  sequences.   It  uses  a

combination  of  distance  matrix and approximate parsimony methods.  TreeAlign

uses too much memory for it to run on PC's (DOS or Mac systems) but  is  really

designed  for  a workstation or mainframe.  It is available by anonymous ftp at

the Indiana, Houston, and EMBL molecular biology software  distribution  sites.

Their    network    addresses    are    respectively:      ftp.bio.indiana.edu,

ftp.bchs.uh.edu, and ftp.ebi.ac.uk.  In the  Indiana  archive  one  must  enter

directory  molbio/align,  in  the  Houston archive it is in directory pub/gene-

server in the directories unix and  vms.   If  you  are  on  Internet  and  use

molecular  data  it is important that you learn to use anonymous ftp and become

familiar with these ftp servers. 

     23. Another multisequence alignment program that  estimates  trees  as  it

aligns multiple sequences is ClustalW.  Currently it is distributed as C source

code, and in Macintosh and DOS executables by its author, Desmond Higgins.   He

is  at  the  European Bioinformatics Institute in Cambridge, England.  ClustalW

successfully compiles and runs on many different workstations.  DOS,  Mac,  and

PowerMac executables are also available 

     It is a complete rewrite and upgrade of the Clustal and ClustalV packages;

the  first was described by Higgins and Sharp (1989).  New features include the

ability  to  detect  read  different   input   formats   (NBRF/   PIR,   Fasta,

EMBL/Swissprot);   align  old  alignments;  produce  phylogenetic  trees  after

alignment (Neighbor Joining trees with a  bootstrap  option);  write  different

alignment   formats   (Clustal,  NBRF/PIR,  GCG,  PHYLIP);  full  command  line


     The program is available by anonymous ftp at  the  Indiana,  Houston,  and

EMBL  molecular  biology  distribution  sites.   Their  network  addresses  are

respectively:   ftp.bio.indiana.edu, ftp.bchs.uh.edu,  and  ftp.ebi.ac.uk.   In

the  Indiana  archive  one  must  enter  directory molbio/align, in the Houston

archive it is in directory pub/gene-server in all of the four directories  dos,

Mac,  unix, and vms (I do not know exactly where it is in the EBI machine).  If

you are on Internet and use molecular data it is important that  you  learn  to

use anonymous ftp and become familiar with one or more of these ftp servers. 

     24. Ward Wheeler and David Gladstein have  written  MALIGN,  a  parsimony-

based  alignment  program  for molecular sequences.  It implements the original

suggestion  by  Sankoff,  Morel,  and  Cedergren  (1973)  that  alignment   and

phylogenies  could  be  done at the same time by finding that tree that minizes

the total alignment score along  the  tree.   Jotun  Hein's  program  TreeAlign

(mentioned  above) is another, more approximate but probably faster, attempt to

implement the Sankoff-Morel-Cedergren suggestion.   MALIGN  is  available  from

Ward  Wheeler  at the American Museum of Natural History in New York city.  His

email address is wheeler@amnh.org.  It comes in DOS, Mac and SUN versions. 

     25.  Rod Page has written COMPONENT,  a  program  for  PCDOS  systems  for

comparing  cladograms  for  use  in phylogeny and biogeography studies.  It has

many  tree  comparison  and  consensus  methods,  and  far  more  features  for

biogeographic  studies (such as comparing species and area cladograms) than any

other package.  It runs on PCDOS 286  or  386  systems  under  Windows  3.0  or

higher.   Its  cost is 40 pounds U.K., and it can be ordered Liz Timpson at the

Department of Botany, Natural History Museum, London (emt@nhm.ic.ac.uk).  Rod's

e-mail  address  is  rod.page@zoology.oxford.ac.uk.   There  is a review of the 

program in Cladistics 9: 351-353 (1993).  COMPONENT has a World Wide Web  site:

http://evolve.zps.ox.ac.uk/Rod/cpw.html which includes an order form. 

     26. Andrew Purvis  and  Andrew  Rambaut  of  the  Department  of  Zoology,

University  of  Oxford,  England,  have  written  CAIC (Comparative Analysis of

Independent Contrasts).  It  is  a  Macintosh  program  that  carries  out  the

contrasts  method  (like  my CONTRAST) but with some modifications by others to

cope with lack of resolution  of  the  phylogeny.   It  is  available  free  by

anonymous  ftp  from  directory  packages/CAIC  at  evolve.zps.ox.ac.uk.  It is

described in CABIOS 11: 247-251 (1995). 

     27. Joaquin Dopazo at the Centro  Nacional  de  Biotecnologia  in  Madrid,

Spain,  has  written  a  program  ABLE (Analysis of Branch Length Errors) which

implements the method described by Adell and Dopazo in J. Mol. Evol. 38:305-309

(1994).   This  is a form of the parametric bootstrap.  It makes use of PHYLIP.

It  is   available   as   a   DOS   executable   over   World   Wide   Web   at

http://www.cnb.uam.es/www/ximo  or  by  anonymous  ftp  at:  ftp.cnb.uam.es  in

directory software/molevol. 

     28. Kent Fiala, now of SAS Institute, has written a compatibility (clique)

program,  based  on  an  earlier  program written by Kent and George Estabrook.

Christopher Meacham has put the latest version of  CLINCH  (6.2),  with  Kent's

permission,  as  a  self-extracting  DOS  archive  vailable free on Jim Beach's

TAXACOM fileserver, muse.bio.cornell.edu. CLINCH 6.2 and associated  files  can

be  found  by anonymous ftp in /pub/software/clinch as clinch62.exe, which is a

self-extracting archive.  Documentation, sample input and output,  and  FORTRAN

source  code  are  included.   PC-CLINCH  is  probably  the  most sophisticated

compatibility analysis program.  The Taxacom server, by the way, also has other

material related to botanical systematics, including flora information. 

     29.  Christopher  Meacham  (Museum  Informatics  Project,  University   of

California,  Berkeley,  California  94720,  U.S.A.)  produces COMPROB, a Pascal

program to compute probabilities that characters would be compatible at random,

thus  telling  us  which  clique  is "most surprising".  He can be contacted as

meacham@violet.berkeley.edu about receiving a copy.  The program is free. 

     30.  The program MARKOV computes  a  distance  measure  between  pairs  of

nucleotide sequences.  It also constructs phylogenies from these and summarizes

the 4x4 substitution matrices between the pairs of species.   It  uses  a  more

general  model of substitution than used in PHYLIP, the Stationary Markov Model

described in the paper by Saccone et. al. in Methods in Enzymology volume  183,

pages 570-583, 1990.  Bootstrapping is used to analyze the statistical error of

the results.  Output files from CLUSTAL and  PILEUP,  as  well  as  some  other

formats,  can  be used for input, and analysis can be confined to certain codon

positions in coding sequences.  The program is written in FORTRAN and  runs  on

VMS  and  Unix  systems.   It was produced by Dr. Graziano Pesole and Professor

Cecilia Saccone at the University of Bari, Italy, and is available (for  free?)

from  Dr.  Cecilia  Lanave  at CSMME-CNR, Dipartimento di Biochimica e Biologia

Molecolare, Universita` di Bari, via Orabona 4, 70126 Bari, Italy.   Her  phone

number  is 39-80-243305, her fax number is 39-80-243317, and her e-mail address

is lanave@vaxba0.ba.it or mvx36@ibacsata.it 

     31. J. S. Armstrong, A. J. Gibbs, R. Peakall and G. Weiller, of Australian

National University, Canberra, have produced RAPDistance, a package for DOS and

(presumably) Windows systems for computing distance matrices for RAPD analyses,

for  use  in  various  phylogeny  programs.   RAPDistance  is available free by

anonymous ftp from directory pub/RAPDistance  at  life.anu.edu.au,  or  on  the

World Wide Web at http://life.anu.edu.au/molecular/software/rapd.html. 

     32.  P. R. Reeves and colleagues at  Sydney  University,  Australia,  have

produced  MULTICOMP,  a  program  for computing various distances from sequence 

data.  It is described in a paper by Reeves  et.  al.  in  CABIOS  10:  281-284

(1994).   I  do  not  know  what  computer  systems  it runs on.  Reeves may be

contacted at reeves@angis.su.oz.au for distribution information. 

     33. Ken Rice of the Department of Organismal and Evolutionary  Biology  of

Harvard  University  has  produced  RSVP (restriction site variability program)

which calculates several measures of genetic variability based  on  restriction

map  data.   It  also  produces  Jukes-Cantor  corrected distance matrices with

standard errors from collections  of  restriction  maps.   C  source  code  for

Version   2.08   of   RSVP   is   available   free   by   anonymous  ftp  from:

oeb.harvard.edu/rice    or    you    can    get     it     on     WWW     from:

http://oeb.harvard.edu/~rice.  It runs under Unix. 

     34. J. S.  Farris  and  Mary  Mickevich  earlier  released  a  package  of

phylogeny programs, PHYSYS, which, at about $5,000, was extremely expensive (in

my opinion, which is certainly a biased one).  I  am  not  sure  whether,  from

whom, or under what conditions it is still available. 

     35.  Fujitsu Ltd. ("a $21 billion  global  leader  in  advanced  computer,

telecommunications,  and  electronic devices") sells for $28,000 US a Fujitsu S

family  workstation  complete  with  a   program,   SINCAIDEN,   which   allows

"experimental  researchers,  even  those  unfamiliar  with  such analyses, [to]

easily create phylogenetic trees in their own laboratories."  The program  also

allows  searches  of the major nucleic acid sequence and protein databases (the

ad I saw does not make it clear whether these databases are provided  with  the

workstation).   The  methods  available  are  UPGMA, neighbor-joining, Farris's

(Distance Wagner)  and  the  modified  Farris  distance  matrix  methods.   The

workstation  is  SPARC  compatible  and  runs SunOS.  The SYNCAIDEN program was

developed by the group at the National Institute of Genetics, Japan  under  Dr.

Takashi  Gojobori.   Fujitsu  Ltd. may be contacted at 21-8, Nishi-Shinbashi 3-

chome, Minato-ku, Tokyo 105, Japan (phone 81-3-3437-5111 ext. 2831,  fax  81-3-

5472-4354),  or  in  the  U.S. at Fujitsu America Inc., 3055 Orchard Drive, San

Jose, California 95134-2017 (phone 1-408-432-1300  ext.  5168,  fax  1-408-434-


     36.  MUST, a package of sequence management programs, is distributed on  a

shareware  basis  by  Herve  Phillippe, Laboratoire de Biologie Cellulaire (URA

CNRS 1134 D), Batiment 444, Universite de Paris-Sud, 91405 Orsay cedex, France.

His  e-mail  address  is:  adoutte@frciti51  on Bitnet/EARN.  His phone and fax

numbers  are  respectively  and   MUST  is

available  on  a  shareware  basis  ($100  registration  fee if you do not send

diskettes) and runs on DOS systems  using  DOS  version  3  or  later.   It  is

intended  as complementary to existing phylogeny and alignment programs and can

produce output files in the formats of PHYLIP, PAUP, Hennig86, and CLUSTAL.  It

contains a variety of sequence input, editing, checking, and storage functions,

as well as a sequence editor and a phylogeny plotter.  It also  allows  further

analyses of the results from these phylogeny programs. 

     37.  Steve Smith, formerly of the Harvard Genome Laboratory,  has  written

an  X-Windows interactive sequence editor, GDE (Genetic Data Environment) which

allows the user to edit sequences and align them by hand, and to select subsets

of  sites  and  sequences  and  call  a  variety of analysis proprams including

ClustalV and many of the PHYLIP 3.5 programs.  The GDE 2.0 system will  run  on

many  workstations  that  have  the  X  windowing system.  It also includes the

TreeTool tree-plotting program (see below).  GDE 2.0 is free and  is  available

for  anonymous  ftp transfer at the molecular biology software servers, such as

ftp.bio.indiana.edu     in      directory      molbio/unix/GDE,      or      at

megasun.bch.umontreal.ca  in  directory  pub/gde.  At the latter location there

are also Linux binaries, and at both there are Sun binaries. 

     38.  Mike Maciukenas, at the Department of Microbiology of the  University

of  Illinois, has written a wonderful X-windows based interactive tree-plotting

program called TreeTool.  It takes as input a PHYLIP  tree  file,  with  branch

lengths  if  they  are provided, displays the tree in either rooted or unrooted

form on any X-windows screen, and allows the user to modify  the  form  of  the

tree and the placement of nodes and labels.  When the tree is in final form the

user can have it written to a Postscript file and/or printed to  a  Postscript-

compatible  printer.   TreeTool  is  free  as  a C program for X windows and is

available   for   anonymous   ftp   from   ftp.bio.indiana.edu   in   directory

molbio/unix/GDE.   It  is  also  included  in  the  GDE  2.0  sequence analysis

environment mentioned above. 

     39. Manolo Gouy of the University of Lyon, France,  has  produced  NJplot,

which  displays  phylogenies  (input in the standard form) on Macintosh screens

and saves them in PICT files.  It is available free and  can  be  retrieved  by

anonmyous  ftp  from  molecular  biology  software servers such as the European

Bioninformatics Institue's server, ftp.ebi.ac.uk,  where  it  is  in  directory


                              HOW YOU CAN HELP ME 

     Simply let me know of any problems you have had adapting the  programs  to

your computer.  I can often make "transparent" changes that, by making the code

avoid the wilder, woolier, and less standard parts of C, not only  help  others

who  have  your machine but even improve the chance of the programs functioning

on new machines.  I  would  like  fairly  detailed  information  on  what  gave

trouble,  on  what  operating  system, compiler and machine, and what had to be

done to make it work.  I will be pleased to  help  do  some  over-the-telephone

trouble-shooting, particularly if I don't pay for the call.  Electronic mail is

a particularly convenient way for me to be asked about  problems,  as  you  can

include  your input and output files so I can see what is going on.  I'd really

like these programs to be able to run with only routine changes  on  ABSOLUTELY

EVERYTHING,  down  to  and  possibly  including  the Amana Touchmatic Radarange

Microwave Oven (which is an Intel 8080 system -- early versions of this package

did run successfully on Intel 8080 systems). 

     I would also like to know timings of programs from the package,  when  run

on the three test input files provided above, for various computer and compiler

combinations, so that I can provide this information in the section  on  speeds

of this document. 

     For the phylogeny plotting programs DRAWGRAM and DRAWTREE,  Chris  Meacham

and  I  are  particularly interested in knowing what has to be done to adapt it

for other common plotters, laser printers, and dot matrix printers. 

     You can also be helpful to PHYLIP users in  your  part  of  the  world  by

giving  them  the  latest  version of PHYLIP and helping them with any problems

they may have in getting PHYLIP working on their data. 

     Your help is appreciated.  I am  always  happy  to  hear  suggestions  for

features  and programs that ought to be incorporated in the package, but please

do not be upset if I  turn  out  to  have  already  considered  the  particular

possibility you suggest and decided against it. 

     I would also  like  to  know  of  any  applications  of  PHYLIP  that  get

published:  I  would appreciate receiving a reprint of any paper reporting work

that used PHYLIP. 

                              IN CASE OF TROUBLE 


solve  the  problem,  get  in  touch  with  me.  I am on electronic mail at the

addresses given below.  If you do ask  about  a  problem,  please  specify  the

program name, version of the package, computer and compiler, and be prepared to

send me your data file so I can test the problem.  Also it helps  to  have  the

relevant input and output and documentation file nearby so that we can refer to

it.  I can also be reached by calling me in my office:  (206)-543-0150,  or  at

home: (206)-526-9057 (how's THAT for user support!).  If I cannot be reached at

either place, a message can  be  left  at  the  office  of  the  Department  of

Genetics,  (206)-543-1657  but I prefer strongly that I not call you, as in any

phone consultation the least you can do is pay the phone bill. 

     Particularly if you are in a part of the world distant from  me,  you  may

also  want  to  try  to  get in touch with other users of PHYLIP nearby.  I can

also, if requested, provide a list of nearby users.

                              Joe Felsenstein

                              Department of Genetics

                              University of Washington

                              Box 357360

                              Seattle, Washington 98195-7360, U.S.A. 

Electronic mail addresses (I prefer that you use the Internet address

if possible): 





Set Home | Add to Favorites

All Rights Reserved Powered by Free Document Search and Download

Copyright © 2011
This site does not host pdf,doc,ppt,xls,rtf,txt files all document are the property of their respective owners. complaint#nuokui.com