TL;DR: FastTree is a method for constructing large phylogenies and for estimating their reliability, instead of storing a distance matrix, that uses sequence profiles of internal nodes in the tree to implement Neighbor-Joining and uses heuristics to quickly identify candidate joins.
Abstract: Gene families are growing rapidly, but standard methods for inferring phylogenies do not scale to alignments with over 10,000 sequences. We present FastTree, a method for constructing large phylogenies and for estimating their reliability. Instead of storing a distance matrix, FastTree stores sequence profiles of internal nodes in the tree. FastTree uses these profiles to implement Neighbor-Joining and uses heuristics to quickly identify candidate joins. FastTree then uses nearest neighbor interchanges to reduce the length of the tree. For an alignment with N sequences, L sites, and a different characters, a distance matrix requires O(N2) space and O(N2L) time, but FastTree requires just O(NLa + N) memory and O(Nlog (N)La) time. To estimate the tree's reliability, FastTree uses local bootstrapping, which gives another 100-fold speedup over a distance matrix. For example, FastTree computed a tree and support values for 158,022 distinct 16S ribosomal RNAs in 17 h and 2.4 GB of memory. Just computing pairwise Jukes–Cantor distances and storing them, without inferring a tree or bootstrapping, would require 17 h and 50 GB of memory. In simulations, FastTree was slightly more accurate than Neighbor-Joining, BIONJ, or FastME; on genuine alignments, FastTree's topologies had higher likelihoods. FastTree is available at http://microbesonline.org/fasttree.
TL;DR: The distance-based redundancy analysis (db-RDA) as mentioned in this paper is a nonparametric multivariate analysis of ecological data using permutation tests that is used to partition the variability in the data according to a complex design or model, as is often required in ecological experiments.
Abstract: Nonparametric multivariate analysis of ecological data using permutation tests has two main challenges: (1) to partition the variability in the data according to a complex design or model, as is often required in ecological experiments, and (2) to base the analysis on a multivariate distance measure (such as the semimetric Bray-Curtis measure) that is reasonable for ecological data sets. Previous nonparametric methods have succeeded in one or other of these areas, but not in both. A recent contribution to Ecological Monographs by Legendre and Anderson, called distance-based redundancy analysis (db-RDA), does achieve both. It does this by calculating principal coordinates and subsequently correcting for negative eigenvalues, if they are present, by adding a constant to squared distances. We show here that such a correction is not necessary. Partitioning can be achieved directly from the distance matrix itself, with no corrections and no eigenanalysis, even if the distance measure used is semimetric. An ecological example is given to show the differences in these statistical methods. Empirical simulations, based on parameters estimated from real ecological species abundance data, showed that db-RDA done on multifactorial designs (using the correction) does not have type 1 error consistent with the significance level chosen for the analysis (i.e., does not provide an exact test), whereas the direct method described and advocated here does.
TL;DR: In this article, the accuracies and efficiencies of three different methods of making phylogenetic trees from gene frequency data were examined by using computer simulation, and the results obtained indicate that in all tree-making methods examined, accuracies of both the topology and branch lengths of a reconstructed tree (rooted tree) are very low when the number of loci used is less than 20 but gradually increase with increasing number of nodes.
Abstract: The accuracies and efficiencies of three different methods of making phylogenetic trees from gene frequency data were examined by using computer simulation. The methods examined are UPGMA, Farris' (1972) method, and Tateno et al.'s (1982) modified Farris method. In the computer simulation eight species (or populations) were assumed to evolve according to a given model tree, and the evolutionary changes of allele frequencies were followed by using the infinite-allele model. At the end of the simulated evolution five genetic distance measures (Nei's standard and minimum distances, Rogers' distance, Cavalli-Sforza's fλ, and the modified Cavalli-Sforza distance) were computed for all pairs of species, and the distance matrix obtained for each distance measure was used for reconstructing a phylogenetic tree. The phylogenetic tree obtained was then compared with the model tree. The results obtained indicate that in all tree-making methods examined the accuracies of both the topology and branch lengths of a reconstructed tree (rooted tree) are very low when the number of loci used is less than 20 but gradually increase with increasing number of loci. When the expected number of gene substitutions (M) for the shortest branch is 0.1 or more per locus and 30 or more loci are used, the topological error as measured by the distortion index (dT) is not great, but the probability of obtaining the correct topology (P) is less than 0.5 even with 60 loci. When M is as small as 0.004, P is substantially lower. In obtaining a good topology (small dT and high P) UPGMA and the modified Farris method generally show a better performance than the Farris method. The poor performance of the Farris method is observed even when Rogers' distance which obeys the triangle inequality is used. The main reason for this seems to be that the Farris method often gives overestimates of branch lengths. For estimating the expected branch lengths of the true tree UPGMA shows the best performance. For this purpose Nei's standard distance gives a better result than the others because of its linear relationship with the number of gene substitutions. Rogers' or Cavalli-Sforza's distance gives a phylogenetic tree in which the parts near the root are condensed and the other parts are elongated. It is recommended that more than 30 loci, including both polymorphic and monomorphic loci, be used for making phylogentic trees. The conclusions from this study seem to apply also to data on nucleotide differences obtained by the restriction enzyme techniques.
TL;DR: In this paper, the authors proposed the use of principal coordinate analysis (PCO) followed by either a canonical discriminant analysis (CDA) or a canonical correlation analysis (CCorA) to provide a flexible and meaningful constrained ordination of ecological species abundance data.
Abstract: A flexible method is needed for constrained ordination on the basis of any distance or dissimilarity measure, which will display a cloud of multivariate points by reference to a specific a priori hypothesis. We suggest the use of principal coordinate analysis (PCO, metric MDS), followed by either a canonical discriminant analysis (CDA, when the hypothesis concerns groups) or a canonical correlation analysis (CCorA, when the hypothesis concerns relationships with environmental or other variables), to provide a flexible and meaningful constrained ordination of ecological species abundance data. Called “CAP” for “Canonical Analysis of Principal coordinates,” this method will allow a constrained ordination to be done on the basis of any distance or dissimilarity measure. We describe CAP in detail, including how it can uncover patterns that are masked in an unconstrained MDS ordination. Canonical tests using permutations are also given, and we show how the method can be used (1) to place a new observation into...
TL;DR: FastTree as mentioned in this paper uses sequence profiles of internal nodes in the tree to implement neighbor-joining and uses heuristics to quickly identify candidate joins, then uses nearest-neighbor interchanges to reduce the length of the tree.
Abstract: Gene families are growing rapidly, but standard methods for inferring phylogenies do not scale to alignments with over 10,000 sequences. We present FastTree, a method for constructing large phylogenies and for estimating their reliability. Instead of storing a distance matrix, FastTree stores sequence profiles of internal nodes in the tree. FastTree uses these profiles to implement neighbor-joining and uses heuristics to quickly identify candidate joins. FastTree then uses nearest-neighbor interchanges to reduce the length of the tree. For an alignment with N sequences, L sites, and a different characters, a distance matrix requires O(N^2) space and O(N^2 L) time, but FastTree requires just O( NLa + N sqrt(N) ) memory and O( N sqrt(N) log(N) L a ) time. To estimate the tree's reliability, FastTree uses local bootstrapping, which gives another 100-fold speedup over a distance matrix. For example, FastTree computed a tree and support values for 158,022 distinct 16S ribosomal RNAs in 17 hours and 2.4 gigabytes of memory. Just computing pairwise Jukes-Cantor distances and storing them, without inferring a tree or bootstrapping, would require 17 hours and 50 gigabytes of memory. In simulations, FastTree was slightly more accurate than neighbor joining, BIONJ, or FastME; on genuine alignments, FastTree's topologies had higher likelihoods. FastTree is available at http://microbesonline.org/fasttree.