<?xml version="1.0" encoding="utf-8" standalone="yes"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom">
  <channel>
    <title>Contingency, predictability in the evolution of a prokaryotic pangenome on Superphysics</title>
    <link>https://www.superphysics.org/research/sciences/evolution/</link>
    <description>Recent content in Contingency, predictability in the evolution of a prokaryotic pangenome on Superphysics</description>
    <generator>Hugo</generator>
    <language>en</language>
    <lastBuildDate>Mon, 01 Jan 0001 00:00:00 +0000</lastBuildDate>
    <atom:link href="https://www.superphysics.org/research/sciences/evolution/index.xml" rel="self" type="application/rss+xml" />
    <item>
      <title>Contingency, predictability in the evolution of a prokaryotic pangenome</title>
      <link>https://www.superphysics.org/research/sciences/evolution/predictability/</link>
      <pubDate>Mon, 01 Jan 0001 00:00:00 +0000</pubDate>
      <guid>https://www.superphysics.org/research/sciences/evolution/predictability/</guid>
      <description>&lt;h3 id=&#34;significance&#34;&gt;Significance&lt;/h3&gt;&#xA;&lt;p&gt;Different strains of the same prokaryotic species often show significant variation in gene content.&lt;/p&gt;&#xA;&lt;p&gt;We do not know whether this variation is due to:&lt;/p&gt;&#xA;&lt;ul&gt;&#xA;&lt;li&gt;genetic drift&lt;/li&gt;&#xA;&lt;li&gt;selection&#xA;&lt;ul&gt;&#xA;&lt;li&gt;This expects sets of genes to be consistently and repeatedly gained or lost together, or sequentially.&lt;/li&gt;&#xA;&lt;/ul&gt;&#xA;&lt;/li&gt;&#xA;&lt;/ul&gt;&#xA;&lt;p&gt;We used machine learning to predict variable genes in a large set of Escherichia coli strains, using other variable genes as predictors.&lt;/p&gt;</description>
    </item>
    <item>
      <title>Results and Discussion</title>
      <link>https://www.superphysics.org/research/sciences/evolution/predictability2/</link>
      <pubDate>Mon, 01 Jan 0001 00:00:00 +0000</pubDate>
      <guid>https://www.superphysics.org/research/sciences/evolution/predictability2/</guid>
      <description>&lt;h3 id=&#34;results&#34;&gt;Results&lt;/h3&gt;&#xA;&lt;p&gt;A Substantial Subset of Accessory Genes in E. coli Can Be Predicted Accurately.&lt;/p&gt;&#xA;&lt;!-- The E. coli pangenome inferred from 2,241 genomes in this study contained accessory gene families with 12,840 unique PAPs that were present in more than 1% and less than 99% of genomes and were hence included in this study. &#xA;&#xA;56,579 gene families were inferred by Panaroo but 28,774 genes were not included in the analysis because they were present or absent in over 99% of genomes. These were mostly very rare genes. &#xA;&#xA;Of the remaining 27,806 genes, 19,137 had a presence–absence pattern that was shared by at least one other gene and were hence collapsed into 4,172 presence–absence patterns in addition to 8,668 genes with unique distributions. &#xA;&#xA;The presence or absence of 3,922 (30.5%) PAPs could be accurately predicted (both F scores &gt;= 0.9) in the test set after the Random Forest model had been trained.&#xA;&#xA;From this accurately predicted dataset, a total of 2,144 (54.7%) had an associated D-statistic greater than 0, meaning that they were distributed widely on the tree.  --&gt;&#xA;&lt;!-- The remaining 1,778 PAPs were “clumped” on the tree, and therefore, it is more difficult to ascribe causality to their association, when a very good explanation might be that they were simply acquired at more or less the same time and have been vertically inherited together since then.  --&gt;&#xA;&lt;!-- SI Appendix, Fig. S3 shows that although the D score is not directly proportional to parsimony score, it correlates strongly, meaning that all 2,144 PAPs with a D score of greater than or equal to zero also had a parsimony score of at least eight though most had a much higher score (SI Appendix, Fig. S3). &#xA;&#xA;This means that we have only examined predicted genes which have been acquired and/or lost at least 8 times across the pangenome, and furthermore, we insist that their distribution is widespread and not localised (35).&#xA;&#xA;We focus on this set of 2,144 PAPs because they manifest a broad, patchy distribution across the phylogeny, stemming from a combination of lateral gene transfer and loss, and we can accurately predict their presence or absence based on the other genes present in the genome.&#xA;&#xA;To evaluate whether the presence–absence matrix of 12,840 uniquely distributed PAPs is no better structured than random expectation based on the underlying phylogeny and gene gain and loss rates in this study, we compared results from the original data to those simulated based on inferred transition rate matrices. Simulated datasets were analysed in the same way as the empirical data. &#xA;&#xA;Treating this as our null hypothesis, we can evaluate the extent to which, even after filtering by the D statistic, the predictability of a gene’s presence or absence can be explained by chance. &#xA;&#xA;In each simulated dataset, several genes pass the F1 score thresholds but the majority of these can be explained by a low D score and are hence removed from the set of accurately predicted genes. &#xA;&#xA;The number of genes that successfully pass both thresholds is between 1.0% and 1.7%, which can be thought of as a false discovery rate. The empirical analysis yielded 16.7% of genes accurately predicted with D &gt; 0 (SI Appendix, Fig. S4). Accordingly, we can reject the hypothesis that these empirical observations of associations have arisen solely due to chance in our dataset or that the structure of the pangenome dataset has no more gene–gene correlations than the structure of randomly assembled data.&#xA;&#xA;In principle, we would expect the number of accurate predictions to increase with increasing quantity of data, provided that the predictions being made are not artefactual. &#xA;&#xA;Hence, if downsampling the dataset results in a decrease in predictions being made accurately, it would be reasonable to infer that the addition of more data would result in more accurate predictions being made. Therefore, we carried out a sensitivity analysis on dataset size. &#xA;&#xA;We randomly eliminated 50%, 75%, 90%, and 95% of the genomes in the dataset and then repeated our Random Forest prediction 10 times per dataset. In each case, reducing the number of genomes substantially and significantly reduced the number of PAPs that were accurately predicted, while having a much smaller effect on the number of total PAPs that could be analysed. For example, the average number of accurately predicted PAPs, over 10 repeated analyses after filtering PAPs with D score &lt; 0, using 50% of the genomes was 1,650/12,642 (13.1%) compared with the 2,144/12,840 (16.7%) in our full analysis. When only 5% of genomes were included in the study, an average of 713/11,644 (6.1%) PAPs were predicted accurately (SI Appendix, Fig. S5). This suggests that predictions would be likely to improve with the addition of more genomes.&#xA;&#xA;The links between the 2,144 predictable PAPs, were used to construct a network with 33,426 edges featuring all well-predicted target nodes and their predictors (Fig. 1). This network consisted of 243 connected components ranging in size from 2 to 248 nodes, featuring both coincident and avoidance edges sensu ref. 19. By considering only the coincident relationships (33,138 out of 33,426 edges), we found 240 connected components containing between 2 and 244 nodes. Taking only avoidance relationships, 28 connected components were generated with a range from 2 to 22 nodes in size. As nonunique gene patterns are collapsed into one entity, both in the analysis and presentation of results, some of the nodes represent multiple genes. Out of the well-predicted PAPs, 827 patterns were observed in more than one gene. In total, independent of whether they are well predicted by our Random Forest model or not, 19,137 genes had nonunique PAPs and were collapsed into 4,172 patterns that were then used both as features for prediction and as patterns to predict. --&gt;&#xA;&lt;p&gt;The Random Forest approach is stochastic, so we repeated the analysis 100 times, each time splitting the data into training and test sets differently.&lt;/p&gt;</description>
    </item>
  </channel>
</rss>
