Friday, 22 April 2016

A Comparison of Big Y Analysis Resources

There are several very useful resources available to us for interpreting the results of our Big Y tests. Here is a brief summary of what we get and what we don't from each resource - all have their Pros and Cons and all add something to the overall interpretation of the results. Because this is a completely new area of science, and we are on the crest of the wave of scientific discovery, the different analyses from the different sources frequently produce different results, which in turn allow us to ask why and refine our methodologies further. This will continuously change over time as we understand more, adjust our SNP declaring criteria, and refine our interpretation of the data. We can expect big changes to take place over the next few years.


FTDNA
The presentation of the Big Y results from FTDNA is currently quite limited. This is not surprising as they were the pioneers in this area and their first offering in terms of how the data is presented is now outdated. But this is due to change in the near future when they introduce their new Big Y features. What these are as yet we don't know but we can expect exciting developments over the next few months. The march of progress carries on! 

Currently we are given a list of close matches and the number and nature of Shared Novel Variants with each match (i.e. shared new SNPs), also Known SNPs that are not shared, and SNPs that are unique when comparing just two specific individuals. FTDNA also places us on their own version of the Haplotree (the human evolutionary tree) which tells us what SNPs lie at branching points above our own particular sub-branch.

Confusion arises from a number of different issues, some of them general points, some of them FTDNA-specific:
  • the separation of new ("Novel") SNPs from those already identified ("Known")
  • FTDNA's high threshold criteria for declaring a SNP misses some SNPs
  • often no SNP names are reported, only SNP positions - you have to go to YBrowse.org or YFULL to get specific information regarding the name of SNPs present at a particular location on the Y chromosome.
However, and most importantly, FTDNA provide the facility to download our raw data (in .vcf, .bed, and .BAM files) which allows us to have the data analysed and interpreted by a host of other resources. Details of how to access and share these files can be found here. The "Download VCF" option will download the .vcf and .bed files (about 1.6 mB) and the "Share BAM" option will allow you to copy a temporary link to your BAM file (which is >600 mB in size and so is far too big to be sent by email).

My current position on FTDNA's haplotree with details of SNPs tested
(green, positive; red, negative)
(click to enlarge)


YFULL
YFULL gives us a more detailed analysis of BAM files and places us on their Y-Haplotree in relation to other people nearby (i.e. who have undergone NGS [Next Generation Sequencing] testing, like the Big Y). It also identifies our terminal SNP (or SNP block), the SNPs at branching points further upstream (and hence the Shared SNPs we have with our neighbours), and the unique / personal / private SNPs that each member possesses (currently).

Over and above FTDNA's analysis, YFULL tells us the following:
  • how many people on adjacent branches have tested and where they are from
  • SNP names and any "equivalent SNPs" (i.e. exactly the same position on the Y chromosome but alternative name)
  • time estimates for the formation of each SNP (and hence the particular branching point)
  • TMRCA estimates for the people in each sub-branch (with 95% Confidence Intervals)
  • information on each SNP including position on the Y, ancestral and derived values, alternative names, and reference sequence. This information can be supplemented by YBrowse.org
  • easier-to-understand presentation of currently unique (personal) SNPs with an estimate of their "quality"
  • data on about 500 STRs including the majority found in FTDNA's 111 marker panel - this can be helpful in calculating TMRCA estimates and narrows the 95% range around the estimate (compared to TMRCAs based on 111 marker data)

The Gleeson Lineage II portion of the YFULL haplotree
(click to enlarge)


Full Genomes (FGC) Analysis
The FGC analysis of BAM files is comparable to the YFULL analysis but their public Y-Haplotree is not as user-friendly as others and is of limited utility. FGC generates the following reports with the underlying files (processed BAM file, mtDNA, and STRs):
  • A detailed analysis of called variants report
  • A variant genotyping report
  • Haplogroup classification
  • Y-STR report, and 
  • mtDNA report

clarifYdna analysis
clarifYDNA will reanalyse your Big Y data for $30 and produces a Y-DNA haplotree report from your results - this will be periodically updated as new data becomes available from other testers.  Reports are in sync with a recent version of the ISOGG haplotree, and are able to indicate which aspects of the phylogenetic structure are robust and which are more tenuous - thus it combines aspects of the Y-haplotree that are both "established" and "provisional / experimental". 

Unfortunately the tree is only available to subscribers and is not available to the public.

Click here for an Example analysis.


Haplogroup Project Administrators 
The Administrators at the Z255 Haplogroup Project, and indeed the Admins of more upstream Haplogroup Projects (e.g. L21, R1b & subclades, etc) are an incredible resource and their respective Yahoo Discussion Groups are great places to post questions and get replies.

John Murphy puts together a regular updated spreadsheet / haplotree for the Z255 group which has an advantage over YFULL's analysis - it incorporates new SNP discoveries from specific SNP Packs and single SNP testing and not just from NGS testing (Big Y, FGC).

The Gleeson Lineage II portion of John Murphy's spreadsheet
(click to enlarge)


Alex Williamson's "Big Tree"
Alex is one of the most important people in the R1b research community and is a champion of data analysis, interpretation, and (most importantly) presentation. His Big Tree website (www.ytree.net) is a masterful display of complex data in a digestible format. He places us on his haplotree so that we can see our terminal SNP (block), SNPs at upstream branching points, and our neighbours on adjacent branches.

Advantages over YFULL include:
  • The Big Tree gives us our neighbours names and places of origin, thus making it easier to form an impression of where a particular sub-branch might have formed and if it is specific for a particular surname.
  • Easy to navigate with lots of additional information by simply clicking on a surname or a SNP.
  • His graphics are superb.
  • His Mutation Matrix allows us to see which SNPs are shared and which SNPs are not between us and our closest neighbours. 
  • His presentation of unique (personal) SNPs gives us not only an estimate of "quality" but also the region of the Y-chromosome in which they are found (this can be useful in judging if this is a true SNP and also how easy it would be test for it in a bespoke SNP Panel)

The (current) 4 branches of the Gleeson Lineage II on the Big Tree


Nigel McCarthy's Z255 Subgroup
Nigel is another pioneer. He is one of the first people to combine SNP markers and STR markers into a single tree. We are lucky enough in the Gleeson Lineage II group that we are closely related to  some of the people in Nigel's McCarthy DNA Project. As a result, Nigel has included us in the Z255 portion of his phylogenic tree (Group E).

Nigel's own SNP analysis is complementary to the ones above and he too will occasionally discover new SNPs that others have not included in their analyses.

Major advantages over previous analyses include:
  • As well as the SNPs, he also presents STR data and the change in STR values at each branching point
  • He includes people who have not been tested on the Big Y (i.e. anyone with Y-DNA-67 or Y-DNA-111 results). As a consequence, his portion of the haplotree contains more Gleeson's from Lineage II than any other tree - 12 members altogether (compared to 9 members on Alex's tree, 9 on John Murphy's, and 6 on the YFULL tree).
  • From Nigel's analysis it is possible to see where Back Mutations and Parallel Mutations occur in the STR markers.


The Gleeson Lineage II members in Nigel McCarthy's Group E of his McCarthy DNA Project


Mike W’s Haplotype Data for R1b-L21
Mike is an administrator of several FTDNA projects and a leader in the genetic genealogy community for a long time. He maintains a very comprehensive spreadsheet that can be downloaded from the Links section of the R1b-L21 project Yahoo group or a smaller version from the Z255 Yahoo group. This spreadsheet collects the STRs for 67 and 111 markers, and SNPs from the Big Ys or other sources. A user can calculate his genetic distance in relation to the complete database and he can infer his haplotype according to the most common haplotype of his closest matches. The spreadsheet also calculates the group mode and several statistics required to characterize a particular group.


James Kane's SNP Matrix
I am new to James' SNP Matrix but it too is a work of art, a magnum opus, not surprising for a 90 mB spreadsheet. Yet again, James' approach to NGS data analysis offers a fresh perspective and can detect possible/probable SNPs that have not turned up in other analyses. Having multiple analyses and interpretations of the same data is a great advantage - it allows us to see points of agreement and points of difference in the various approaches, and ultimately helps us to question the data more intelligently which in turn will lead to better analysis and interpretation.

James’ matrix compares the SNPs of all the participants while the other methods preselect the relevant SNPs. This capability is very important for the identification of new potential SNPs. These can be checked against other analyses for consistency or disagreement. Additionally, it is possible to evaluate unique SNPs or SNPs that belong to a particular group with different levels of quality or if they are part of the combBED area. It also provides positions for both the build 37 (GRCh37) and 38 (GRCh38) human reference genome sequence.

The matrix workbook requires BAMs for inclusion. What the scripts do is visit each file for every variant location and outputs the read depth in a very large combined VCF file. The idea is to remove the ambiguity of BED files. The old HTML based pages did include everything, but became unwieldy. They are being replaced soon.

James also has a blog site and an Experimental Y Tree (currently being updated) with SNP names & their equivalents, surnames, places or origin, and TMRCA estimates.

Everyone who has done the Big Y test should send James a link to their BAM file so he can include you in his analysis. This looks at the data from yet another perspective and helps further with the interpretation. It should help clarify the discoveries from other sources and may even identify some additional SNPs.

Below are the instructions and an explanation that James has put together about sending him your BAM file for analysis and what will happen after that:

What’s needed?
A link to your Big Y BAM file using the “Share BAM” button on your Raw Results page.  Let me know if detailed instructions would be helpful.  Please also include this statement in the email:

As the owner or administrator of FTDNA kit#, [YOUR KIT#], I consent to allow analysis of the Y DNA contained in the provided BAM file.  The results of this analysis may be used the phylogenetic tree of haplogroup R or independent researchers for scientific purposes.

What will be done?
Your BAM will be downloaded and realigned to GRCh38.  This will allow a new VCF/BED to be created and compared with others.  Results will be included in http://www.it2kane.org/matrix/R-P312.html.  When sufficient analysis is available for the branches, it will be possible to include time to most recent common ancestor estimation based on these results.  The new VCF/BED will be provided to those interested.

What won’t be done?
Unlike the commercial 3rd party analysis, you won’t get mtDNA, STR value estimates, or variant naming any time in the near future.

See below for my data use policy.

Raw Data Policies

In light of the recent dust-up between FTDNA and another 3rd party site, I have codified my data usage policies.
1.     VCF/BED files submitted for analysis are made available for other R-L21 researchers using the R1b-L21(S145) Haplogroup and Subclades Y DNA forum hosted on Yahoo.  This aids researchers to correctly assign variants to their related haplogroups.
2.     Raw BAM data is retained in a password protected cloud storage account.  The project recognizes there is a low probability that files may contain data not actually on the Y chromosome, which may reveal medically relevant information about the participant. BAM files may be individually shared with qualified researchers and analysts only after approval of the sample’s owner.
3.     GRCh38 aligned versions of variant calls and BED coverage generated by the project’s bio-informatics workflow can be shared with researchers without the sample owner’s explicit authorization.
4.     FTDNA kit #’s are displayed for convenience of related surname projects or haplogroups in all reporting.  As the identifier is used to log into the FTDNA account this has security implications for the kit owner.  Project members may request reporting on tree or call matrix reports use an internal project id instead.
5.     Project members have the right to request that their raw data is removed from reporting at any time, but shared variants in the tree will be retained.


Some Closing Thoughts
This is a new science and we are still trying to get to grips with it. The pithy saying "Many hands make light work" operates quite nicely in this situation. It is only by looking at the data from a variety of different perspectives that we can hope to understand it better, and quickly. So we should be using all of the above utilities to analyse and interpret our Big Y results. Thanks to the internet, this process (which previously would have taken decades to complete) can now be accomplished in a matter of years thanks to what effectively is a crowd-sourcing approach - a group of citizen scientists working together toward a common goal and employing the power of the internet to communicate and collaborate effectively.

There is still a lot of testing to be done - we need more people to do the NGS tests (Big Y, FGC tests, etc). And we need clever people to develop more tools for analysis, interpretation, & presentation of the data. But as this critical mass of people tested builds, and as our ability to analyse and interpret  and present the data improves, we will begin to reap greater and greater dividends. 

Software packages are being developed to help build combination family trees using SNP data, STR data, and standard genealogy. Already you are able to add DNA markers to your Family Tree on Ancestry. This will advance even further and trees will start to be linked online via downstream SNP markers.

Furthermore, for Irish genealogies at least, we will be able to link some of our family trees to the Ancient Annals and Genealogies, bringing us back to before the time of surnames, back to 900 AD, 800 AD, 700 AD, and even further.

In a few years, when we look back at this time in human history, we will be able to say ...
I was there. 
I contributed to that. 
I helped build the Evolutionary Tree of Mankind. 
And I know exactly where I sit on it.


Maurice Gleeson
German Creamer
Lisa Little
April 2016







No comments:

Post a Comment