Saturday, 3 December 2016

Analysing my DNA: Crossovers Part 1

Recently I have been working on identifying the "crossover points" in the DNA of a group of four siblings. They are related to me, so the results of their DNA tests will not only help me in my ancestral searches, but also in discovering more about my own DNA and mapping it to specific ancestors. Crossover points are important for anyone who is trying to identify where segments of DNA came from, ie through which ancestors, as they indicate a change in the DNA, from that of one grandparent to that of the other grandparent in that couple.

There are detailed explanations of the processes involved available elsewhere online but these are the basics, for anyone who is new to this. We all have 23 pairs of chromosomes, one of each pair coming from our father and one from our mother. Just as we received them from our parents, so our parents received one of each pair of their chromosomes from each of their parents, any future child's grandparents:

Things aren't usually as simple as that shown above. Meiosis, the process of cell division that produces the egg or sperm and ensures the correct amount of DNA is passed on to the offspring, more commonly involves the two chromosomes in a pair splitting and then recombining in a different way, so that the resulting chromosome that's passed on is a mixture of the two of the parent:

 As the process of recombination is random, children of the same parents will each receive a different combination of the DNA that came to their parents from their grandparents:

Unfortunately, the DNA tests do not phase our results, so we cannot even identify our two separate chromosomes, unless we have other relatives tested. All we have to work with is the raw data and information about where we match other people.

In the case of a parent and child who have both tested, comparison with their matches may sometimes indicate a possible crossover:

Not only does the Match, match me by much less than they do my mother, they also match a group of other people, who only match my mother, not me, over the latter part of the segment, from about 122,000,000 to 134,000,000. So it appears that I may have a crossover in my maternal chromosome and did not receive the rest of the segment from the ancestor shared with this match. If I knew which side of my mother's family this match has a connection to then, if my mother and I have any shared matches starting after 125,000,000, I would know to concentrate my search for the shared ancestry on the other side of my mother's family.

But comparisons with a parent against other matches is only likely to reveal a few of the potential crossovers. A better method, available to anyone with a group of three or more siblings tested, is to use the sibling comparisons to identify crossovers. As already indicated, a group of siblings will have received different segments of the grandparent's DNA, but the results are not phased by any of the testing company chromosome browsers:

I have represented in orange the (approximate!) overall matching segments of the children - but, using Gedmatch, it is possible to also identify where two people fully match, ie match on both of their chromosomes, rather than just "half match", ie match on one chromosome. It is this, more complete, pattern of matching - changing between the states of having full, half, or no, matching DNA - which is used in order to identify the points where the DNA "crossed over" from one grandparent's DNA to the other.

There are probably several methods for doing this but I think most credit goes to Kathy Johnston for her "Visual Phasing" method. There are some very good blog posts about using Kathy's method, which I shall include links to below - it is worth reading several, as we all have different ways of describing what we do. I have now developed a slightly different method of working, which suits me better.

But I am going to start with the smallest chromosome, chromosome 21, and follow Kathy's instructions, to illustrate the basic method to start with.

These are the comparisons between the four siblings at Gedmatch:

This is the key to the Gedmatch colours:

 And these are the figures for the comparisons:

I have kept the figures separate from the individual chromosome images, as I found the crossover lines end up obscuring the figures on the chromosomes with more crossovers.

Despite having looked at comparisons between the siblings in various other formats (eg the FTDNA downloads), it was only when I did these comparisons that I realised Siblings A and B do not match each other on this chromosome.

Which goes to show how we often notice just what is present - not what is missing! 🙂

But, as you can see, even where two siblings do not match each other (ie they have a grey bar along the lower section, not a blue bar) there are still some base pairs showing a half, or even a full, match. There just aren't sufficient of such matching base pairs in a consecutive sequence for it to be regarded as genealogically relevant.

The next step is for the crossover points to be identified. These are the points where there is a change between fully matching and half matching, or half matching and non-matching, ie where the bottom bar changes between blue and grey, or where the top bars change between an area that is consistently green and one which is predominantly yellow, with intermittent green. The former changes are also demonstrated by the figures. Unfortunately, the changes between fully matching and half matching are not specifically identified in any figures at Gedmatch, although Sue Griffith has explained how to obtain a very good estimate of them*. They can also be identified using one of David Pike's tools*.

Once the crossover points have been identified, they are allocated to particular siblings - a crossover "belongs" to the sibling who shows that change in all of their comparisons. (This isn't always obvious, especially if only using three siblings - sometimes, what looks like a single crossover for one sibling can actually be a double crossover for the two others. Having results available from more than three siblings is an advantage for me.)

In the comparison between B to D, the matching segment does seem to start before the crossover point indicated in the comparisons between D to A and D to C. I suspect this segment could be being artificially extended through some base pairs that just happen to match on both B and D. Issues like this are things to note for future investigation, as they may be a hint that something is wrong with the identification.

Next, working with just the identified crossover lines in an image, but referring to the comparison diagram and the figures, the phased segments of the grandparents' DNA are constructed, usually starting with a segment where two siblings are fully identical. In order to do this, four colours are chosen to represent the DNA received by the children from the grandparents.  Two colours are used for the top grandparent couple and two for the bottom grandparent couple.  [Note, If you follow a colour coded genealogy filing system, I would suggest choosing different colours for the chromosome mapping  (at least, until you are absolutely positive you have identified the correct grandparents' segments, in which case you could change the colours to match your genealogy system.  This would then also be a visual clue that, that chromosome is "confirmed")  But, if you use those colours prior to such confirmation, you might find yourself becoming confused, as we do not yet know which grandparent couple is represented by which phased chromosome.]

At the start of this chromosome 21, B does not match any of their siblings, whereas the other three are all fully identical to each other, so the colours can be allocated as follows:

Since neither A nor B have any crossovers, their coloured bars can be extended for the full length of the chromosome. D's can also be extended as far as D's crossover line at 40:

 Between "37" and "40", C becomes half identical to all three of the other siblings. We don't know whether the crossover is on the maternal or the paternal chromosomes (and we haven't identified the colours as being for specific grandparents anyway), so we just have to pick one of the colours to change. I have chosen the top chromosome, purple changing to blue.  As this is the only crossover C has, the two bars can then be extended to the end:

At "40", D becomes half matching to A and B, but fully matching to C. The same chromosome that we changed for C therefore needs to change for D, in order to produce the correct pattern of matching, and the other colour can be extended, unchanged, to the end:

At this stage, we don't know which colours represent which grandparents - that can only be identified by comparison to other known relatives. But we can still look at the shared matches between the siblings, to see how those results correlate with the phasing represented here. For example, I would expect there to be no shared matches between A and B at any point on this chromosome, whereas A, C and D should have exactly the same matches prior to the point "37". B, C and D will share some matches after point "40", but not all of them. The ones C and D don't share with B after "40", should be people that match A, as well.

So, in my next post, I will explore that. I'll also describe some of the issues I have come across in this process so far, as well as explain the way I have adapted Kathy's method to my own way of working.

But, if you thought this chromosome was easy to phase, then perhaps you'd like to consider the following set of comparisons:

[PS Having begun to look at the matches the siblings have with their niece and their 1st cousin, as well as the more distant matches, I have found an "anomaly". So, perhaps phasing chromosome 21 isn't so straightforward, after all!]

* Sources and references I have found helpful:

Kathy Johnston - step by step instructions for her method: http://forums.familytreedna.com/showthread.php?t=36812 (make sure you download both the slides and the instructions)
Jason Lee - a blog post detailing Kathy's method: http://dnagenealogy.tumblr.com/post/137722603308/the-use-of-crossover-lines-among-siblings-to
Blaine Bettinger's pdf combining his five posts about the phasing process - http://thegeneticgenealogist.com/wp-content/uploads/2016/11/Visual-Phasing-Bettinger.pdf

Two other bloggers with helpful posts about phasing, including issues such as the way what looks like a single crossover for one sibling can actually be a double crossover for two others:
Ann Raymont - https://dnasleuth.wordpress.com/2016/06/01/chromosome-mapping-with-siblings-part-2/ (and part 1)
Joel Hartley: http://www.jmhartley.com/HBlog/?p=2239

 Sue Griffith's post on how obtain the values for crossovers from FIR to HIR & vice versa: http://www.genealogyjunkie.net/blog/obtaining-fir-boundaries-on-gedmatch-using-the-little-tick-marks

David Pike has a number of free DNA tools, including the "Search for Shared DNA Segments in Two Raw Data Files" which reports single and double matching segments (ie half identical and fully identical): http://www.math.mun.ca/~dapike/FF23utils/pair-comp.php

Friday, 25 November 2016

DNA Update

It has been an "interesting" year on my DNA journey. Ever since I first took an autosomal DNA test with 23andMe in 2010, I have been working on looking for what are known as "triangulating groups" (TGs) in the data. These are groups of people, who all match me over the same segment of DNA and who also all match each other over that same segment. The theory is that shared DNA indicates shared ancestry and, therefore, if a group of people all share the same segment of DNA, it must have come from the same ancestor (at some level - some of the people in the group may share a close ancestor along the line back to the overall shared ancestor.) The theory sounds "right" and logical, and it appears to fit the patterns I can see in the data:

 I liked using 23andMe for this process. It is the only testing company where it is possible to compare the people you match (and are sharing with) to each other and therefore confirm for yourself whether, or not, they form a TG. This is not possible at the other companies I have tested with. At Family Tree DNA (FTDNA), it is only possible to see where someone matches you, and whether they are "in common with" (ie also share some DNA with) any of your other matches. But you then need to ask them where they match the other people, in order to confirm if they actually match those people over the same segment that they match you on. If it is a different segment, so the TG theory went, then you may all be related to each other through different ancestors, since many of us probably have multiple ancestors in common, as we move further back in time. It was said that you could only be sure the DNA was from the same ancestor if you matched on the same segment.

Part of the difficulty in identifying the TGs at FTDNA, and why you cannot assume people who match you over what looks to be the same segment, and who are "in common with" each other, actually do match each other in the same place and therefore form a TG, is that these DNA tests do not phase the data, ie they do not split it into the two sides we received from our parents. We all have 23 pairs of chromosomes, one of each pair from our father and one from our mother - but the tests just report the two base pairs (bits of DNA!) we have at particular points along the chromosome. So, whilst it might look as if two people match you over the same segment of DNA, one could be matching you on your maternal side and one could be matching you on your paternal side. In that case, the DNA each shares with you would be from different ancestors, one on each side of your family. If the two people also happened to share another ancestor between them, they would show as "in common with" each other - but you would not all be a TG.

 [The lack of phasing also creates the possibility of "false positives" - people who show as a match but who aren't really, because the computers doing the matching have effectively criss-crossed between the base pairs of each chromosome. This is potentially an issue at both FTDNA and 23andMe, in particular. It isn't thought to be so much of an issue at Ancestry, as Ancestry does a form of phasing of the data. However, I didn't think such false matches were likely to be much of a problem, because I thought that, if a group of people were all triangulating, then the chances of all the comparisons being "computer creations" must be quite slim. I do have some groups of matches where no-one matches each other, despite all apparently matching me over the same segment - so those were the matches I took to be "false positives", as theoretically there can only be a maximum of two non-matching results over any particular segment. A third person must match one of the other two, if the matches are genuine.]

 Although I have more of my relatives tested at FTDNA, the reliance on having to contact your matches in order to obtain the details for how they match others was why FTDNA did not seem to be so useful to me, especially as many people do not respond to contact. And Ancestry does not give us any tools to analyse where the actual shared DNA is, so the process of finding TGs is impossible there. Therefore, whilst the other companies do have their own advantages, 23andMe was where I did most of my "work" and, although most of the triangulating groups at 23andMre shared relatively small segments with me (ie between 7cM - 15cM ), I had identified the potential shared ancestry with one of my matches, a 4th Cousin 1x removed, who shared 14cM with me and I just assumed the relationships for the other matches were likely to be further back in time.

So I was happy with my 23andMe process. I'd even agreed to do a talk for the Guild of One-Name Studies on using autosomal DNA, as I felt confident I knew what I was doing.

But a couple of months later, everything changed. A different theory had developed, partly as a result of statistics produced by Ancestry but also through the work of other scientists. These statistics demonstrated that the probability of several cousins actually sharing the same matching segment was very low, if not impossible. Instead of "triangles", we now had "circles" - and suddenly that brought into question exactly what all these "triangulating groups" really are.

The "circle" theory is still based on the fact that shared DNA means shared ancestry - but now the claim was that the shared DNA would be on different segments of the chromosomes, because of the way DNA is transmitted. A parent passes half their DNA to each child, but each child receives a different half, as there is a recombination process between each parent's two chromosomes before one chromosome is passed on to the child. After several generations, there would be quite a variety of smaller segments carried by cousins descended from the same ancestor. So, rather than looking for the TGs, we should be looking for "genetic networks", clusters of people who share DNA with each other in the cluster but not necessarily over the same segments. The existence of the TGs was explained partly by features in the testing process, such as the lack of phasing, but also by the existence of what are called "population segments" - sequences of base pairs that are just very common in particular populations, so everyone has them, even though there are no close ancestors in common.

How does one know the difference between a genealogically significant triangulating segment and a population segment? Or between a group of matches who have received different segments of DNA from a single ancestor and a group of matches who match on different segments that have come to them from a variety of shared ancestors? Surely the companies are taking these factors into account when they predict the matches? Were the results from the companies even reliable?

So many questions - I felt like I was floundering.

My confidence in what I was doing certainly took a dive at that time. It didn't help that I had also uploaded the raw data for my mother and I to another organisation, DNA Land, who claim to be able to impute "missing" (by which I assume they mean, "untested") areas of DNA, in order to produce a more complete sequence - and yet the number of matches they suggested as a result of this process was not only much less than I have at the other companies, it included people who don't appear to match me at any of the other companies. That seems strange, given that I have tested at all three of the main companies. I know only a small number of my matches elsewhere will have uploaded to DNA Land, but the differences still seemed quite significant [ie only three matches, including Mum, for me at DNA Land - compared to the 1888 I currently have at 23andMe, 1146 at FTDNA, and almost 6000 at Ancestry!]

Was this DNA testing all a waste of time (and money!)?

When in doubt - I go back to what I know. Just as I work from the known to the unknown in my normal genealogy, I realised I needed to do that more with my DNA research, as well. A "stab in the dark" may occasionally hit a target but it's just as likely to leave me floundering around in the darkness, following blind alleys.  And that's what looking for shared ancestry just from the TGs felt like.

The statistics from all of the companies indicate that autosomal test relationships can only be predicted reliably for about the first five generations. That is not to say we won't show a match to more distant relatives - it's just that, the more distant the relationship, the more difficult it becomes to predict the level of that relationship, as the range of possibilities increases. A single segment of DNA may be passed on unchanged for many generations. But, in all the test results, I knew my known relatives always showed up as they should do. My mother was definitely my mother (not that I doubted that!) And my father's known relatives all show up as matches at the right levels.

So DNA testing works!

Beating the temptation to run and hide, I gave the talk in August, describing the two theories and commenting that "most of us don't understand enough about the statistics to make definitive claims either way so a combination of the methods seems to be the best approach. Both methods are valid but have caveats, eg small segments often appear to triangulate, but may not be genuine, clusters of people sharing different DNA may be due to having multiple ancestors in common."

Some bloggers do seem to be finding segments that are shared by groups of distant cousins. The problem for many of us in the UK, though, is that often we don't have sufficient "middle-distance" relatives identified (both in our genealogy and in our DNA) to produce the sort of success stories that many in the US seem to be experiencing. For example I only have 29 fourth cousins in the Ancestry "4th cousins & closer" section, whereas some of the American results I have seen have between 400 - 750 relatives at that level!

But I have had some success in identifying relationships with my matches - I now have the potential shared ancestry identified for 10 of them (and if the 10th is actually correct, it's a big clue as to which of my ancestral lines three other shared matches fit into). So that's a start.

As well as confirming my genealogy & finding new relatives, one of my goals with DNA testing is mapping where my DNA came from. Identifying shared ancestry with my matches is one part of this process and, so far, my chromosome map, mapping DNA received to the relevant "most recent common ancestor" (MRCA), looks like this:

Chromosome 4 shows where a known Parry segment contains within it a Saunders segment:

And this shows how that Saunders segment of DNA appears to have passed down to my Parry grandfather:

Any other matches over the identified segments on the chromosome map should (if the identification is correct) be either a descendant of the same couple, or a descendant of one of their ancestors. 

I think there needs to be a continual checking process, using both DNA and genealogy - for example, having found a genealogical connection to one of my DNA matches at Ancestry, we were then able to confirm, using FTDNA, that the person also matched my mother over the same segment, and that neither my mother, nor I, matched the person's father (both requirements necessary for the genealogy to be correct.) 

Since I have several close relatives tested, it gives me the opportunity to work from the DNA data backwards, rather than just concentrating on those potential triangulating groups of distant relatives. My DNA consists of segments of the DNA of my grandparents, passed to me by each of my parents. The "crossover points", where a segment from one grandparent switches over to a segment from the other grandparent can (sometimes) be identified in our DNA, using the details of how we match close relatives. This is a process I began looking at some years ago, using tools written by David Pike. But now more of my relatives are on Gedmatch, I can use the "Visual phasing" method as explained by Kathy Johnston, which should be a lot easier. 

I have been working on this recently and will post about the process soon (now there's a challenge to myself!)

Friday, 24 June 2016

My Ancestors and their Descendants - my potential DNA Tree

Earlier this year, our ISP informed us that it would no longer support personal web spaces - a poor decision in my view (of course!)

The upside of this is that it will force me to do the web site "re-write" that I set as a goal in 2015.

The downside is that I haven't done it yet, so my Parry Surname Research (Family History and the One-Name Study) site has disappeared.

Theoretically, since the site was written in html and css, it would have been quite easy to just upload all the files elsewhere.  But then there'd be little incentive to get the rewrite done.  And, with the development of the Guild's "Members' Websites Project", it seems an ideal opportunity to separate out any personal family history from the Parry One-Name Study information, and to ensure the long term survival of the ONS data by placing it on the Guild's site.

So that's the plan. And it is in progress (slowly).

But today, frustrated at the loss of my "DNA tree", which I really need to accompany the autosomal DNA project I have set up at Family Tree DNA, I decided to try uploading that here, on Blogger.  It's taken a bit of tweaking of the coding, especially on the page width, which I hope I don't accidentally delete, but at least the information is available again:

My Ancestors and their Descendants - my potential DNA Tree

And now I've been reminded of just how many of my ancestors and their descendants I still need to trace. ☺