Wednesday, 9 August 2017

Shared matches - matches who match both my paternal and maternal lines

This is just a quick post, to show the information I looked at, in order to reply to a question Debbie Kennett asked on the ISOGG DNA-NEWBIE mailing list.  The question was "how many people have double matches in their tree, ie, where a person has a match with both your mother and your father."

Now, I don't have my father tested - he passed away in 2001.  However, I do have all four of his siblings tested, as well as a paternal first cousin of theirs.  So, whilst it's not quite the same, as I know there are still some areas of my chromosomes where none of my Dad's relatives match me, the data should give me a reasonable indication of the overlap between my matches and the two different sides of my family (which, as far as I am aware, are not related to each other).

I had actually noticed this 'matching to both sides' some time ago, when I first started playing about with my FTDNA "in common with" (ICW) data and the Pajek program (which I mentioned in my previous post) just to see what the program did.  I realised then that eleven of my matches seemed to match both sides of my family.  Yesterday, I decided to check the current situation in order to answer Debbie's question.

To do this, I used the DNAGedcom Client app to download my ICW file from FTDNA.  I then extracted all the matches who are in common between me and my six relatives (my mother, Dad's four siblings, and their paternal 1c).  I then used the Pajek program to display the information.  This first image was produced using the options "Energy: Kamada-Kawai: separate components"

The program can display the names associated with each point, but I have obviously removed those for privacy reasons.  It is quite clear that there are two main clusters, with sixteen matches spanning the two groups. I spread those sixteen out manually, to make them more obvious, but it's not very easy to see what is happening within the two groups, so next I tried the options "Energy: Fruchterman Reingold: 3D".  Again, I've straightened out the sixteen matches in the middle and this time allocated reference numbers to them, as well as to my relatives:

In this image, as well as the sixteen matches who link to both the paternal and maternal sides of my family, the clusters of matches for each of my father's relatives are more distinct.

(The same information can also be discovered by using a spreadsheet containing all six of the ICW files combined together and creating a pivot table with match names down the rows, and my relatives as the column headings, the table then showing a count of the match names.  By filtering on all those who match my mother, and gradually working through all those who match one or more of my paternal relatives, the full list of people matching both sides of my family can be obtained.

Doing this in a pivot table has the additional benefit that, once the list of people who match both sides is completed, it can be used to pick out the same people from the chromosome browser (CB) file*, so that the actual nature of the matching segments can be examined.

I've allocated matches to the maternal or paternal sides of my family on the basis of who shares the same segment as the match does to me.  However, in the cases where I share two segments with a match (M11 and M14) the segments are each shared by different sides of my family, so that it appears I connect to those matches through both the paternal and the maternal sides of my family:

It will be interesting to see if those matches turn out to be genuine!

Pajek Quick Reference sheet

Picking out the CB data for the "Both" people
There's probably several ways of doing this but, as I am sure there's other people in the same situation that I am in, having to learn it as I go along (and relearn it every time I want to do something similar!) these are the details of what I did:
Having cut and pasted the list of people identified as matching both sides into the first column of a new spreadsheet in the CB file, I pasted the following formula into an empty cell alongside the first match in the CB spreadsheet (replacing the blue text with the appropriate information):  =VLOOKUP([the cell reference of the Full name column for the first match in the CB spreadsheet],'[the name of the new spreadsheet containing the list of people matching both sides]'!A:A,1,FALSE) , where the A is the column in the new spreadsheet containing the list of names who match both sides, so make sure that list is pasted into the first column labelled A.  I then used 'Fill down' to copy the formula to all the cells in the CB column.  The result is the cells either show #N/A, if that match is not on the "Both" list, or the name of the match, if they are on the "Both" list.  I then used the filter function to show just the rows with the match name in and copied all the CB data for those matches into a new spreadsheet, which I used to create the table where I have allocated the matches to maternal and paternal sides.

Ancestry shared matches and a new connection

This post continues my general theme of looking for strategies to deal with my DNA results - in this case, results from AncestryDNA.

I have 225 pages of matches at Ancestry, which equates to almost 11,250 matches.  I use the DNAGedcom Client app to download the information.  That gives me three files - a list of my matches, a file showing which of the matches are in common with each other (based around fourth cousins and closer only), and details from my match's trees.  This latter, 'ancestors', file has over 345,000 lines of data in it, which seems a staggering amount to consider dealing with - especially as, unfortunately, most of it is probably not relevant to my connections with my matches, as the majority of them are in the USA and few have traced their connection back to the UK, which is where most of my pedigree information relates to.

Although I do have three Ancestry Hints, which have been helpful, I don't appear in any 'DNA Circles'.  So I've been looking at the "shared matches", to see what clues I can garner from those. Ancestry provides details of my matches that are fourth cousins and closer, and indicates where they share DNA with another of my close matches.   They do also show the more distant matches that are shared matches to the closer cousins - but only by showing the closer match on the more distant match's profile.  Given how many thousands of distant matches I have, I do not check each of their profiles individually to see if they just happen to match a closer cousin.  So the app download makes this feature more useful, by picking up those more distant matches who are in common with the fourth cousins, as well as providing the information in a more convenient, (ie spreadsheet) format.

I have 59 matches within the '4th cousins or closer' category and 379 rows in the ICW* file downloaded by the Client app, which, as far as I am aware, includes each individual who connects to one of my '4th cousins or closer' matches.  That's probably not many in comparison to people with colonial US ancestry but I imagine it's about average for those of us in the UK.  And it is enough to do some simple 'network analysis', which I hope might allow me to make more sense of the data.

Let me say here that I don't really know anything about proper network analysis - I think that's complicated computing, with thousand of entries, which produces things like the Genetic Communities.  It involves lots of statistical calculations and terms that I don't even understand the meaning of, yet alone know how to use! But most of us are probably capable of using some simple techniques - the basic concept for what I am doing I learnt when studying for a GCSE in psychology, so that's a qualification designed for teenagers. In that course, we were using it to analyse friendship patterns in a class of schoolchildren.  The "sociometric" technique simply consisted of asking each child in a class who their three best friends in the class were.  One then drew a diagram something like the following, where each dot is a person and the arrow shows the direction of the 'choice'.

It occurred to me some years ago that this type of diagram could possibly be used to help analyse genealogical networks and I had hoped to use it in my Parry One-Name Study to try to sort out the potential relationships among the lower gentry of Herefordshire (which contains numerous Parry connections that may, or may not, relate to the same Parry family). I came across a (free!) program* that looked like it would be useful for actually drawing the diagram (although it is easy to do by hand, if there's a lot to draw, a computer obviously does make it easier) but I never managed to get all the pedigrees typed up sufficiently to try it out for my study.  Now, with doing genetic genealogy, it seems to me that the same principle could be used with shared matches.

And so the following diagram shows the connections between my shared matches at Ancestry:

In this image, each red dot represents one of my matches, and the blue lines indicate the other matches that they also match.  I am not using arrows, just lines, as the genetic relationships will be in both directions.

As you can see, the matches fall into groups, Sometimes these are made up of just two or three people who are shared matches with each other.  But there's also some larger groups, one of about 50 connections, and the other with over 150 connections.

It was interesting to see how the data plotted, but how does this help me?

Well, my theory, as you've possibly guessed by now, is that the people in the same group are likely to connect to me (at some level) through the same ancestral line.

So, firstly, I allocated everyone in each group an 'AncestryICW Group Number' (both in the Notes section of my view of their DNA profile on Ancestry and in my spreadsheet) to help me keep track of the Groups.   I also added any information about potential surname connections.  Here's the same diagram, with those numbers added and also some additional symbols based on my family history. (Key in the bottom right corner of image)

As you can see, the Group 1 (derived just from the genetic relationships provided by Ancestry), contains two people who share the surname NAYLOR with me. One of these I have discovered the potential connection to, the other currently just has the surname in common with me.

I've also 'starred' one match - over the weekend, I carried out a new download of the shared matches file. There were 32 new rows added since the previous download, which, once charted, increased the size of some of the existing groups and also created a few new ones.  (NB these are not new 'fourth cousins or closer' - these are more distantly related new matches, who just happen to connect to my fourth cousins and closer.  As such, I would not normally have checked them out, among the many new distant matches that keep being added.)

I was just starting to work through them, adding the group numbers to my spreadsheet and checking if the people had trees attached to their account, when I noticed the surname NAYLOR.  Yes, one of the new additional matches in Group 1 also had a NAYLOR in their tree!  It was just one, a NAYLOR female marrying into their SMITH family, with no other information about her except her husband's name, and their child's details.  And the family were in the 'wrong' place in the UK (up in Lancashire, rather than in London) - but obviously I didn't leave it there.

By initially working on the husband of the SMITH child, and then finding him and his wife in the 1939 Register, I was able to obtain her proper birth date (1895, not 1885 as shown on the pedigree). That correction meant that I could then find her in the 1901 and 1911 censuses with her parents - her mother being the NAYLOR by birth. Those censuses gave me sufficient information to get back to the previous generation - who traced back to London and the entries I believe relate to my family in 1841!

All of this still needs confirming properly, especially the early censuses for the family, which I had found some months ago when identifying the other NAYLOR connection, who is in Australia.

But it all looks very promising that my new match and I are fourth cousins through the NAYLOR line.

So, just the process of simply grouping my shared matches, on the basis of who they are in common with, has been sufficient for me to spot a connection that I may not have seen otherwise, since the new match was identified by Ancestry as a more distant 5th-8th cousin, sharing just 9.7cM across 1 DNA segment. Although I understand that there may be other reasons for shared DNA of that quantity, unless I can find other evidence to contradict it, the simplest explanation, that the three matches in Group 1 who all share the NAYLOR surname with me obtained it from a common NAYLOR ancestry, does seem to be logical.

Network analysis program used for drawing chart: Pajek (http://mrvar.fdv.uni-lj.si/pajek/ )  [One day, I hope to learn to use the program properly, as I am sure it could potentially display the DNA information more effectively, taking account of features such as the closeness of relationships etc]

ICW - stands for "in common with" - the term often used for matches who also match someone else you match.

Friday, 4 August 2017

Autosomal DNA Discussions - and some statistics for my kits

There have been some interesting discussions on the mailing lists recently*, which have caused me to look at some statistics for the kits I manage.  On the one hand, there were the, seemingly straightforward, questions concerning the best strategy for dealing with autosomal DNA results, and how to manage the ever increasing influx of new results.  Answers to these questions tend to include the importance of sharing multiple segments and of limiting the minimum length of the segments worked with, as well as focusing on names and locations relevant to one’s own family history.

But, on the other hand, the ongoing debate, predominantly between two people who I regard as genetic genealogy experts, Debbie Kennett and Tim Janzen, shows that things can be far from “straightforward” when dealing with DNA.  Alongside issues of terminology (what do we actually mean when we say “identical by state”, or “identical by descent” etc.), and how far back shared ancestry might be for particular levels of shared DNA (even up to 10 or 20 generations), such discussions often revolve around the problem of “triangulating groups”* (TGs) – what causes them, how relevant they are (or aren't), and the factors that affect them (such as segment size, phasing, haplotype frequency, and the population that’s involved).  

Fundamentally, the problem seems to be that scientific modelling suggests TGs shouldn’t exist, as it’s thought to be “mathematically impossible for so many people to share the same segment by virtue of sharing a single ancestral couple.”* But many people's results seem to indicate that they do exist – so why?

I don’t have the answer to that question, obviously, and I've written before about the two differing theories (at http://notjusttheparrys.blogspot.co.uk/2016/11/dna-update.html)  But two comments in particular struck me, as I realised that I hadn't specifically examined my kits with these issues in mind.  First was Tim’s comment that half identical regions (ie matching segments) that are at least 15 cMs in length and contain at least 2000 SNPs will almost always be "identical by descent" (IBD) and, secondly, Debbie’s comment that, in her experience with UK matches, the only segments that fall into triangulated groups are small segments under 15 cMs, and that we would be better off focusing our attention on matches that share over 15 cMs.

Debbie and I have discussed the numbers of TGs we have before, so I know my results show a few more than hers do, but this has prompted me to take a detailed look at my kits, to see the effect of applying such thresholds.

I began with FTDNA, where I have access to seven UK kits.  The following graph show the numbers of matches I have with particular “longest segment” lengths, annotated for any known relatives:

These graphs shows a group of four siblings and the numbers of matches they each have with particular “longest segment” lengths, annotated for any known relatives:

And finally in this section, graphs for the three other kits I have access to:

The following table summarises how many matches each of the above kits would have to work with, if either a 15cM or a 20cM threshold was applied:

So applying such thresholds would certainly reduce the number of matches regarded as 'relevant' and make working with the results more manageable.

But would we be missing useful information, as demonstrated by the 4c1r matching kit N?

I had hoped to produce similar graphs for the two kits I manage at 23andMe but, as the download file doesn't include details of the "longest segment" for matches sharing multiple segments, the following graphs include all segments for those matches my mother and I are sharing with (or who are "Open Sharing").  (I have removed the parent/child segment data to avoid an 'extended tail' in the graphs.)

The "curves" of the 23andMe graphs are much more irregular than the FTDNA kits, which could be a feature of the differences in the nature of sharing between the two companies.

But, once again, it is clear that applying thresholds of 15cM or 20cM would dramatically reduce the number of segments left to work with.

As a slight sidetrack, in view of another question on a mailing list, concerning the numbers of matches that don't match parents, I just thought I'd add in a graph to show the numbers of N's segments that are from matches identified as also matching N's mother.

As you can see, the number of non-maternal matches is generally greater than the number of maternal, which possibly indicates that there is some level of false positives in the results.  However, it could also just be a sign that N's paternal side of her family has more matches in the databases - something which is supported by the higher numbers of matches N's paternal relatives have at FTDNA in comparison to N and her mother.   But the important issue is that the segment lengths for the matches identified as maternal do go all the way down to 5cM.  It seems to me therefore, that it would not be easy to distinguish which segments may be false positives (ie people identified as matches who are not genuine matches), based just on maternal/paternal matching.  With such short segment lengths, it is possible that the parents' results are showing false negatives (ie genuine matches not identified as matches in the parent for some reason.)

Back to the original questions.  Of course, segment length isn't the only consideration - Tim's criteria included the numbers of SNPs as well.  The following scattergram shows numbers of SNPs per segment length, with the shaded area being those who would meet the criteria of being "at least 15 cMs in length and containing at least 2000 SNPs".

There are 165 segments (out of 1920) that would meet Tim's criteria to be genuine 100% of the time (and 32 segments, if the segment length used was 20cM.)

The 23andMe graphs for N shows an unexpected peak at 27cM, which the scattergram indicates is made up of some segments with less than 2000 SNPs.  Closer analysis shows these are predominantly at the start of chromosome 15 and, using the ADSA tool*, it can be seen that all but one of the segments fully triangulate as a non-maternal TG.

Is it a genuine segment (ie descended to all the matches from a shared ancestor)?  The low SNP count might imply not, but the apparent phasing and the fact that it is at the start of the chromosome (where recombination is perhaps less likely), as well as it being over 15cM, may be factors in favour of it being so.

But the honest truth is, I currently don't know - with factors both for and against it, I often think the only way to tell if a segment is genealogically relevant is if one finds a genealogical connection!

So, what about any other triangulating groups I might have?  I started by using the more restrictive thresholds of 20cM and 2000 SNPs.  At these levels, my FTDNA kit showed one TG:

However, three of the matches are clearly related to each other, so the TG actually only consists of three separate ancestral lines (theirs, mine and the fourth match's).  When I added my close relatives in, all four of these matches show as paternal matches.  Reducing the threshold to 15cM (but maintaining SNP threshold at 2000) picks up another member of the one family, and reducing it to 10cM picks up one other match (10cM, 2700 SNPs), who triangulates with all of the others.

On two other chromosomes, at 20cM, there is a match who shows as matching my 1c1r so, whilst not creating a TG, these do give me hints as to the relevant ancestral lines there.

In addition to the above TG, reducing the threshold to 15cM produces TGs on ten other chromosomes with my FTDNA kit.  These can be identified as either paternal or maternal based on matching to relatives (who aren't shown, in order to keep the diagrams easy to read):

I do think some of these TGs look "too perfect" - for example, see chromosome 8, where twelve people show identical figures.

Decreasing the thresholds below 15cM, increases the numbers of matches in these TGs, as well as producing more TGs, but many look too regular, given the random nature of DNA transmission. The use of matching to close relatives to 'phase' the segments should indicate a genuine matching sequence on one chromosome out of a pair (rather than a "match" being created from criss-crossing between SNPs on the two chromosomes in a pair).  But I do have a nagging suspicion that something may not be right, when all of the matches over any particular segment seem to be on just one chromosome, rather than there being overlapping maternal and paternal TGs - although, occasionally, that pattern of two overlapping TGs can be found, as in this example from chromosome 4:

Moving on to my 23andMe kit, at 20cM and 2000 SNPs that shows two TGs:

Chromosome 4 (maternal TG)

And on the X chromosome, a paternal TG:

I did think there was a third TG, on chromosome 11:

But, on checking the profiles, I discovered the two matches are identical twins, so that means there's just two ancestral lines involved (mine and theirs), and so this doesn't make a TG.

It is also a timely reminder that DNA results should always be analysed in conjunction with the genealogy!

Rerunning the 23andMe data using thresholds of 15cM and 2000 SNPs produces TGs on an additional eleven chromosomes.  This time, I have included details of how my mother matches, since she's the only close relative at 23andMe, so it doesn't complicate the images too much and does make the phasing more obvious.

So, in my results, I do have some TG's above 15cM and 2000 SNPs.  But I am not convinced that they are all valid, based on what they look like in comparison to what I understand about the random nature of DNA transmission.  I do need to work through the groups, to see if there are any obvious explanations for the anomalies and "overly perfect" matching (as in the case of the identical twins above.)   There are probably some other investigations I could do with the data, for example, checking for runs of homozygosity (sequences of identical SNPs on both chromosomes), which might be affecting matching.

However, I don't think there's much that I, as an individual test taker, can do to find out about how issues such as endogamy, haplotype frequencies, and population segments (which are some of the possible reasons given for why TGs may not be valid), might affect the validity of the TGs appearing in my results.

But trying to test the validity of the comments in the discussions wasn't the point of this post.  My aim was purely to examine my results in the light of those comments, to see what doing so showed, and I feel that carrying out this analysis has been very useful.  It has been helpful to look at ways to make the numbers of matches more manageable and to think about what information I might lose by doing so.   Focusing on these aspects of my results has also caused me to notice things about the data that I had previously missed.  I'm sure the results will also be helpful as I continue to work on the visual phasing and looking at how the segments shared with matches correlate with what might be predicted from that. There are clearly other aspects of my results that I also need to consider, such as matches sharing multiple segments and the company predictions about relationships levels, which I haven't taken account of here.

But, hopefully, all of this combined will enable me to work out some more effective strategies for dealing with my results - which, of course, must include one of the main things I have been reminded of during this process, which is the importance of checking out the genealogy of my matches!

*Discussion References
Corrinne Curtis, Re: [G] Family Finder Kit  (http://archiver.rootsweb.ancestry.com/th/read/GOONS/2017-07/1500365965)

Three discussions on the ISOGG lists (which are not public so I won't post the links - membership of ISOGG is free though, so please join - see https://isogg.org/ ) The discussions are in the threads [ISOGG] Autosomal Survey, [DNA-NEWBIE] Spreadsheets and new matches, and [DNA-NEWBIE] Re: Single Large Segments

Ian Logan, [DNA] Falsely positive matches of Autosomal results (http://archiver.rootsweb.ancestry.com/th/read/GENEALOGY-DNA/2017-08/1501660749)

About Triangulation - https://isogg.org/wiki/Triangulation

ADSA tool - https://dnagedcom.com/adsa/index.php