Wednesday, 9 August 2017

Shared matches - matches who match both my paternal and maternal lines

This is just a quick post, to show the information I looked at, in order to reply to a question Debbie Kennett asked on the ISOGG DNA-NEWBIE mailing list.  The question was "how many people have double matches in their tree, ie, where a person has a match with both your mother and your father."

Now, I don't have my father tested - he passed away in 2001.  However, I do have all four of his siblings tested, as well as a paternal first cousin of theirs.  So, whilst it's not quite the same, as I know there are still some areas of my chromosomes where none of my Dad's relatives match me, the data should give me a reasonable indication of the overlap between my matches and the two different sides of my family (which, as far as I am aware, are not related to each other).

I had actually noticed this 'matching to both sides' some time ago, when I first started playing about with my FTDNA "in common with" (ICW) data and the Pajek program (which I mentioned in my previous post) just to see what the program did.  I realised then that eleven of my matches seemed to match both sides of my family.  Yesterday, I decided to check the current situation in order to answer Debbie's question.

To do this, I used the DNAGedcom Client app to download my ICW file from FTDNA.  I then extracted all the matches who are in common between me and my six relatives (my mother, Dad's four siblings, and their paternal 1c).  I then used the Pajek program to display the information.  This first image was produced using the options "Energy: Kamada-Kawai: separate components"

The program can display the names associated with each point, but I have obviously removed those for privacy reasons.  It is quite clear that there are two main clusters, with sixteen matches spanning the two groups. I spread those sixteen out manually, to make them more obvious, but it's not very easy to see what is happening within the two groups, so next I tried the options "Energy: Fruchterman Reingold: 3D".  Again, I've straightened out the sixteen matches in the middle and this time allocated reference numbers to them, as well as to my relatives:

In this image, as well as the sixteen matches who link to both the paternal and maternal sides of my family, the clusters of matches for each of my father's relatives are more distinct.

(The same information can also be discovered by using a spreadsheet containing all six of the ICW files combined together and creating a pivot table with match names down the rows, and my relatives as the column headings, the table then showing a count of the match names.  By filtering on all those who match my mother, and gradually working through all those who match one or more of my paternal relatives, the full list of people matching both sides of my family can be obtained.

Doing this in a pivot table has the additional benefit that, once the list of people who match both sides is completed, it can be used to pick out the same people from the chromosome browser (CB) file*, so that the actual nature of the matching segments can be examined.

I've allocated matches to the maternal or paternal sides of my family on the basis of who shares the same segment as the match does to me.  However, in the cases where I share two segments with a match (M11 and M14) the segments are each shared by different sides of my family, so that it appears I connect to those matches through both the paternal and the maternal sides of my family:

It will be interesting to see if those matches turn out to be genuine!

Pajek Quick Reference sheet

Picking out the CB data for the "Both" people
There's probably several ways of doing this but, as I am sure there's other people in the same situation that I am in, having to learn it as I go along (and relearn it every time I want to do something similar!) these are the details of what I did:
Having cut and pasted the list of people identified as matching both sides into the first column of a new spreadsheet in the CB file, I pasted the following formula into an empty cell alongside the first match in the CB spreadsheet (replacing the blue text with the appropriate information):  =VLOOKUP([the cell reference of the Full name column for the first match in the CB spreadsheet],'[the name of the new spreadsheet containing the list of people matching both sides]'!A:A,1,FALSE) , where the A is the column in the new spreadsheet containing the list of names who match both sides, so make sure that list is pasted into the first column labelled A.  I then used 'Fill down' to copy the formula to all the cells in the CB column.  The result is the cells either show #N/A, if that match is not on the "Both" list, or the name of the match, if they are on the "Both" list.  I then used the filter function to show just the rows with the match name in and copied all the CB data for those matches into a new spreadsheet, which I used to create the table where I have allocated the matches to maternal and paternal sides.

Ancestry shared matches and a new connection

This post continues my general theme of looking for strategies to deal with my DNA results - in this case, results from AncestryDNA.

I have 225 pages of matches at Ancestry, which equates to almost 11,250 matches.  I use the DNAGedcom Client app to download the information.  That gives me three files - a list of my matches, a file showing which of the matches are in common with each other (based around fourth cousins and closer only), and details from my match's trees.  This latter, 'ancestors', file has over 345,000 lines of data in it, which seems a staggering amount to consider dealing with - especially as, unfortunately, most of it is probably not relevant to my connections with my matches, as the majority of them are in the USA and few have traced their connection back to the UK, which is where most of my pedigree information relates to.

Although I do have three Ancestry Hints, which have been helpful, I don't appear in any 'DNA Circles'.  So I've been looking at the "shared matches", to see what clues I can garner from those. Ancestry provides details of my matches that are fourth cousins and closer, and indicates where they share DNA with another of my close matches.   They do also show the more distant matches that are shared matches to the closer cousins - but only by showing the closer match on the more distant match's profile.  Given how many thousands of distant matches I have, I do not check each of their profiles individually to see if they just happen to match a closer cousin.  So the app download makes this feature more useful, by picking up those more distant matches who are in common with the fourth cousins, as well as providing the information in a more convenient, (ie spreadsheet) format.

I have 59 matches within the '4th cousins or closer' category and 379 rows in the ICW* file downloaded by the Client app, which, as far as I am aware, includes each individual who connects to one of my '4th cousins or closer' matches.  That's probably not many in comparison to people with colonial US ancestry but I imagine it's about average for those of us in the UK.  And it is enough to do some simple 'network analysis', which I hope might allow me to make more sense of the data.

Let me say here that I don't really know anything about proper network analysis - I think that's complicated computing, with thousand of entries, which produces things like the Genetic Communities.  It involves lots of statistical calculations and terms that I don't even understand the meaning of, yet alone know how to use! But most of us are probably capable of using some simple techniques - the basic concept for what I am doing I learnt when studying for a GCSE in psychology, so that's a qualification designed for teenagers. In that course, we were using it to analyse friendship patterns in a class of schoolchildren.  The "sociometric" technique simply consisted of asking each child in a class who their three best friends in the class were.  One then drew a diagram something like the following, where each dot is a person and the arrow shows the direction of the 'choice'.

It occurred to me some years ago that this type of diagram could possibly be used to help analyse genealogical networks and I had hoped to use it in my Parry One-Name Study to try to sort out the potential relationships among the lower gentry of Herefordshire (which contains numerous Parry connections that may, or may not, relate to the same Parry family). I came across a (free!) program* that looked like it would be useful for actually drawing the diagram (although it is easy to do by hand, if there's a lot to draw, a computer obviously does make it easier) but I never managed to get all the pedigrees typed up sufficiently to try it out for my study.  Now, with doing genetic genealogy, it seems to me that the same principle could be used with shared matches.

And so the following diagram shows the connections between my shared matches at Ancestry:

In this image, each red dot represents one of my matches, and the blue lines indicate the other matches that they also match.  I am not using arrows, just lines, as the genetic relationships will be in both directions.

As you can see, the matches fall into groups, Sometimes these are made up of just two or three people who are shared matches with each other.  But there's also some larger groups, one of about 50 connections, and the other with over 150 connections.

It was interesting to see how the data plotted, but how does this help me?

Well, my theory, as you've possibly guessed by now, is that the people in the same group are likely to connect to me (at some level) through the same ancestral line.

So, firstly, I allocated everyone in each group an 'AncestryICW Group Number' (both in the Notes section of my view of their DNA profile on Ancestry and in my spreadsheet) to help me keep track of the Groups.   I also added any information about potential surname connections.  Here's the same diagram, with those numbers added and also some additional symbols based on my family history. (Key in the bottom right corner of image)

As you can see, the Group 1 (derived just from the genetic relationships provided by Ancestry), contains two people who share the surname NAYLOR with me. One of these I have discovered the potential connection to, the other currently just has the surname in common with me.

I've also 'starred' one match - over the weekend, I carried out a new download of the shared matches file. There were 32 new rows added since the previous download, which, once charted, increased the size of some of the existing groups and also created a few new ones.  (NB these are not new 'fourth cousins or closer' - these are more distantly related new matches, who just happen to connect to my fourth cousins and closer.  As such, I would not normally have checked them out, among the many new distant matches that keep being added.)

I was just starting to work through them, adding the group numbers to my spreadsheet and checking if the people had trees attached to their account, when I noticed the surname NAYLOR.  Yes, one of the new additional matches in Group 1 also had a NAYLOR in their tree!  It was just one, a NAYLOR female marrying into their SMITH family, with no other information about her except her husband's name, and their child's details.  And the family were in the 'wrong' place in the UK (up in Lancashire, rather than in London) - but obviously I didn't leave it there.

By initially working on the husband of the SMITH child, and then finding him and his wife in the 1939 Register, I was able to obtain her proper birth date (1895, not 1885 as shown on the pedigree). That correction meant that I could then find her in the 1901 and 1911 censuses with her parents - her mother being the NAYLOR by birth. Those censuses gave me sufficient information to get back to the previous generation - who traced back to London and the entries I believe relate to my family in 1841!

All of this still needs confirming properly, especially the early censuses for the family, which I had found some months ago when identifying the other NAYLOR connection, who is in Australia.

But it all looks very promising that my new match and I are fourth cousins through the NAYLOR line.

So, just the process of simply grouping my shared matches, on the basis of who they are in common with, has been sufficient for me to spot a connection that I may not have seen otherwise, since the new match was identified by Ancestry as a more distant 5th-8th cousin, sharing just 9.7cM across 1 DNA segment. Although I understand that there may be other reasons for shared DNA of that quantity, unless I can find other evidence to contradict it, the simplest explanation, that the three matches in Group 1 who all share the NAYLOR surname with me obtained it from a common NAYLOR ancestry, does seem to be logical.

Network analysis program used for drawing chart: Pajek (http://mrvar.fdv.uni-lj.si/pajek/ )  [One day, I hope to learn to use the program properly, as I am sure it could potentially display the DNA information more effectively, taking account of features such as the closeness of relationships etc]

ICW - stands for "in common with" - the term often used for matches who also match someone else you match.

Friday, 4 August 2017

Autosomal DNA Discussions - and some statistics for my kits

There have been some interesting discussions on the mailing lists recently*, which have caused me to look at some statistics for the kits I manage.  On the one hand, there were the, seemingly straightforward, questions concerning the best strategy for dealing with autosomal DNA results, and how to manage the ever increasing influx of new results.  Answers to these questions tend to include the importance of sharing multiple segments and of limiting the minimum length of the segments worked with, as well as focusing on names and locations relevant to one’s own family history.

But, on the other hand, the ongoing debate, predominantly between two people who I regard as genetic genealogy experts, Debbie Kennett and Tim Janzen, shows that things can be far from “straightforward” when dealing with DNA.  Alongside issues of terminology (what do we actually mean when we say “identical by state”, or “identical by descent” etc.), and how far back shared ancestry might be for particular levels of shared DNA (even up to 10 or 20 generations), such discussions often revolve around the problem of “triangulating groups”* (TGs) – what causes them, how relevant they are (or aren't), and the factors that affect them (such as segment size, phasing, haplotype frequency, and the population that’s involved).  

Fundamentally, the problem seems to be that scientific modelling suggests TGs shouldn’t exist, as it’s thought to be “mathematically impossible for so many people to share the same segment by virtue of sharing a single ancestral couple.”* But many people's results seem to indicate that they do exist – so why?

I don’t have the answer to that question, obviously, and I've written before about the two differing theories (at http://notjusttheparrys.blogspot.co.uk/2016/11/dna-update.html)  But two comments in particular struck me, as I realised that I hadn't specifically examined my kits with these issues in mind.  First was Tim’s comment that half identical regions (ie matching segments) that are at least 15 cMs in length and contain at least 2000 SNPs will almost always be "identical by descent" (IBD) and, secondly, Debbie’s comment that, in her experience with UK matches, the only segments that fall into triangulated groups are small segments under 15 cMs, and that we would be better off focusing our attention on matches that share over 15 cMs.

Debbie and I have discussed the numbers of TGs we have before, so I know my results show a few more than hers do, but this has prompted me to take a detailed look at my kits, to see the effect of applying such thresholds.

I began with FTDNA, where I have access to seven UK kits.  The following graph show the numbers of matches I have with particular “longest segment” lengths, annotated for any known relatives:

These graphs shows a group of four siblings and the numbers of matches they each have with particular “longest segment” lengths, annotated for any known relatives:

And finally in this section, graphs for the three other kits I have access to:

The following table summarises how many matches each of the above kits would have to work with, if either a 15cM or a 20cM threshold was applied:

So applying such thresholds would certainly reduce the number of matches regarded as 'relevant' and make working with the results more manageable.

But would we be missing useful information, as demonstrated by the 4c1r matching kit N?

I had hoped to produce similar graphs for the two kits I manage at 23andMe but, as the download file doesn't include details of the "longest segment" for matches sharing multiple segments, the following graphs include all segments for those matches my mother and I are sharing with (or who are "Open Sharing").  (I have removed the parent/child segment data to avoid an 'extended tail' in the graphs.)

The "curves" of the 23andMe graphs are much more irregular than the FTDNA kits, which could be a feature of the differences in the nature of sharing between the two companies.

But, once again, it is clear that applying thresholds of 15cM or 20cM would dramatically reduce the number of segments left to work with.

As a slight sidetrack, in view of another question on a mailing list, concerning the numbers of matches that don't match parents, I just thought I'd add in a graph to show the numbers of N's segments that are from matches identified as also matching N's mother.

As you can see, the number of non-maternal matches is generally greater than the number of maternal, which possibly indicates that there is some level of false positives in the results.  However, it could also just be a sign that N's paternal side of her family has more matches in the databases - something which is supported by the higher numbers of matches N's paternal relatives have at FTDNA in comparison to N and her mother.   But the important issue is that the segment lengths for the matches identified as maternal do go all the way down to 5cM.  It seems to me therefore, that it would not be easy to distinguish which segments may be false positives (ie people identified as matches who are not genuine matches), based just on maternal/paternal matching.  With such short segment lengths, it is possible that the parents' results are showing false negatives (ie genuine matches not identified as matches in the parent for some reason.)

Back to the original questions.  Of course, segment length isn't the only consideration - Tim's criteria included the numbers of SNPs as well.  The following scattergram shows numbers of SNPs per segment length, with the shaded area being those who would meet the criteria of being "at least 15 cMs in length and containing at least 2000 SNPs".

There are 165 segments (out of 1920) that would meet Tim's criteria to be genuine 100% of the time (and 32 segments, if the segment length used was 20cM.)

The 23andMe graphs for N shows an unexpected peak at 27cM, which the scattergram indicates is made up of some segments with less than 2000 SNPs.  Closer analysis shows these are predominantly at the start of chromosome 15 and, using the ADSA tool*, it can be seen that all but one of the segments fully triangulate as a non-maternal TG.

Is it a genuine segment (ie descended to all the matches from a shared ancestor)?  The low SNP count might imply not, but the apparent phasing and the fact that it is at the start of the chromosome (where recombination is perhaps less likely), as well as it being over 15cM, may be factors in favour of it being so.

But the honest truth is, I currently don't know - with factors both for and against it, I often think the only way to tell if a segment is genealogically relevant is if one finds a genealogical connection!

So, what about any other triangulating groups I might have?  I started by using the more restrictive thresholds of 20cM and 2000 SNPs.  At these levels, my FTDNA kit showed one TG:

However, three of the matches are clearly related to each other, so the TG actually only consists of three separate ancestral lines (theirs, mine and the fourth match's).  When I added my close relatives in, all four of these matches show as paternal matches.  Reducing the threshold to 15cM (but maintaining SNP threshold at 2000) picks up another member of the one family, and reducing it to 10cM picks up one other match (10cM, 2700 SNPs), who triangulates with all of the others.

On two other chromosomes, at 20cM, there is a match who shows as matching my 1c1r so, whilst not creating a TG, these do give me hints as to the relevant ancestral lines there.

In addition to the above TG, reducing the threshold to 15cM produces TGs on ten other chromosomes with my FTDNA kit.  These can be identified as either paternal or maternal based on matching to relatives (who aren't shown, in order to keep the diagrams easy to read):

I do think some of these TGs look "too perfect" - for example, see chromosome 8, where twelve people show identical figures.

Decreasing the thresholds below 15cM, increases the numbers of matches in these TGs, as well as producing more TGs, but many look too regular, given the random nature of DNA transmission. The use of matching to close relatives to 'phase' the segments should indicate a genuine matching sequence on one chromosome out of a pair (rather than a "match" being created from criss-crossing between SNPs on the two chromosomes in a pair).  But I do have a nagging suspicion that something may not be right, when all of the matches over any particular segment seem to be on just one chromosome, rather than there being overlapping maternal and paternal TGs - although, occasionally, that pattern of two overlapping TGs can be found, as in this example from chromosome 4:

Moving on to my 23andMe kit, at 20cM and 2000 SNPs that shows two TGs:

Chromosome 4 (maternal TG)

And on the X chromosome, a paternal TG:

I did think there was a third TG, on chromosome 11:

But, on checking the profiles, I discovered the two matches are identical twins, so that means there's just two ancestral lines involved (mine and theirs), and so this doesn't make a TG.

It is also a timely reminder that DNA results should always be analysed in conjunction with the genealogy!

Rerunning the 23andMe data using thresholds of 15cM and 2000 SNPs produces TGs on an additional eleven chromosomes.  This time, I have included details of how my mother matches, since she's the only close relative at 23andMe, so it doesn't complicate the images too much and does make the phasing more obvious.

So, in my results, I do have some TG's above 15cM and 2000 SNPs.  But I am not convinced that they are all valid, based on what they look like in comparison to what I understand about the random nature of DNA transmission.  I do need to work through the groups, to see if there are any obvious explanations for the anomalies and "overly perfect" matching (as in the case of the identical twins above.)   There are probably some other investigations I could do with the data, for example, checking for runs of homozygosity (sequences of identical SNPs on both chromosomes), which might be affecting matching.

However, I don't think there's much that I, as an individual test taker, can do to find out about how issues such as endogamy, haplotype frequencies, and population segments (which are some of the possible reasons given for why TGs may not be valid), might affect the validity of the TGs appearing in my results.

But trying to test the validity of the comments in the discussions wasn't the point of this post.  My aim was purely to examine my results in the light of those comments, to see what doing so showed, and I feel that carrying out this analysis has been very useful.  It has been helpful to look at ways to make the numbers of matches more manageable and to think about what information I might lose by doing so.   Focusing on these aspects of my results has also caused me to notice things about the data that I had previously missed.  I'm sure the results will also be helpful as I continue to work on the visual phasing and looking at how the segments shared with matches correlate with what might be predicted from that. There are clearly other aspects of my results that I also need to consider, such as matches sharing multiple segments and the company predictions about relationships levels, which I haven't taken account of here.

But, hopefully, all of this combined will enable me to work out some more effective strategies for dealing with my results - which, of course, must include one of the main things I have been reminded of during this process, which is the importance of checking out the genealogy of my matches!

*Discussion References
Corrinne Curtis, Re: [G] Family Finder Kit  (http://archiver.rootsweb.ancestry.com/th/read/GOONS/2017-07/1500365965)

Three discussions on the ISOGG lists (which are not public so I won't post the links - membership of ISOGG is free though, so please join - see https://isogg.org/ ) The discussions are in the threads [ISOGG] Autosomal Survey, [DNA-NEWBIE] Spreadsheets and new matches, and [DNA-NEWBIE] Re: Single Large Segments

Ian Logan, [DNA] Falsely positive matches of Autosomal results (http://archiver.rootsweb.ancestry.com/th/read/GENEALOGY-DNA/2017-08/1501660749)

About Triangulation - https://isogg.org/wiki/Triangulation

ADSA tool - https://dnagedcom.com/adsa/index.php

Tuesday, 11 July 2017

AncestryDNA - Genetic Communities

Back in February, when I wrote about my LivingDNA results, I commented on the upcoming release of AncestryDNA's "Genetic Communities" feature, which I'd heard about through others who could see their communities as part of the beta testing.  Unfortunately, general "busy-ness" got in the way of me posting about my own Genetic Communities, when I received them soon after that.  So this is a 'catch up' post.  I'm not going to cover all the details of how the Genetic Communities work - information about that is already available on the blogs of other genetic genealogists, such as Blaine Bettinger* or Debbie Kennett*, or on the Ancestry site itself. In this post I'm just going to focus on my own results and explore how useful (or otherwise) the information might be.

This is from my AncestryDNA Home Page, showing my general ethnicity and also that I am in three of the genetic communities.

Clicking though to view my "genetic ancestry" gives me the details of which communities I am in, and a map showing both the communities and the estimated general ethnicity areas (I only have traces of 'ancestry' from the "three more regions" so they aren't shown in detail.)

There are over 300 Genetic Communities currently available (Blaine Bettinger has provided a pdf of the full list, from a link on his blog), and it is possible to click down from a continental level, to explore what communities have been identified in different regions of the world, by clicking the "view all" button.  However, I find this a bit inconsistent, and potentially "buggy", when trying to explore the regions where I am in a community.

For example, If I look at the "Scots", which I am not part of, all of the communities show separately in white:

But, when I view a region where I am part of a community, I can only see my own community. For example "The Welsh and English West Midlanders" contains three communities:

But I only seem to get shown the one that I am in, when I try to view these:

This is virtually the same view I get when viewing my own Genetic Community, "English in the West Midlands". 

Based on the list provided by Blaine Bettinger, the "Welsh and English West Midlanders" region also contains the "North Walians" and the "South Walians", but I don't seem able to access the view similar to the one I see for the Scots region, showing all three of the communities in the region - although I can (sometimes) see the whole region, if I access it from the drop down on my own genetic communities view above:

For the other two community regions that I am in, the "English Midlanders and Northerners" and the "Southern English", I seem to be in the overall region but not allocated  to a more specific community within that, but again, the only view I can obtain is the same as my personal view, so I cannot see what the three more refined communities in each of these regions are.

 I would be interested in seeing how the three regions my Genetic Communities are in look like to someone who is not in them.

Comparison to LivingDNA
Since LivingDNA is the only other company that provides ethnicity estimates in fine detail within the UK, I thought it might be interesting to compare the results from them to my Ancestry Genetic Community regions.  My LivingDNA results have been updated since I wrote about them at http://notjusttheparrys.blogspot.co.uk/2017/02/a-slight-sidetrack-my-livingdna-results.html so, for now, I am including an image from both versions of LivingDNA to compare to AncestryDNA's Genetic Communities. (I will do a more detailed post about the updated LivingDNA results later.)

The three Genetic Communities I am in on Ancestry cover a large area of England, but do not include any of Scotland and only cover the border area of Wales.  In some ways, the earlier version of the LivingDNA results was a better match to the Genetic Communities, as it included down into Devon and Cornwall, and did not include much of Scotland, whereas the updated results no longer show any Devon or Cornish DNA, and now include Aberdeenshire.  However, we are talking about fairly low percentages for these counties.  Both Ancestry and LivingDNA place my main 'ancestry' as being from the West Midlands/Welsh Border areas - which does tie in with my known family history.

So I do feel that both companies are identifying connections to similar areas within the UK and, as the details continue to be refined, potentially the results will be very useful in furthering my family history.

Debbie Kennett has pointed out that, given the current predominance of Americans in the database, the Genetic Communities can help those of us in the UK to filter our match lists so as to focus on the more relevant matches, ie those who do have an identifiable connection to the same UK areas that we have.  However, although the Genetic Communities are created initially from the DNA analysis, with pedigrees then being used to supply historical information that helps to 'identify' the community, it isn't necessary to have a pedigree in order to be in a community, so finding the connections to matches who are in communities will usually involve further research (and, ultimately, might still be impossible in some cases). 

But the very fact that a pedigree isn't required, in order to appear in a community, does make the Genetic Communities a useful feature for anyone who does not know their family history, as it can help to identify some "times and places" for them to explore potential connections to their matches.

So, as confirming my family history and discovering new relatives are my main aims in using DNA, how useful are the communities for finding the connections between my matches and my own family history, beyond the general benefit of narrowing down my match lists? 

 The story views on the Genetic Communities help to provide more detail about the places where my matches' ancestors were from.

And also where they went to:

And the connection page indicates some of the surnames that are more prominent in the particular community, as well as indicating my own strength of connection to the Community:

(I love the background photo, by the way - definitely a place with relevance to my family history!)

As you can see, there is overlap between the three communities that I am in.

Just as I am in several communities, so are many of my matches.  The following diagram illustrates the numbers of my matches in each of the overlapping Community groupings:

(For anyone who does the maths, yes, there is an inconsistency between the images, with 23 matches being listed as in the "English in the West Midlands" community, and only 22 shown in my diagram - that's because another person was added in the four days between extracting the community match lists to produce the diagram and then copying the "Your Connection" image above.  Keeping data up to date is not easy!)

Since the "English in the West Midlands" is a subset of the "Welsh and English in the West Midlands", it does seem strange that two of the matches are in the subset but not in the higher level community (but that's just a minor anomaly that I've noticed, rather than something I'm looking into).

It seems clear that, at the moment, whilst it is helpful to know these matches have a UK connection, the Communities don't necessarily narrow that down to a particular branch of my family - partly because my genetic matches and I might both be in the same multiple communities but also because, as Blaine points out in his post, just because a match shares a particular community with me, it doesn't mean that, that is definitely where the shared ancestry is from.  But the Genetic Communities certainly could be helpful 'pointers' to potential connections and I imagine they will also improve over time, so may eventually even hint at specific family lines, especially when combined with other information from known family history and shared matches. 

What about those DNA matches that I have already identified some shared ancestry with - how do the Genetic Communities match up to our shared ancestry? 

Unfortunately, only two of those 'identified matches' appear in the same communities that I am in.  In one case, the match is in three of the communities I am in - the 'Welsh & English West Midlanders', 'English in the West Midlands' and 'English Midlanders and Northerners'.  There is quite an overlap between these three communities anyway, but it is reassuring that our shared ancestry is from around the Bromyard area, in north eastern Herefordshire.  The other match is in both the 'Southern English' and the 'English Midlanders and Northerners'.  In this case, our shared ancestry is in London in the later 1800s and then traces back to Wiltshire by the beginning of that century, so it looks as if the 'Southern English' community may be relevant to this - but, if I didn’t already know the connection, the shared 'English Midlanders and Northerners' could send us looking in the wrong place.

There is one other match who, whilst I don't know exactly how we relate, is known to be related to me on my mother's side, thanks to comparisons at Gedmatch.  They are in both the 'Southern English' and the 'English Midlanders and Northerners', either of which could be relevant to my mother's side of my family.  However, I have noticed that a third match, who is shared between the two of us, is showing as just in the 'Southern English' community, so that may possibly hint at where the shared ancestry is (although that community does take in everything under a line from South Wales to the Wash, so that's hardly narrowing things down :-) )

In another example, I do have a match who is in all four communities that I can see, but is a shared match to someone who is only in one of the four.  So the combination of the Genetic Communities with shared matches may be another topic to explore, to see if it can help indicate the potentially more relevant areas of the country to be researching in. 

However,  this may not be without its problems and may still be misleading to me.  For example, I have a match who shows up in just the 'Southern English' community, but both his profile and a shared match indicate there's likely to be a high level of Welsh ancestry.  Since I assume that I am not seeing any communities that my matches are in, but which I am not in, it's possible that they both share in a Welsh community,  and it's probably more likely that one of my West Midlands ancestors headed into Wales and connects into their trees that way, than the connection being in the south of England.

Shared matches are something I will write about in a separate post soon, so I shall perhaps consider the combined use of these two tools further in that.  I'm certainly grateful to AncestryDNA for the various tools they provide and look forward to future developments.  

I just know that I still have a lot to learn, to be able to work with the tools effectively!


Tuesday, 4 July 2017

A Day Out at UCL

It was with some slight trepidation that I set out last Tuesday morning for the Workshop on “Personal Genetic Testing: Challenges and Benefits in and Beyond the Clinic” at the University College London (UCL). PGT covers more than just the ‘direct-to-customer’ DNA tests that we genetic genealogists use and this was clearly going to be an “academic” day. Was it all going to be “over my head”?

But, considering my interest in DNA testing, and with people such as Debbie Kennett involved in the event, I had decided it was worth taking the risk. In the end, whilst I imagine the day might not have appealed to the ‘average’ genetic genealogist, I did find it interesting and useful, even if some of the topics were not directly relevant to me.

Following a non-eventful train journey to Euston (would you believe that first class on London Midland was actually the cheapest ticket!), I arrived with plenty of time to spare, so took a few minutes to sit in the garden at the Friends meeting house, which was between Euston and UCL, to enjoy the experience of being in London.

The first talk of the day was by Adam Rutherford.  Although he is well known as an author and presenter, it was actually the first time I’ve heard him speak. I may not fully grasp the issues of “identity politics” but his talk was interesting and informative, and gave me a better understanding of how genetics-related topics I’d previously learnt about, such as Mendelian inheritance from biology lessons at school, and the “Nature vs Nurture” debate from when I was studying psychology, fit into the wider picture of genetics.  I also learnt a new term (‘European Genetic isopoint’ – “the time at which everyone alive is the ancestor of everyone alive today, or no-one”*, which is said to be approximately the tenth century.)  It was enlightening (and slightly shocking) to hear how some topics, which Dr Rutherford described as “non-controversial” to geneticists, can be extremely controversial amongst some members of the general public.  Unfortunately, it would seem that the simplistic understanding many of us possibly have about genetics can lead to undesirable consequences, such as when a concept like the “warrior gene” becomes an accepted excuse for criminal behaviour.

Coffee break was followed by a focus group on the Science of Ancestry Testing, dealing with the tests we take for genealogy.  Rather than the panel members doing presentations, as in the later sessions, this was initiated by the moderator, Mark Thomas, posing some questions about what we actually mean when we use terms such as “ethnicity” and “ancestry”, and whether genetic testing is helping to debunk myths, or whether it is reinforcing them.  The representatives of two testing companies (Dave Nicholson, from LivingDNA, and Mike Mulligan, from Ancestry) made good points about their companies’ activities in education, and about the need to be trusted by their customers (and therefore having a solid scientific basis to their claims).  But there’s clearly some differences of opinion with the scientists as to how scientific the simple ‘one-liners’ that often appear in adverts actually are.  And of course, it is often similar, simple one-liners that make the news headlines about DNA testing. There was a good comment from someone to the effect that phrases such as ‘the seven daughters of Eve’ may provide a “clear narrative” but are “scientifically problematic”.

So one “take home point” for me, from this session, was that I should try to be more critical and analytical about the things I read (and write) about genetic genealogy – people place their own interpretations on what they read, based on their own understanding and biases, and even terms such as “ancient ancestry” and “recent ancestry” often have different meanings, for example, when used by a genetic genealogist, as opposed to a population geneticist.  There is a need for clarity about how terms are being used in any particular context, as well as more awareness of the details underlying the headlines.

After the lunch break (when I joined most of the other genetic genealogists for an enjoyable lunch in the nearby Wellcome Institute Cafe), the first afternoon session concerned ethical issues in PGT.  This involved three presentations which were all thought-provoking, for different reasons.  Concepts such as “genomic sovereignty”, and the “forensic microbiome”, have certainly given me a few things to look up since I returned home*.  Whilst I cannot even imagine what it is like to live in a country such as Mexico, where thousands have been killed, or have disappeared, the second presentation, involving the question of the “personal or social” nature of genetic testing was one I could relate to more easily, having considered some of the issues myself when deciding to test at 23andMe (and in asking relatives to also test).  To know, or not to know, that is the question.  I was glad that one conclusion of the study was that people can make ethical decisions, if they have the relevant information.  The third presentation, concerning the ethical issues that arise in the use of DNA when dealing with disaster settings, is one I hope I never need to consider from a personal viewpoint.  Sadly very timely in the light of recent events, this was an insight into the very real challenges, and difficult decisions, faced by those who work in this field and raised many questions about the “Pandora’s box” that the ability to carry out DNA testing has opened.

The next session was a panel presenting social scientific perspectives on PGT and Identity. Unfortunately, I’ve always struggled with the “wordiness” of the social sciences so, for me, this was the least interesting session and reminded me of why I didn’t go into research following my psychology degree.

After another coffee break, there was a useful tutorial on the challenges of security and privacy in genomics.  It’s an important point to remember that, unlike passwords or bank details, there’s no “reset button” for our genomic data, which is why, even at the level of data we genealogists deal with, we should consider carefully what we share about it, and about those we connect to.

The final panel concerned medical and research aspects of PGT.  Again, these were interesting, even though not directly relevant to me.  The first, concerning personalised medicine and whole genome sequencing for genetic diagnosis, again illustrated some of the difficult decisions organisations such as the NHS face, when considering issues such as population screening, where the benefit of potentially discovering a curable disease at an early stage, needs to be weighed against the possibility of discovering other, untreatable, diseases at the same time. The second talk in this panel, and the final one of the day, was an enthusiastic presentation about open-access medical genomics, with particular concentration on the Personal Genome Project UK (PGP-UK). This introduced me to a few more “omics” terms (epigenomics and transcriptomics) to go with ‘genomics’, as well as describing how different types of data access affected ease of research. The PGP has a very intense application procedure, including an exam that even someone with a genetics PhD can fail, if they don’t read the information properly. So participants are very clear about what the project involves, and what open access of their data will mean, before they take part in the project.  I doubt there’ll be any concerns regarding a lack of informed consent in that project!

The day ended with an informal reception, which was another opportunity to catch up with the other genetic genealogists, and to hear their views of the day.

So, to sum up, I enjoyed the day and it opened my eyes to some of the wider issues concerning PGT and it wasn’t (entirely) over my head. I do feel that there is a gap between what the academics are focusing on and the priorities for many genetic genealogists. I imagine that the time some scientists have had to spend ‘debunking’ the more ridiculous claims that have been made regarding genetic identities of groups (such as of the Vikings), has influenced this. There clearly is scope for research into the relationship between DNA testing and identity, or ‘belonging’ – but I suspect that the majority of those testing initially do so from a sense of curiosity, rather than as a way of finding their place in the world or, as one participant put it, an “identity grab”, finding “distinctiveness in a complex world, with fractured identities”.  However, at the moment, there seems to be an overemphasis on the ancestry/ethnicity side of the tests and the claims relating to that aspect, rather than on the other aspects, such as “cousin matching” (in order to confirm researched family history, or to discover unknown parentage), which is a very important aspect for many genealogists who test.  Although I admit to having had a variety of reasons for the specific tests I have taken, or arranged to be taken, over the years, including curiosity and health issues, as well as using it as a tool for my one-name study, it is confirming my family history and finding new relatives that are the priority for me.  

Why does any of this matter?

There’s probably several reasons, but here’s a couple: I have visited societies that have had a talk by a scientist about DNA testing, who have been left with the impression that direct-to-customer DNA tests are overly expensive and not worth doing.  This concerns me, given that I am trying to encourage the use of such tests for genealogy.  I don’t mind people deciding against testing - I have several relatives who have done that and it is entirely their choice – but I’d like people to be making the decision based on accurate information.   Also, during the day I spoke to at least one person who supported the idea of regulation of the direct-to-customer DNA tests.  This wasn’t the first time that I’d found myself involved in such a conversation, having previously experienced it at WDYTYA. It’s no surprise that the topic of regulation comes up, not just because of the concerns about unscientific “ancestry” claims, but when one considers that there are now companies claiming they can use your DNA to help you with your diet, your exercise, even your wine choice*, it can seem as if the general public might need protecting.

So I think it is important that we continue to engage with the scientists and academic community to ensure that how we are using DNA testing is based on sound scientific principles, and that the way we are using it is then properly understood and represented by those who may, one day, be involved in any potential regulation.  I am very grateful to the other genetic genealogists who attended last week, as I know most of them have a better understanding of these issues than I do.  I’m also grateful to the scientists and staff at UCL, who are enabling ongoing debate about the issues surrounding PGT.  Long may it continue.

And, hopefully, we will all end up better for it.

* Sources, references or other relevant links
Personal Genetic Testing: Challenges and Benefits in and Beyond the Clinic

Genetic Sovereignty - 
Genomic Sovereignty and "The Mexican Genome" - https://ore.exeter.ac.uk/repository/handle/10036/3500
Genomic sovereignty and the African promise: mining the African genome for the benefit of Africa

The increasing use of DNA in other aspects of life:
Diet and fitness – examples of scientific literature I found:
http://www.bmj.com/content/324/7351/1438 (Summary free, main article behind a paywall)
(And a search on google for “DNA diet” will give results from companies aiming to sell you such a test.  Caveat emptor!)
Wine choice

Sunday, 25 June 2017

Analysing my DNA: Crossovers Part 2

This is a continuation from my part 1 post at http://notjusttheparrys.blogspot.co.uk/2016/12/analysing-my-dna-crossovers-part-1.html.  My initial intention for this post was simply to look at the shared matches between the siblings, to see how those results correlate with the phasing of chromosome 21 represented in part 1.  That sounds easy enough but one of the reasons it has taken me so long to post, is that things very rapidly become complicated! 

So, in this post, I will look at the shared matching between the siblings and their three closest relatives - a niece, a first cousin and a third cousin once removed - and how adding the additional relatives caused me to alter my interpretation of how the niece matched. 

This was my starting point from part 1, the four siblings A, B, C, and D, with their chromosomes represented by the four colours:

The parents' chromosomes can then be represented as follows:

And which parts of the parents' chromosomes each of the siblings received like this:

One of the closest matches to the siblings is their niece, daughter of a deceased brother.  Since the brother was never tested, I don't know what crossover points he received from his parents. The niece will only have one chromosome (of each chromosome pair) from her father - but the differences in matching between the niece and the siblings could be as a result of crossovers within each of her father's chromosomes, or between the father's two chromosomes. 

So these are the comparisons between the niece and each of the siblings from Gedmatch, along with a potential crossover point identified at 43:

So immediately there is an issue - the niece matches sibling B up until 43, and, correctly, does not match any of the other siblings until that point.  However, beyond 43, the niece appears to match none of the siblings (based on the grey "match" bar).  But we know that, since sibling A and B do not match each other at all on this chromosome, the two siblings A and B, between them, cover all four of the siblings' parents' chromosome 21s.  So, if the niece doesn't match sibling B, then she has to match sibling A at least.  And, looking at the Gedmatch image, it seems quite clear that this is a threshold issue - the niece does actually match the three siblings A, C and D beyond 43.  The match just isn't being picked up as a match by Gedmatch at the default threshold.  Reducing the threshold indicates the niece matches all three siblings A, C and D by 6.9cM, containing between 1041 - 1045 SNPs.

The initial interpretation of the DNA received by the niece therefore became:

Next, I looked at how the siblings and the niece matched the siblings' paternal first cousin.  The Gedmatch image below was produced using the default threshold, but again, reducing the thresholds slightly indicated a potential matching segment just below the 7cM threshold:

Chr        Start Location        End Location        Centimorgans (cM)
21        14,677,076        22,936,413        18.2
21        22,950,552        33,423,011        15.7
21        34,132,054        37,056,381        6.7

The paternal first cousin can only match the siblings through their father's chromosomes.  But, as their father will not have received exactly the same DNA as the first cousin's parent did, there will be some areas where the first cousin does not match any of the siblings.

By comparison to the phasing of the siblings and niece, the first cousin's matching segments were therefore mapped as follows:

(this process also indicated that the "Parent 2" phasing represents the siblings' father's chromosomes.)

So far, so good.

When I downloaded the matching segments for the siblings, in order to start investigating the shared matches, I realised a known relative shared DNA with sibling B on chromosome 21.  The relative is a 3rd cousin 1 removed (3c1r) and shares from about 17 to 28.  The shared ancestry is on the siblings' paternal side of the family, the same as the 1c is:

But now there's a problem.  This 3c1r does not match any of the other siblings, or the niece, on chromosome 21.  But, at the point where the  3c1r matches B, we have already "used" both of the paternal chromosomes, one for the matching between the first cousin and siblings ACD, the other for the matching between the niece and sibling B.   It's okay that the 1c doesn't match the 3c1r - that actually indicates that the chromosome ACD share with the 1c must be the one the siblings' father received from his mother, the siblings' grandmother, as she is also a common ancestor with the 1c. 

But, clearly the chromosome the niece shares with sibling B cannot be the other paternal chromosome.  As far as I am aware, there's no other shared ancestry with the 3c1r.  So, let's go back to the matching between the niece and the siblings - where did I go wrong?

Siblings A, C, and D all show a very small area of potentially matching SNPs between 24 and 26 - but it is only 1.5 cM and 365 SNP.  I don't believe that has any significance, especially as there's no change in matching with sibling B. (The niece only has one relevant chromosome in this comparison - and the kit being used is a "paternal" one that's been phased using her mother's data, so should be fairly accurate.)

So what about the potentially matching segment with sibling C, between 37 - 39?  This is a 4.2 cM segment, containing 743 SNPs - so it is a small segment that, under normal circumstances, when matching to unknown and more distant relatives, should be ignored. 

From the sibling phasing, B and C are matching from 37, after C had a crossover, and their matching segment is a "Parent 1" segment.  So, is it possible that the niece's matching should actually be as follows:

The niece is matching B on a Parent 1 chromosome (now known to be maternal).  Sibling C then starts to match both B & the niece at 37, but the niece stops matching C at 39, as the niece has a crossover between the two chromosomes her father had.  If she switches from her father's maternal chromosome to his paternal chromosome, and those are also the two chromosomes sibling B has, that would account for why the niece continues to match B until 43.  At 43 there is then a crossover between the two chromosomes of Parent 2 - which would indicate a crossover in the niece's father, passed on to the niece within the segment from his paternal chromosome.  This interpretation would account for the niece's match to the 1c, between 40 - 43, and explain why she does not match the paternal 3c1r earlier on the chromosome, between 17 - 28.

If that is the situation, then the diagram of the siblings' parents' chromosomes can now be extended to also show the DNA received by their grandchild, the siblings' niece, as well as the potential source for the paternal chromosomes:

Please let me know if you can spot any mistakes in my reasoning.