Kamil Slowikowski
2024-04-22
Table of Contents
Here, we share a single file afnd.tsv (5.99MB) in tab-delimited format with all allele frequencies for 8 HLA genes, 18 KIR genes, 2 MIC genes, and 29 cytokine genes from Allele Frequency Net Database (AFND).
The script allelefrequencies.py automatically downloads allele frequencies from the website.
What is the Allele Frequency Net Database?
The Allele Frequency Net Database (AFND) is a public database which contains frequency information of several immune genes such as Human Leukocyte Antigens (HLA), Killer-cell Immunoglobulin-like Receptors (KIR), Major histocompatibility complex class I chain-related (MIC) genes, and a number of cytokine gene polymorphisms.
The afnd.tsv file looks like this:
d <- fread("afnd.tsv")
head(d)
## group gene allele population indivs_over_n alleles_over_2n n
## 1: hla A A*01:01 Argentina Rosario Toba 15.1 0.0760 86
## 2: hla A A*01:01 Armenia combined Regions 0.1250 100
## 3: hla A A*01:01 Australia Cape York Peninsula Aborigine 0.0530 103
## 4: hla A A*01:01 Australia Groote Eylandt Aborigine 0.0270 75
## 5: hla A A*01:01 Australia New South Wales Caucasian 0.1870 134
## 6: hla A A*01:01 Australia Yuendumu Aborigine 0.0080 191
Definitions:
alleles_over_2n
(Alleles / 2n) Allele Frequency: total number of
copies of the allele in the population sample in three decimal format.
indivs_over_n
(100 * Individuals / n) Percentage of individuals who
have the allele or gene.
n
(Individuals) Number of individuals sampled from the population.
Here are a few examples of how we can use R to analyze these data.
View the largest and smallest populations available in the data:
d %>%
mutate(n = parse_number(n)) %>%
select(population, n) %>%
unique() %>%
arrange(-n)
## population n
## 1: Germany DKMS - German donors 3456066
## 2: USA NMDP European Caucasian 1242890
## 3: USA NMDP African American pop 2 416581
## 4: USA NMDP Mexican or Chicano 261235
## 5: USA NMDP South Asian Indian 185391
## ---
## 1489: Cameroon Sawa 13
## 1490: Paraguay/Argentina Ache NA-DHS_24 (G) 13
## 1491: Malaysia Orang Kanaq Cytokine 11
## 1492: Cameroon Baka Pygmy 10
## 1493: Paraguay/Argentina Guarani NA-DHS_23 (G) 10
Count the number of alleles for each gene:
d %>%
count(group, gene, allele) %>%
count(group, gene) %>%
arrange(-n) %>%
head(15)
## group gene n
## 1: hla B 1979
## 2: hla A 1394
## 3: hla C 1209
## 4: hla DRB1 954
## 5: hla DPB1 384
## 6: hla DQB1 351
## 7: kir 3DL1 90
## 8: mic MICA 69
## 9: kir 3DL3 67
## 10: kir 2DL1 52
## 11: kir 2DL4 35
## 12: mic MICB 34
## 13: hla DQA1 30
## 14: kir 3DL2 30
## 15: kir 2DL5B 24
Sum the allele frequencies for each gene in each population. This allows us to see which populations have a set of allele frequencies that adds up to 100 percent:
d %>%
mutate(alleles_over_2n = parse_number(alleles_over_2n)) %>%
filter(alleles_over_2n > 0) %>%
group_by(group, gene, population) %>%
summarize(sum = sum(alleles_over_2n)) %>%
count(sum == 1)
## `summarise()` has grouped output by 'group', 'gene'. You can override using the `.groups` argument.
## # A tibble: 44 × 4
## # Groups: group, gene [28]
## group gene `sum == 1` n
##
## 1 hla A FALSE 420
## 2 hla A TRUE 18
## 3 hla B FALSE 513
## 4 hla B TRUE 19
## 5 hla C FALSE 323
## 6 hla C TRUE 19
## 7 hla DPA1 FALSE 54
## 8 hla DPA1 TRUE 6
## 9 hla DPB1 FALSE 207
## 10 hla DPB1 TRUE 39
## # ℹ 34 more rows
Plot the frequency of a specific allele in populations with more than 1000 sampled individuals:
my_allele <- "DQB1*02:01"
my_d <- d %>% filter(allele == my_allele) %>%
mutate(
n = parse_number(n),
alleles_over_2n = parse_number(alleles_over_2n)
) %>%
filter(n > 1000) %>%
arrange(-alleles_over_2n)
ggplot(my_d) +
aes(x = alleles_over_2n, y = reorder(population, alleles_over_2n)) +
scale_y_discrete(position = "right") +
geom_colh() +
labs(
x = "Allele Frequency (Alleles / 2N)",
y = NULL,
title = glue("Frequency of {my_allele} across populations"),
caption = "Data from AFND http://allelefrequencies.net"
)
If you use this data, please cite the latest manuscript about Allele Frequency Net Database:
@ARTICLE{Gonzalez-Galarza2020,
title = "{Allele frequency net database (AFND) 2020 update: gold-standard
data classification, open access genotype data and new query
tools}",
author = "Gonzalez-Galarza, Faviel F and McCabe, Antony and Santos, Eduardo
J Melo Dos and Jones, James and Takeshita, Louise and
Ortega-Rivera, Nestor D and Cid-Pavon, Glenda M Del and
Ramsbottom, Kerry and Ghattaoraya, Gurpreet and Alfirevic, Ana
and Middleton, Derek and Jones, Andrew R",
journal = "Nucleic acids research",
volume = 48,
number = "D1",
pages = "D783--D788",
month = jan,
year = 2020,
language = "en",
issn = "0305-1048, 1362-4962",
pmid = "31722398",
doi = "10.1093/nar/gkz1029",
pmc = "PMC7145554"
}
Here are all of the resources I could find that have information about HLA allele frequencies in different populations.
https://github.com/Vaccitech/HLAfreq/
The authors provide xlsx files on this website:
But the frequency information is binned into categories:
There is a tool called HLA-Net that provides a visualization of the CIWD data.
http://tools.iedb.org/population/download
At the IEDB Tools page, we can find a tool called Population Coverage. The authors have downloaded the HLA frequency information from AFND and saved it in a Python pickle file.
https://www.ncbi.nlm.nih.gov/gv/mhc
The dbMHC database and website appears to be discontinued. But an archive of old files is still available via FTP.
https://bioinformatics.bethematchclinical.org/hla-resources/haplotype-frequencies/high-resolution-hla-alleles-and-haplotypes-in-the-us-population/
Thanks to David A. Wells for sharing scrapeAF, which inspired me to work on this project.