irGSEA:基于秩次的单细胞基因集富集分析整合框架
距离上一次先容irGSEA,仍是是两年前了,详见:8种法子可视化你的单细胞基因集打分,现在Seurat皆仍是更新到了V5,假如你不心爱最新版的Seurat包的单细胞理念,恰恰这个irGSEA亦然与时俱进,不单是是更新到了17种基因集结打分算法,况兼是互助V4和V5版块的Seurat责任流,超越的浅易!淌若民众看已矣这个万字长文的先容,仍然是欢欣深化了解irGSEA不错望望文末的交流群哈!
布景许多Functional Class Scoring (FCS)法子,如GSEA, GSVA,PLAGE, addModuleScore, SCSE, Vision, VAM, gficf, pagoda2和Sargent,皆会受数据集组成的影响,数据集组成的狭窄变化将改变细胞的基因集富集分数。
假如将新的单细胞数据集整合到现存数据中,使用这些FCS法子需要从头计较每个细胞的基因集富集分数。这个形式可能是繁琐且资源密集的。
相背,基于单个细胞抒发品级的FCS,如AUCell、UCell、singscore、ssGSEA、JASMINE和Viper,只需要计较新添加的单细胞数据集的富集分数,而无需从头计较整个细胞的基因集富集分数。原因是这些法子生成的富集分数仅依赖于单个细胞水平上的相对基因抒发,与数据集组成无关。因此,这些法子不错知人善任多数的工夫。
注释成果在这里,咱们注释了17种常见的FCS法子:
GSEA 检测排序基因列表顶部或底部的基因集富集进程,该列表是分组后计较排序基因信噪比或排序基因倍数变化得到的;GSVA 推测整个细胞之间每个基因的聚积密度函数的核。 这个过程中需要讨论整个样本,容易受到样本布景信息的影响;PLAGE 对跨细胞的基因抒发矩阵进行尺度化,并索要奇异值领会看成基因集富集分数;Zscore 团聚了基因集合整个基因的抒发,通过细胞间的平均值和尺度差缩放抒发;AddModuleScore需要先计较基因集合整个基因的平均值,再笔据平均值把抒发矩阵切割成多少份,然后从切割后的每一份中当场抽取对照基因(基因集外的基因)看成布景值。因此,在整合不相同本的情况下,即使使用交流基因集为交流细胞打分,也会产生不同的富集评分;SCSE 使用基因集整个基因的归一化的总额来量化基因集富集分数;Vision 使用当场签名的预期均值和方差对基因集富集分数进行 z 归一化从而篡改基因集富集分数;VAM 笔据经典Mahalanobis多元距离从单细胞 RNA 测序数据生成基因集富集分数;Gficf 欺骗通过非负矩阵领会赢得的基因抒发值的潜在因子的信息生物信号;Pagoda2 拟合每个细胞的误差模子,并使用其第一个加权主因素量化基因集富集分数;AUCell 基于单个样本中的基因抒发排行,使用弧线底下积来评估输入基因集是否在单个样本的前5%抒发基因内富集;UCell 基于单个样本的基因抒发排行,使用Mann-Whitney U统计量计较单个样本的基因集富集分数;Singscore 笔据基因抒发品级评估距单个细胞中心的距离。 基因集合的基因笔据单个细胞中的转录本品貌进行排序。 平均品级相对于表面最小值和最大值单独尺度化,以零为中心,然后团聚,所得分数代表基因集的富集分数;ssGSEA 笔据每个细胞的基因抒发品级计较里面和外部基因集之间的造就聚积散布的各异分数。 使用全局抒发谱对各异分数进行尺度化。 尺度化这一步容易受样本组成的影响。JASMINE 笔据在单个细胞中抒发基因中的基因排行和抒发基因中基因集的富集度计较类似平均值。 这两个值均尺度化为 0-1 范畴,并通过平均进行组合,得出基因集的最终富集分数。Viper 通过笔据细胞间基因抒发的排行推论three-tailed计较来推测基因集的富集分数。Sargent 将给定细胞的非零抒发基因从高抒发到低抒发进行排序,并将输入的基因逐细胞抒发矩阵休养为相应的gene-set-by-cell assignment score matrix。 但Sargent 需要计较细胞间的gini-index后,将按gene-set-by-cell assignment score matrix休养为distribution of indexes。责任历程图片
使用AUCell、UCell、singscore、ssgsea、JASMINE 和 viper分别对各个细胞进行评分,得到不同的富集评分矩阵。通过wilcoxon陶冶计较不同的富集评分矩阵中每个细胞亚群各异抒发的基因集。up或down默示该细胞簇内各异基因集的富集进程高于或低于其他簇。单一的基因集富集分析法子不仅只可响应有限的信息,况兼也容易带来误差。咱们期待从多个角度显露复杂的生物知识题,并找到生物知识题中的共性部分。大致地为多种基因集富集分析法子的成果取共同错乱,不仅容易得到少而保守的成果,况兼忽略了富集分析法子中许多的其他信息,举例不同基因集的相对富集进程信息。
咱们但愿主见基因集在大部分富集分析法子中皆是富集且富集进程莫得昭彰各异。因此,咱们通过RobustRankAggreg包中的秩团聚算法(robust rank aggregation, RRA)对各异分析的成果进行评估,筛选出在6种法子中发扬出相似的富集进程的各异基因集。
irGSEA装配1.irGSEA装配(基础树立)仅使用 AUCell, UCell, singscore, ssGSEA, JASMINE和viper
# install packages from CRANcran.packages <- c("aplot", "BiocManager", "data.table", "devtools", "doParallel", "doRNG", "dplyr", "ggfun", "gghalves", "ggplot2", "ggplotify", "ggridges", "ggsci", "irlba", "magrittr", "Matrix", "msigdbr", "pagoda2", "pointr", "purrr", "RcppML", "readr", "reshape2", "reticulate", "rlang", "RMTstat", "RobustRankAggreg", "roxygen2", "Seurat", "SeuratObject", "stringr", "tibble", "tidyr", "tidyselect", "tidytree", "VAM")for (i in cran.packages) { if (!requireNamespace(i, quietly = TRUE)) { install.packages(i, ask = F, update = F) }}# install packages from Bioconductorbioconductor.packages <- c("AUCell", "BiocParallel", "ComplexHeatmap", "decoupleR", "fgsea", "ggtree", "GSEABase", "GSVA", "Nebulosa", "scde", "singscore", "SummarizedExperiment", "UCell", "viper","sparseMatrixStats")for (i in bioconductor.packages) { if (!requireNamespace(i, quietly = TRUE)) { install.packages(i, ask = F, update = F) }}# install packages from Githubif (!requireNamespace("irGSEA", quietly = TRUE)) { devtools::install_github("chuiqin/irGSEA", force =T)}2.irGSEA装配(进阶树立)
思使用VISION, gficf, Sargent, ssGSEApy, GSVApy等其他法子(这部分是可选装配)
# VISIONif (!requireNamespace("VISION", quietly = TRUE)) { devtools::install_github("YosefLab/VISION", force =T)}# mdt need rangerif (!requireNamespace("ranger", quietly = TRUE)) { devtools::install_github("imbs-hl/ranger", force =T)}# gficf need RcppML (version > 0.3.7) packageif (!utils::packageVersion("RcppML") > "0.3.7") { message("The version of RcppML should greater than 0.3.7 and install RcppML package from Github") devtools::install_github("zdebruine/RcppML", force =T)}# please first `library(RcppML)` if you want to perform gficfif (!requireNamespace("gficf", quietly = TRUE)) { devtools::install_github("gambalab/gficf", force =T)}# GSVApy and ssGSEApy need SeuratDisk packageif (!requireNamespace("SeuratDisk", quietly = TRUE)) { devtools::install_github("mojaveazure/seurat-disk", force =T)}# sargentif (!requireNamespace("sargent", quietly = TRUE)) { devtools::install_github("Sanofi-Public/PMCB-Sargent", force =T)}# pagoda2 need scde packageif (!requireNamespace("scde", quietly = TRUE)) { devtools::install_github("hms-dbmi/scde", force =T)}# if error1 (functio 'sexp_as_cholmod_sparse' not provided by package 'Matrix')# or error2 (functio 'as_cholmod_sparse' not provided by package 'Matrix') occurs# when you perform pagoda2, please check the version of irlba and Matrix# It's ok when I test as follow:# R 4.2.2 irlba(v 2.3.5.1) Matrix(1.5-3)# R 4.3.1 irlba(v 2.3.5.1) Matrix(1.6-1.1)# R 4.3.2 irlba(v 2.3.5.1) Matrix(1.6-3)#### create conda env# If error (Unable to find conda binary. Is Anaconda installed) occurs, # please perform `reticulate::install_miniconda()`if (! "irGSEA" %in% reticulate::conda_list()$name) { reticulate::conda_create("irGSEA")}# if python package existpython.package <- reticulate::py_list_packages(envname = "irGSEA")$packagerequire.package <- c("anndata", "scanpy", "argparse", "gseapy", "decoupler")for (i in seq_along(require.package)) { if (i %in% python.package) { reticulate::conda_install(envname = "irGSEA", packages = i, pip = T) }}3.国内镜像加快装配
装配github包和pip包有清贫的参考这里,我把github包克隆到了gitee上
options(BioC_mirror="https://mirrors.tuna.tsinghua.edu.cn/bioconductor/")options("repos" = c(CRAN="http://mirrors.cloud.tencent.com/CRAN/"))# install packages from CRANcran.packages <- c("aplot", "BiocManager", "data.table", "devtools", "doParallel", "doRNG", "dplyr", "ggfun", "gghalves", "ggplot2", "ggplotify", "ggridges", "ggsci", "irlba", "magrittr", "Matrix", "msigdbr", "pagoda2", "pointr", "purrr", "RcppML", "readr", "reshape2", "reticulate", "rlang", "RMTstat", "RobustRankAggreg", "roxygen2", "Seurat", "SeuratObject", "stringr", "tibble", "tidyr", "tidyselect", "tidytree", "VAM")for (i in cran.packages) { if (!requireNamespace(i, quietly = TRUE)) { install.packages(i, ask = F, update = F) }}# install packages from Bioconductorbioconductor.packages <- c("AUCell", "BiocParallel", "ComplexHeatmap", "decoupleR", "fgsea", "ggtree", "GSEABase", "GSVA", "Nebulosa", "scde", "singscore", "SummarizedExperiment", "UCell", "viper")for (i in bioconductor.packages) { if (!requireNamespace(i, quietly = TRUE)) { install.packages(i, ask = F, update = F) }}# install packages from gitif (!requireNamespace("irGSEA", quietly = TRUE)) { devtools::install_git("https://gitee.com/fan_chuiqin/irGSEA.git", force =T)}# VISIONif (!requireNamespace("VISION", quietly = TRUE)) { devtools::install_git("https://gitee.com/fan_chuiqin/VISION.git", force =T)}# mdt need rangerif (!requireNamespace("ranger", quietly = TRUE)) { devtools::install_git("https://gitee.com/fan_chuiqin/ranger.git", force =T)}# gficf need RcppML (version > 0.3.7) packageif (!utils::packageVersion("RcppML") > "0.3.7") { message("The version of RcppML should greater than 0.3.7 and install RcppML package from Git") devtools::install_git("https://gitee.com/fan_chuiqin/RcppML.git", force =T)}# please first `library(RcppML)` if you want to perform gficfif (!requireNamespace("gficf", quietly = TRUE)) { devtools::install_git("https://gitee.com/fan_chuiqin/gficf.git", force =T)}# GSVApy and ssGSEApy need SeuratDisk packageif (!requireNamespace("SeuratDisk", quietly = TRUE)) { devtools::install_git("https://gitee.com/fan_chuiqin/seurat-disk.git", force =T)}# sargentif (!requireNamespace("sargent", quietly = TRUE)) { devtools::install_git("https://gitee.com/fan_chuiqin/PMCB-Sargent.git", force =T)}# pagoda2 need scde packageif (!requireNamespace("scde", quietly = TRUE)) { devtools::install_git("https://gitee.com/fan_chuiqin/scde.git", force =T)}#### create conda env# If error (Unable to find conda binary. Is Anaconda installed) occurs, # please perform `reticulate::install_miniconda()`if (! "irGSEA" %in% reticulate::conda_list()$name) { reticulate::conda_create("irGSEA")}# if python package existpython.package <- reticulate::py_list_packages(envname = "irGSEA")$packagerequire.package <- c("anndata", "scanpy", "argparse", "gseapy", "decoupler")for (i in require.package) { if (! i %in% python.package) { reticulate::conda_install(envname = "irGSEA", packages = i, pip = T, pip_options = "-i https://pypi.tuna.tsinghua.edu.cn/simple") }}使用教程1.irGSEA守旧Seurat 对象(V5或V4),Assay对象(V5或V4)
# 咱们通过SeuratData包加载示例数据集(正式好的PBMC数据集)看成演示#### Seurat V4对象 ####library(Seurat)library(SeuratData)library(RcppML)library(irGSEA)data("pbmc3k.final")pbmc3k.final <- irGSEA.score(object = pbmc3k.final,assay = "RNA", slot = "data", seeds = 123, ncores = 1, min.cells = 3, min.feature = 0, custom = F, geneset = NULL, msigdb = T, species = "Homo sapiens", category = "H", subcategory = NULL, geneid = "symbol", method = c("AUCell","UCell","singscore","ssgsea", "JASMINE", "viper"), aucell.MaxRank = NULL, ucell.MaxRank = NULL, kcdf = 'Gaussian') Assays(pbmc3k.final)>[1] "RNA" "AUCell" "UCell" "singscore" "ssgsea" "JASMINE" "viper" #### Seurat V5对象 ####data("pbmc3k.final")pbmc3k.final <- SeuratObject::UpdateSeuratObject(pbmc3k.final)pbmc3k.final2 <- CreateSeuratObject(counts = CreateAssay5Object(GetAssayData(pbmc3k.final, assay = "RNA", slot = "counts")), meta.data = pbmc3k.final[[]])pbmc3k.final2 <- NormalizeData(pbmc3k.final2)pbmc3k.final2 <- irGSEA.score(object = pbmc3k.final2,assay = "RNA", slot = "data", seeds = 123, ncores = 1, min.cells = 3, min.feature = 0, custom = F, geneset = NULL, msigdb = T, species = "Homo sapiens", category = "H", subcategory = NULL, geneid = "symbol", method = c("AUCell","UCell","singscore","ssgsea", "JASMINE", "viper"), aucell.MaxRank = NULL, ucell.MaxRank = NULL, kcdf = 'Gaussian')Assays(pbmc3k.final2)[1] "RNA" "AUCell" "UCell" "singscore" "ssgsea" "JASMINE" "viper" #### Assay5 对象 ####data("pbmc3k.final")pbmc3k.final <- SeuratObject::UpdateSeuratObject(pbmc3k.final)pbmc3k.final3 <- CreateAssay5Object(counts = GetAssayData(pbmc3k.final, assay = "RNA", slot = "counts"))pbmc3k.final3 <- NormalizeData(pbmc3k.final3)pbmc3k.final3 <- irGSEA.score(object = pbmc3k.final3,assay = "RNA", slot = "counts", seeds = 123, ncores = 1, min.cells = 3, min.feature = 0, custom = F, geneset = NULL, msigdb = T, species = "Homo sapiens", category = "H", subcategory = NULL, geneid = "symbol", method = c("AUCell","UCell","singscore","ssgsea", "JASMINE", "viper"), aucell.MaxRank = NULL, ucell.MaxRank = NULL, kcdf = 'Gaussian')Assays(pbmc3k.final3)# Assay5对象存放在RNA assay中
成果文献大小统计:
# 看起来Seurat V5和Assay 5如实存放数据会比较小> cat(object.size(pbmc3k.final) / (1024^2), "M")339.955 M> cat(object.size(pbmc3k.final2) / (1024^2), "M")69.33521 M> cat(object.size(pbmc3k.final3) / (1024^2), "M")69.27851 M2.irGSEA守旧的基因集打分法子
测试了不同数据大小下各式评分法子使用50个Hallmark基因集进行打分所需的工夫和内存峰值, 民众笔据我方的电脑和工夫进行酌情选拔;
图片
GSVApy、ssGSEApy 和 viperpy 分别代表 GSVA、ssGSEA 和 viper 的 Python 版块。Python版块的GSVA比R版块的GSVA从简太多工夫了。咱们对singscore、ssGSEA、JASMINE、viper的内存峰值进行了优化。 对于擢升 50000 个细胞的数据集,咱们实施了一种政策,将它们分袂为5000 个细胞/单位进行评分。 天然这不错缓解内存峰值问题,但如实会延迟照拂工夫。
3.irGSEA守旧的基因集打分法子为了浅易用户获取MSigDB数据库中事前界说好的基因集,咱们内置了msigdbr包进行MSigDB的基因集数据的获取。msigdbr包守旧多个物种的基因集获取,以及多种基因形式的抒发矩阵的输入。
①irGSEA通过内置的msigdbr包进行打分
library(Seurat)library(SeuratData)library(RcppML)library(irGSEA)data("pbmc3k.final")#### Hallmark基因集打分 ####pbmc3k.final <- irGSEA.score(object = pbmc3k.final,assay = "RNA", slot = "data", seeds = 123, ncores = 1, min.cells = 3, min.feature = 0, custom = F, geneset = NULL, msigdb = T, species = "Homo sapiens", category = "H", subcategory = NULL, geneid = "symbol", method = c("AUCell","UCell","singscore","ssgsea", "JASMINE", "viper"), aucell.MaxRank = NULL, ucell.MaxRank = NULL, kcdf = 'Gaussian')#### KEGG基因集打分 ####pbmc3k.final <- irGSEA.score(object = pbmc3k.final,assay = "RNA", slot = "data", seeds = 123, ncores = 1, min.cells = 3, min.feature = 0, custom = F, geneset = NULL, msigdb = T, species = "Homo sapiens", category = "C2", subcategory = "CP:KEGG", geneid = "symbol", method = c("AUCell","UCell","singscore","ssgsea", "JASMINE", "viper"), aucell.MaxRank = NULL, ucell.MaxRank = NULL, kcdf = 'Gaussian')#### GO-BP基因集打分 ####pbmc3k.final <- irGSEA.score(object = pbmc3k.final,assay = "RNA", slot = "data", seeds = 123, ncores = 1, min.cells = 3, min.feature = 0, custom = F, geneset = NULL, msigdb = T, species = "Homo sapiens", category = "C5", subcategory = "GO:BP", geneid = "symbol", method = c("AUCell","UCell","singscore","ssgsea", "JASMINE", "viper"), aucell.MaxRank = NULL, ucell.MaxRank = NULL, kcdf = 'Gaussian')
②irGSEA使用最新版的MSigDB基因集进行打分
缺憾的是,msigdbr包内置的MSigDB的版块是MSigDB 2022.1。然则,现在MSigDB数据库仍是更新到了2023.2,包括2023.2.Hs和2023.2.Mm。咱们不错从这个集合(https://data.broadinstitute.org/gsea-msigdb/msigdb/release/)下载2023.2.Hs的gmt文献或者db.zip文献。比拟gmt文献,db.zip文献包含了基因集的刻画,不错用来筛选XX功能相关基因。底下的例子中,我将先容奈何筛选血管生成相关的基因集。
#### work with newest Msigdb ##### https://data.broadinstitute.org/gsea-msigdb/msigdb/release/# In this page, you can download human/mouse gmt file or db.zip file# The db.zip file contains metadata information for the gene set# load librarylibrary(clusterProfiler)library(tidyverse)library(DBI)library(RSQLite)### db.zip #### download zip file and unzip zip filezip_url <- "https://data.broadinstitute.org/gsea-msigdb/msigdb/release/2023.2.Hs/msigdb_v2023.2.Hs.db.zip"local_zip_path <- "./msigdb_v2023.2.Hs.db.zip"download.file(zip_url, local_zip_path)unzip(local_zip_path, exdir = "./")# code modified by https://rdrr.io/github/cashoes/sear/src/data-raw/1_parse_msigdb_sqlite.rcon <- DBI::dbConnect(RSQLite::SQLite(), dbname = './msigdb_v2023.2.Hs.db')DBI::dbListTables(con)# define tables we want to combinegeneset_db <- dplyr::tbl(con, 'gene_set') # standard_name, collection_namedetails_db <- dplyr::tbl(con, 'gene_set_details') # description_brief, description_fullgeneset_genesymbol_db <- dplyr::tbl(con, 'gene_set_gene_symbol') # meat and potatoesgenesymbol_db <- dplyr::tbl(con, 'gene_symbol') # mapping from ids to gene symbolscollection_db <- dplyr::tbl(con, 'collection') %>% dplyr::select(collection_name, full_name) # collection metadata# join tablesmsigdb <- geneset_db %>% dplyr::left_join(details_db, by = c('id' = 'gene_set_id')) %>% dplyr::left_join(collection_db, by = 'collection_name') %>% dplyr::left_join(geneset_genesymbol_db, by = c('id' = 'gene_set_id')) %>% dplyr::left_join(genesymbol_db, by = c('gene_symbol_id' = 'id')) %>% dplyr::select(collection = collection_name, subcollection = full_name, geneset = standard_name, description = description_brief, symbol) %>% dplyr::as_tibble() # clean upDBI::dbDisconnect(con)unique(msigdb$collection)# [1] "C1" "C2:CGP" "C2:CP:BIOCARTA" # [4] "C2:CP:KEGG_LEGACY" "C2:CP:PID" "C3:MIR:MIRDB" # [7] "C3:MIR:MIR_LEGACY" "C3:TFT:GTRD" "C3:TFT:TFT_LEGACY" # [10] "C4:3CA" "C4:CGN" "C4:CM" # [13] "C6" "C7:IMMUNESIGDB" "C7:VAX" # [16] "C8" "C5:GO:BP" "C5:GO:CC" # [19] "C5:GO:MF" "H" "C5:HPO" # [22] "C2:CP:KEGG_MEDICUS" "C2:CP:REACTOME" "C2:CP:WIKIPATHWAYS"# [25] "C2:CP" unique(msigdb$subcollection)# [1] "C1" "C2:CGP" "C2:CP:BIOCARTA" # [4] "C2:CP:KEGG_LEGACY" "C2:CP:PID" "C3:MIR:MIRDB" # [7] "C3:MIR:MIR_LEGACY" "C3:TFT:GTRD" "C3:TFT:TFT_LEGACY" # [10] "C4:3CA" "C4:CGN" "C4:CM" # [13] "C6" "C7:IMMUNESIGDB" "C7:VAX" # [16] "C8" "C5:GO:BP" "C5:GO:CC" # [19] "C5:GO:MF" "H" "C5:HPO" # [22] "C2:CP:KEGG_MEDICUS" "C2:CP:REACTOME" "C2:CP:WIKIPATHWAYS"# [25] "C2:CP" # convert to list[Hallmark] required by irGSEA packagemsigdb.h <- msigdb %>% dplyr::filter(collection=="H") %>% dplyr::select(c("geneset", "symbol"))msigdb.h$geneset <- factor(msigdb.h$geneset)msigdb.h <- msigdb.h %>% dplyr::group_split(geneset, .keep = F) %>% purrr::map( ~.x %>% dplyr::pull(symbol) %>% unique(.)) %>% purrr::set_names(levels(msigdb.h$geneset))#### Hallmark基因集打分 ####pbmc3k.final <- irGSEA.score(object = pbmc3k.final, assay = "RNA", slot = "data", custom = T, geneset = msigdb.h, method = c("AUCell","UCell","singscore","ssgsea", "JASMINE", "viper"), kcdf = 'Gaussian')# convert to list[go bp] required by irGSEA packagemsigdb.go.bp <- msigdb %>% dplyr::filter(collection=="C5:GO:BP") %>% dplyr::select(c("geneset", "symbol"))msigdb.go.bp$geneset <- factor(msigdb.go.bp$geneset)msigdb.go.bp <- msigdb.go.bp %>% dplyr::group_split(geneset, .keep = F) %>% purrr::map( ~.x %>% dplyr::pull(symbol) %>% unique(.)) %>% purrr::set_names(levels(msigdb.go.bp$geneset))#### GO-BP基因集打分 ####pbmc3k.final <- irGSEA.score(object = pbmc3k.final, assay = "RNA", slot = "data", custom = T, geneset = msigdb.go.bp, method = c("AUCell","UCell","singscore","ssgsea", "JASMINE", "viper"), kcdf = 'Gaussian')# convert to list[KEGG] required by irGSEA packagemsigdb.kegg <- msigdb %>% dplyr::filter(collection=="C2:CP:KEGG_MEDICUS") %>% dplyr::select(c("geneset", "symbol"))msigdb.kegg$geneset <- factor(msigdb.kegg$geneset)msigdb.kegg <- msigdb.kegg %>% dplyr::group_split(geneset, .keep = F) %>% purrr::map( ~.x %>% dplyr::pull(symbol) %>% unique(.)) %>% purrr::set_names(levels(msigdb.kegg$geneset))#### KEGG基因集打分 ####pbmc3k.final <- irGSEA.score(object = pbmc3k.final, assay = "RNA", slot = "data", custom = T, geneset = msigdb.kegg, method = c("AUCell","UCell","singscore","ssgsea", "JASMINE", "viper"), kcdf = 'Gaussian')# Look for the gene sets associated with angiogenesis from gene sets names and # gene sets descriptionscategory <- c("angiogenesis", "vessel")msigdb.vessel <- list()for (i in category) { # Ignore case matching find.index.description <- stringr::str_detect(msigdb$description, pattern = regex(all_of(i), ignore_case=TRUE)) find.index.name <- stringr::str_detect(msigdb$geneset, pattern = regex(all_of(i), ignore_case=TRUE)) msigdb.vessel[[i]] <- msigdb[find.index.description | find.index.name, ] %>% mutate(category = i) }msigdb.vessel <- do.call(rbind, msigdb.vessel)head(msigdb.vessel)# # A tibble: 6 × 6# collection subcollection geneset description symbol category# <chr> <chr> <chr> <chr> <chr> <chr> # 1 C2:CGP Chemical and Genetic Perturbations HU_ANGIOGENESIS_UP Up-regulated … HECW1 angioge…# 2 C2:CGP Chemical and Genetic Perturbations HU_ANGIOGENESIS_UP Up-regulated … JADE2 angioge…# 3 C2:CGP Chemical and Genetic Perturbations HU_ANGIOGENESIS_UP Up-regulated … SEMA3C angioge…# 4 C2:CGP Chemical and Genetic Perturbations HU_ANGIOGENESIS_UP Up-regulated … STUB1 angioge…# 5 C2:CGP Chemical and Genetic Perturbations HU_ANGIOGENESIS_UP Up-regulated … FAH angioge…# 6 C2:CGP Chemical and Genetic Perturbations HU_ANGIOGENESIS_UP Up-regulated … COL7A1 angioge…length(unique(msigdb.vessel$geneset))# [1] 112# convert gene sets associated with angiogenesis to list # required by irGSEA packagemsigdb.vessel <- msigdb.vessel %>% dplyr::select(c("geneset", "symbol"))msigdb.vessel$geneset <- factor(msigdb.vessel$geneset)msigdb.vessel <- msigdb.vessel %>% dplyr::group_split(geneset, .keep = F) %>% purrr::map( ~.x %>% dplyr::pull(symbol) %>% unique(.)) %>% purrr::set_names(levels(msigdb.vessel$geneset))#### 血管生成相关基因集打分 ####pbmc3k.final <- irGSEA.score(object = pbmc3k.final, assay = "RNA", slot = "data", custom = T, geneset = msigdb.vessel, method = c("AUCell","UCell","singscore","ssgsea", "JASMINE", "viper"), kcdf = 'Gaussian')### gmt file #### download gmt filegmt_url <- "https://data.broadinstitute.org/gsea-msigdb/msigdb/release/2023.2.Hs/msigdb.v2023.2.Hs.symbols.gmt"local_gmt <- "./msigdb.v2023.2.Hs.symbols.gmt"download.file(gmt_url , local_gmt)msigdb <- clusterProfiler::read.gmt("./msigdb.v2023.2.Hs.symbols.gmt")# convert to list[hallmarker] required by irGSEA packagemsigdb.h <- msigdb %>% dplyr::filter(str_detect(term, pattern = regex("HALLMARK_", ignore_case=TRUE)))msigdb.h$term <- factor(msigdb.h$term)msigdb.h <- msigdb.h %>% dplyr::group_split(term, .keep = F) %>% purrr::map( ~.x %>% dplyr::pull(gene) %>% unique(.)) %>% purrr::set_names(levels(msigdb.h$term))#### Hallmark基因集打分 ####pbmc3k.final <- irGSEA.score(object = pbmc3k.final, assay = "RNA", slot = "data", custom = T, geneset = msigdb.h, method = c("AUCell","UCell","singscore","ssgsea", "JASMINE", "viper"), kcdf = 'Gaussian')# convert to list[go bp] required by irGSEA packagemsigdb.go.bp <- msigdb %>% dplyr::filter(str_detect(term, pattern = regex("GOBP_", ignore_case=TRUE)))msigdb.go.bp$term <- factor(msigdb.go.bp$term)msigdb.go.bp <- msigdb.go.bp %>% dplyr::group_split(term, .keep = F) %>% purrr::map( ~.x %>% dplyr::pull(gene) %>% unique(.)) %>% purrr::set_names(levels(msigdb.go.bp$term))#### GO-BP基因集打分 ####pbmc3k.final <- irGSEA.score(object = pbmc3k.final, assay = "RNA", slot = "data", custom = T, geneset = msigdb.go.bp, method = c("AUCell","UCell","singscore","ssgsea", "JASMINE", "viper"), kcdf = 'Gaussian')# convert to list[KEGG] required by irGSEA packagemsigdb.kegg <- msigdb %>% dplyr::filter(str_detect(term, pattern = regex("KEGG_", ignore_case=TRUE)))msigdb.kegg$term <- factor(msigdb.kegg$term)msigdb.kegg <- msigdb.kegg %>% dplyr::group_split(term, .keep = F) %>% purrr::map( ~.x %>% dplyr::pull(gene) %>% unique(.)) %>% purrr::set_names(levels(msigdb.kegg$term))#### KEGG基因集打分 ####pbmc3k.final <- irGSEA.score(object = pbmc3k.final, assay = "RNA", slot = "data", custom = T, geneset = msigdb.kegg, method = c("AUCell","UCell","singscore","ssgsea", "JASMINE", "viper"), kcdf = 'Gaussian')
③irGSEA使用clusterProfiler包同款基因集进行打分
#### work with clusterProfiler package ##### load librarylibrary(clusterProfiler)library(tidyverse)### kegg #### download kegg pathway (human) and write as gson filekk <- clusterProfiler::gson_KEGG(species = "hsa")gson::write.gson(kk, file = "./KEGG_20231128.gson")# read gson filekk2 <- gson::read.gson("./KEGG_20231123.gson")# Convert to a data framekegg.list <- dplyr::left_join(kk2@gsid2name, kk2@gsid2gene, by = "gsid")head(kegg.list)# gsid name gene# 1 hsa01100 Metabolic pathways 10# 2 hsa01100 Metabolic pathways 100# 3 hsa01100 Metabolic pathways 10005# 4 hsa01100 Metabolic pathways 10007# 5 hsa01100 Metabolic pathways 100137049# 6 hsa01100 Metabolic pathways 10020# Convert gene ID to gene symbolgene_name <- clusterProfiler::bitr(kegg.list$gene, fromType = "ENTREZID", toType = "SYMBOL", OrgDb = "org.Hs.eg.db")kegg.list <- dplyr::full_join(kegg.list, gene_name, by = c("gene"="ENTREZID"))# remove NA value if existkegg.list <- kegg.list[complete.cases(kegg.list[, c("gene", "SYMBOL")]), ]head(kegg.list)# gsid name gene SYMBOL# 1 hsa01100 Metabolic pathways 10 NAT2# 2 hsa01100 Metabolic pathways 100 ADA# 3 hsa01100 Metabolic pathways 10005 ACOT8# 4 hsa01100 Metabolic pathways 10007 GNPDA1# 5 hsa01100 Metabolic pathways 100137049 PLA2G4B# 6 hsa01100 Metabolic pathways 10020 GNE# convert to list required by irGSEA packagekegg.list$name <- factor(kegg.list$name)kegg.list <- kegg.list %>% dplyr::group_split(name, .keep = F) %>% purrr::map( ~.x %>% dplyr::pull(SYMBOL) %>% unique(.)) %>% purrr::set_names(levels(kegg.list$name))head(kegg.list)### 整理好之后就不错进行KEGG打分#### KEGG基因集打分 ####pbmc3k.final <- irGSEA.score(object = pbmc3k.final, assay = "RNA", slot = "data", custom = T, geneset = kegg.list, method = c("AUCell","UCell","singscore","ssgsea", "JASMINE", "viper"), kcdf = 'Gaussian')### go bp #### download go bp (human) and write as gson filego <- clusterProfiler::gson_GO(OrgDb = "org.Hs.eg.db", ont = "BP")gson::write.gson(go, file = "./go_20231128.gson")# read gson filego2 <- gson::read.gson("./go_20231128.gson")# Convert to a data framego.list <- dplyr::left_join(go2@gsid2name, go2@gsid2gene, by = "gsid")head(go.list)# gsid name gene# 1 GO:0000001 mitochondrion inheritance <NA># 2 GO:0000002 mitochondrial genome maintenance 142# 3 GO:0000002 mitochondrial genome maintenance 291# 4 GO:0000002 mitochondrial genome maintenance 1763# 5 GO:0000002 mitochondrial genome maintenance 1890# 6 GO:0000002 mitochondrial genome maintenance 2021# Convert gene ID to gene symbolgo.list <- dplyr::full_join(go.list, go2@gene2name, by = c("gene"="ENTREZID"))# remove NA value if existgo.list <- go.list[complete.cases(go.list[, c("gene", "SYMBOL")]), ]head(go.list)# gsid name gene SYMBOL# 2 GO:0000002 mitochondrial genome maintenance 142 PARP1# 3 GO:0000002 mitochondrial genome maintenance 291 SLC25A4# 4 GO:0000002 mitochondrial genome maintenance 1763 DNA2# 5 GO:0000002 mitochondrial genome maintenance 1890 TYMP# 6 GO:0000002 mitochondrial genome maintenance 2021 ENDOG# 7 GO:0000002 mitochondrial genome maintenance 3980 LIG3# convert to list required by irGSEA packagego.list$name <- factor(go.list$name)go.list <- go.list %>% dplyr::group_split(name, .keep = F) %>% purrr::map( ~.x %>% dplyr::pull(SYMBOL) %>% unique(.)) %>% purrr::set_names(levels(go.list$name))head(go.list)#### GO-BP基因集打分 ####pbmc3k.final <- irGSEA.score(object = pbmc3k.final, assay = "RNA", slot = "data", custom = T, geneset = go.list, method = c("AUCell","UCell","singscore","ssgsea", "JASMINE", "viper"), kcdf = 'Gaussian')4.irGSEA的齐备历程
library(Seurat)library(SeuratData)library(RcppML)library(irGSEA)data("pbmc3k.final")# 基因集打分pbmc3k.final <- irGSEA.score(object = pbmc3k.final,assay = "RNA", slot = "data", seeds = 123, ncores = 1, min.cells = 3, min.feature = 0, custom = F, geneset = NULL, msigdb = T, species = "Homo sapiens", category = "H", subcategory = NULL, geneid = "symbol", method = c("AUCell","UCell","singscore","ssgsea", "JASMINE", "viper"), aucell.MaxRank = NULL, ucell.MaxRank = NULL, kcdf = 'Gaussian')# 计较各异基因集,并进行RRA# 淌若报错,讨论加句代码options(future.globals.maxSize = 100000 * 1024^5)result.dge <- irGSEA.integrate(object = pbmc3k.final, group.by = "seurat_annotations", method = c("AUCell","UCell","singscore","ssgsea", "JASMINE", "viper"))# 稽查RRA识别的在多种打分法子中皆广大认同的各异基因集geneset.show <- result.dge$RRA %>% dplyr::filter(pvalue <= 0.05) %>% dplyr::pull(Name) %>% unique(.)可视化展示1)全局展示
①热图
你还不错把method从'RRA"换成“ssgsea”,展示特定基因集富集分析法子中各异上调或各异下调的基因集;
irGSEA.heatmap.plot <- irGSEA.heatmap(object = result.dge, method = "RRA", top = 50, show.geneset = NULL)irGSEA.heatmap.plot
图片
默许展示前50,你也不错展示你思展示的基因集。举例,我思展示RRA识别的各异基因集。irGSEA.heatmap.plot1 <- irGSEA.heatmap(object = result.dge, method = "RRA", show.geneset = geneset.show)irGSEA.heatmap.plot1
图片
②气泡图
irGSEA.bubble.plot <- irGSEA.bubble(object = result.dge, method = "RRA", show.geneset = geneset.show)irGSEA.bubble.plot
图片
③upset plot
upset图展示了轮廓评估中每个细胞亚群具有统计学意旨各异的基因集的数量,以及不同细胞亚群之间具有错乱的各异基因集数量;
irGSEA.upset.plot <- irGSEA.upset(object = result.dge, method = "RRA", mode = "intersect", upset.width = 20, upset.height = 10, set.degree = 2, pt_size = grid::unit(2, "mm"))irGSEA.upset.plot
图片
左边不同激情的条形图代表不同的细胞亚群;上方的条形图代表具有错乱的各异基因集的数量;中间的气泡图单个点代表单个细胞亚群,多个点连线代表多个细胞亚群取错乱()这里只展示两两取错乱;
④堆叠条形图
堆叠柱状图具体展示每种基因集富集分析法子中每种细胞亚群中上调、下合股莫得统计学各异的基因集数量;
irGSEA.barplot.plot <- irGSEA.barplot(object = result.dge, method = c("AUCell", "UCell", "singscore", "ssgsea", "JASMINE", "viper", "RRA"))irGSEA.barplot.plot
图片
上方的条形代表每个亚群中不同法子中各异的基因数量,红色代表上调的各异基因集,蓝色代表下调的各异基因集;中间的柱形图代表每个亚群中不同法子中上调、下合股莫得统计学意旨的基因集的比例;
2)局部展示①密度散点图
密度散点图将基因集的富集分数和细胞亚群在低维空间的投影衔尾起来,展示了特定基因集在空间上的抒发水平。
scatterplot <- irGSEA.density.scatterplot(object = pbmc3k.final, method = "UCell", show.geneset = "HALLMARK-INFLAMMATORY-RESPONSE", reduction = "umap")scatterplot
图片
其中,激情越黄,密度分数越高,代表富集分数越高;②半小提琴图
半小提琴图同期以小提琴图(左边)和箱线图(右边)进行展示。不同激情代表不同的细胞亚群;
halfvlnplot <- irGSEA.halfvlnplot(object = pbmc3k.final, method = "UCell", show.geneset = "HALLMARK-INFLAMMATORY-RESPONSE")halfvlnplot
图片
除此除外,还不错
vlnplot <- irGSEA.vlnplot(object = pbmc3k.final, method = c("AUCell", "UCell", "singscore", "ssgsea", "JASMINE", "viper"), show.geneset = "HALLMARK-INFLAMMATORY-RESPONSE")vlnplot
图片
③山峦图山峦图中上方的核密度弧线展示了数据的主要散布,下方的条形编码图展示了细胞亚群具体的数量。不同激情代表不同的细胞亚群,而横坐标代表不同的抒发水平;
ridgeplot <- irGSEA.ridgeplot(object = pbmc3k.final, method = "UCell", show.geneset = "HALLMARK-INFLAMMATORY-RESPONSE")ridgeplot
图片
④密度热图
密度热图展示了具体各异基因在不同细胞亚群中的抒发和散布水平。激情越红,代表富集分数越高;
densityheatmap <- irGSEA.densityheatmap(object = pbmc3k.final, method = "UCell", show.geneset = "HALLMARK-INFLAMMATORY-RESPONSE")densityheatmap
图片
参考贵寓[1]AUCell: https://doi.org/10.1038/nmeth.4463
[2]UCell: https://doi.org/10.1016/j.csbj.2021.06.043
[3]singscore: https://doi.org/10.1093/nar/gkaa802
[4]ssgsea: https://doi.org/10.1038/nature08460
[5]JASMINE: https://doi.org/10.7554/eLife.71994
[6]VAM: https://doi.org/10.1093/nar/gkaa582
[7]scSE: https://doi.org/10.1093/nar/gkz601
[8]VISION: https://doi.org/10.1038/s41467-019-12235-0
[9]wsum: https://doi.org/10.1093/bioadv/vbac016
[10]wmean: https://doi.org/10.1093/bioadv/vbac016
[11]mdt: https://doi.org/10.1093/bioadv/vbac016
[12]viper: https://doi.org/10.1038/ng.3593
[13]GSVApy: https://doi.org/10.1038/ng.3593
[14]gficf: https://doi.org/10.1093/nargab/lqad024
[15]GSVA: https://doi.org/10.1186/1471-2105-14-7
[16]zscore: https://doi.org/10.1371/journal.pcbi.1000217
[17]plage: https://doi.org/10.1186/1471-2105-6-225
[18]ssGSEApy: https://doi.org/10.1093/bioinformatics/btac757
[19]viperpy: https://doi.org/10.1093/bioinformatics/btac757
[20]AddModuleScore: https://doi.org/10.1126/science.aad0501
[21]pagoda2: https://doi.org/10.1038/nbt.4038
[22]Sargent: https://doi.org/10.1016/j.mex.2023.102196
文末交流群R包斥地是学术性质,亦然公益的,是以交流群并不会诞生门票也不会有二次收费。但是但愿是一些高质地数据分析实战派小伙伴进群交流,不错是对于irGSEA的斥地提议和观点,比如其它 Functional Class Scoring (FCS)法子,其它可视化法子,其它统计学算法的植入。淌若只是是联系于irGSEA的使用法子疑问,不错平直看官方文档即可哈。
本站仅提供存储做事,整个实质均由用户发布,如发现存害或侵权实质,请点击举报。