Lisa

Video tutorial

About

Transcriptional regulators (TRs), including transcription factors and chromatin regulators, play central role in gene regulation, and are essential in many biological processes. To rank TRs given geneset, we develop a computational biology framework, Lisa, to evaluate and rank TRs effect by integrative modelling of chromatin landscapes and cistromes from Cistrome Data Browser (CistromeDB). To run Lisa online application or local command line, the only thing you need is the differential gene set derived from differential gene expression analysis of microarray and RNA-seq. Then you will be able to leverage the power of the most comprehensive TR ChIP-seq, DNase-seq, H3K27ac ChIP-seq database to discover the underlying regulatory TRs.

Lisa combines two models, epigenetics landscape in silico subtraction model and the cistrome regulatory potential model. The first model integrates epigenetics landscape with cistromes through in silico deletion strategy. First, the model will build a chromatin landscape model using the input gene set based on the comprehensive database of DNase-seq, or H3K27ac ChIP-seq. The model selects a subset of the cell type specific chromatin landscape. Then, the model will probe the effect of TRs on a gene in these landscapes by simulating deletions of chromatin signals on TR binding sites. These binding sites can come from the ChIP-seq experiment-derived data in CistromeDB, or the motif scan, or imputed TR binding sites from DREAM challenge. The second model uses only cistromes to model the TR regulatory potential on each gene. Finally, Lisa would assign every binding site profile(ChIP-seq data, motif hit profile or imputed TF binding site) a score showing essentiality of the set of transcriptional regulator binding sites by comparing the effect between all the input genes and a fixed/user-customized background genes, and combined all the statistics from both models for each TR.

Online version

Online Usage

The input fields and options is shown in the input interface(Figure 1). User can input one or two gene set by using RefSeq ids or gene official symbols, usually the two gene sets can be obtained from up and down-regulated gene set from a differential gene expression analysis,and select the species the gene set originates from, for example, here are two gene sets for MYC knockdown in brain tumors: up-regulated gene set, and down-regulated geneset. If one gene set is given, it should be placed in "Gene Set 1", and the TF ranking list of that one gene list would be provided. If two are given, an additional figure of comparison between the two gene set analysis will be generated, a typical scenario would be input up and down regulated gene set from a set of expression profiles, the label for each gene set can be customized optionally. For gene set size, we would recommend 50 to 500 genes, since few genes will lower the predictive power of chromatin profile modelling. To build a model that can best represent the cell type specific chromatin landscape, the number of genes can not be too few. By default, Lisa samples 3000 background genes for comparison, user can also specify customized background gene set. However, because in-silico deleting binding sites are time-consuming step for all the ChIP-seq profile and motif hit profile for all the input genes, the Lisa running time would suffer from the number of genes when it goes beyond 3000. By using the default backgroudn gene set based on TAD and promoter activity, we prepared a set of transcription factor perturbation experiments used as benchmarks in the paper (see in Human/Mouse Gallery). In addition, user can select method based on TR ChIP-seq peak, based on in-silico knockout method, or both. A single run takes 5-10 minutes depending on the size of the gene set, user can optionally set an email to receive the result link.

Figure 1. Lisa input interface

Online output

With two gene sets input, the MYC knockdown demo outputs the page, which lists the following results as in Figure 2:

Figure 2. Lisa result list of MYC knockdown gene set

A pairwise scatterplot for comparison of the two gene set TF rankings with combined p-values based on TF ChIP-seq(Figure 3).

Figure 3. Lisa result based on TF ChIP-seq

A pairwise scatterplot for comparison of the two gene set based on TF motif (Figure 4).

Figure 4. Lisa TF ranking based on motif

NOTE: These two figures can be saved to local disk, The x axis and y axis shows the -log10(p-value) of the TF ranking significance for either gene set. Once user moves mouse over the points, the corresponding TF ChIP-seq sample will be shown by Cistrome ID and TF name, user can see details information about the sample by searching CistromeDB with Cistrome ID.

The TF ranking table is computed by combining Peak-RP method, H3K27ac, DNase-seq in silico deletion of TF ChIP-seq peaks(Figure 5).

Figure 5. Lisa TF ranking for down-regulated gene set based on ChIP-seq

NOTE: Click on each of the p-value for TF ranking, the Cistrome dataset link will be shown below (Figure 6).

Figure 6. TF ranking linked to CistromeDB

The TF ranking table is computed by combining H3K27ac and DNase-seq in silico deletion methods (Figure 7).

Figure 7. TF ranking based on chromatin profile knockout of motif hit site

NOTE: These tables above can be exported to csv or pdf to local disk.

Clicking on the 'Download all results', all the intermediate and final results of Lisa will be downloaded as a zip file. Uncompressing the zip files, all the files are named by using job name and timestamp as the prefix, the result list are shown in the table. Except for the scatterplot, another special output from the online version of Lisa is the gene set enrichment view on WashU genome browser, user can open the links in the MYC_Lisa*browser_link.txt in the downloaded files. We randomly picked arbitrarily 20 genes from the differential gene set to generate the genome tracks for the top ranked TF ChIP-seq, selected chromatin profile ChIP-seq for both gene set 1 (e.g. up regulated) and gene set 2 (e.g. down regulated) gene set (Figure 8).

Figure 8. gene set enrichment view mode on Wash U genome browser

Local command version

Local installation

Lisa can be deployed as a command line version, which is installed using the following commands.

    
   ### Preparation of Anaconda environment
   wget -c https://repo.continuum.io/miniconda/Miniconda3-latest-Linux-x86_64.sh
   bash Miniconda3-latest-Linux-x86_64.sh
   export PATH="${HOME}/miniconda3/bin:$PATH"
   conda create -n lisa python=3.6 && conda config --add channels conda-forge && conda config --add channels bioconda
   conda activate lisa
   
   ### Installation
   conda install -c qinqian lisa

Local usage

The Lisa framework has three pipeline represented with three sub-parsers, model, multiple_fastq2bigwig, and multiple_bigwig2hdf; The TF rank method is run through the model sub-parser.

Run all the methods

   # with fixed background gene set
   time lisa model --method="all" --web=False --new_rp_h5=None --new_count_h5=None --species hg38 --epigenome "['DNase', 'H3K27ac']" --cluster=False --covariates=False --random=True --prefix background_explore --background=dynamic_auto_tad --stat_background_number=1000 --threads 8 test.gene_symbol

   # with customized gene set
   time lisa model --method="all" --web=False --new_rp_h5=None --new_count_h5=None --species hg38 --epigenome "['DNase', 'H3K27ac']" --cluster=False --covariates=False --random=True --prefix background_explore --background=background.gene_symbol --stat_step_background_number=1000 --threads 8 test.gene_symbol

Options: The test.gene_symbol is the gene set user input, which is a text file with one official gene symbol or refseq id per line, mixture of the both is not supported yet, this is used as a prefix for output, so put the file in a directory with written permission.

--epigenome option controls the histone mark, DNase-seq or both to use in the features.

--background option can specify user provided text file containing one gene a row to compare with the target gene set, --background=None to use random selected 3000 genes.

--stat_background_number specify the background genes number for wilcoxon test, usually specify 500-1000, if the --background specify user customized gene set, all of the genes will be used from the text file.

Run only one method

run only the knockout or beta (direct from TF ChIP-seq) method: --method option can control the method to run knockout or beta method or both, use it with the option --web=True to turn off the baseline method.

Local Output

Take AR.symbol input and histone mark H3K27ac as an example. By using the above command, the key results are as follows,

Output File	Content	Instruction
AR.symbol.premodel_records.txt	regulatory potential cluster-based regression performance and coefficients	one row is PRAUC,AUC,lambda, 2nd row is coefficients
AR.symbol_motif99_baseline.csv	baseline method based on motif hit between differential & control gene	one row is motif_id\|TF_name,p-value
AR.symbol_chipseq_baseline.csv	baseline method based on chip-seq peak hit between differential & control gene	one row is CistromeID\|TF_name,p-value
AR.symbol.gene_symbol.DNase.coefs.csv	Regression coefficients for DNase	one row is CistromeID,coefficient,cell type, cell line, tissue
AR.symbol.DNase.lisa_predicted_rp.csv	predicted differential expression log probability	one row is refseq,log probability
AR.symbol.gene_symbol.H3K27ac.coefs.csv	Regression coefficients for H3K27ac	one row is CistromeID,coefficient,cell type, cell line, tissue
AR.symbol.H3K27ac.lisa_predicted_rp.csv	predicted differential expression log probability	one row is refseq,log probability
AR.symbol.H3K27ac.gene_set.performance	differential gene prediction	one row is AUC,PRAUC
AR.symbol.1000.lisa_direct.csv	BETA TF ranks based on 1000 background genes (or user customized genes)	one row is CistromeID\|TF_name,p_value
AR.symbol.3000.lisa_direct.csv	BETA TF ranks based on 3000 background genes (or user customized genes)	one row is CistromeID\|TF_name,p_value
AR.symbol.gene_symbol.DNase.motif99.p_value.csv	motif insilco knockout result for DNase	one row is motif_id\|TF_name,p-value
AR.symbol.DNase.chipseq.p_value.csv	TF ChIP-seq insilco knockout result for DNase	one row is motif_id\|TF_name,p-value
AR.symbol.gene_symbol.H3K27ac.motif99.p_value.csv	motif insilco knockout result for H3K27ac	one row is motif_id\|TF_name,p-value
AR.symbol.H3K27ac.chipseq.p_value.csv	TF ChIP-seq insilco knockout result for H3K27ac	one row is motif_id\|TF_name,p-value
AR.symbol_chipseq_cauchy_combine_raw.csv	Combined ChIP-seq-based TF p values by Cauchy combination test	one row is motif_id\|TF_name,p_value
AR.symbol_chipseq_cauchy_combine_dedup.csv	Combined ChIP-seq-based TF p values removing redundant samples	one row is motif_id\|TF_name,p_value
AR.symbol_chipseq_fisher_combine_raw.csv	Combined ChIP-seq-based TF p values by Fisher's method	one row is motif_id\|TF_name,p_value
AR.symbol_chipseq_fisher_combine_dedup.csv	Combined ChIP-seq-based TF p values removing redundant samples	one row is motif_id\|TF_name,p_value

FAQ

How many datasets are there?

The 1st version of CistromeDB is incorporated, the dataset statistics is as the table below, NOTE: the data used for current Lisa online is updated to CistromeDB (201801).

Table 1. Chromatin profile data and TF/CR ChIP-seq dataset

Category	Human	Mouse
DNase-seq	1,106	361
H3K27ac ChIP-seq	1,544	1,031
TF and CR ChIP-seq	8,471	4,363

What if we swap the gene list from input to background genes?

We recommend the user not to swap the gene lists (swapped Lisa run). Since the gene labels are changed as well, this would render symmetric changes of the sign of model coefficients and deletion-RP, thus this makes the statistics test unchanged. However, the TF rank from the peak-RP method will change to be negatively correlated with the un-swapped running since the peak-RP for a same gene is not affected by gene list swap. In total, the combined p value from both cauchy and fisher combination test will be affected unexpectedly.