Contents

1 1 Introduction

This vignette provides a description of the PMScanR package. Which can be downloaded and loaded by using following functions:

# Installation command - typically not evaluated in a vignette
# User should install before using the vignette
# devtools::install_github("prodakt/PMScanR", force = TRUE)

# Load libraries
library(PMScanR)
library(ggseqlogo)

The PMScanR package provides large-scale identification, analysis and visualization of protein motifs. The package integrates various methods to facilitate motif identification, characterization, and visualization. It includes functions for running PS-Scan, a PROSITE database tool. Additionally, PMScanR supports format conversion to GFF, enhancing downstream analyses such as graphical representation and database integration. The library offers multiple visualization tools, including heatmaps, sequence logos, and pie charts, enabling a deeper understanding of motif distribution and conservation. Through its integration with PROSITE, PMScanR provides access to up-to-date motif data.

Proteins play a crucial role in biological processes, with their functions closely related to structure. Protein functions are often associated with the presence of specific motifs, which are short, sometimes repetitive amino acid sequences essential for distinctive molecular interactions or modifications. Most of the existing bioinformatics tools focus mainly on the identification of known motifs and often do not provide during motif extraction, interactive analysis and visualization tools. Moreover, they do not take into account the effect of single variations on an entire domain or protein motif. These limitations highlight the need for a tool that can automate and scale the analysis. To address this, we have developed PMScanR, an R-based package. Designed to facilitate and automate the prediction and the evaluation of the effect of single amino acid substitutions on the occurrence of protein motifs on a large scale of both motifs and sequences. However, existing tools lacked the capability to perform comparative analysis of multiple motifs across multiple sequences, a gap that PMScanR was particularly developed to fill.

2 2 Data manipulation and overall usage

2.1 2.1 GUI

If the user prefers to perform the analysis using a graphical user interface (GUI), they can simply run the function runPMScanRShiny(). This will launch a Shiny app that opens an interactive window. The window can be used both within R and in a web browser, providing a clickable, user-friendly interface that allows the entire analysis, including visualizations, to be carried out without needing to write code. If you want you can use shiny to use all the features of the library with user friendly GUI helping to follow all the steps which are contained in this library.

# Running Shiny apps is interactive, usually set eval=FALSE in vignettes
# runPMScanRShiny()

2.2 2.2 Command Line

Alternatively, if the user wishes to work directly with the code, the library provides a set of functions to perform the full analysis, including protein motif identification and visualization. This can be done through an R script, where users can execute and customize the analysis programmatically. Each function included in the package is described below, along with an explanation of its purpose and functionality.

2.2.1 2.2.1 List of functions and their description

The following list of commands provides a step-by-step description of the functions that ensure the complete analysis provided by the PMScanR package.

The first step of the analyses is to establish the working environment through the use of functions:

# Setting working directory is user-specific, usually eval=FALSE in vignettes
# setwd("disc:/your/path/to/working/directory")

The next step is to load the data for the analysis (more example input files are given in the repository on Github, and below is one example file just from this repository):

# Load example FASTA file included with the package
fasta_file <- system.file("extdata", "hemoglobins.fasta", package = "PMScanR")
# Check if file exists (optional, good practice)
if (!file.exists(fasta_file)) {
  warning("Example FASTA file not found. Vignette examples may fail.")
}

Once the previous steps have been completed, you can move on to the relevant part of the analysis. Below is a description of the function, together with an example of its use.

2.2.1.1 runPsScan()

A function that allows ps_scan to be run from the user’s operating system, they receive an output file containing information about the protein motifs - used for further analysis. This function requires the user to provide values for three arguments: in_file=, out_format=, and out_file=. These are essential for the function to run. If other arguments use their default values, the function first detects the operating system. A message is displayed to the user, asking for confirmation that the detected operating system is correct (yes/no response). If the detected system is incorrect, the user will be prompted to provide the correct value for the OS= argument. The function then proceeds to automatically download necessary files from the PROSITE database to perform the analysis. Users can bypass this automatic download feature by providing their own files, downloaded from the PROSITE website.

# This requires ps_scan to be installed and configured,
# and might download data, so often set eval=FALSE in automated checks/vignettes.
# Set eval=TRUE if you have a small example and ps_scan configured in the build environment.
# runPsScan(in_file = fasta_file, out_format = 'gff', out_file = "results_pfscan.gff")

This function is an automatic form of analysis loading, which searches the sequence for protein motifs present on it, but there is also an alternative option where the user can manually point the path to each of the files and the version of the operation system.

After analysis, which finally searched for protein motifs located on the input file sequence, the two functions below can be used to convert the output files into GFF format:

2.2.1.2 read.prosite()

A function that enables an input file in prosite format to be converted into a GFF format file, which is a more universal format used for analysis and visualisation such as heatmap or pie chart.

## Path to an actual output file (user specific)
# motifs_prosite <- "out_Hb_psa.txt"

2.2.1.3 read.psa()

A function that enables an input file in psa format to be converted into a GFF format file, which is a more universal format used for analysis and visualisation such as heatmap or pie chart. You can point the path to the output file generated by the actual analysis:

# Path to an actual output file (user specific)
# motifs_psa <- "out_Hb_psa.txt"

Or to the file attached in the examples:

# Load example PSA output file included with the package
motifs_psa_file <- system.file("extdata", "out_Hb_psa.txt", package = "PMScanR")

# Check if file exists
if (file.exists(motifs_psa_file)) {
  psaGFF <- read.psa(motifs_psa_file)
  # Optional: display structure or head if needed
  # str(psaGFF)
  # head(psaGFF)
} else {
  warning("Example PSA file not found. Downstream examples may fail.")
  # Create a placeholder empty data frame if needed for vignette to build
  psaGFF <- data.frame(seqid=character(), source=character(), type=character(),
                       start=integer(), end=integer(), score=numeric(),
                       strand=character(), phase=character(), attributes=character(),
                       stringsAsFactors=FALSE)
}

Once the analysis output files have been converted to GFF format, the next step is to create the matrix from this file needed for further visualisation in the form of heatmap generation. That can me make by using:

2.2.1.4 gff2matrix()

Function used when the user receives a file in GFF format. this function enables the creation of a matrix of the occurrence from a GFF file of all the motifs in the entire sequence (or only those selected by the user), which contains 0/1 digits so even for very large-scale analyses it should not take up much disk space and is easy to visualise and analyse in any external program in which tables can be imported as calculation sheets.

# 'psaGFF' is the GFF-like data frame from PSA conversion
# Check if psaGFF has data before proceeding
if (nrow(psaGFF) > 0) {
 mom <- gff2matrix(psaGFF)
 # Display the first few rows of the Motif Occurrence Matrix (MOM)
 head(mom)
} else {
 warning("Input GFF data is empty. Cannot create matrix.")
 mom <- matrix(nrow=0, ncol=0) # Placeholder
}
## Aggregation function missing: defaulting to length
##                 NP_000175.1 NP_000508.1 NP_000549.1 NP_000550.2 NP_001138679.1
## PS00001:125-128           0           0           0           0              0
## PS00001:148-151           0           0           0           0              0
## PS00001:152-155           0           0           0           0              0
## PS00001:177-180           0           0           0           0              0
## PS00001:182-185           0           0           0           0              0
## PS00001:189-192           0           0           0           0             11
##                 NP_001305067.1 NP_001305150.1 NP_001350615.1 NP_071441.1
## PS00001:125-128             10              0              0           0
## PS00001:148-151             10              0              0           0
## PS00001:152-155             10              0              0           0
## PS00001:177-180              0              0              0           0
## PS00001:182-185             10              0              0           0
## PS00001:189-192              0              0              0           0
##                 NP_599030.1
## PS00001:125-128           0
## PS00001:148-151           0
## PS00001:152-155           0
## PS00001:177-180          11
## PS00001:182-185           0
## PS00001:189-192           0

After using this function a heatmap can be generate by using:

2.2.1.5 matrix2hm()

Once the matrix of the occurrence is obtained, this function is used to generate a heatmap, which is one of the options for visualising user data. This function provides the ability to adjust the plot size - this facilitates comparative analysis. By using the plotly library, the plots are interactive, allowing both extensive manipulation of the plot area, but also precise identification of individual results. The code shows an example of usage for creating a heatmap using matrix2hm() - from PSA-converted Matrix: The matrix2hm() function (assumed from PMScanR) generates a heatmap from a motif occurrence matrix.

# This example creates a heatmap from the MOM derived from PSA-converted GFF data ('mom').
# Check if mom has data
if (nrow(mom) > 0 && ncol(mom) > 0) {
  hm1 <- matrix2hm(x = colnames(mom), # 'x' argument specifies columns (motifs)
                   y = row.names(mom), # 'y' argument specifies rows (sequences)
                   input = mom)        # 'input' argument takes the MOM
  # Display the heatmap 'hm1' (might require explicit print for plotly in vignettes)
  print(hm1)
} else {
  warning("Motif matrix is empty. Cannot generate heatmap.")
}

2.2.1.6 matrix2hm_2()

It’s also the function used after receiving the matrix of the occurrence to generate the heatmap, but differing from the above heatmap dimension to allow easier identification of individual variations. By using the plotly library, the plots are interactive, allowing both extensive manipulation of the plot area, but also precise identification of individual results. The matrix2hm_2() function (assumed from PMScanR) is another function for heatmap generation, possibly with a different style or options compared to matrix2hm().

# Example of usage for creating a heatmap using matrix2hm_2() - from PSA-converted Matrix:
# This example uses 'matrix2hm_2()' to visualize the MOM from PSA-converted data ('mom').
# Check if mom has data
if (nrow(mom) > 0 && ncol(mom) > 0) {
  hm2 <- matrix2hm_2(x = colnames(mom),
                     y = row.names(mom),
                     input = mom)
  # Display heatmap 'hm2'
  print(hm2)
} else {
  warning("Motif matrix is empty. Cannot generate heatmap 2.")
}

Heatmap is the first option to visualise the data (shown above), the next option is to generate a seqlogo, after preparing the protein motifs for their generation using the functions described below:

2.2.1.7 extract_segments()

Function used for another type of data visualisation which is a seqlogo generated based on the frequency of occurrence of individual amino acid residues in a selected part of the sequence, that is the defined region in selected sequences. Function used for a fasta format input file. This function extracts the protein motifs present on the sequence thus preparing the file for seqlogo creation.

# Starting position of the region
from_pos <- 10
# Ending position of the region
to_pos <- 20

# Read the FASTA file containing protein sequences (loaded earlier)
# Check if fasta_file path is valid and file exists
if (exists("fasta_file") && file.exists(fasta_file)) {
  seq <- seqinr::read.fasta(file = fasta_file, seqtype = "AA")
  # Extract segments from position 'from_pos' to 'to_pos' for all sequences
  # Ensure the function exists and seq has content
  if (length(seq) > 0) {
     seqShort <- extract_segments(seq = seq, from = from_pos, to = to_pos)
     # Check if seqShort has content before plotting
     if (length(unlist(seqShort)) > 0) {
        ggseqlogo::ggseqlogo(unlist(seqShort), seq_type = "aa")
     } else {
        warning("No segments extracted, cannot generate seqlogo.")
     }
  } else {
     warning("FASTA sequence data is empty.")
  }
} else {
  warning("Example FASTA file path not found or invalid.")
}

2.2.1.8 extract_protein_motifs()

Function used for another type of data visualisation which is a seqlogo generated based on the frequency of occurrence of individual amino acid residues in a selected part of the sequence, that is the defined region in selected sequences. The function used for the input file in PSA format, which extracts the protein motifs present on the sequence in the file, thus preparing them for data visualisation using seqlogo. Example of usage for creating sequence logo for a selected region from PSA:

Firstly start with defining the start and end position of the region you want to visualize as a sequence logo. (Note: Example doesn’t define start/end, it extracts motifs from PSA)

# Check if motifs_psa_file exists
if (exists("motifs_psa_file") && file.exists(motifs_psa_file)) {
  protein_motifs_psa <- extract_protein_motifs(motifs_psa_file)
  # Check results before plotting
  if (length(protein_motifs_psa) >= 1) {
     ggseqlogo::ggseqlogo(protein_motifs_psa[[1]], seq_type = 'aa')
  } else {
     warning("No motifs extracted or result format unexpected (expected list).")
  }
  if (length(protein_motifs_psa) >= 5) {
     ggseqlogo::ggseqlogo(protein_motifs_psa[[5]], seq_type = 'aa')
  } else {
     warning("Fewer than 5 motifs extracted.")
  }
} else {
  warning("Example PSA file path not found or invalid.")
}

2.2.1.9 freqPie()

There is also a visualization that generates a pie chart of the frequency of each motif type from the GFF format file containing information about the motif names and their locations. The pie chart shows the percentage of each motif/protein motif type in the analyzed dataset. This function uses the ggplot2 library. The pie chart allows a quick assessment of motif frequencies. If you want to run a pie chart visualization use the function named freqPie().

# Example usage would require the psaGFF object from earlier
# Check if psaGFF has data
# if (exists("psaGFF") && nrow(psaGFF) > 0) {
#   # Assuming freqPie takes the GFF data frame as input
#   pie_chart <- freqPie(psaGFF)
#   print(pie_chart)
# } else {
#   warning("GFF data is empty. Cannot generate pie chart.")
# }
# Setting eval=FALSE as freqPie function definition/usage isn't fully clear from text

3 References

Appendix

Sigrist C.J.A., de Castro E., Cerutti L., Cuche B.A., Hulo N., Bridge A., Bougueleret L., Xenarios I. (2012). New and continuing developments at PROSITE. Nucleic Acids Res.

A Session Information

sessionInfo()
## R version 4.5.0 (2025-04-11)
## Platform: x86_64-pc-linux-gnu
## Running under: Ubuntu 24.04.2 LTS
## 
## Matrix products: default
## BLAS:   /home/biocbuild/bbs-3.22-bioc/R/lib/libRblas.so 
## LAPACK: /usr/lib/x86_64-linux-gnu/lapack/liblapack.so.3.12.0  LAPACK version 3.12.0
## 
## locale:
##  [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C              
##  [3] LC_TIME=en_GB              LC_COLLATE=C              
##  [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8   
##  [7] LC_PAPER=en_US.UTF-8       LC_NAME=C                 
##  [9] LC_ADDRESS=C               LC_TELEPHONE=C            
## [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       
## 
## time zone: America/New_York
## tzcode source: system (glibc)
## 
## attached base packages:
## [1] stats     graphics  grDevices utils     datasets  methods   base     
## 
## other attached packages:
## [1] seqinr_4.2-36    ggseqlogo_0.2    PMScanR_0.99.3   BiocStyle_2.37.0
## 
## loaded via a namespace (and not attached):
##  [1] plotly_4.10.4       sass_0.4.10         generics_0.1.4     
##  [4] tidyr_1.3.1         stringi_1.8.7       digest_0.6.37      
##  [7] magrittr_2.0.3      evaluate_1.0.3      grid_4.5.0         
## [10] RColorBrewer_1.1-3  bookdown_0.43       fastmap_1.2.0      
## [13] plyr_1.8.9          jsonlite_2.0.0      tinytex_0.57       
## [16] promises_1.3.3      BiocManager_1.30.25 httr_1.4.7         
## [19] purrr_1.0.4         crosstalk_1.2.1     viridisLite_0.4.2  
## [22] scales_1.4.0        ade4_1.7-23         lazyeval_0.2.2     
## [25] jquerylib_0.1.4     cli_3.6.5           shiny_1.10.0       
## [28] rlang_1.1.6         withr_3.0.2         cachem_1.1.0       
## [31] yaml_2.3.10         tools_4.5.0         reshape2_1.4.4     
## [34] dplyr_1.1.4         ggplot2_3.5.2       httpuv_1.6.16      
## [37] vctrs_0.6.5         R6_2.6.1            mime_0.13          
## [40] magick_2.8.6        lifecycle_1.0.4     stringr_1.5.1      
## [43] fs_1.6.6            shinyFiles_0.9.3    htmlwidgets_1.6.4  
## [46] MASS_7.3-65         pkgconfig_2.0.3     pillar_1.10.2      
## [49] bslib_0.9.0         later_1.4.2         gtable_0.3.6       
## [52] glue_1.8.0          data.table_1.17.4   Rcpp_1.0.14        
## [55] xfun_0.52           tibble_3.2.1        tidyselect_1.2.1   
## [58] knitr_1.50          dichromat_2.0-0.1   xtable_1.8-4       
## [61] farver_2.1.2        bsicons_0.1.2       htmltools_0.5.8.1  
## [64] labeling_0.4.3      rmarkdown_2.29      compiler_4.5.0