FAQ

Data and web-server version

Code version 2024_05_31_v2
Data version 2024_05_31_v2
Code https://github.com/CCB-SB/plsdb
Tested browsers 125.0.6422.141, 127.0.2, 18615.3.12.11.2

Download

All relevant data files can be downloaded here. The provided archive includes also a README with information on other included files, and instructions on how to use the Mash sketches and BLAST database locally.

Publication

For more information about the resource, please read our updated publication and original publication.

Georges P Schmartz, Anna Hartung, Pascal Hirsch, Fabian Kern, Tobias Fehlmann, Rolf Müller, Andreas Keller; PLSDB: advancing a comprehensive database of bacterial plasmids, Nucleic Acids Res., 2021 Nov 25, doi: 10.1093/nar/gkab1111

Valentina Galata, Tobias Fehlmann, Christina Backes, Andreas Keller; PLSDB: a resource of complete bacterial plasmids, Nucleic Acids Res., 2018 Oct 31, doi: 10.1093/nar/gky1050

Please note, that some pipeline steps may have changed to resolve issues arising during data updates. Modifications can include updated versions of the tools and data used for plasmid annotation, and minor changes in the performed processing steps. For more information, see the code repository referenced above. It also contains a file listing the most important changes in the pipeline. Additionally, a file containing a list of removed, added and changed plasmid records with respect to the previous version is included in the download.

Data Pipeline

Data Collection
- Nuccore
  Records are retrieved from NCBI Nucleotide database (Entrez Direct v.16.2) (INDSC -DDBJ, EMBL/ENA, Genbak-, RefSeq) on 2024_05_31_v2 using the following query: biomol_genomic[PROP] AND plasmid[FILT] AND (bacteria[FILT] OR archaea[FILT]) NOT complete cds[TITL] NOT gene[TITL] NOT genes[TITL] NOT contig[TITL] NOT scaffold[TITL] NOT whole genome map[TITL] NOT partial sequence[TITL] NOT (partial[TITL] AND plasmid[TITL]) NOT locus[TITL] NOT region[TITL] NOT fragment[TITL] NOT integron[TITL] NOT transposon[TITL] NOT insertion sequence[TITL] NOT insertion element[TITL] NOT phage[TITL] NOT operon[TITL] (from Orlek et al). The query retrieves sequences containing plasmids as location tag and being assigned to both bacterial and archaeal organisms, while discarding sequences matching any of the specified NOT pattern[FILT]. Document summary was fetched for each hit and subsequently linked to their correspondent Biosample and Assembly record, when available.
- Biosample
  Records are retrieved from NCBI Biosample database. Biosample attributes are retrieved by unifying the attribute's name according to the Biosample Package guidelines. The list of current Biosample attributes incorporated into PLSDB, can be found here.
- Assembly
  Records are retrieved from NCBI Assembly database.
- Taxonomy
  Records are retrieved from NCBI Taxonomy database.

Plasmid Filtering
- To remove non-plasmids or non-complete plasmid sequences Orlek et al regex was used to scan the NUCCORE_Description: contig|\sgene(?!tic|ral|rat|ric)|integron|transposon|scaffold|insertion sequence|insertion element|phage|operon|partial sequence|partial plasmid|region|fragment|locus|complete (?!sequence|genome|plasmid|\.|,)|(?<!complete sequence, )whole genome shotgun|artificial|synthetic|vector.
- If assembly information was available, only records with the 'completness', 'lastest', and 'non-anomalous' assembly status are retained.
- If no completeness tag was associated to the record, then only the assembly tag was used and vice versa. Only the non-empty tags were used to remove the records.
- Preference for RefSeq Records: prioritizing data sources, precedence is given to RefSeq record over those from INSDC repository
- Inclusion of additional information: records enriched with supplementary details such as geographical location, Biosample, and Assembly information, are favored.
- Records with more recent assembly release date, nucleotide creation date, and highest coverage, are preferred
- A list of candidates was created by performing an insilico rMLST analysis using, i.e searching the 53 rps genes, downloaded from PubMLST using mash distances (version 2.3; mash sketch -S 123 -k 21 -s 1000 -i; mash dist -d 0.00123693)
- Further comparison against a local chromosomal version of NCBI nucleotide database. The local chromosomal dataset was retrieved using the query: (Bacteria OR Archaea) NOT plasmid [FILT] NOT complete cds[TITL] NOT gene[TITL] NOT genes[TITL] NOT contig[TITL] NOT scaffold[TITL] NOT whole genome map[TITL] NOT partial sequence[TITL] NOT locus[TITL] NOT region[TITL] NOT fragment[TITL] NOT integron[TITL] NOT transposon[TITL] NOT insertion sequence[TITL] NOT insertion element[TITL] NOT phage[TITL] NOT operon[TITL] NOT whole genome shotgun[TITL] NOT assembly[TITL] . Data available until 2024_05_31_v2 date was included.
- To further screen for false positives, only candidates with more than 5 unique rps genes and with at least one hit in the chromosomal dataset with at least 99% identity and 80% query coverage were considered as chromosomal sequences and excluded from the plasmid collection (version 2.15.0; blastn -task megablast -perc_identity 99 -qcov_hsp_perc 80 -evalue 0.05 -max_target_seqs 10 -max_hsp 10 )
- To further detect and discard biologically implausible plasmids, plasmids with outlier's values of GC content or log10 sequence length were submitted to further manual inspection. Values were considered outliers according to the interquartile range criterion (value < Q0.25-1.5*IQR or value > Q0.75 + 1.5*IQR).
- Candidate anomalous plasmids with a total absence of plasmid typing (including replicon, relaxases, mate pore formatting, origin of transfer, and pMLST) and total absence of annotated genetic elements (i.e. genes, insertion sequences) were suppressed.
- Additionally, candidate anomalous plasmids with a sequence length < 1Kbp or length > 4MB were further suppressed.

Ecosystem and Disease Identification
Each biosample record linked to a plasmid entry proves rich source of information for ecosystem and disease inference.

Ecosystems are automatically scanned from the following biosample attributes: host, host_taxid, host_common_name, host_animal_breed, animal_env, local_class, soil_type, metagenome_source, samp_mat_type, source_type, host_tissue_sampled, tissue, host_body_habitat, isolation_source, env_medium, env_broad_scale, env_local_scale. Each record is automatically scanned for the identification of potential host-associated and environmental-related ecosystems. Host-associated ecosystems are determinate utilizing NCBI Taxonomy and ETE4 (version 4.1.0-beta).

Diseases were classified according to Disease and Symptom Ontologies.

Following automatic classification, identified diseases and ecosystems undergo manual curation to validate and refine the classification process. In cases where a sample was associated with multiple habitats and diseases, all the corresponding habitats and diseases were considered.

Geographical information
Geographical information for plasmid collection was retrieved from BioSample location name (geo_loc_name) or coordinates (lat_lon) if available. When both attributes were provided, coordinates were preferred. Attributes were processed using the Geopy API of OpenStreetMap and further compared with Bing Maps. Manual inspection was performed when discrepancies between OpenStreetMap and Bing Maps arise.

Plasmid Annotation

The data processing pipeline makes use of the PubMLST website developed by Keith Jolley (Jolley & Maiden 2010, BMC Bioinformatics, 11:595) and sited at the University of Oxford; the development of that website was funded by the Wellcome Trust.

Included tools for sequence search

Mash (Mash paper, Mash screen pre-print, repository)
Version used by the web-server: 2.3
CMD: mash sketch -S 123 -k 21 -s 1000{individually} -o {query_msh_noext} {query_fa} && mash dist {plasmids_msh} {query_msh} -v {max_pvalue} -d {max_dist} > {output}
CMD: mash screen {plasmids_msh} {query_fa} -v {max_pvalue} -i {min_ident}{winner_takes_all} > {output}
BLASTn (official website)
Version used by the web-server: 2.14.1
CMD:
- blastn -query {query_fa} -task blastn -db {plasmids_db} -out {output} -evalue 1 -perc_identity {min_ident} -qcov_hsp_perc {min_cov} -outfmt '6 qseqid sseqid qstart qend sstart send evalue bitscore pident qcovs qcovhsp'
- tblastn -query {query_fa} -task tblastn -db {plasmids_db} -out {output} -evalue 1 -qcov_hsp_perc {min_cov} -outfmt '6 qseqid sseqid qstart qend sstart send evalue bitscore pident qcovs qcovhsp'

Used libraries

Boostrap v.5.1.3
Font Awesome icons
Highcharts
Krona
underscore.js
jquery.auto-complete.js
Kablammo

Support

If you encounter any problems or have questions about the data feel free to open an issue. In case of reporting an error, do not forget to specify the used web server version, the error message, and a description of your request (e.g. job ID and input data).

Disclaimer

This website is provided "as is" without any warranties.

Uploaded data

If you submit data for sequence search we assume that you have the right to upload this data to this web-server. The uploaded sequences are stored only temporary and are deleted upon job completion.

FAQ

Data and web-server version

Download

Publication

Data Pipeline

Data Collection

Nuccore

Biosample

Assembly

Taxonomy

Plasmid Filtering

Metadata Attributes

Sequence Deduplication

Putative chromosomal sequences

Quality Control

Ecosystem and Disease Identification

Geographical information

Plasmid Annotation

Protein-coding genes

MOB Typing

PlasmidFinder Typing

Antimicrobial Resistance Genes

Biosyntethic Gene Clusters

Included tools for sequence search

Used libraries

Support

Disclaimer

Uploaded data