Fred2.IO Module

IO.ADBAdapter

IO.EnsemblAdapter

IO.FileReader

Fred2.IO.FileReader.read_annovar_exonic(annovar_file, gene_filter=None, experimentalDesig=None)

Reads an gene-based ANNOVAR output file and generates Variant objects containing all annotated Transcript ids an outputs a list Variant.

Parameters:
  • annovar_file (str) – The path ot the ANNOVAR file
  • gene_filter (list(str)) – A list of gene names of interest (only variants associated with these genes are generated)
Returns:

List of :class:`~Fred2.Core.Variant.Variants fully annotated

Return type:

list(Variant)

Fred2.IO.FileReader.read_fasta(files, in_type=<class 'Fred2.Core.Peptide.Peptide'>, id_position=1)

Generator function:

Read a (couple of) peptide, protein or rna sequence from a FASTA file. User needs to specify the correct type of the underlying sequences. It can either be: Peptide, Protein or Transcript (for RNA).

Parameters:
  • files – A (list) of file names to read in
  • in_type (Peptide or Transcript or Protein) – The type to read in
  • id_position (int) – the position of the id specified counted by |
In_type files:

list(str) or str

Returns:

a list of the specified sequence type derived from the FASTA file sequences.

Return type:

(list(in_type))

Raises:

ValueError – if a file is not readable

Fred2.IO.FileReader.read_lines(files, in_type=<class 'Fred2.Core.Peptide.Peptide'>)

Generator function:

Read a sequence directly from a line. User needs to manually specify the correct type of the underlying data. It can either be: Peptide, Protein or Transcript, Allele.

Parameters:
In_type files:

list(str) or str

Returns:

A list of the specified objects

Return type:

(list(in_type))

Raises:

IOError – if a file is not readable

Fred2.IO.FileReader.read_vcf(vcf_file, gene_filter=None, experimentalDesig=None)

Reads an vcf v4.0 or 4.1 file and generates Variant objects containing all annotated Transcript ids an outputs a list Variant. Only the following variants are considered by the reader where synonymous labeled variants will not be integrated into any variant: filter_variants = [‘missense_variant’, ‘frameshift_variant’, ‘stop_gained’, ‘missense_variant&splice_region_variant’, “synonymous_variant”, “inframe_deletion”, “inframe_insertion”]

Parameters:
  • vcf_file (str) – The path ot the vcf file
  • gene_filter (list(str)) – A list of gene names of interest (only variants associated with these genes are generated)
Returns:

List of :class:`~Fred2.Core.Variant.Variants fully annotated

Return type:

Tuple of (list(Variant), list(transcript_ids)

IO.MartsAdapter

class Fred2.IO.MartsAdapter.MartsAdapter(usr=None, host=None, pwd=None, db=None, biomart=None)

Bases: Fred2.IO.ADBAdapter.ADBAdapter

get_all_variant_gene(locations, _db='hsapiens_gene_ensembl', _dataset='gene_ensembl_config')

Fetches the important db ids and names for given chromosomal location

Parameters:
  • chrom (int) – Integer value of the chromosome in question
  • start (int) – Integer value of the variation start position on given chromosome
  • stop (int) – Integer value of the variation stop position on given chromosome
Returns:

The respective gene name, i.e. the first one reported

get_all_variant_ids(**kwargs)

Fetches the important db ids and names for given gene _or_ chromosomal location. The former is recommended. AResult is a list of dicts with either of the tree combinations:

  • ‘Ensembl Gene ID’, ‘Ensembl Transcript ID’, ‘Ensembl Protein ID’
  • ‘RefSeq Protein ID [e.g. NP_001005353]’, ‘RefSeq mRNA [e.g. NM_001195597]’, first triplet
  • ‘RefSeq Predicted Protein ID [e.g. XP_001720922]’, ‘RefSeq mRNA predicted [e.g. XM_001125684]’, first triplet
Parameters:
  • 'locations' – list of locations as triplets of integer values representing (chrom, start, stop)
  • 'genes' – list of genes as string value of the genes of variation
Returns:

The list of dicts of entries with transcript and protein ids (either NM+NP or XM+XP)

get_ensembl_ids_from_id(gene_id, **kwargs)

Returns a list of gene-transcript-protein ids from some sort of id

Parameters:
  • gene_id (str) – The id to be queried
  • type (EIdentifierTypes()) – Assumes given ID from type found in list of EIdentifierTypes() , default is gene name
  • _db (str) – can override MartsAdapter default db (“hsapiens_gene_ensembl”)
  • _dataset (str) – specifies the query dbs dataset if default is not wanted (“gene_ensembl_config”)
Returns:

Containing information about the corresponding (linked) entries.

Return type:

list(dict)

get_gene_by_position(chrom, start, stop, **kwargs)

Fetches the gene name for given chromosomal location

Parameters:
  • chrom (int) – Integer value of the chromosome in question
  • start (int) – Integer value of the variation start position on given chromosome
  • stop (int) – Integer value of the variation stop position on given chromosome
  • _db (str) – Can override MartsAdapter default db (“hsapiens_gene_ensembl”)
  • _dataset (str) – Specifies the query dbs dataset if default is not wanted (“gene_ensembl_config”)
Returns:

The respective gene name, i.e. the first one reported

Return type:

str

get_product_sequence(product_id, **kwargs)

Fetches product (i.e. protein) sequence for the given id

Parameters:
  • product_id (str) – The id to be queried
  • type (EIdentifierTypes()) – Assumes given ID from type found in EIdentifierTypes(), default is ensembl_peptide_id
  • _db (str) – Can override MartsAdapter default db (“hsapiens_gene_ensembl”)
  • _dataset (str) – Specifies the query dbs dataset if default is not wanted (“gene_ensembl_config”)
Returns:

The requested sequence

Return type:

str

get_transcript_information(transcript_id, **kwargs)

Fetches transcript sequence, gene name and strand information for the given id

Parameters:
  • transcript_id (str) – The id to be queried
  • type (EIdentifierTypes()) – Assumes given ID from type found in EIdentifierTypes(), default is ensembl_transcript_id
  • _db (str) – Can override MartsAdapter default db (“hsapiens_gene_ensembl”)
  • _dataset (str) – Specifies the query dbs dataset if default is not wanted (“gene_ensembl_config”)
Returns:

Dictionary of the requested keys as in EAdapterFields.ENUM

Return type:

dict

get_transcript_information_from_protein_id(product_id, **kwargs)

Fetches transcript sequence for the given id

Parameters:
  • product_id (str) – The id to be queried
  • type (EIdentifierTypes()) – Assumes given ID from type found in EIdentifierTypes(), default is ensembl_peptide_id
  • _db (str) – Can override MartsAdapter default db (“hsapiens_gene_ensembl”)
  • _dataset (str) – Specifies the query dbs dataset if default is not wanted (“gene_ensembl_config”)
Returns:

List of dictionary of the requested sequence, the respective strand and the associated gene name

Return type:

list(dict)

get_transcript_position(transcript_id, start, stop, **kwargs)

If no transcript position is available for a variant, it can be retrieved if the mart has the transcripts connected to the CDS and the exons positions

Parameters:
  • transcript_id (str) – The id to be queried
  • start (int) – First genomic position to be mapped
  • stop (int) – Last genomic position to be mapped
  • type (EIdentifierTypes()) – Assumes given ID from type found in EIdentifierTypes(), default is ensembl_transcript_id
  • _db (str) – Can override MartsAdapter default db (“hsapiens_gene_ensembl”)
  • _dataset (str) – Specifies the query dbs dataset if default is not wanted (“gene_ensembl_config”)
Returns:

A tuple of the mapped positions start, stop

Return type:

int

get_transcript_sequence(transcript_id, **kwargs)

Fetches transcript sequence for the given id

Parameters:
  • transcript_id (str) – The id to be queried
  • type (EIdentifierTypes()) – Assumes given ID from type found in EIdentifierTypes(), default is ensembl_transcript_id
  • _db (str) – Can override MartsAdapter default db (“hsapiens_gene_ensembl”)
  • _dataset (str) – Specifies the query dbs dataset if default is not wanted (“gene_ensembl_config”)
Returns:

The requested sequence

Return type:

str

get_variant_id_from_protein_id(transcript_id, **kwargs)

Returns all information needed to instantiate a variation

Parameters:
  • transcript_id (str) – The id to be queried
  • type (EIdentifierTypes()) – assumes given ID from type found in EIdentifierTypes(), default is ensembl_transcript_id
  • _db (str) – can override MartsAdapter default db (“hsapiens_gene_ensembl”)
  • _dataset (str) – specifies the query dbs dataset if default is not wanted (“gene_ensembl_config”)
Returns:

Containing all information needed for a variant initialization

Return type:

list(dict)

get_variant_ids(**kwargs)

Fetches the important db ids and names for given gene _or_ chromosomal location. The former is recommended. AResult is a list of dicts with either of the tree combinations:

  • ‘Ensembl Gene ID’, ‘Ensembl Transcript ID’, ‘Ensembl Protein ID’
  • ‘RefSeq Protein ID [e.g. NP_001005353]’, ‘RefSeq mRNA [e.g. NM_001195597]’, first triplet
  • ‘RefSeq Predicted Protein ID [e.g. XP_001720922]’, ‘RefSeq mRNA predicted [e.g. XM_001125684]’, first triplet
Parameters:
  • 'chrom' – integer value of the chromosome in question
  • 'start' – integer value of the variation start position on given chromosome
  • 'stop' – integer value of the variation stop position on given chromosome
  • 'gene' – string value of the gene of variation
  • 'transcript_id' – string value of the gene of variation
Returns:

The list of dicts of entries with transcript and protein ids (either NM+NP or XM+XP)

IO.RefSeqAdapter

Deprecated since version 1.0.

class Fred2.IO.RefSeqAdapter.RefSeqAdapter(**kwargs)

Bases: Fred2.IO.ADBAdapter.ADBAdapter

get_product_sequence(product_refseq, **kwargs)

Fetches product sequence for the given id

Parameters:product_refseq (str) – Given refseq id
Returns:List of dictionaries of the requested sequence, the respective strand and the associated gene name
Return type:list(dict)
get_transcript_information(transcript_refseq, **kwargs)

Fetches transcript sequence for the given id

Parameters:
  • transcript_id (str) – The transcript ID as string
  • type – Given id, is in the form of this type,found in EIdentifierTypes(). It is to be documented if an ADBAdapter implementation overrides these types.
Returns:

list of dictionary of the requested sequence, the respective strand and the associated gene name

Return type:

list(dict)

get_transcript_sequence(transcript_refseq, **kwargs)

Fetches transcript sequence for the given id :param transcript_refseq: :return: list of dictionary of the requested sequence, the respective strand and the associated gene name

load(filename)

IO.UniProtAdapter

Deprecated since version 1.0.

class Fred2.IO.UniProtAdapter.UniProtDB(**kwargs)
exists(seq)

fast check if given sequence exists (as subsequence) in one of the UniProtDB objects collection of sequences.

Parameters:seq – the subsequence to be searched for
Returns:True, if it is found somewhere, False otherwise
read_seqs(sequence_file)

read sequences from uniprot files (.dat or .fasta) or from lists or dicts of BioPython SeqRecords and make them available for fast search. Appending also with this function.

Parameters:sequence_file – uniprot files (.dat or .fasta)
Returns:
search(seq)

search for first occurrence of given sequence(s) in the UniProtDB objects collection returning (each) the fasta header front part of the first occurrence.

Parameters:seq – a string interpreted as a single sequence or a list (of str) interpreted as a coll. of sequences
Returns:a dictionary of sequences to lists (of ids, ‘null’ if n/a)
search_all(seq)

search for all occurrences of given sequence(s) in the UniProtDB objects collection returning (each) the fasta header front part of all occurrences.

Parameters:seq – a string interpreted as a single sequence or a list (of str) interpreted as a coll. of sequences
Returns:a dictionary of the given sequences to lists (of ids, ‘null’ if n/a)
write_seqs(name)

writes all fasta entries in the current object into one fasta file

Parameters:name – the complete path with file name where the fasta is going to be written