A list of puns related to "Gene Ontology"
Hi everyone,
I have a list of UniprotKB (around 900) along with gene IDs in a csv file. It is not RNA Seq data. Is it possible to fetch Uniprot GO terms in R? I need to plot GO terms for the set of genes. Also, these 900 UniProtKB is distributed across 20000 genes, hence each UniProtKB has a unique frequency. Kindly let me know if you have any suggestions. Thank you.
I downloaded the dataset goa_human.gaf
from Gene Ontology Consortium. The description in the file and on the website say that this dataset has been "filtered in order to reduce redundancy". What does that mean?
I did find the release pipeline and the Annotation QC checks: http://wiki.geneontology.org/index.php/Release_Pipeline#Annotation_QC_checks And this pointed me to the GO rules: https://github.com/geneontology/go-site/blob/master/metadata/rules/README.md
There are so many rules here that I don't fully understand these. However, these look more or less like quality assurance filters to make sure the data is clean and usable. What I might be worried about is a systematic removal of GO terms that I did not consider/assume. My goal is to use these GO to find protein functional similarity (based on information content and jaccard similarity - see Funsim measures section of this paper). I'd like to ensure that a chunk of the data wasn't just removed for a totally different goal/purpose (e.g., certain level of frequency that isn't useful for a specific and different task such as gene set enrichment analysis).
Any insights?
More info
Website description (http://current.geneontology.org/products/pages/downloads.html) > Filtered Files > > These files are taxon-specific and reflect the work of specific projects, primarily the model organisms database groups, to provide comprehensive, non-redundant annotation files for their organism. All the files in this table have been filtered using the annotation file QC pipeline. A major component to the filtering is the requirement that particular taxon IDs can only be included within the association files provided by specific projects; the current list of authoritative groups and major model organisms can be found below.
Data downloaded:
Homo sapiens
EBI Gene Ontology Annotation Database (goa) protein 543477 goa_human.gaf (gzip)
Data dictionary: http://geneontology.org/docs/go-annotation-file-gaf-format-2.1/
Data header:
!gaf-version: 2.2
!
!generated-by: GOC
!
!date-generated: 2021-10-27T15:09
!
!Header from source association file:
!=================================
!
!generated-by: GOC
!
!date-generated: 2021-10-27T04:08
!
!Header from goa_human source association file:
!=================================
!
!The set of protein accessions included in this file is based on UniProt reference proteomes, which provide on
... keep reading on reddit β‘Any suggestions on tools (standalone or associated with Python/R/etc.) that take a FASTA input and return gene ontologies for the domains that are present in the sequence?
I have tried to work with HMMER2GO, but thereβs a Perl module issue I havenβt been able to overcome. Iβm trying to use the Docker version, but Iβm pretty inexperienced with Docker so Iβm unsure how to feed the Docker container with an input string (rather than running the image directly, which opens an interactive shell in this case). Iβm getting the string through Python, and want to pass it to HMMER2GO via its Docker image, but no luck.
Iβm trying to avoid using BLAST2GO, since I only want to use open-source + free command line tools.
Hi everyone,
Im newish into R and I just finished a differential-expression analysis with R of my LFQ-based proteomics dataset. I ended with a data frame containing the significantly expressed genes (in this case from yeast), their UniprotIDs, p-values, Log2(FC), etc. I would like, however, to add some annotations into my analysis (GOMF, GOCC, GOBP, KEGG, etc.).
Which R package would you recommend to add this type of annotations based on the UniprotIDs?
Thanks a lot :)
Hi all,
I am new to bioinformatics and am currently learning how to use GSEA.
Background: I analyzed my RNA-Seq results using DESeq2, and am now learning to perform GSEA. For my project, in broad terms, I have samples from sick patients and healthy patients. My plan was to perform GSEA to identify enriched gene sets in the sick patients, and then perform Leading Edge analysis to view genes that are present across many of the enriched gene sets. I am particularly interested in gene sets/genes having to do with immune responses. The MSigDB I used was BP of GO.
As I am working with this program for the first time, I am stumped about two things:
The GSEA analysis came back with "888 gene sets are significantly enriched at nominal pvalue < 1%" and "1815 gene sets are significant at FDR < 25%". And so I am a little bit overwhelmed with the volume of data. In this type of analysis, is it sufficient to look at and work with, for example, the top 50 enriched gene sets and continue my Leading Edge analysis with those? -> my guess is no and that this would lead me to miss out on potentially interesting results ?
Looking at my top 20 enriched gene sets, for example, there are a number of gene sets that pertain to my experiment and what I am interested in, such as those having to do with pattern recognition receptor signaling, TLR signaling, antigen processing, etc. In addition, there are also highly enriched gene sets such as golgi vesicle transport, ER to golgi mediated transport, vesicle targeting which I am less interested in as they have less to do with immune responses. Is there a method to filter my GO results for ones having to do with immune response, and perform Leading Edge analysis on that filtered subset?
I guess what it boils down to is - I am overwhelmed with the # of enriched gene sets and volume of data, and am unsure of where to go next in my analysis!
I would appreciate any help/suggestions/advice! I hope my question was clear - I am still new to bioinformatics and am not always certain about the terminology and stuff :)
What databases would you recommend for gene ontology? How would you form a comprehensible report on expressed gene functions?
I've been looking around at various packages (GO.db, topGO, etc.), but it's not clear to me if this is possible in them. Do you know of an easy way to do this?
------------
*Edit* For future readers: I ended up using ontology/pathway annotations from the msigdbr package. It doesn't have the full Gene Ontology, but it's enough for me. I tried the NCBI "gene2go" file, but it includes associations from all species and I didn't feel doing the additional filtering steps required. Also, I believe it only includes annotations at ontology "stems", so you have to trace the hierarchy up if you want to annotate more general terms. AnnotationDbi (e.g. "org.Hs.eg.db") works well, but it requires a few seconds for the search, which is too slow for testing many terms at once.
Hello,
I have a list of genes and I want to pull out an analyze just the transcription factors.
Is there a tool that I can use to just give broad ontologies to a set of genes? Such as βtranscription factorβ, βkinaseβ, etc.
The GO, KEGG categories are too specific
Hi folks,
I am a master's student trying to figure out gene ontology on the fly. My thesis defense is in about a month. I have been floundering for a bit and am turning to the glory of Reddit to aid me. I have no practical skills when it comes to genetic research so be gentle. I am working with differential expression data procured using next-gen RNAseq. The following is how I approached analyzing and interpreting the DE data.
What I would like to know is 1) are the tools/approach I used to make the ontology sound? If yes great, if not, where did I go wrong. 2) What exactly am I looking at once I have an ancestor chart? What inferences can be made or not made?
Thanks
https://preview.redd.it/4gecml24g2e51.png?width=1194&format=png&auto=webp&s=b6f6da6faf25bfb1ea26063e202116a0b07834c5
I have an assignment question about this. We created a ROC plot to compare PSI-BLAST results agains GO database and looked for how many true and false positives hits were present in our PSI-BLAST results. I think the answer is no. We can't use the GO database as a 'gold standard' but we could use it to double check and point out certain false positive in PSI-BLAST results? I found a few papers about increasing PSI-BLAST accuracy and none mention GO, however they do mention using literature and even google to increase accuracy so we should be able use GO database for this purpose as well? Could someone elaborate with a comprehensive explanation?
Thanks in advance!
Hi, I'm a very inexperienced high school intern and honestly I don't really understand how DAVID works, but I need to use it to get some gene annotation data. I submitted two miRNA target prediction Ensembl transcript ID lists (and nearly all transcripts correspond to unique genes), each about 210 entries long. They have 196 transcripts in common, so I expected to get very similar GOTERM results. However, the top ranked/lowest p value results for the first list are 'osteoblast differentiation', 'spliceosome', 'protein kinase binding' while the top results for the other are 'membrane fusion', 'vesicle', 'endoplasmic reticulum'.
As I scroll down I guess there are more GO results that are the same between the two lists, but the difference is pretty significant and especially I wonder why the top results are so dissimilar. Do those ~10 different transcripts make a huge impact or did I do something wrong? Thanks for any help!
Hello everybody, how are you doing? I've seen a couple of questions here about GO but I couldn't find what I need, sorry if this is indeed a duplicated post.
I've been asked to perform a GO enrichment analysis to a couple of RNA seq libraries of strawberry. I have the list of genes with its GO terms and the background (the total genes in the genome, according to the genome annotation web) also with its GO terms. What I'm trying to do (I think i'm trying, I've never done this before) is a Fisher test. I found several online tools to perform something like this, but all of them has pre-loaded species, and I would like to load my own background data. Another issue is the name of the genes. The IDs are awful, and there's no way to get something like entrez or genebank ID. The only server that I found that could work is Agrigo, but only worked once with my data, and never again.
Do you know any other tool that could do the trick? Any python or R package (I'm more familiar to python)?
Thanks in advance!
I want to find Transcription factor related GO terms. How do i do it ? I have no idea where to look for or curate in manually
Hello guys ,
I've recently performed a gene set enrichment analysis for 4 list of genes that i have using DAVID .
I got the terms ENRICHED for my desired p-val <0.05, Now i Don't know how to interpret those results .
Can someone help me on this? How do i interpret individual lists and across different lists.
Hey,
I want to get all sequences from swissprot and the corresponding GO ids with a special GO evidence code. My first approach was to download the fasta and request for all uniprot ids the txt but this was very slow. My next approach was to create a custom table with the ids and download it however I nowhere find the option to include evidence codes in the table and therefore makes it useless.
I'd really appreciate your help
Nuraxx
Hello all, I've written a script to search an entire downloaded NCBI transcriptome for mRNA that has the recognition site for the specific sequence of interest. The R script is still running and hopefully will return an excel spreadsheet of mRNA that contain the recognition sequence. From here, I would like to acquire GO terms to bin the identified genes for downstream comparisons among two other species (venn diagram, heat maps, all that jazz.
I am NEW to coding and it was quite the journey to get this far, so consider my skill level in your literature and resource suggestions. I'd appreciate any insight, most of the GO analysis tutorials I have seen has been for enrichment analysis. I'm just looking for the basic sorting and bonus visualizations. Thanks in advance!
I've used the R package TopGo to get the GO terms for my genes of interest. However, I end up with 50+ terms at low p-values. Many of them seem very similar. I was hoping for help regarding a good way to reduce my GO terms.
Revigo seems like a decent option, but I was wondering if there are other methods that don't require me to copy and paste into a web app.
Thanks!
Hello everybody,
I have used a few different tools for GO analysis (GOseq, GAGE, David) but find it hard to choose between them. As far as I can tell there is no preferred tool in the community.
What do you use and why?
Thanks!
Hello everybody,
So here's my problem : we have a nice RNA-seq, many significant genes, etc. And of course, we want to pass those genes in GO.
But what I would like to do is instead of feeding GO with my raw list of genes, I'd like to have an algorithm where I can also give a p-value and/or fold enrichment to give more weight to some genes.
I don't know if it exists or if it's even possible, and I'm sorry but I kind of lagged-behind in bioinfo !
Thank you !
Hi everyone,
I'm not sure if this will be of any use to anyone here, but I've just released an R package named goldi
: Gene Ontology Label Discernment and Identification. The package arose through a collaboration which attempted to identify Gene Ontology terms in journal articles in various fields in order to compare frequencies and over expressed terms. The package (hopefully) provides an easy to use syntax for searching a given article or abstract for Gene Ontology molecular function terms, or any other list of terms which you could provide.
It's a very basic concept, though it appears to work pretty well. We define an acceptance function which is the number of words needed to be present in a sentence for the algorithm to say the particular term is present. We apply typical NLP quality control on all of the input to reduce type I and type II error. I've also coded certain parts of the algorithm in Rcpp
to improve the performance, and it seems to scale quite well. The package also has the ability to integrate in synonyms, and has some convenience functions for common analyses. I've provided descriptions of these in the vignettes.
Again, I'm not sure if this will be useful to anyone, but just in case it is:
I'm submitting to CRAN but haven't had a chance to do it yet with other obligations. If you do try it, please feel free to give any feedback (positive or negative) and bug reports.
All the best!
For example, would it be possible to search UniProt for all known enzymes with acidic products?
I'm doing a type of analysis where I'm quantifying reads all along the genome, not just in annotated parts. As such, I've split every part of the genome into unique, non-overlapping windows, which don't correspond to known gene boundaries. Using a Bioconductor/DESeq pipeline, I've been able to identify the windows that were most significantly differentially expressed by treating each window as if it were a gene. By manual inspection, it seems that some of these significant windows overlap with known gene annotations, while others correspond to un-annotated (likely regulatory) regions.
I'd like to conduct some sort of gene ontology analysis going forward, to see whether windows overlapping with annotated genes tend to have functional similarities. The problem I'm facing now is that most standard pipelines assume your gene names correspond to existing "standard" gene names. However, in my analysis, the window names simply uniquely identify the chromosome, strand, and window number. Based on this information, I can construct a GRanges object to map my windows to real locations in the genome.
From my list of significantly differentially expressed windows, I've constructed corresponding GRanges. Now, how do I map these ranges to existing genes within the annotation databases so I can perform GO analysis?
Sorry if this question doesn't make much sense. I'll answer whatever I can in the comments.
Does anybody have a good software or way to easily create editable GO term graphs? I have two annotated species transcriptomes with gene ontology terms and we used WEGO and Blast2GO to create the graphs which just generate a downloadable JPG. But my PI wants editable graphs.
Anybody got any ideas?
Please note that this site uses cookies to personalise content and adverts, to provide social media features, and to analyse web traffic. Click here for more information.