international online community for dna barcoding professionals
We need to get some thoughts on details of how we want to deal with ITS barcodes. I post a first draft of conditions we want to consider for submission, started by Keith Seifert below. Please comment.
Definition of an acceptable Fungal ITS barcode
Database administrators will need a formal, somewhat mechanical definition of what constitutes an ITS barcode of acceptable quality. Note that by the strict definition of a barcode in the CBOL standard, barcode sequences must already be accompanied by vouchers, on-line metadata, sequence traces and this need not be restated.
Note: For this discussion, we are talking about true Fungi only. However, the Oomycete people are welcome to clone this and adjust it to their own organisms, because they also would like to use the ITS as a barcode.
A 9th criterion from the first draft, on how to deal with barcodes including introns, was deleted after some discussion with people who know more than I do.
Some explanations …
1. A simple standardized title will simplify Boolean searches, rather than having all the variants we see now.
3. These values need to be considered carefully. The barcode standard says 500 bp, but I believe that a significant number of fungi have a shorter ITS, and we need to build this into the definition.
4 and 5. These are my opinions. If we can define these positions more precisely, we should do so.
6. Again, my opinion that because the function of the sequence is a barcode, we might be able to liberate people from having to annotate the different parts as features.
7. The 75% comes from the barcode criteria for Cox1 and something like this needs to be in our definition so that database administrators can be certain that the sequence is actually an ITS sequence. With Cox they do this by translating into protein, which we can’t do, but we can get the same purpose out of the 5.8S. Henrik Nilsson has pointed out a few problems with my original guess… is this any better now?. I added the text in bold italics as an alternative or additive to a possible list of fungi. Does this make sense?
8. This comes from the Cox 1 criteria. I suppose it is okay.
Benjamin Stielow at CBS also mentioned the need for automated chimera checking. Is this an issue for ITS in an identification database, or is it mostly relevant to DNA amplifications from mixed specimens?
thanks Conrad and Keith for your work. I especially like 5. - which makes life much easier and took long discussion in the past. To 7. I would prefer the bold and italic version, because it would prevent us from discussions which would be the correct reference species. Actually, I do not want to test all 'my' species agains the proposed taxa and test whether I am in 75% or not... However, a technical approach would be to list all type species of all fungal classes.
To Benjamins remark: I don't really see chimera to be a problem working with axenic strains or herbarium material. It's a big issue working with environmental samples. But to build the barcodes, we will use a specimen based approach. Thus, I don't see barcode labels on environmental sequences in the near future. Although I accept, that we will use ITS to re-indentify in different studies using next generation sequencing approaches.
Thanks for starting this discussion and for posting the draft. Points 2-6 sound fine to me as they are. A few comments
Add 1. Who will be giving this title to the sequence? The sequence submitter? Genbank? Or is this definition meant for GenBank as a guide whether a sequence will apply for the barcode flag? If the submitter is using this title, there will be a situation where we have sequences called ‘barcode’ with flag and others called barcode without flag (because they do not meet all of the barcode requirements after all). In addition, there are (heritage) sequences not called ‘barcode’ that might qualify for the flag. Though it sounds confusing, not necessarily a bad situation.
Alternatively, the obligatory title is ‘fungal ITS’ only and the word barcode is added by GenBank only if the requirements are met. Unless the latter is superfluous, because there will be the barcode flag added by GenBank anyway.
Add 8. I think we should ditch point 8. For barcode users who cannot or do not want to go back to look at the trace files, one could consider a check box 70 % coverage? Yes/no (preferably automated), or an automatically calculated bidirectional coverage, calculated from the submitted traces.
Depending on the basecaller and basecaller settings used, the same raw data could be with a lot of 'n' or with mostly unambiguous base calls. Thus, 70 % coverage could be met or failed with the same data. Though it is possible to reanalyse data with a different base caller, I do not expect that a standard basecaller will be used to re-analyse the submitted traces. Therefore this value is only an approximation anyway.
The other reason why I am writing this: COI data does not normally suffer from infrageneric length differences. The ITS does; at least in many yeast strains and basidiomycetes minor infrageneric length differences are a regular occurrence and often species specific. Some ITS sequences also suffer from repetitive motives that do not allow to sequence through to the other end and even if the trace is full length, then only with double or triple peaks or worse. Not even cloning does always help. Again, the occurrence of such motives is species specific. I think we should not exclude these sequences and, in extension, taxa from the body of barcode reference data.
The inclusion of quality assessment values (i.e. phred values) in the quality assessment does not necessarily help, because technically perfect sequences will still get a bad quality value for the mixed base calls.
Cloning can ease some of the problems mentioned under 8. On the other hand, there is a considerable chance of getting untypical sequences, which can only be identified as such when a certain number of cloned sequences are compared. As a minimum requirement, one could imagine, cloned sequences should be indicated as such. Or should clones be banned? This latter position was voiced in discussions of COI requirements during past meetings.
Chimera can be an issue if complete ITS sequences were pieced together from separate ITS1 and ITS2 fragments. This is often necessary practice when old specimens (type specimens!!!) are sequenced.
Best wishes, Ursula
Related to the deleted 9th criterion "barcoding including introns". In my opinion, we should include that the intron/intross position must be annotate. In ascomycotina, they are many exemples where two ITS sequences (with and without introns) are obtained from the same DNA isolation.
On points 4&5 are these inteneted to be minimum standards or strict limits? Would a longer sequence still meet the barcode critera?
Alignment of sequences would certainly be easier if these were strict limits, however enforcing these limits will make it harder to upgrade existing GenBank sequences to barcode status by simply adding the tracefiles.
Could the barcode be part of a longer sequence (say including all 28S) and the boundries just be annotated?
There are a lot of issues brought up by different people in the discussion, so thanks for that.
What will happen at BOLD is that they will create an informatics pipeline that checks sequences submitted as ITS sequences to be sure that they meet the criteria. PHRED values are already part of the quality check pipeline that they have. To the extent that mycologists decide to use BOLD, it is important to define these criteria so that they are consistent. Bevan's point that longer ITS might not be bounded by the two motifs indicated in points 4 and 5 does bring out a flaw. I suggest changing 5 to:
5. Barcode normally ends with the first 5 bases of the nuclear large ribosomal subunit, which in Saccharomyces cerevisiae are GTTTG, but the functional barcode excludes these bases; barcodes longer than 750 bp need not have the 3' motif provided the 5' motif is included.
Regarding Dominik's concern about point 7, I agree that we don't want to bother with this as barcode generators, but BOLD would just include this as a quality check. As long as it is acceptable to BOLD, which should be able to automatically blast 5.8 S itself and decide if it is hitting a fungus, I suggest the following version for 7.
7. 75% of the bases of the 5.8S must normally align to the reference sequences of Saccharomyces cerevisiae or any other species accepted as a a member of the Kingdom Fungi by MycoBank or Index Fungorum.
Concerning the deleted number 9 on introns, I cannot decide what is sensible here. We had some discussion earlier whether to include the introns as part of the barcode or not, or whether they should be edited out. I agree with Maria that they can be annotated, but I don't think this really is relevant to quality checking that these guidelines are meant to be used for.
Let's keep this discussion going for a few more days, then we will put up a revised version for final commenting sometime towards the end of February.