filterbyname

Samantha M Zarnick — Wed, 05 Feb 2025 21:16:04 -0000

Does filterbyname remove all reads in the fastq files that match a read name in the txt file? Are duplicates in the input files an issue?

suggested bbmerge improvements

Eric Boyden — Wed, 12 Jun 2024 21:29:52 -0000

I was looking for a java-based alternative to either cutadapt or ngmerge, and bbmerge seems to be able to do much of what both other tools can do (primarily I was looking for fixed read trimming, quality trimming, and paired read dovetail trimming). A few suggested improvements:
* Allow simultaneous fixed trimming and q-trimming. Right now it seems like q-trimming supercedes fixed trimming if both are specified, regardless of which end you want to trim in each case; but there are situations where e.g. one might want to trim a fixed number of bases from the 5' end and quality-trim the 3' end. Additionally, one might want to trim at least X bases from one end, but perform additional quality trimming beyond that if necessary. I'd recommend just implementing an order of operations, e.g. fixed trimming > quality trimming > adapter trimming > merging or dovetail trimming (similar to what cutadapt uses).
* Enable an option for q-trimming that properly handles 2-color chemistry (where 3' G bases may represent null cycles beyond the end of the template but nevertheless have high quality scores, so conventional trimming algorithms do not trim them) - cutadapt implemented something like this a while back. Also relevant for quality trimming in BBMap.
* Add the ability to put fixed trimmed sequences or trimmed adapter sequences in the fastq comment as a properly formatted SAM tag (this is purely just a nice-to-have and maybe beyond scope, but cutadapt allows a lot of both read and read-name editing like this that really enhances its flexibility for downstream applications).

Thanks!

suggested improvements

Eric Boyden — Wed, 12 Jun 2024 21:13:12 -0000

I was looking for a java-based global short-read aligner and this software performs excellently, with lots of neat features that many other aligners lack. In descending order of priority, here are some ideas for improvement:
* Update SAM spec to v1.6 (currently 1.3/1.4), primarily to produce an updated @HD line to indicate query grouped output (GO:query). Many downstream tools (e.g. some fgbio tools) rely on this and won't work unless this is explicitly specified, and resorting just to add this annotation is time-consuming. (Also according to SAM spec, =/X CIGAR strings were added with v1.3 not v1.4 https://samtools.github.io/hts-specs/SAMv1.pdf; this option probably shouldn't be tied to sam spec at all but should be renamed.)
* killbadpairs and requirecorrectstrand should prevent paired reads from mapping to different chromosomes/contigs; currently such reads are passed through since they're technically not on "opposite" strands.
* Add functionality to move fastq read comments (everything after the first whitespace) into a SAM tag, possibly as an alternative/add-on to trimreaddescriptions. This feature is offered by several other aligners, including BWA, Bowtie2, MiniMap2, and SNAP.
* Add support for ubam input, so that realigning bams doesn't require reverting them back to fastqs (which may also be complicated to do without losing ubam metadata, although adding the ability to move fastq comments to sam tags will help).
Also FWIW, the default minid=0.76 seems to be a bit too sensitive for human PE150 data, and allows a fairly high rate of nonspecific alignments of junk reads in NTCs, at least with our data. Raising this value to 0.85 cleaned everything up, with output similar to that of other common aligners (BWA, Bowtie2). Unsurprisingly, this error rate is also more consistent with Bowtie2's default error tolerance in global mode.
Thanks!

GC content for paired-reads

Priscilla Glenn — Wed, 06 Dec 2023 19:47:00 -0000

Hi, I recently read on biostars [https://www.biostars.org/p/9546248/]HERE that I could use your tools to determine the GC content of my reads. My reads are paired reads though and I wanted to adjust this to determine the GC content for each chromosome. I was able to manually split my bam file into the various chromosomes but am unsure how best to use reformat and stats on these files. When I tried to tell reformat that it was paired, and provide two output file names, it didn't work. It runs properly if I say they are unpaired but I'm worried that if I say the reads are unpaired, that will change the answer for the overall chromosome GC content, Thoughts? Thank you in advance!

finding references (phiX174 and adapters) for bbduk

Jay Osvatic — Mon, 20 Mar 2023 11:49:13 -0000

Hello,

I am currently trying to make a more reproducible workflow (removing weird paths to files mostly) and I am having trouble figuring out the shortcut that can be used for phiX174 as a reference for BBDuk.

It looks like the adapters file can be called like this: "ref=adapters", eliminating the extra path before it. Is there a similar way to write the phiX174_ill.ref.fa.gz file?

Thanks!

Quality filtering on minimum average quality

Jake — Wed, 21 Sep 2022 18:02:45 -0000

Seems like it is done with

-10*math.log(sum([10**(-(ord(c)-33)/10) for c in line4])/len(line4),10)

rather than just

sum([ord(c)-33 for c in line4])/len(line4)

So, its the quality score of the average probability.

Quality filtering on minimum average quality

Jake — Fri, 16 Sep 2022 22:16:32 -0000

I've been using bbduk for quality filtering and I pretty much just took it for granted. However, I've been looking at it as I usually use minavequality=15 which removes about 15-20% of our reads. minavequality=30 removes almost 100% of them.

However when I do the quality averaging by hand, almost all of my reads have an average quality above 30. I've been taking the ASCII values of the quality scores and subtracting 33, summing them up and dividing by the length.

How is bbduk computing the average quality?

Java issue

Muku — Wed, 03 Aug 2022 19:40:30 -0000

I am trying to run bbsplit:
bbsplit.sh build=1 threads=12 ref_x=$ref_Human ref_y=$ref_Mouse path=$indexded_ref

I am running into following error:
Error: Could not find or load main class align2.BBSplitter
Caused by: java.lang.ClassNotFoundException: align2.BBSplitter
srun: error: b001: task 0: Exited with exit code 1

Is there a way to solve this issue ?

Thank you.

Gene coverage detection output from pileup.sh

Hannah Freund — Mon, 20 Dec 2021 17:04:07 -0000

I am currently using pileup.sh to determine the coverage of genes in my metagenomes by comparing my read mapping alignment file to the gene predictions of the contigs identified by Prodigal. I was wondering if anyone could tell me the output for the gene_coverage.tmp file, and if there are the same headers as the contig_comverage.tmp files? I am not sure if I should use the avgDepth value or the depthSum for my coverage value when I go to normalize these counts to counts per million (CPM).

I am adding a screenshot that shows the column headers in this gene_coverage tmp file. Thanks for your help and for making such a great set of tools!

Gene coverage detection output from pileup.sh

Hannah Freund — Mon, 20 Dec 2021 17:04:03 -0000

I am adding a screenshot that shows the column headers in this gene_coverage tmp file. Thanks for your help and for making such a great set of tools!

Recent posts to Discussion

filterbyname

suggested bbmerge improvements

suggested improvements

GC content for paired-reads

finding references (phiX174 and adapters) for bbduk

Quality filtering on minimum average quality

Quality filtering on minimum average quality

Java issue

Gene coverage detection output from pileup.sh

Gene coverage detection output from pileup.sh