The Human Pangenome Reference Consortium asked the scientific community to submit assemblies for the HG002 assembly comparision project.
Background: The Human Reference Pangenome Consortium has organized a collection of HG002 (NA24385) data to guide the development of new assembly and evaluation strategies for our reference production efforts.
We accepted assembly submissions for evaluation until March 16, 2020.
Details below. Please contact Karen Miga (firstname.lastname@example.org) and Erich Jarvis (email@example.com) for details about submitted assemblies.
The HPRC HG002-Data Freeze v1.0 is described here:
Participating groups are recommended to use a subset of the overall data release (Downsampled data):
Please note that we have included four flow cells of PacBio HiFi data (2x 15kb and 2x 20kb) to match our anticipated read length and quality in current production. However, teams are welcome to use 4x 15kb HiFi data if their method performance is improved with read libraries of similar length.
We are requesting assembly submissions using the downsampled data, since it will most closely resemble our production data types. However, we also welcome additional assembly efforts using any combination of data from full HG002 Data Freeze v1.0.
Acceptable submission formats for HG002
Submitted assemblies utilized the following formats. No more than one assembly per group was submitted, although investigators were allowed to submit more than one of the following formats.
(1) Primary unphased assembly: A single fasta file (no base qualities) in which each contig / scaffold is unphased and the assembly is expected to be approximately the haploid size of the genome. Contigs / scaffolds must have unique names, defined by the first word in the the header line of each sequence record. e.g. a header line ‘>chr_1 other stuff\n’ would have the name ‘chr_1’ which must be unique.
(2) Primary + Secondary assembly: Two fasta files, one as in (1), the second a series of alternate contigs/scaffolds representing alternates haplotypes where the assembler has an alternative, but no attempt is made to specify the complete chromosome phasing. To identify the correspondence between the sequences in the primary assembly file and the secondary assembly file please append to chromosome names the following tags:
“pri” for a primary sequence.
“alt” for a secondary sequence.
Thus for a sequence named “chr_1” in the primary file with an alternative representation in the secondary file you would include a header line (minimally) “>chr_1pri\n” for the sequence in the primary file and a header line (minimally) “>chr_1alt\n” for the sequence in the secondary file. Using these tags will help us keep track should we choose to merge files, and maintain a unique sequence naming space.
(3) Phased assembly: two fastas, with each contig/scaffold having a corresponding contig/scaffold in the other file, and with the expectation that they represent the two haplotypes at that location. Combined file size is expected to be approximately the size of the diploid genome. This format should be especially useful for fully phased chromosome scale assemblies.
For this format please use suffix tags “hap1” and “hap2”, in analogy to the naming scheme in (2), where the sequence names in first first file will be appended “hap1” and the sequence names in the second file will be appended with “hap2”. Where parental information was used to phase the genomes you are encouraged to submit paternal chromosomes as “hap1” and maternal chromosomes as “hap2”. For submissions that do not use parental information it is understood that the partition into “hap1” and “hap2” is arbitrary.
In addition to these three formats were also able to replace fasta format files with fastq files, allowing you to specify base qualities for the assembly.
Investigators also submitted an additional GFA file of assemblies in addition to one or more of the above, allowing us to experimentally evaluate using GFA files. If investigators submitted a GFA they were encouraged to use the same naming conventions outlined above if possible.
Assembly quality considerations: Asked to remove trailing Ns, adapter/vector contaminations, and keep Mitochondrial DNA separated.