HG002

The Human Pangenome Reference Consortium welcomes you to submit an assembly for HG002. Below are specific instructions for Assembly Submissions and Quality Evaluations and as well as links to the HG002 data.

Background: The Human Reference Pangenome Consortium has organized a collection of HG002 (NA24385) data to guide the development of new assembly and evaluation strategies for our reference production efforts.

We aaccepted assembly submissions for evaluation until March 16, 2020.

Details below.  Please contact Karen Miga (khmiga@soe.ucsc.edu) for instructions on uploading your submission.

The HPRC HG002-Data Freeze v1.0 is described here: 

https://github.com/human-pangenomics/HG002_Data_Freeze_v1.0

Participating groups are recommended to use a subset of the overall data release (Downsampled data):

https://github.com/human-pangenomics/HG002_Data_Freeze_v1.0#hg002-data-freeze-v10-recommended-downsampled-data-mix

Please note that we have included four flow cells of PacBio HiFi data (2x 15kb and 2x 20kb) to match our anticipated read length and quality in current production. However, teams are welcome to use 4x 15kb HiFi data if their method performance is improved with read libraries of similar length.

https://s3-us-west-2.amazonaws.com/human-pangenomics/index.html?prefix=HG002/hpp_HG002_NA24385_son_v1/PacBio_HiFi/15kb

We are requesting assembly submissions using the downsampled data, since it will most closely resemble our production data types. However, we also welcome additional assembly efforts using any combination of data from full HG002 Data Freeze v1.0.

Acceptable submission formats for HG002

Please submit your assembly using the following formats. Please submit no more than one assembly per group, although you can submit it using more than one of the following formats.

(1) Primary unphased assembly: A single fasta file (no base qualities) in which each contig / scaffold is unphased and the assembly is expected to be approximately the haploid size of the genome. Contigs / scaffolds must have unique names, defined by the first word in the the header line of each sequence record. e.g. a header line ‘>chr_1 other stuff\n’ would have the name ‘chr_1’ which must be unique.

(2) Primary + Secondary assembly: Two fasta files, one as in (1), the second a series of alternate contigs/scaffolds representing alternates haplotypes where the assembler has an alternative, but no attempt is made to specify the complete chromosome phasing. To identify the correspondence between the sequences in the primary assembly file and the secondary assembly file please append to chromosome names the following tags:

“pri” for a primary sequence.

“alt” for a secondary sequence.

Thus for a sequence named “chr_1” in the primary file with an alternative representation in the secondary file you would include a header line (minimally) “>chr_1pri\n” for the sequence in the primary file and a header line (minimally) “>chr_1alt\n” for the sequence in the secondary file. Using these tags will help us keep track should we choose to merge files, and maintain a unique sequence naming space. 

(3) Phased assembly: two fastas, with each contig/scaffold having a corresponding contig/scaffold in the other file, and with the expectation that they represent the two haplotypes at that location. Combined file size is expected to be approximately the size of the diploid genome. This format should be especially useful for fully phased chromosome scale assemblies.

For this format please use suffix tags “hap1” and “hap2”, in analogy to the naming scheme in (2), where the sequence names in first first file will be appended “hap1” and the sequence names in the second file will be appended with “hap2”. Where parental information was used to phase the genomes you are encouraged to submit paternal chromosomes as “hap1” and maternal chromosomes as “hap2”. For submissions that do not use parental information it is understood that the partition into “hap1” and “hap2” is arbitrary.

In addition to these three formats you may also replace fasta format files with fastq files, allowing you to specify base quailities for the assembly.

You may also submit an additional GFA file of your assembly in addition to one or more of the above, allowing us to experimentally evaluate using GFA files. If you submit a GFA you are encouraged to use the same naming conventions outlined above if possible.

Assembly quality considerations: Please aim to remove trailing Ns, adapter/vector contaminations, and keep Mitochondrial DNA separated.

HG002-Data Freeze v1.0 https://s3-us-west-2.amazonaws.com/human-pangenomics/index.html?prefix=HG002/hpp_HG002_NA24385_son_v1/