Correcting Errors in DNA Storage

Citation

Sima, Jin (2022) Correcting Errors in DNA Storage. Dissertation (Ph.D.), California Institute of Technology. doi:10.7907/kdph-6z71. https://resolver.caltech.edu/CaltechTHESIS:06062022-185003284

Abstract

DNA-based storage has potentially unprecedented advantages of high information density and long duration, and is one of the promising techniques to meet the ever-growing demands to keep data in the future. As noise and errors are present in almost every procedure during reading, writing, and storing of information in DNA storage systems, error correction is inevitable to guarantee reliable data storage in DNA. Moreover, it is often required that error correction is done in an efficient manner to reduce the cost and time needed for reading and writing data. Due to the technology constraints and physical limitations, error correction in DNA-based storage poses the following challenges that differ from those in traditional digital data transmission and storage systems.

1. A combination of deletion, insertion, and substitution errors present. The goal is to construct efficient codes correcting these errors. While substitution errors are special cases of deletion and insertion errors, and are well studied under the current theory and practice frameworks, deletion and insertion errors are much more difficult to deal with, and less understanding was gained for deletion and insertion errors.

2. Error correction is over an unordered set of strings, rather than over a single string, which can be regarded as a set of ordered strings. The latter, which includes the above deletion/insertion coding problem, is commonly studied for current digital communication and storage systems. Our goal is to extend the deletion/insertion correction capability for a single string to a set of unordered strings.

3. The decoder observes multiple noisy copies of every coded string. The problem is to deduce a set of strings (or a single string) from a collection of their noisy samples, also studied as the population recovery (or trace reconstruction for a single string) problem. The problem is well answered with substitution errors only and becomes elusive with the introduction of deletion and insertion errors.

This thesis tries to address the above challenges. For the first challenge, we proposed binary codes correct any constant number of deletions and/or insertions with order-wise optimal redundancy, which made a step toward a solution to a longstanding open problem introduced by Levenshtein in 1960s. We also extended it to different settings, in particular, non-binary deletin/insertion correcting codes suitable for DNA storage applications.

For the second challenge, we established lower and upper bounds on the optimal redundancy of codes correcting any number of substitution, deletion, and insertion errors and found that the redundancy needed for coding over an unordered set of strings is order-wise the same as that needed for coding over a ordered set of strings. Using our results for the first challenge, we proposed codes correcting any constant number of deletion/insertion errors with order-wise optimal redundancy under some parametter settings.

For the third challenge, we studied the problem of trace reconstruction, which asks the number of noisy samples needed to reconstruct a single string. While there is a exponential gap between upper and lower bounds on sample complexities in general, we showed that a polynomial number of samples suffice, given a reference string that is within constant edit distance from the target string.

Apart from dealing with the above challenges, we investigated error correction for multi-head racetrack memory applications. The problem can be considered as correcting any constant number of deletions/insertions in a single string with multiple noisy copies, with the help of coding. Different from the settings we considered above in the trace reconstruction problem, where noisy copies are independent given the target string, in racetrack memory, the noisy copies are correlated, and the number of errors is small compared to the trace reconstruction problem. We derived a lower bound on redundancy and proposed a code correcting any number of deletions/insertions with order-wise optimal redundancy.

Item Type:

Thesis (Dissertation (Ph.D.))

Subject Keywords:

DNA Storage; Error Correction Codes

Degree Grantor:

California Institute of Technology

Division:

Engineering and Applied Science

Major Option:

Electrical Engineering

Awards:

Charles and Ellen Wilts Prize, 2022.

Thesis Availability:

Public (worldwide access)

Research Advisor(s):

Bruck, Jehoshua

Thesis Committee:

Kostina, Victoria (chair)
Hassibi, Babak
Raviv, Netanel
Bruck, Jehoshua

Defense Date:

25 May 2022

Record Number:

CaltechTHESIS:06062022-185003284

Persistent URL:

https://resolver.caltech.edu/CaltechTHESIS:06062022-185003284

DOI:

10.7907/kdph-6z71

Related URLs:

URL	URL Type	Description
https://ieeexplore.ieee.org/stamp/stamp.jsp?arnumber=9369384	Publisher	Chapter V
https://ieeexplore.ieee.org/ielaam/18/9039774/8886525-aam.pdf	Publisher	Chapter II
https://ieeexplore.ieee.org/document/8849750	Publisher	Chapter III
https://ieeexplore.ieee.org/document/9518244	Publisher	Chapter VIII
https://ieeexplore.ieee.org/document/9174447	Publisher	appeared in part of Chapter VI
https://ieeexplore.ieee.org/document/9174241	Publisher	appeared in part of Chapter IV
https://ieeexplore.ieee.org/document/9173986	Publisher	appeared in part of Chapter IV
https://ieeexplore.ieee.org/document/9174009	Publisher	appeared in part of Chapter IV
https://ieeexplore.ieee.org/document/8849783	Publisher	Chapter VII

ORCID:

Author	ORCID
Sima, Jin	0000-0003-4588-9790

Default Usage Policy:

No commercial reproduction, distribution, display or performance rights in this work are provided.

ID Code:

14951

Collection:

CaltechTHESIS

Deposited By:

Jin Sima

Deposited On:

07 Jun 2022 21:48

Last Modified:

04 Aug 2022 00:08

Thesis Files

PDF - Final Version
See Usage Policy.
1MB

Repository Staff Only: item control page