Talks

  • Succinct Indices for Range Queries with applications to Orthogonal Range Maxima
    Rajeev Raman
    We consider the problem of preprocessing $N$ points in 2D, each endowed with a priority, to answer the following queries: given a axis-parallel rectangle, determine the point with the largest priority in the rectangle. Using the ideas of the \emph{effective entropy} of range maxima queries and \emph{succinct indices} for range maxima queries, we obtain a structure that uses $O(N)$ words and answers the above query in $O(\log N \log \log N)$ time. This a direct improvement of Chazelle’s result from 1985 \cite{Chazelle88} for this problem — Chazelle required $O(N/\epsilon)$ words to answer queries in $O((\log N)^{1+\epsilon})$ time for any constant $\epsilon > 0$.

    [Joint work with Arash Farzan and Ian Munro.]

  • A Faster Grammar-Based Self-Index
    Travis Gagie
    To store and search genomic databases efficiently, researchers have recently started building compressed self-indexes based on straight-line programs and LZ77.  In this paper we show how, given a straight-line program with $r$ rules for a string \(S [1..n]\) whose LZ77 parse consists of $z$ phrases, we can store a self-index for $S$ in $\Oh{r + z \log \log n}$ space such that, given a pattern \(P [1..m]\), we can list the $\occ$ occurrences of $P$ in $S$ in $\Oh{m^2 + \occ \log \log n}$ time.  If the straight-line program is balanced and we accept a small probability of building a faulty index, then we can reduce the $\Oh{m^2}$ term to $\Oh{m \log m}$.  All previous self-indexes are either larger or slower in the worst case.

    [Joint work with , Paweł Gawrychowski, Juha Kärkkäinen, Yakov Nekrich and Simon J. Puglisi.]

  • Time-Space Trade-Offs for Longest Common Extensions
    Benjamin Sach
    We revisit the longest common extension (LCE) problem, that is, preprocess a string $T$ into a compact data structure that supports fast LCE queries. An LCE query takes a pair $(i,j)$ of indices in $T$ and returns the length of the longest common prefix of the suffixes of $T$ starting at positions $i$ and $j$. We study the time-space trade-offs for the problem, that is, the space used for the data structure vs. the worst-case time for answering an LCE query. Let $n$ be the length of $T$. Given a parameter $\tau$, \leq \tau \leq n$, we show how to achieve either $O(\infrac{n}{\sqrt{\tau}})$ space and $O(\tau)$ query time, or $O(\infrac{n}{\tau})$ space and $O(\tau \log({|\LCE(i,j)|}/{\tau}))$ query time, where $|\LCE(i,j)|$ denotes the length of the LCE returned by the query. These bounds provide the first smooth trade-offs for the LCE problem and almost match the previously known bounds at the extremes when $\tau=1$ or $\tau=n$. We apply the result to obtain improved bounds for several applications where the LCE problem is the computational bottleneck, including approximate string matching and computing palindromes. Finally, we also present an efficient technique to reduce LCE queries on two strings to one string.
  • Regular Expression Matching: History, Status, and Challenges
    Philip Bille
    In this talk I’ll give a concise survey of the classical regular expression matching problem. I’ll discuss the main algorithmic techniques for worst-case efficient solutions and present a selection of open problems.
  • Cache-Oblivious Implicit Predecessor Dictionaries with the Working-Set Property
    Casper Kejlberg-Rasmussen
    In this paper we present an implicit dynamic dictionary with the working-set property, supporting insert(e) and delete(e) in O(log n) time, predecessor(e) in O(log l_{p(e)}) time, successor(e) in O(log l_{s(e)}) time and search(e) in O(log min(l_{p(e)},l_{e}, l_{s(e)})) time, where n is the number of elements stored in the dictionary, l_{e} is the number of distinct elements searched for since element e was last searched for and p(e) and s(e) are the predecessor and successor of e, respectively. The time-bounds are all worst-case. The dictionary stores the elements in an array of size n using no additional space. In the cache-oblivious model the log is base B and the cache-obliviousness is due to our black box use of an existing cache-oblivious implicit dictionary. This is the first implicit dictionary supporting predecessor and successor searches in the working-set bound. Previous implicit structures required O(log n) time.
  • Tree Compression with Top Trees
    Oren Weimann
    I will describe a new compression scheme for labeled trees based on top trees. Our approach is to replace the input tree with another tree that is in essence equivalent but is more compressible. The idea of transforming the input into a more compressable one is analogous to the Burrows-Wheeler transform for strings.
    Our new compression scheme is the first to take advantage of internal repeats in the tree (as opposed to the classical DAG compression that only exploits rooted subtree repeats) while also supporting fast navigational queries directly on the compressed representation.
    We show that the new compression scheme achieves close to optimal worst-case compression, can compress exponentially better than DAG compression, is never much worst than DAG compression, and supports navigational queries in logarithmic time.
    [Joint work with Philip Bille, Inge Li Gortz, and Gad M. Landau.]
  • Partial Persistence for Dynamic Planar Range Maxima Reporting
    Kostas Tsakalidis
    In this talk I will present a generic technique to obtain fully dynamic data structures
    that support deletions in $O(insertion)$ complexity. In particular, I will show how to
    augment a dynamic search tree by use of partially persistent secondary structures.
    A persistent data structure is a dynamic data structure that remembers its versions
    as update operations are performed to it. It is called partially persistent, when only
    the latest version is updatable, while the previous versions can only be queried.In [ICALP’11] we apply the technique to the problem of maintaining the maximal points (points that are not dominated by any other point) of a planar pointset, under insertions and deletions of points, and obtain improved deletion time over previous efforts for the pointer machine and the RAM models.
    [Joint work with Gerth Stølting Brodal]
  • Local Exact Pattern Matching for Non-fixed RNA Structures
    Mika Amit
    Detecting local common sequence-structure regions of RNAs is a biologically meaningful problem. By detecting such regions, biologists are able to identify functional similarity between the inspected molecules.
    We developed dynamic programming algorithms for finding common structure-sequence patterns between two RNAs.
    The RNAs are given by their sequence and a set of potential base pairs with associated probabilities. In contrast to prior work which matches fixed structures, we support the \textit{arc breaking} edit operation; this allows to match only a subset of the given base pairs.
    We present an $O(n^3)$ algorithm for local exact pattern matching between two nested RNAs, and an $O(n^3\log n)$ algorithm for one nested RNA and one bounded-unlimited RNA. In addition, an algorithm for approximate pattern matching is introduced that for two given nested RNAs and a number $k$, finds the maximal local pattern matching score between the two RNAs with at most $k$ mismatches in $O(n^3k^2)$ time.
    Finally, we present an $O(n^3)$ algorithm for finding the most similar subforest between two nested RNAs.
    [Joint work with Rolf Backofen, Steffen Heyne, Gad M. Landau, Mathias Möhl, Christina Schmiedl, Sebastian Will]
  • Detecting Approximate Periodic Patterns in Sub-Cubic Time
    Avivit Levy 
    Given $\epsilon\in[0, 1)$, the {\em Relative Error Periodic Pattern
    Problem (REPP)} is the following:\\
    {\sl INPUT:} An $n$-long sequence $S$ of numbers $s_i\in \mathbb{N}$
    in increasing order.\\
    {\sl OUTPUT:} The longest $\epsilon$-relative error periodic pattern,
    i.e.,\ the longest subsequence $s_{i_1}, s_{i_2},\ldots, s_{i_k}$ of
    $S$, for which there exists a number $p$ such that the absolute
    difference between any two consecutive numbers in the subsequence is
    at least $p$ and at most $p(1 + \epsilon)$.\\
    The best known algorithm for this problem has $O(n^3)$ time
    complexity. This bound is too high for large inputs in practice. In
    this paper we give a new algorithm for finding the longest relative
    error periodic pattern (the REPP problem). Our algorithm is based on a
    new method that considers another parameter of the input
    sequence. Though our method is not guaranteed to work in sub-cubic
    time for all inputs, we prove its \emph{worst case} guarantee using
    this new parameter of the input. This enables us to prove that our
    algorithm works in sub-cubic time on inputs for which the best known
    algorithm works in $O(n^3)$ time. In fact, in many practical
    situations our algorithm works in $\tilde{O}(n^2)$ time.
  • RNA Tree Comparisons Via Unrooted Unordered Mappings
    Nimrod Milo
    RNA secondary structures are often represented as trees, motivating the application of tree comparison-based computational tools for the detection of structural similarities and common motifs. We generalize some current approaches for the alignments between RNA trees to also consider unordered unrooted mappings. This is motivated by the drive to extend RNA structural comparisons to accommodate additional possible evolutionary phenomena (e.g. segment insertions, translocations and reversals) as well as other new variations in structural conservation which were not considered before.
    The problem of unrooted unordered tree edit distance is known to be MAX-SNP hard. Hence, we focus here on a constrained tree comparison problem, based on a homeomorphic tree mapping. The new problem variant is a generalization of several previous approaches, extending them to support both the unrooting of the trees in the alignment and the unordering of their branch mappings. This problem variant is fully symmetric, allowing deletions from both of the aligned trees and thus eliminating the “text-pattern” restriction which was required in a previous related work. We propose a new algorithm for this problem, which applies to several modes, including global or local alignment of trees and supporting both ordered and unordered, rooted and unrooted subtree mappings. Our new algorithm maintains the cubic time complexity of the previous, less general variants of the problem. Specifically, our time bound is $O(n_T n_S + L_T L_S \min(d_T, d_S))$, where $T$ and $S$ are the input trees and $n_T$, $L_T$ and $d_T$ are the number of nodes, the number of leaves and the maximal degree of $T$, respectively.
    We implemented the algorithm as a graphical software tool (web-interface available in http://www.cs.bgu.ac.il/~negevcb/FRUUT), which computes and displays similarities between secondary structures of RNA given as input, and employed it to a preliminary experiment in which we ran all-against-all pairwise alignments of RNAse P RNA family members. Our results demonstrate that applying unrooted unordered alignment to RNA trees can be used to expose structural similarity between some members in the family more thoroughly than the traditional rooted ordered alignment approaches.
    [Joint work with Shay Zakov, Erez Katzenelson, Eitan Bachmat , Yefim Dinitz and Michal Ziv-Ukelson.]
  • Gene bi-targeting by viral and human miRNAs
    Isana Veksler-Lublinsky
    MicroRNAs (miRNAs) are an abundant class of small noncoding RNAs that can affect gene expression by post-transcriptional regulation of mRNAs. They play important roles in several biological processes (e.g., development and cell cycle regulation). Some viral organisms also encode miRNAs, a fact that contributes to the complex interactions between viruses and their hosts. A need arises to understand the functional relationship between viral and host miRNAs and their effect on viral and host genes. Our approach to meet this challenge is to identify modules where viral and host miRNAs cooperatively regulate host gene expression. We present a method to identify groups of viral and host miRNAs that cooperate in post-transcriptional gene regulation, and their target genes that are involved in similar biological processes. We call these groups (genes and miRNAs of human and viral origin) – modules. The modules are found in a new two-stage procedure, which we call bi-targeting, and is presented in this talk. The stages are (i) a new and efficient target prediction, and (ii) a new method for clustering objects of three different data types. The presented algorithm, which makes use of diverse biological data, is demonstrated to be an efficient approach for finding bi-targeting modules of viral and human miRNAs. These modules can contribute to a better understanding of viral-host interactions and the role that miRNAs play in them.

    [Joint work with Yonat Shemer-Avni, Klara Kedem and Michal Ziv-Ukelson]
  • Maximal-exponent Repeats
    Maxime Crochemore
    The exponent of a string is the quotient of its length over its
    smallest period. The exponent and the period of a string can be computed
    in time proportional to its length.
    We design an algorithm to compute the maximal exponent of factors of an
    overlap-free string. It runs in linear-time on a fixed-size alphabet, while a
    naive solution of the question would run in cubic time.
    The solution for non overlap-free strings derives from algorithms to compute
    all maximal repetitions, also called runs, occurring in the string.
    We show there is a linear number of maximal-exponent repeats in an overlap-free
    string. The algorithm can locate them all in linear time.
  • Exact Pattern Matching for RNA Structure Ensembles
    Sebastian Will
    ExpaRNA’s core algorithm computes, for two fixed RNA structures,
    a maximal non-overlapping set of maximal exact matchings. We
    introduce an algorithm ExpaRNA-P that solves the lifted problem of
    finding such sets of exact matchings in entire Boltzmann-distributed
    structure ensembles of two RNAs. Due to a novel kind of structural
    sparsification, the new algorithm maintains the time and space
    complexity of the algorithm for fixed input structures. Furthermore,
    we generalized the chaining algorithm of ExpaRNA in order to
    compute a compatible subset of ExpaRNA-P’s exact matchings. We
    show that ExpaRNA-P outperforms ExpaRNA in
    Bralibase 2.1 benchmarks, where we pass the chained exact
    matchings as anchor constraints to the RNA alignment tool
    LocARNA. Compared to LocARNA, this novel approach shows
    similar accuracy but is six times faster.
    [Joint work with Mika Amit, Rolf Backofen, Steffen Heyne, Gad M.
    Landau, Mathias Möhl, Christina Schmiedl]
  • Recent Progress on The Multi-State Perfect Phylogeny Problem
    Dan Gusfield
    The Multi-State Perfect Phylogeny Problem is a natural extension
    of the Binary Perfect Phylogeny (or Compatibility) Problem, allowing
    evolutionary characters to take on more than two states. The basic
    problem is to determine whether multi-state data can be derived on
    a multi-state perfect phylogeny. In population genetics terminology,
    the binary problem is to determine if data fits the “infinite-sites”
    model, while the multi-state problem is to determine if data fits
    the “infinite-alleles” model. The problem is of interest due to
    prior elegant mathematical and algorithmic results,
    and due to non-binary, but integer, data that is increasingly becoming
    available.In 1975, Buneman showed a how to view the multi-state perfect
    phylogeny problem as a problem of triangulating non-chordal graphs,
    but that result has not been widely exploited as a computational tool,
    despite a robust and continuing literature on the problem of triangulating
    non-chordal graphs. In this talk, I discuss our recent work on
    exploiting the chordal graph approach (and briefly, two other approaches)
    to study multi-state perfect
    phylogeny problems. I will talk about several sets of results.The first set of results concerns problems that extend the utility
    of the multi-state perfect phylogeny model: The Missing Data (MD)
    Problem, where some entries in the input are missing and the question
    is whether (bounded) values for the missing data can be imputed so
    that the resulting data has a multi-state perfect phylogeny; The
    Character-Removal (CR) Problem, where we want to minimize the
    number of characters to remove from the data so that the resulting
    data has a multi-state perfect phylogeny. Using results on minimal
    triangulation of non-chordal graphs, we establish new necessary and
    sufficient conditions for the existence of a perfect phylogeny with
    (or without) missing data. This leads to solutions of the MD and
    CR problems using integer linear programming.The second set of results concerns generalization of the famous
    “four-gametes test” that characterizes the compatibility of binary
    data. Such a generalization was sought, but not found, in the 1970’s
    and 1980’s, and has been open since then. The generalization
    establishes that 3-state data has a perfect phylogeny if and only
    if every subset of three characters has a 3-state perfect phylogeny.
    We also characterize the patterns in the data that prohibit a 3-state
    perfect phylogeny. In 3-states, there are four obstructing patterns,
    rather than just one (four gametes) in the case of two states. We mention
    recent progress (by Fumei Lam) on generalizing this result to an arbitrary
    number of states.The third set of results concerns reductions of the multi-state problem
    to established techniques. We demonstrate that the 3-state problem
    reduces to the well-known and efficiently solvable 2-SAT problem. This builds
    on the efficient solution to the 3-state problem due to Dress and Steel. We next
    demonstrate that the general multi-state problem reduces to the binary problem
    with missing data, a problem that was discussed in a prior Newton Phylogeny meeting.
    Then, using tools developed for the binary case with missing data, we see how this
    approach solves the MD and CR problems. Finally, we show how to speed up
    triangulation methods by applying specification ideas.
    [Joint work with Kristian Stevens, R. Gysel, Y. Wu, F. Lam, S. Sridhar.]
  • Local Search for String Problems: Brute Force is Essentially Optimal
    Danny Hermelin
    Local search is a universal algorithmic approach for coping with
    computationally hard optimization problems. When analyzing heuristics of
    this type, one is interested in the amount of time required to compute a
    so-called “improvement step”, and the number of such improvement steps
    that are required in order to solve the problem at hand. In this talk I
    will examine the first question when applied to classical NP-hard string
    problems. I will show that for the Closest String problem in a natural
    local search formulation, one cannot compute an improvement step in
    n^{o(k)} time (assuming standard complexity-theoretic assumptions), while
    simple brute-force allows this in n^{O(k)} time. Similar results hold also
    for Longest Common Supersequence, Shortest Common Superstring, and
    Shortest Common Supersequence.
    [Joint work with Jiong Guo and Christian Komusiewicz.]
  • Online Sorted Range Reporting
    Alex Lopez-Ortiz
    [Joint work with Gerth Brodal and Rolf Fagerberg.]
  • Succinct Posets
    Ian Munro
    We describe an algorithm for compressing a partially ordered set, or poset, so that it occupies space matching the information theory lower bound (to within lower order terms), in the worst case. Using this algorithm, we design a succinct data structure for representing a poset that, given two elements, can report whether one precedes the other in constant time. This is equivalent to succinctly representing the transitive closure graph of the poset, and we note that the same method can also be used to succinctly represent the transitive reduction graph. For an $n$ element poset, the data structure occupies $n^2/4 + o(n^2)$ bits, in the worst case, which is roughly half the space occupied by an upper triangular matrix. Furthermore, a slight extension to this data structure yields a succinct oracle for reachability in arbitrary directed graphs. Thus, using roughly a quarter of the space required to represent an arbitrary directed graph, reachability queries can be supported in constant time.
    [Joint work with Pat Nicholson.]
  • Pattern Matching in Multiple Streams
    Raphael Clifford
    I will introduce the problem of pattern matching in multiple
    streams. In this model, one symbol arrives at a time and is associated
    with one of s streaming texts. The task at each time step is to report
    if there is a new match between a fixed pattern of length m and a
    newly updated stream. As is usual in the streaming context, the goal
    is to use as little space as possible while still reporting matches
    quickly. We give almost matching upper and lower space bounds for
    three distinct pattern matching problems. For exact matching we show
    that the problem can be solved in constant time per arriving symbol
    and O(m+s) words of space. For the k-mismatch and k-difference
    problems we give O(k) time solutions that require O(m+ks) words of
    space. In all three cases we also give space lower bounds which show
    our methods are optimal up to a single logarithmic factor. Finally I
    will set out a number of open problems related to this new model for
    pattern matching.
    [Joint work with Markus Jalsenius, Ely Porat and Benjamin Sach.]
  • Fast Arc-Annotated Subsequence Matching in Linear Space
    Inge Li Gørtz
    An arc-annotated string is a string of characters, called bases, augmented with a set of pairs, called arcs, each connecting two bases. Given arc-annotated strings P and Q the arc-preserving subsequence problem is to determine if P can be obtained from Q by deleting bases from Q. Whenever a base is deleted any arc with an endpoint in that base is also deleted. Arc-annotated strings where the arcs are “nested” are a natural model of RNA molecules that captures both the primary and secondary structure of these. The arc-preserving subsequence problem for nested arc-annotated strings is basic primitive for investigating the function of RNA molecules. Gramm et al. [ACM Trans. Algorithms 2006] gave an algorithm for this problem using O(nm) time and space, where m and n are the lengths of P and Q, respectively. In this paper we present a new algorithm using O(nm) time and O(n + m) space, thereby matching the previous time bound while significantly reducing the space from a quadratic term to linear. This is essential to process large RNA molecules where the space is a likely to be a bottleneck. To obtain our result we introduce several novel ideas which may be of independent interest for related problems on arc-annotated strings.
    [Joint work with Philip Bille.]
  • Two Dimensional Range Minimum Queries and Fibonacci Lattices
    Moshe Lewenstein
    Given a matrix of size N, two dimensional range minimum
    queries (2D-RMQs) ask for the position of the minimum element in a
    rectangular range within the matrix. We study trade-off s between the
    query time and the additional space of data structures supporting
    2D-RMQs. As a novel technique, we use the discrepancy properties of
    Fibonacci lattices to achieve an indexing data structure of size O(N=c
    + c(log c)O(1)) bits additional space with O(c log c(log log c)^2)
    query time. Also, when the entries of an input matrix are from a fixed
    size alphabet, we show that the query time can be improved to O(c log
    c log log c).
    [Joint work with Gerth Stølting Brodal, Pooya Davoodi, Rajeev Raman and Srinivasa Rao.]
  • Efficient Computation of Sparse Transforms
    Oren Kapah
    Efficient handling of sparse data is a key challenge in Computer Science.
    Binary convolutions, such as the Fast Fourier Transform or the Walsh
    Transform are a useful tool in many applications and are efficiently solved.
    In the last decade, several problems required efficient solution of
    sparse binary convolutions. Both randomized and deterministic
    algorithms were developed for efficiently computing the sparse FFT.

    The key operation in all these algorithms was length reduction. The
    sparse data is mapped into small vectors that preserve the convolution
    result. The reduction method used to-date was the modulo function
    since it preserves location (of the “1” bits) up to cyclic shift.
    To date there is no known efficient algorithm for computing the sparse
    Walsh Transform. Since the modulo function does not preserve the Walsh
    transform a new method for length reduction is needed.

    In this paper we present such a new method – polynomials.
    This method enables the development of an efficient algorithm for
    computing the binary sparse Walsh Transform. To our knowledge, this is
    the first such algorithm. We also show that this method allows a
    faster deterministic computation of the sparse FFT than currently
    known in the literature.
    [Joint work with A. Amir, E. Porat and A. Rothschild.]

  • Binary Search Trees for Compression
    John Iacono
    Suppose you are receiving a string one character at a time from an
    alphabet of size n and wish to output a compressed representation of
    it. One method is to search for each character in a binary search tree
    and output the search path to the character. This ancient method has
    several benefits: it uses a number of words of space linear in the
    alphabet size and compresses in time linear in the number of bits of
    the output string. What are the types of patterns that will compress
    well in such a scheme? What is the best choice of binary search tree?
    What patterns will not compress well and are better served through
    other schemes?
  • Configurations and Minority in the String Consensus Problem
    Haim Paryenty
    The Closest String Problem is defined as follows. Let $S$ be
    a set of $k$ strings $\{s_1,\dots s_k\}$, each of length
    $\ell$, find a string $\hat{s}$, such that the maximum Hamming
    distance of $\hat{s}$ from each of the strings is minimized.
    We denote this distance with $d$. The string $\hat{s}$ is called
    a consensus string. In this paper we present two main algorithms,
    the {\tt Configuration} algorithm with $O(k^2 \ell ^ k)$ running time for this problem,
    and the {\tt Minority} algorithm.

    Despite the great effort to obtain
    efficient algorithms for this problem an algorithm with the natural
    running time of $O(\ell ^ k)$ was not known. In this paper we close
    this gap.

    Our result means that algorithms solving the closest string problem in
    times $O(\ell^2), O(\ell^3), O(\ell^4)$ and $O(\ell^5)$ exist for the
    cases of $k=2,3,4$ and $5$, respectively. It is known that, in fact,
    the cases of $k=2,3,$ and $4$ can be solved in linear time. No
    efficient algorithm is currently known for the case of $k=5$. We prove
    two lemmas, the {\em unit square lemma} and the {\em minority lemma}
    that exploit surprising properties of the closest string problem and enable
    constructing the closest string in a sequential fashion.
    These lemmas with some additional ideas give a $O(\ell^2)$ algorithm for computing a closest string
    of $5$ binary strings.
    [Joint work with A. Amir and L. Roditty.]