# Talks

• Succinct Indices for Range Queries with applications to Orthogonal Range Maxima
Rajeev Raman
We consider the problem of preprocessing $N$ points in 2D, each endowed with a priority, to answer the following queries: given a axis-parallel rectangle, determine the point with the largest priority in the rectangle. Using the ideas of the \emph{effective entropy} of range maxima queries and \emph{succinct indices} for range maxima queries, we obtain a structure that uses $O(N)$ words and answers the above query in $O(\log N \log \log N)$ time. This a direct improvement of Chazelle’s result from 1985 \cite{Chazelle88} for this problem — Chazelle required $O(N/\epsilon)$ words to answer queries in $O((\log N)^{1+\epsilon})$ time for any constant $\epsilon > 0$.

[Joint work with Arash Farzan and Ian Munro.]

• A Faster Grammar-Based Self-Index
Travis Gagie
To store and search genomic databases efficiently, researchers have recently started building compressed self-indexes based on straight-line programs and LZ77.  In this paper we show how, given a straight-line program with $r$ rules for a string $$S [1..n]$$ whose LZ77 parse consists of $z$ phrases, we can store a self-index for $S$ in $\Oh{r + z \log \log n}$ space such that, given a pattern $$P [1..m]$$, we can list the $\occ$ occurrences of $P$ in $S$ in $\Oh{m^2 + \occ \log \log n}$ time.  If the straight-line program is balanced and we accept a small probability of building a faulty index, then we can reduce the $\Oh{m^2}$ term to $\Oh{m \log m}$.  All previous self-indexes are either larger or slower in the worst case.

[Joint work with , Paweł Gawrychowski, Juha Kärkkäinen, Yakov Nekrich and Simon J. Puglisi.]

• Time-Space Trade-Offs for Longest Common Extensions
Benjamin Sach
We revisit the longest common extension (LCE) problem, that is, preprocess a string $T$ into a compact data structure that supports fast LCE queries. An LCE query takes a pair $(i,j)$ of indices in $T$ and returns the length of the longest common prefix of the suffixes of $T$ starting at positions $i$ and $j$. We study the time-space trade-offs for the problem, that is, the space used for the data structure vs. the worst-case time for answering an LCE query. Let $n$ be the length of $T$. Given a parameter $\tau$, \leq \tau \leq n$, we show how to achieve either$O(\infrac{n}{\sqrt{\tau}})$space and$O(\tau)$query time, or$O(\infrac{n}{\tau})$space and$O(\tau \log({|\LCE(i,j)|}/{\tau}))$query time, where$|\LCE(i,j)|$denotes the length of the LCE returned by the query. These bounds provide the first smooth trade-offs for the LCE problem and almost match the previously known bounds at the extremes when$\tau=1$or$\tau=n$. We apply the result to obtain improved bounds for several applications where the LCE problem is the computational bottleneck, including approximate string matching and computing palindromes. Finally, we also present an efficient technique to reduce LCE queries on two strings to one string. • Regular Expression Matching: History, Status, and Challenges Philip Bille In this talk I’ll give a concise survey of the classical regular expression matching problem. I’ll discuss the main algorithmic techniques for worst-case efficient solutions and present a selection of open problems. • Cache-Oblivious Implicit Predecessor Dictionaries with the Working-Set Property Casper Kejlberg-Rasmussen In this paper we present an implicit dynamic dictionary with the working-set property, supporting insert(e) and delete(e) in O(log n) time, predecessor(e) in O(log l_{p(e)}) time, successor(e) in O(log l_{s(e)}) time and search(e) in O(log min(l_{p(e)},l_{e}, l_{s(e)})) time, where n is the number of elements stored in the dictionary, l_{e} is the number of distinct elements searched for since element e was last searched for and p(e) and s(e) are the predecessor and successor of e, respectively. The time-bounds are all worst-case. The dictionary stores the elements in an array of size n using no additional space. In the cache-oblivious model the log is base B and the cache-obliviousness is due to our black box use of an existing cache-oblivious implicit dictionary. This is the first implicit dictionary supporting predecessor and successor searches in the working-set bound. Previous implicit structures required O(log n) time. • Tree Compression with Top Trees Oren Weimann I will describe a new compression scheme for labeled trees based on top trees. Our approach is to replace the input tree with another tree that is in essence equivalent but is more compressible. The idea of transforming the input into a more compressable one is analogous to the Burrows-Wheeler transform for strings. Our new compression scheme is the first to take advantage of internal repeats in the tree (as opposed to the classical DAG compression that only exploits rooted subtree repeats) while also supporting fast navigational queries directly on the compressed representation. We show that the new compression scheme achieves close to optimal worst-case compression, can compress exponentially better than DAG compression, is never much worst than DAG compression, and supports navigational queries in logarithmic time. [Joint work with Philip Bille, Inge Li Gortz, and Gad M. Landau.] • Partial Persistence for Dynamic Planar Range Maxima Reporting Kostas Tsakalidis In this talk I will present a generic technique to obtain fully dynamic data structures that support deletions in$O(insertion)$complexity. In particular, I will show how to augment a dynamic search tree by use of partially persistent secondary structures. A persistent data structure is a dynamic data structure that remembers its versions as update operations are performed to it. It is called partially persistent, when only the latest version is updatable, while the previous versions can only be queried.In [ICALP’11] we apply the technique to the problem of maintaining the maximal points (points that are not dominated by any other point) of a planar pointset, under insertions and deletions of points, and obtain improved deletion time over previous efforts for the pointer machine and the RAM models. [Joint work with Gerth Stølting Brodal] • Local Exact Pattern Matching for Non-fixed RNA Structures Mika Amit Detecting local common sequence-structure regions of RNAs is a biologically meaningful problem. By detecting such regions, biologists are able to identify functional similarity between the inspected molecules. We developed dynamic programming algorithms for finding common structure-sequence patterns between two RNAs. The RNAs are given by their sequence and a set of potential base pairs with associated probabilities. In contrast to prior work which matches fixed structures, we support the \textit{arc breaking} edit operation; this allows to match only a subset of the given base pairs. We present an$O(n^3)$algorithm for local exact pattern matching between two nested RNAs, and an$O(n^3\log n)$algorithm for one nested RNA and one bounded-unlimited RNA. In addition, an algorithm for approximate pattern matching is introduced that for two given nested RNAs and a number$k$, finds the maximal local pattern matching score between the two RNAs with at most$k$mismatches in$O(n^3k^2)$time. Finally, we present an$O(n^3)$algorithm for finding the most similar subforest between two nested RNAs. [Joint work with Rolf Backofen, Steffen Heyne, Gad M. Landau, Mathias Möhl, Christina Schmiedl, Sebastian Will] • Detecting Approximate Periodic Patterns in Sub-Cubic Time Avivit Levy Given$\epsilon\in[0, 1)$, the {\em Relative Error Periodic Pattern Problem (REPP)} is the following:\\ {\sl INPUT:} An$n$-long sequence$S$of numbers$s_i\in \mathbb{N}$in increasing order.\\ {\sl OUTPUT:} The longest$\epsilon$-relative error periodic pattern, i.e.,\ the longest subsequence$s_{i_1}, s_{i_2},\ldots, s_{i_k}$of$S$, for which there exists a number$p$such that the absolute difference between any two consecutive numbers in the subsequence is at least$p$and at most$p(1 + \epsilon)$.\\ The best known algorithm for this problem has$O(n^3)$time complexity. This bound is too high for large inputs in practice. In this paper we give a new algorithm for finding the longest relative error periodic pattern (the REPP problem). Our algorithm is based on a new method that considers another parameter of the input sequence. Though our method is not guaranteed to work in sub-cubic time for all inputs, we prove its \emph{worst case} guarantee using this new parameter of the input. This enables us to prove that our algorithm works in sub-cubic time on inputs for which the best known algorithm works in$O(n^3)$time. In fact, in many practical situations our algorithm works in$\tilde{O}(n^2)time. • RNA Tree Comparisons Via Unrooted Unordered Mappings Nimrod Milo RNA secondary structures are often represented as trees, motivating the application of tree comparison-based computational tools for the detection of structural similarities and common motifs. We generalize some current approaches for the alignments between RNA trees to also consider unordered unrooted mappings. This is motivated by the drive to extend RNA structural comparisons to accommodate additional possible evolutionary phenomena (e.g. segment insertions, translocations and reversals) as well as other new variations in structural conservation which were not considered before. The problem of unrooted unordered tree edit distance is known to be MAX-SNP hard. Hence, we focus here on a constrained tree comparison problem, based on a homeomorphic tree mapping. The new problem variant is a generalization of several previous approaches, extending them to support both the unrooting of the trees in the alignment and the unordering of their branch mappings. This problem variant is fully symmetric, allowing deletions from both of the aligned trees and thus eliminating the “text-pattern” restriction which was required in a previous related work. We propose a new algorithm for this problem, which applies to several modes, including global or local alignment of trees and supporting both ordered and unordered, rooted and unrooted subtree mappings. Our new algorithm maintains the cubic time complexity of the previous, less general variants of the problem. Specifically, our time bound isO(n_T n_S + L_T L_S \min(d_T, d_S))$, where$T$and$S$are the input trees and$n_T$,$L_T$and$d_T$are the number of nodes, the number of leaves and the maximal degree of$T, respectively. We implemented the algorithm as a graphical software tool (web-interface available in http://www.cs.bgu.ac.il/~negevcb/FRUUT), which computes and displays similarities between secondary structures of RNA given as input, and employed it to a preliminary experiment in which we ran all-against-all pairwise alignments of RNAse P RNA family members. Our results demonstrate that applying unrooted unordered alignment to RNA trees can be used to expose structural similarity between some members in the family more thoroughly than the traditional rooted ordered alignment approaches. [Joint work with Shay Zakov, Erez Katzenelson, Eitan Bachmat , Yefim Dinitz and Michal Ziv-Ukelson.] • Gene bi-targeting by viral and human miRNAs Isana Veksler-Lublinsky MicroRNAs (miRNAs) are an abundant class of small noncoding RNAs that can affect gene expression by post-transcriptional regulation of mRNAs. They play important roles in several biological processes (e.g., development and cell cycle regulation). Some viral organisms also encode miRNAs, a fact that contributes to the complex interactions between viruses and their hosts. A need arises to understand the functional relationship between viral and host miRNAs and their effect on viral and host genes. Our approach to meet this challenge is to identify modules where viral and host miRNAs cooperatively regulate host gene expression. We present a method to identify groups of viral and host miRNAs that cooperate in post-transcriptional gene regulation, and their target genes that are involved in similar biological processes. We call these groups (genes and miRNAs of human and viral origin) – modules. The modules are found in a new two-stage procedure, which we call bi-targeting, and is presented in this talk. The stages are (i) a new and efficient target prediction, and (ii) a new method for clustering objects of three different data types. The presented algorithm, which makes use of diverse biological data, is demonstrated to be an efficient approach for finding bi-targeting modules of viral and human miRNAs. These modules can contribute to a better understanding of viral-host interactions and the role that miRNAs play in them. [Joint work with Yonat Shemer-Avni, Klara Kedem and Michal Ziv-Ukelson] • Maximal-exponent Repeats Maxime Crochemore The exponent of a string is the quotient of its length over its smallest period. The exponent and the period of a string can be computed in time proportional to its length. We design an algorithm to compute the maximal exponent of factors of an overlap-free string. It runs in linear-time on a fixed-size alphabet, while a naive solution of the question would run in cubic time. The solution for non overlap-free strings derives from algorithms to compute all maximal repetitions, also called runs, occurring in the string. We show there is a linear number of maximal-exponent repeats in an overlap-free string. The algorithm can locate them all in linear time. • Exact Pattern Matching for RNA Structure Ensembles Sebastian Will ExpaRNA’s core algorithm computes, for two fixed RNA structures, a maximal non-overlapping set of maximal exact matchings. We introduce an algorithm ExpaRNA-P that solves the lifted problem of finding such sets of exact matchings in entire Boltzmann-distributed structure ensembles of two RNAs. Due to a novel kind of structural sparsification, the new algorithm maintains the time and space complexity of the algorithm for fixed input structures. Furthermore, we generalized the chaining algorithm of ExpaRNA in order to compute a compatible subset of ExpaRNA-P’s exact matchings. We show that ExpaRNA-P outperforms ExpaRNA in Bralibase 2.1 benchmarks, where we pass the chained exact matchings as anchor constraints to the RNA alignment tool LocARNA. Compared to LocARNA, this novel approach shows similar accuracy but is six times faster. [Joint work with Mika Amit, Rolf Backofen, Steffen Heyne, Gad M. Landau, Mathias Möhl, Christina Schmiedl] • Recent Progress on The Multi-State Perfect Phylogeny Problem Dan Gusfield The Multi-State Perfect Phylogeny Problem is a natural extension of the Binary Perfect Phylogeny (or Compatibility) Problem, allowing evolutionary characters to take on more than two states. The basic problem is to determine whether multi-state data can be derived on a multi-state perfect phylogeny. In population genetics terminology, the binary problem is to determine if data fits the “infinite-sites” model, while the multi-state problem is to determine if data fits the “infinite-alleles” model. The problem is of interest due to prior elegant mathematical and algorithmic results, and due to non-binary, but integer, data that is increasingly becoming available.In 1975, Buneman showed a how to view the multi-state perfect phylogeny problem as a problem of triangulating non-chordal graphs, but that result has not been widely exploited as a computational tool, despite a robust and continuing literature on the problem of triangulating non-chordal graphs. In this talk, I discuss our recent work on exploiting the chordal graph approach (and briefly, two other approaches) to study multi-state perfect phylogeny problems. I will talk about several sets of results.The first set of results concerns problems that extend the utility of the multi-state perfect phylogeny model: The Missing Data (MD) Problem, where some entries in the input are missing and the question is whether (bounded) values for the missing data can be imputed so that the resulting data has a multi-state perfect phylogeny; The Character-Removal (CR) Problem, where we want to minimize the number of characters to remove from the data so that the resulting data has a multi-state perfect phylogeny. Using results on minimal triangulation of non-chordal graphs, we establish new necessary and sufficient conditions for the existence of a perfect phylogeny with (or without) missing data. This leads to solutions of the MD and CR problems using integer linear programming.The second set of results concerns generalization of the famous “four-gametes test” that characterizes the compatibility of binary data. Such a generalization was sought, but not found, in the 1970’s and 1980’s, and has been open since then. The generalization establishes that 3-state data has a perfect phylogeny if and only if every subset of three characters has a 3-state perfect phylogeny. We also characterize the patterns in the data that prohibit a 3-state perfect phylogeny. In 3-states, there are four obstructing patterns, rather than just one (four gametes) in the case of two states. We mention recent progress (by Fumei Lam) on generalizing this result to an arbitrary number of states.The third set of results concerns reductions of the multi-state problem to established techniques. We demonstrate that the 3-state problem reduces to the well-known and efficiently solvable 2-SAT problem. This builds on the efficient solution to the 3-state problem due to Dress and Steel. We next demonstrate that the general multi-state problem reduces to the binary problem with missing data, a problem that was discussed in a prior Newton Phylogeny meeting. Then, using tools developed for the binary case with missing data, we see how this approach solves the MD and CR problems. Finally, we show how to speed up triangulation methods by applying specification ideas. [Joint work with Kristian Stevens, R. Gysel, Y. Wu, F. Lam, S. Sridhar.] • Local Search for String Problems: Brute Force is Essentially Optimal Danny Hermelin Local search is a universal algorithmic approach for coping with computationally hard optimization problems. When analyzing heuristics of this type, one is interested in the amount of time required to compute a so-called “improvement step”, and the number of such improvement steps that are required in order to solve the problem at hand. In this talk I will examine the first question when applied to classical NP-hard string problems. I will show that for the Closest String problem in a natural local search formulation, one cannot compute an improvement step in n^{o(k)} time (assuming standard complexity-theoretic assumptions), while simple brute-force allows this in n^{O(k)} time. Similar results hold also for Longest Common Supersequence, Shortest Common Superstring, and Shortest Common Supersequence. [Joint work with Jiong Guo and Christian Komusiewicz.] • Online Sorted Range Reporting Alex Lopez-Ortiz [Joint work with Gerth Brodal and Rolf Fagerberg.] • Succinct Posets Ian Munro We describe an algorithm for compressing a partially ordered set, or poset, so that it occupies space matching the information theory lower bound (to within lower order terms), in the worst case. Using this algorithm, we design a succinct data structure for representing a poset that, given two elements, can report whether one precedes the other in constant time. This is equivalent to succinctly representing the transitive closure graph of the poset, and we note that the same method can also be used to succinctly represent the transitive reduction graph. For ann$element poset, the data structure occupies$n^2/4 + o(n^2)$bits, in the worst case, which is roughly half the space occupied by an upper triangular matrix. Furthermore, a slight extension to this data structure yields a succinct oracle for reachability in arbitrary directed graphs. Thus, using roughly a quarter of the space required to represent an arbitrary directed graph, reachability queries can be supported in constant time. [Joint work with Pat Nicholson.] • Pattern Matching in Multiple Streams Raphael Clifford I will introduce the problem of pattern matching in multiple streams. In this model, one symbol arrives at a time and is associated with one of s streaming texts. The task at each time step is to report if there is a new match between a fixed pattern of length m and a newly updated stream. As is usual in the streaming context, the goal is to use as little space as possible while still reporting matches quickly. We give almost matching upper and lower space bounds for three distinct pattern matching problems. For exact matching we show that the problem can be solved in constant time per arriving symbol and O(m+s) words of space. For the k-mismatch and k-difference problems we give O(k) time solutions that require O(m+ks) words of space. In all three cases we also give space lower bounds which show our methods are optimal up to a single logarithmic factor. Finally I will set out a number of open problems related to this new model for pattern matching. [Joint work with Markus Jalsenius, Ely Porat and Benjamin Sach.] • Fast Arc-Annotated Subsequence Matching in Linear Space Inge Li Gørtz An arc-annotated string is a string of characters, called bases, augmented with a set of pairs, called arcs, each connecting two bases. Given arc-annotated strings P and Q the arc-preserving subsequence problem is to determine if P can be obtained from Q by deleting bases from Q. Whenever a base is deleted any arc with an endpoint in that base is also deleted. Arc-annotated strings where the arcs are “nested” are a natural model of RNA molecules that captures both the primary and secondary structure of these. The arc-preserving subsequence problem for nested arc-annotated strings is basic primitive for investigating the function of RNA molecules. Gramm et al. [ACM Trans. Algorithms 2006] gave an algorithm for this problem using O(nm) time and space, where m and n are the lengths of P and Q, respectively. In this paper we present a new algorithm using O(nm) time and O(n + m) space, thereby matching the previous time bound while significantly reducing the space from a quadratic term to linear. This is essential to process large RNA molecules where the space is a likely to be a bottleneck. To obtain our result we introduce several novel ideas which may be of independent interest for related problems on arc-annotated strings. [Joint work with Philip Bille.] • Two Dimensional Range Minimum Queries and Fibonacci Lattices Moshe Lewenstein Given a matrix of size N, two dimensional range minimum queries (2D-RMQs) ask for the position of the minimum element in a rectangular range within the matrix. We study trade-off s between the query time and the additional space of data structures supporting 2D-RMQs. As a novel technique, we use the discrepancy properties of Fibonacci lattices to achieve an indexing data structure of size O(N=c + c(log c)O(1)) bits additional space with O(c log c(log log c)^2) query time. Also, when the entries of an input matrix are from a fixed size alphabet, we show that the query time can be improved to O(c log c log log c). [Joint work with Gerth Stølting Brodal, Pooya Davoodi, Rajeev Raman and Srinivasa Rao.] • Efficient Computation of Sparse Transforms Oren Kapah Efficient handling of sparse data is a key challenge in Computer Science. Binary convolutions, such as the Fast Fourier Transform or the Walsh Transform are a useful tool in many applications and are efficiently solved. In the last decade, several problems required efficient solution of sparse binary convolutions. Both randomized and deterministic algorithms were developed for efficiently computing the sparse FFT. The key operation in all these algorithms was length reduction. The sparse data is mapped into small vectors that preserve the convolution result. The reduction method used to-date was the modulo function since it preserves location (of the “1” bits) up to cyclic shift. To date there is no known efficient algorithm for computing the sparse Walsh Transform. Since the modulo function does not preserve the Walsh transform a new method for length reduction is needed. In this paper we present such a new method – polynomials. This method enables the development of an efficient algorithm for computing the binary sparse Walsh Transform. To our knowledge, this is the first such algorithm. We also show that this method allows a faster deterministic computation of the sparse FFT than currently known in the literature. [Joint work with A. Amir, E. Porat and A. Rothschild.] • Binary Search Trees for Compression John Iacono Suppose you are receiving a string one character at a time from an alphabet of size n and wish to output a compressed representation of it. One method is to search for each character in a binary search tree and output the search path to the character. This ancient method has several benefits: it uses a number of words of space linear in the alphabet size and compresses in time linear in the number of bits of the output string. What are the types of patterns that will compress well in such a scheme? What is the best choice of binary search tree? What patterns will not compress well and are better served through other schemes? • Configurations and Minority in the String Consensus Problem Haim Paryenty The Closest String Problem is defined as follows. Let$S$be a set of$k$strings$\{s_1,\dots s_k\}$, each of length$\ell$, find a string$\hat{s}$, such that the maximum Hamming distance of$\hat{s}$from each of the strings is minimized. We denote this distance with$d$. The string$\hat{s}$is called a consensus string. In this paper we present two main algorithms, the {\tt Configuration} algorithm with$O(k^2 \ell ^ k)$running time for this problem, and the {\tt Minority} algorithm. Despite the great effort to obtain efficient algorithms for this problem an algorithm with the natural running time of$O(\ell ^ k)$was not known. In this paper we close this gap. Our result means that algorithms solving the closest string problem in times$O(\ell^2), O(\ell^3), O(\ell^4)$and$O(\ell^5)$exist for the cases of$k=2,3,4$and$5$, respectively. It is known that, in fact, the cases of$k=2,3,$and$4$can be solved in linear time. No efficient algorithm is currently known for the case of$k=5$. We prove two lemmas, the {\em unit square lemma} and the {\em minority lemma} that exploit surprising properties of the closest string problem and enable constructing the closest string in a sequential fashion. These lemmas with some additional ideas give a$O(\ell^2)$algorithm for computing a closest string of$5\$ binary strings.
[Joint work with A. Amir and L. Roditty.]