<?xml version="1.0" encoding="UTF-8"?>
      <rss version="2.0"
          xmlns:content="http://purl.org/rss/1.0/modules/content/"
          xmlns:wfw="http://wellformedweb.org/CommentAPI/"
          xmlns:dc="http://purl.org/dc/elements/1.1/"
          xmlns:atom="http://www.w3.org/2005/Atom"
          xmlns:sy="http://purl.org/rss/1.0/modules/syndication/"
          xmlns:slash="http://purl.org/rss/1.0/modules/slash/"
          xmlns:georss="http://www.georss.org/georss" xmlns:geo="http://www.w3.org/2003/01/geo/wgs84_pos#" xmlns:media="http://search.yahoo.com/mrss/">
        <channel>
          <title>EventQL Blog</title>
          <atom:link href="https://eventql.io/blog/feed.xml" rel="self" type="application/rss+xml" />
          <link>https://eventql.io/blog</link>
          <description>A blog about massively parallel database internals</description>
          <language>en-US</language>
          
        <item>
          <title>EventQL v0.4.0 released</title>
          <link>https://eventql.io/blog/eventql-v0-4-0-released</link>
          <guid isPermaLink="false">https://eventql.io/blog/eventql-v0-4-0-released</guid>
          <enclosure url="https://eventql.io/blog/eventql-v0-4-0-released/eventql_0_4_0.png" length="68095" type="image/png"/>
          <description><![CDATA[<![CDATA[ <p><i>
EventQL 0.4.0 is out and is ready for deployment. This release contains a lot of
new features and fixes since 0.3, and is a recommended upgrade for all 0.3 users.
</i></p>

<p><img src="https://eventql.io/blog/eventql-v0-4-0-released/eventql_0_4_0.png" /></p>

<p>The new year is here and so is the next release of EventQL! New features and
fixes include:</p>

<ul>
<li><p>Added a native transport layer (a framed TCP-based protocol). See
doc/internals/protocol.txt for the specification</p></li>
<li><p>Added a c client driver library (libevqlclient)</p></li>
<li><p>Added a new table config option that allows to set a finite partition size
(called a &quot;partition size hint&quot; in the user facing documentation). A table
with a finite partition size will initially have zero partitions but
instead add (finite) partitions as it receives inserts for the respective
chunks of the keyspace (without splitting the new partition from a larger
partition)</p></li>
<li><p>Added the DROP TABLE statement</p></li>
<li><p>Added a thread-local storage and retrieval mechanism for the current
session. This also adds a new requirement that all operations must be
executed from within a valid database thread.</p></li>
<li><p>Added support for loading configuration files using the &#39;-c&#39; flag and for
setting/overring individual configuration parameters using the &#39;-C&#39; flags
in the EventQL client (evql) binary</p></li>
<li><p>Added the following new configuration options: server.c2s<em>io</em>timeout,
server.c2s<em>idle</em>timeout, server.s2s<em>io</em>timeout, server.s2s<em>idle</em>timeout,
server.heartbeat<em>interval, cluster.allowed</em>hosts, server.http<em>io</em>timeout,
server.query<em>progress</em>rate<em>limit, cluster.allow</em>anonymous,
cluster.allow<em>drop</em>table, server.query<em>max</em>concurrent<em>shards,
server.query</em>max<em>concurrent</em>shards<em>per</em>host,
server.query<em>failed</em>shard<em>policy, client.timeout, server.disk</em>capacity,
server.load<em>limit</em>{soft,hard}, server.partitions<em>loading</em>limit<em>{soft,hard},
server.loadinfo</em>publish<em>interval, server.noalloc,
server.s2s</em>pool<em>max</em>connections, server.s2s<em>pool</em>max<em>connections</em>per<em>host,
server.s2s</em>pool<em>linger</em>timeout, cluster.allow<em>create</em>database,
server.replication<em>threads</em>max</p></li>
<li><p>Added the evqlslap benchmark and load testing tool</p></li>
<li><p>Added caching for metadata lookups</p></li>
<li><p>Added internal (server-to-server) connection pooling for native TCP
connections</p></li>
<li><p>Added a new &quot;user defined partitioning&quot; mode that can be enabled by setting
user<em>defined</em>partitioning=true on a table and allows to explicitly specify
the partition key for each row</p></li>
<li><p>Added a new monitor thread that publishes disk usage and other load
information to the coordination service</p></li>
<li><p>Added new frames to native protocol: ACK, INSERT, REPL<em>INSERT,
META</em>GETFILE{<em>RESULT}, META</em>CREATEFILE, META<em>PERFORMOP{</em>RESULT},
META<em>DISCOVER{</em>RESULT}, META<em>LISTPARTITIONS{</em>RESULT},
META<em>FINDPARTITION{</em>RESULT}</p></li>
<li><p>Added the DESCRIBE PARTITIONS table statement to obtain information about
the partitions of a table</p></li>
<li><p>Added the USE statement to switch databases</p></li>
<li><p>Added new sql functions: usleep, fnv32</p></li>
<li><p>Added the CREATE DATABASE statement</p></li>
<li><p>Added the CLUSTER SHOW SERVERS statement and the cluster-list command to
obtain information about the servers of the current cluster.</p></li>
<li><p>Added the table-import command to the evqlctl util</p></li>
<li><p>Added a basic ruby driver (drivers/ruby)</p></li>
<li><p>Changed the parallel GROUP BY operation to use the new native protocol and
to schedule all remote partial group bys from a a single thread using
asynchronous I/O (currently via poll()).</p></li>
<li><p>Changed the evql client to use the new native protocol</p></li>
<li><p>Changed the connection acceptor code to read the first byte of an incoming
connection and switch protocols (currently HTTP and native) based on this
first byte</p></li>
<li><p>Changed the http server code to execute incoming requests from within a
database thread (previously this was a generic thread pool)</p></li>
<li><p>Changed the internal cluster auth to require all internal hosts to be
whitelisted using the cluster.allowed_hosts option. In standalone mode,
the whitelist is defaulted to 0.0.0.0/0</p></li>
<li><p>Changed the HTTP transport to run fully within the database thread (i.e.
removed the dependency on an external event loop) and added http timeouts</p></li>
<li><p>Changed the SQL engine to use the new native transport</p></li>
<li><p>Changed the zookeeper backend to explicitly reconnect on session timeouts
and not rely on the built-in reconnect mechanism</p></li>
<li><p>Changed the DESCRIBE TABLE statement to return a new &quot;encoding&quot; column that
contains information about the column&#39;s encoding settings</p></li>
<li><p>Changed the server allocator to assign new partitions to servers using
a weighted random scheme, taking into a account the load facotr and the
number of available bytes on disk for each server</p></li>
<li><p>Changed all metadata operations to use the native protocol</p></li>
<li><p>Removed the obsolete &quot;sha1_tokens&quot; field from the cluster/server config</p></li>
<li><p>Removed the PCRE dependency</p></li>
</ul>

<p>We hope you enjoy working with the new release and please let us know of any
issues.</p>

<p>-- The EventQL Team</p>
]]></description>
          <content:encoded><![CDATA[<![CDATA[ <p><i>
EventQL 0.4.0 is out and is ready for deployment. This release contains a lot of
new features and fixes since 0.3, and is a recommended upgrade for all 0.3 users.
</i></p>

<p><img src="https://eventql.io/blog/eventql-v0-4-0-released/eventql_0_4_0.png" /></p>

<p>The new year is here and so is the next release of EventQL! New features and
fixes include:</p>

<ul>
<li><p>Added a native transport layer (a framed TCP-based protocol). See
doc/internals/protocol.txt for the specification</p></li>
<li><p>Added a c client driver library (libevqlclient)</p></li>
<li><p>Added a new table config option that allows to set a finite partition size
(called a &quot;partition size hint&quot; in the user facing documentation). A table
with a finite partition size will initially have zero partitions but
instead add (finite) partitions as it receives inserts for the respective
chunks of the keyspace (without splitting the new partition from a larger
partition)</p></li>
<li><p>Added the DROP TABLE statement</p></li>
<li><p>Added a thread-local storage and retrieval mechanism for the current
session. This also adds a new requirement that all operations must be
executed from within a valid database thread.</p></li>
<li><p>Added support for loading configuration files using the &#39;-c&#39; flag and for
setting/overring individual configuration parameters using the &#39;-C&#39; flags
in the EventQL client (evql) binary</p></li>
<li><p>Added the following new configuration options: server.c2s<em>io</em>timeout,
server.c2s<em>idle</em>timeout, server.s2s<em>io</em>timeout, server.s2s<em>idle</em>timeout,
server.heartbeat<em>interval, cluster.allowed</em>hosts, server.http<em>io</em>timeout,
server.query<em>progress</em>rate<em>limit, cluster.allow</em>anonymous,
cluster.allow<em>drop</em>table, server.query<em>max</em>concurrent<em>shards,
server.query</em>max<em>concurrent</em>shards<em>per</em>host,
server.query<em>failed</em>shard<em>policy, client.timeout, server.disk</em>capacity,
server.load<em>limit</em>{soft,hard}, server.partitions<em>loading</em>limit<em>{soft,hard},
server.loadinfo</em>publish<em>interval, server.noalloc,
server.s2s</em>pool<em>max</em>connections, server.s2s<em>pool</em>max<em>connections</em>per<em>host,
server.s2s</em>pool<em>linger</em>timeout, cluster.allow<em>create</em>database,
server.replication<em>threads</em>max</p></li>
<li><p>Added the evqlslap benchmark and load testing tool</p></li>
<li><p>Added caching for metadata lookups</p></li>
<li><p>Added internal (server-to-server) connection pooling for native TCP
connections</p></li>
<li><p>Added a new &quot;user defined partitioning&quot; mode that can be enabled by setting
user<em>defined</em>partitioning=true on a table and allows to explicitly specify
the partition key for each row</p></li>
<li><p>Added a new monitor thread that publishes disk usage and other load
information to the coordination service</p></li>
<li><p>Added new frames to native protocol: ACK, INSERT, REPL<em>INSERT,
META</em>GETFILE{<em>RESULT}, META</em>CREATEFILE, META<em>PERFORMOP{</em>RESULT},
META<em>DISCOVER{</em>RESULT}, META<em>LISTPARTITIONS{</em>RESULT},
META<em>FINDPARTITION{</em>RESULT}</p></li>
<li><p>Added the DESCRIBE PARTITIONS table statement to obtain information about
the partitions of a table</p></li>
<li><p>Added the USE statement to switch databases</p></li>
<li><p>Added new sql functions: usleep, fnv32</p></li>
<li><p>Added the CREATE DATABASE statement</p></li>
<li><p>Added the CLUSTER SHOW SERVERS statement and the cluster-list command to
obtain information about the servers of the current cluster.</p></li>
<li><p>Added the table-import command to the evqlctl util</p></li>
<li><p>Added a basic ruby driver (drivers/ruby)</p></li>
<li><p>Changed the parallel GROUP BY operation to use the new native protocol and
to schedule all remote partial group bys from a a single thread using
asynchronous I/O (currently via poll()).</p></li>
<li><p>Changed the evql client to use the new native protocol</p></li>
<li><p>Changed the connection acceptor code to read the first byte of an incoming
connection and switch protocols (currently HTTP and native) based on this
first byte</p></li>
<li><p>Changed the http server code to execute incoming requests from within a
database thread (previously this was a generic thread pool)</p></li>
<li><p>Changed the internal cluster auth to require all internal hosts to be
whitelisted using the cluster.allowed_hosts option. In standalone mode,
the whitelist is defaulted to 0.0.0.0/0</p></li>
<li><p>Changed the HTTP transport to run fully within the database thread (i.e.
removed the dependency on an external event loop) and added http timeouts</p></li>
<li><p>Changed the SQL engine to use the new native transport</p></li>
<li><p>Changed the zookeeper backend to explicitly reconnect on session timeouts
and not rely on the built-in reconnect mechanism</p></li>
<li><p>Changed the DESCRIBE TABLE statement to return a new &quot;encoding&quot; column that
contains information about the column&#39;s encoding settings</p></li>
<li><p>Changed the server allocator to assign new partitions to servers using
a weighted random scheme, taking into a account the load facotr and the
number of available bytes on disk for each server</p></li>
<li><p>Changed all metadata operations to use the native protocol</p></li>
<li><p>Removed the obsolete &quot;sha1_tokens&quot; field from the cluster/server config</p></li>
<li><p>Removed the PCRE dependency</p></li>
</ul>

<p>We hope you enjoy working with the new release and please let us know of any
issues.</p>

<p>-- The EventQL Team</p>
]]></content:encoded>
          <author>paul@eventql.io (Paul Asmuth)</author>
          <pubDate>Tue, 10 Jan 2017 18:00:00 -0000</pubDate>
        </item>
      
        <item>
          <title>Dividing Infinity - Distributed Partitioning Schemes</title>
          <link>https://eventql.io/blog/dividing-infinity-distributed-partitioning-schemes</link>
          <guid isPermaLink="false">https://eventql.io/blog/dividing-infinity-distributed-partitioning-schemes</guid>
          <enclosure url="https://eventql.io/blog/dividing-infinity-distributed-partitioning-schemes/circular_keyspace_servers.png" length="72087" type="image/png"/>
          <description><![CDATA[<![CDATA[ <p><i>
This is the second post in a series discussing the architecture and
implementation of massively parallel databases, such as Vertica <a href="https://eventql.io/blog/dividing-infinity-distributed-partitioning-schemes/#foot0">[0]</a>,
BigQuery <a href="https://eventql.io/blog/dividing-infinity-distributed-partitioning-schemes/#foot1">[1]</a> or EventQL <a href="https://eventql.io/blog/dividing-infinity-distributed-partitioning-schemes/#foot2">[2]</a>. The target audience are
software and system engineers with an interest in databases and distributed systems.
</i></p>

<p>In <a href="https://eventql.io/blog/parallel-io-and-columnar-storage/">the last post</a> we saw that in order
to execute interactive queries on a large data set we have to split the data up
into smaller partitions and put each partition on it&#39;s own server. This way we
can utilize the combined processing power of all servers to answer the query
rapidly.</p>

<p>The problem we&#39;ll discuss today is how exactly we&#39;re going to split up a given
dataset into partitions and distribute them among servers.</p>

<p>Say we have a table containing a large number of rows, a couple billion or so.
The total size of the table is around 100TB. Our task is to distribute the rows
uniformly among 20 servers, i.e. put roughly 5TB of the table on each server.</p>

<p>Of course, solving that task is trivial: We read in our 100TB source table,
write the first 5TB of rows to the first server, the next 5TB to the second server
and so on.</p>

<p><img src="https://eventql.io/blog/dividing-infinity-distributed-partitioning-schemes/parallel_query.png" /></p>

<p>While this simplistic scheme works well for a static dataset, we&#39;ll have to be
more clever if we are to implement an entire database that supports adding
and modifying rows.</p>

<p>Why? Consider this: If we want to modify a row in our naively partitioned table,
we first have to figure out on which server we have put the row when splitting
the table into pieces. Now, to find the row, we have to search through all rows
until we hit the correct one. In the worst case we would have to examine <em>all</em>
rows on <em>all</em> servers to find any single row - the whole 100TB of data.</p>

<p>In more technical terms: Locating a row has <em>linear complexity</em> <a href="https://eventql.io/blog/dividing-infinity-distributed-partitioning-schemes/#foot3">[3]</a>: Finding
the row in a table containing one thousand rows would take one thousand times
longer than finding the row in a table containing just one row. It gets slower
and slower as we add more rows.</p>

<p>Clearly, our simplistic solution will not scale: We need a more efficient way to
figure out on which server a given row is stored.</p>

<p><hr /></p>

<p>To quickly tell the location of a specific row, we could store an index file
somewhere that records the location of each row. We could then do a
quick lookup into our index to find the correct server instead of searching
through all the data.</p>

<p><img src="https://eventql.io/blog/dividing-infinity-distributed-partitioning-schemes/location_index.png" /></p>

<p>Sadly, this doesn&#39;t solve the problem. The index file would still have
linear complexity. Allthough this time it wouldn&#39;t be linear in time, but in
space: If we were storing a bazillion rows in our table, our index file would
also have a bazillion rows.</p>

<p>Essentially, we would just be rephrasing the problem statement from
<em>&quot;How do we partition a large table?&quot;</em> to <em>&quot;How do we partition a large index
file?&quot;.</em></p>

<p><hr /></p>

<p>We&#39;ll have to come up with a partitioning scheme that allows us to find rows
with less than linear complexity. I.e. we have to find an algorithm that can
correctly compute the location of any single row, but doesn&#39;t get slower and
slower as we add more rows to the table.</p>

<h2>Modulo Hashing</h2>

<p>One such algorithm is called <em>modulo hashing</em>. The good thing about modulo
hashing is that it&#39;s not only very efficient but also extremely simple to
implement.</p>

<p>If we want to partition our input table using modulo hashing, we first have to
assign an identifier <code>ID</code> to every row in the table. This <code>ID</code> is usually
dervied from the row itself, for example by designating one of the table&#39;s columns
as the primary key. For illustration purposes, we will use numeric identifiers,
but the same works with strings. <a href="https://eventql.io/blog/dividing-infinity-distributed-partitioning-schemes/#foot4">[4]</a></p>

<p>The only piece of informationthat modulo hashing keeps is a single variable <code>N</code>.
This variable <code>N</code> contains the number of servers among  which the table should
be partitioned.</p>

<p>Now, to figure out on which server a given row belongs, we simply compute the
remainder of the division of the row&#39;s <code>ID</code> over <code>N</code>. This use of the modulo
operation is also where the algorithm derives its name from.</p>

<pre><code>find_row(row_id) {
  server_id := row_id % N;
  return server_id;
}
</code></pre>

<p>If, for example, we wanted to locate the row with <code>ID=123</code> in a table partitioned
among 8 servers (<code>N=8</code>), the row would be stored on server number 3
(<code>123 % 8 = 3</code>).</p>

<p>Of course, this is assuming we have also used the algorithm to decide on
which server to put each row while loading the input table in the first place.</p>

<p>Modulo hashing works out so that every possible <code>ID</code> is consistently mapped
to a single server. The distribution of the rows among servers will be
approximately uniform, i.e. every server will get roughly the same number of
rows. <a href="https://eventql.io/blog/dividing-infinity-distributed-partitioning-schemes/#foot5">[5]</a></p>

<p><hr /></p>

<p>The modulo hashing algorithm is a huge improvement over our naive approach as it
is constant in space and time: Locating a row takes the same amount of time
regardless of the total number of rows in the table. And the only pieces of
information we need to tell where a row goes are the row&#39;s <code>ID</code> and the total
number of servers <code>N</code>, which will only take a few bytes to store. Not bad.</p>

<p>So are we done? I&#39;m afraid we&#39;re not. </p>

<p>One thing modulo hashing can&#39;t handle well is a growing dataset. If we&#39;re
continually adding more rows to the table, we will eventually have to add more
capacity and increase the number of servers <code>N</code>. However, once we do that the
locations of almost all of the rows would change, since changing <code>N</code> also changes
the result of all modulo operations.</p>

<p>This means that in order to increase the number of servers <code>N</code> and still keep the
rows where they belong, we would have to copy every single row in the table to
it&#39;s new location every time we add or remove a server.</p>

<p>That&#39;s not exactly ideal: Even if we did not care about the massive overhead of
copying every single row, we would eventually reach a point where our table grows
faster than we can rebalance it.</p>

<h2>Consistent Hashing</h2>

<p>Consistent hashing is a more involved version of modulo hashing. The main
improvement of consistent hashing is that it allows to add and remove servers
without affecting the locations of all rows.</p>

<p>At the heart of the consistent hashing algorithm is a so called <em>circular
keyspace</em>. Before we discuss what exactly that means let&#39;s define the term
<em>keyspace</em>:</p>

<p>As with modular hashing we need to assign an identifier <code>ID</code> to each row. Now,
the keyspace is the range of all valid <code>ID</code> values. If the <code>ID</code> is numeric, the
keyspace is the range from negative infinity to positive infinity.</p>

<p><img src="https://eventql.io/blog/dividing-infinity-distributed-partitioning-schemes/keyspace_only.png" />
<div class="img_sub">A keyspace and the positions of three row identifiers within the keyspace.</div></p>

<p>Within the keyspace, identifiers are well-ordered <a href="https://eventql.io/blog/dividing-infinity-distributed-partitioning-schemes/#foot6">[6]</a>. That means each <code>ID</code> has a
<em>successor</em> and a <em>predecessor</em>. The successor of an <code>ID</code> is the <code>ID</code> that goes
immediately after it and the predecessor is the one that goes immediately before it.</p>

<p>In the illustration above, the successor of green (<code>ID=123</code>) is red (<code>ID=856</code>), the
successor of red (<code>ID=856</code>) is yellow (<code>ID=923</code>) and so on.</p>

<p>But what is the successor of yellow (<code>ID=923</code>)? Does it have one? The answer is
not entirely clear. However, we will later have to come up with a successor for each
possible position in the keyspace, so we will have to define what the successor
of the last position in our keyspace is.</p>

<p>Imagine, we glued one end of our keyspace to the other end:</p>

<p><img src="https://eventql.io/blog/dividing-infinity-distributed-partitioning-schemes/circular_keyspace.png" />
<div class="img_sub">A circular keyspace and the positions of three row identifiers within the keyspace.</div></p>

<p>Now we can clearly say that, in clockwise order, yellow&#39;s (<code>ID=923</code>) successor
is green (<code>ID+123</code>). Also, it finally looks like a circular keyspace!</p>

<p><hr /></p>

<p>Back to consistent hashing. Like we did with modulo hashing, we will choose
an initial number of servers <code>N</code>. For each of the <code>N</code> servers, we will put a marker
at a random position in the circular keyspace.</p>

<p>To decide on which server a given row <code>ID</code> belongs, we first locate the position
of the <code>ID</code> in the circular keyspace and then search clockwise for the next
server marker. In other words: each row goes onto the server whose marker
immediately succeeds the row-id&#39;s position in the keyspace.</p>

<p>The illustration below shows a circular keyspace with eight server markers. In
the illustration, the succeeding server marker for row <code>123</code> is <code>server 1</code>,
while the suceeding server marker for rows <code>856</code> and <code>923</code> is <code>server 3</code>.</p>

<p>So row <code>123</code> gets stored on <code>server 1</code> while rows <code>856</code> and <code>923</code> get stored on
<code>server 3</code>.</p>

<p><img src="https://eventql.io/blog/dividing-infinity-distributed-partitioning-schemes/circular_keyspace_servers.png" />
<div class="img_sub">A circular keyspace with three row identifiers and eight server markers.</div></p>

<p>Alike the modulo hashing scheme, we still end up with a uniform distribution
of rows among servers <a href="https://eventql.io/blog/dividing-infinity-distributed-partitioning-schemes/#foot7">[7]</a> and can quickly locate each row. The only
information we have to store are the positions of each server&#39;s marker in the keyspace.</p>

<p>Additionally, any server marker that we add or remove will only affect the rows
immediately between it and the previous marker: The location of all other rows
remain consistent, hence the name consistent hashing.</p>

<p>This means we can add or remove servers and only have to move a small subset of the
rows into their new locations. The ratio of rows that needs to be moved is <code>1/N</code>
where <code>N</code> is the number of servers. So as we add more servers, the percentage of
rows that need to be moved actually gets smaller.</p>

<p><hr /></p>

<p>Can we still do better? It depends on your usecase. Consistent hashing is
successfully employed in a number of popular key/value databases <a href="https://eventql.io/blog/dividing-infinity-distributed-partitioning-schemes/#foot8">[8]</a> such as DynamoDB <a href="https://eventql.io/blog/dividing-infinity-distributed-partitioning-schemes/#foot9">[9]</a>,
Cassandra <a href="https://eventql.io/blog/dividing-infinity-distributed-partitioning-schemes/#foot10">[10]</a> or memcache <a href="https://eventql.io/blog/dividing-infinity-distributed-partitioning-schemes/#foot11">[11]</a>. Nevertheless, here are two things that we could
improve about consistent hashing:</p>

<p>Firstly, consistent hashing only supports an exact lookup operation. That is, we
can only find a row quickly if we already know it&#39;s <code>ID</code>. If we want to find the
locations of all rows in a range of <code>IDs</code>, for example all rows with an <code>ID</code>
between <code>100</code> and <code>200</code>, we&#39;re back to scanning the full table.</p>

<p>Because of this, consistent hashing is particularly well suited for key/value
databases where range scans are not required but less so for OLAP [<a href="https://eventql.io/blog/dividing-infinity-distributed-partitioning-schemes/#foot12">12</a>] systems
like EventQL.</p>

<p>The other possible improvement is that we still have to copy roughly <code>1/N</code> of the
table&#39;s rows after changing the number of servers. If we had 100TB on 20 servers,
that would mean we&#39;re -  realistically - still copying at least 15TB for every
server addition <a href="https://eventql.io/blog/dividing-infinity-distributed-partitioning-schemes/#foot13">[13]</a>. Not too bad, but still a lot of overhead network traffic.</p>

<h2>The BigTable Algorithm</h2>

<p>The last algorithm we will look at today is best known for it&#39;s publication in
Google&#39;s BigTable paper <a href="https://eventql.io/blog/dividing-infinity-distributed-partitioning-schemes/#foot14">[14]</a>.</p>

<p>The bigtable algorithm takes a completely different approach to the problem,
but it also starts by defining a keyspace. Except this time it&#39;s not a 
circular keyspace, but a linear one.</p>

<p>The illustration below shows a bigtable keyspace and the position of three rows
with the identifiers <code>123</code>, <code>856</code> and <code>923</code> within the keyspace.</p>

<p><img src="https://eventql.io/blog/dividing-infinity-distributed-partitioning-schemes/keyspace_only.png" /></p>

<p>The next thing bigtable does is to split up the keyspace into a number of
<em>partitions</em> that are defined by their start and end positions, i.e. by the
lowest and highest row identifiers that will still be contained in the partition.</p>

<p>The illustration below shows a keyspace that is split into five parititions
A-E. In the illustration, the row with <code>ID=123</code> goes into partition <code>B</code> and
the rows with <code>ID=856</code> and <code>ID=923</code> both go into partition <code>D</code>.</p>

<p><img src="https://eventql.io/blog/dividing-infinity-distributed-partitioning-schemes/keyspace.png" />
<div class="img_sub">Three record identifiers are mapped onto five partitions</div></p>

<p>Now, the clever bit about the bigtable algorithm is how it comes up with the
partition boundaries. To see why it&#39;s clever we have to understand why we can&#39;t
simply divide the keyspace into equal parts without knowing the exact distribution
of the input data:</p>

<p>One reason for that is that if you split up the range from negative to positive
infinity into a list of discrete partitions, you end up with an infinite number
of partitions.</p>

<p>The other reason is that it&#39;s highly likely that the row identifiers will all
be piled up in a small area of the keyspace. After all, the identifiers might
be user-supplied so we can&#39;t nessecarily guarantee anything about their distribution.</p>

<p><img src="https://eventql.io/blog/dividing-infinity-distributed-partitioning-schemes/keyspace_distribution.png" />
<div class="img_sub">Realistic distribution of record identifiers in the keyspace</div></p>

<p>So it could be, that even though we have split up the keyspace into a large number
of partitions, all rows actually end up in the same partition. And we can&#39;t
solve the problem by making the partitions infinitesimally small either - that
would be like going to back to keeping track of every row&#39;s location individually,
just a bit worse.</p>

<p><hr /></p>

<p>Here&#39;s how bigtable solves the problem: Initially the table starts out with a
single partition that covers the whole keyspace - from negative infinity to
positive infiity.</p>

<p>As soon as this first partition has become too large, it will be split in two.
The split point will be chosen so that it roughly halves the data in the
partition into equal parts. This continues recursively as partitions become too
large. At a basic level, it&#39;s an application of the classic divide and conquer
principle <a href="https://eventql.io/blog/dividing-infinity-distributed-partitioning-schemes/#foot15">[15]</a>.</p>

<p><img src="https://eventql.io/blog/dividing-infinity-distributed-partitioning-schemes/partition_split.png" />
<div class="img_sub">Partition D is splitting into partitions D1 and D2</div></p>

<p>This way, you always end up with a number of partitions that are roughly equal
in size. Even though the distribution of row identifiers in the keyspace is
initially unknown.</p>

<p>And since each partition is defined in terms of it&#39;s lowest and highest contained
row identifier, we can easily implement efficient range scans: To find all rows
in a given range of identifiers, we only have to scan the partitions with
overlapping ranges.</p>

<p>Lastly, the bigtable scheme does require a second allocation layer to assign
partitions to servers that we didn&#39;t discuss here. Suffice to say that this
second allocation layer allows us to add new servers to a cluster without
physically moving a single row. Of course, we still need to move around some rows
every time we split a partition.</p>

<h2>Conclusion</h2>

<p>So is the bigtable algorithm really &quot;better&quot; than consistent hashing? Again, it
depends on the usecase.</p>

<p>The upsides of the bigtable scheme are that it supports range scans and that we
can add capacity to a cluster without copying rows. The major downside is that
implementing the algorithm in a masterless system requires a fair amount of
code and synchronization.</p>

<p>For EventQL, we still chose the bigtable algorithm as the clear winner. After
reading this post, go have a look at the debug interface of EventQL where you
can see the <em>partition map</em> for a given table. Hopefully it will make a lot more
sense now:</p>

<p><a href="https://eventql.io/blog/dividing-infinity-distributed-partitioning-schemes/partition_map.png" target="_blank"><img src="https://eventql.io/blog/dividing-infinity-distributed-partitioning-schemes/partition_map_small.png" /></a>
<div class="img_sub">Partition map for an EventQL table with a DATETIME primary key</div></p>

<p>That&#39;s all for today. In the next post we will discuss how to handle streaming
updates on columnar files. You can subscribe to email updates for upcoming posts
or the RSS feed in the sidebar.</p>

<div class="footnotes">
  <p>
    <a id="foot0"></a>
    [0] Andrew Lamb et al. (2012) The Vertica Analytic Database: C-Store 7 Years Later
    (The 38th International Conference on Very Large Data Bases) &mdash;
    <a target="_blank" href="http://vldb.org/pvldb/vol5/p1790_andrewlamb_vldb2012.pdf">http://vldb.org/pvldb/vol5/p1790_andrewlamb_vldb2012.pdf</a>
  </p>
  <p>
    <a id="foot1"></a>
    [1] Sergey Melnik et al. (2010) Dremel: Interactive Analysis of Web-Scale Datasets
    (The 36th International Conference on Very Large Data Bases) &mdash;
    <a target="_blank" href="http://static.googleusercontent.com/media/research.google.com/en//pubs/archive/36632.pdf">http://static.googleusercontent.com/media/research.google.com/en//pubs/archive/36632.pdf</a>
  </p>
  <p>
    <a id="foot2"></a>
    [2] EventQL (2016) An open-source SQL database for large-scale event analytics &mdash
    <a target="_blank" href="http://eventql.io">http://eventql.io</a>
  </p>
  <p>
    <a id="foot3"></a>
    [3] Complexity Theory on Wikipedia &mdash;
    <a target="_blank" href="https://eventql.io/blog/dividing-infinity-distributed-partitioning-schemes/https://en.wikipedia.org/wiki/Big_O_notation">https://en.wikipedia.org/wiki/Big_O_notation</a>
  </p>
  <p>
    <a id="foot4"></a>
    [4] The usual way to do this is to first run the string through a hash function
    and then use (a subset of) the output of the hash function as a numeric identifier.
  </p>
  <p>
    <a id="foot5"></a>
    [5] To ensure uniform distribution, pre-process the identifiers with a suitable
    hash function (i.e. a hash function that guarantees uniform distribution of
    its outputs such as <a target="_blank" href="https://eventql.io/blog/dividing-infinity-distributed-partitioning-schemes/https://en.wikipedia.org/wiki/MurmurHash">MurmurHash</a>)
  </p>
  <p>
    <a id="foot6"></a>
    [6] Well-order on Wikipedia &mdash;
    <a target="_blank" href="https://eventql.io/blog/dividing-infinity-distributed-partitioning-schemes/https://en.wikipedia.org/wiki/Well-order">https://en.wikipedia.org/wiki/Well-order</a>
  </p>
  <p>
    <a id="foot7"></a>
    [7] Practical systems will apply a hash function to the row ID and assign
    more than one marker per server. Marker values are chosen randomly in usually
    a 160 bit keyspace. This works out so that the data is almost guaranteed to
    be uniformly distributed among servers.
  </p>
  <p>
    <a id="foot8"></a>
    [8] Key-value database on Wikipedia &mdash;
    <a target="_blank" href="https://eventql.io/blog/dividing-infinity-distributed-partitioning-schemes/https://en.wikipedia.org/wiki/Key-value_database">https://en.wikipedia.org/wiki/Key-value_database</a>
  </p>
  <p>
    <a id="foot9"></a>
    [9] Giuseppe DeCandia et al. (2007) Dynamo: Amazon’s Highly Available Key-value Store
    (21st ACM Symposium on Operating Systems Principles) &mdash;
    <a target="_blank" href="http://www.allthingsdistributed.com/files/amazon-dynamo-sosp2007.pdf">http://www.allthingsdistributed.com/files/amazon-dynamo-sosp2007.pdf</a>
  </p>
  <p>
    <a id="foot10"></a>
    [10] Avinash Lakshman, Prashant Malik (2009) Cassandra - A Decentralized Structured Storage System
    (ACM SIGOPS Operating Systems Review archive Volume 44 Issue 2) &mdash;
    <a target="_blank" href="https://eventql.io/blog/dividing-infinity-distributed-partitioning-schemes/https://www.cs.cornell.edu/projects/ladis2009/papers/lakshman-ladis2009.pdf">https://www.cs.cornell.edu/projects/ladis2009/papers/lakshman-ladis2009.pdf</a>
  </p>
  <p>
    <a id="foot11"></a>
    [11] memcached - a distributed memory object caching system &mdash;
    <a target="_blank" href="https://eventql.io/blog/dividing-infinity-distributed-partitioning-schemes/https://memcached.org/">https://memcached.org/</a>
  </p>
  <p>
    <a id="foot12"></a>
    [12] OLAP on Wikipedia &mdash;
    <a target="_blank" href="https://eventql.io/blog/dividing-infinity-distributed-partitioning-schemes/https://en.wikipedia.org/wiki/Online_analytical_processing">https://en.wikipedia.org/wiki/Online_analytical_processing</a>
  </p>
  <p>
    <a id="foot13"></a>
    [13] Assuming a replication factor of 3. In practice, replication facotr and
    overhead combined may result in the data being copied 6-9 times.
  </p>
  <p>
    <a id="foot14"></a>
    [14] Fay Chang et al. (2006) Bigtable: A Distributed Storage System for Structured Data
    (7th USENIX Symposium on Operating Systems Design and Implementation) &mdash;
    <a target="_blank" href="http://static.googleusercontent.com/media/research.google.com/en//archive/bigtable-osdi06.pdf">http://static.googleusercontent.com/media/research.google.com/en//archive/bigtable-osdi06.pdf</a>
  </p>
  <p>
    <a id="foot15"></a>
    [15] Divide and conquer on Wikipedia &mdash;
    <a target="_blank" href="https://eventql.io/blog/dividing-infinity-distributed-partitioning-schemes/https://en.wikipedia.org/wiki/Divide_and_conquer">https://en.wikipedia.org/wiki/Divide_and_conquer</a>
  </p>
</div>

<h4>Next up in the series:</h4>

<ul>
<li><a href="https://eventql.io/blog/parallel-io-and-columnar-storage/">Part 1: Parallel I/O and Columnar Storage</a></li>
<li><a href="https://eventql.io/blog/dividing-infinity-distributed-partitioning-schemes/">Part 2: Dividing Infinity - Distributed Partitioning Schemes</a></li>
</ul>
]]></description>
          <content:encoded><![CDATA[<![CDATA[ <p><i>
This is the second post in a series discussing the architecture and
implementation of massively parallel databases, such as Vertica <a href="https://eventql.io/blog/dividing-infinity-distributed-partitioning-schemes/#foot0">[0]</a>,
BigQuery <a href="https://eventql.io/blog/dividing-infinity-distributed-partitioning-schemes/#foot1">[1]</a> or EventQL <a href="https://eventql.io/blog/dividing-infinity-distributed-partitioning-schemes/#foot2">[2]</a>. The target audience are
software and system engineers with an interest in databases and distributed systems.
</i></p>

<p>In <a href="https://eventql.io/blog/parallel-io-and-columnar-storage/">the last post</a> we saw that in order
to execute interactive queries on a large data set we have to split the data up
into smaller partitions and put each partition on it&#39;s own server. This way we
can utilize the combined processing power of all servers to answer the query
rapidly.</p>

<p>The problem we&#39;ll discuss today is how exactly we&#39;re going to split up a given
dataset into partitions and distribute them among servers.</p>

<p>Say we have a table containing a large number of rows, a couple billion or so.
The total size of the table is around 100TB. Our task is to distribute the rows
uniformly among 20 servers, i.e. put roughly 5TB of the table on each server.</p>

<p>Of course, solving that task is trivial: We read in our 100TB source table,
write the first 5TB of rows to the first server, the next 5TB to the second server
and so on.</p>

<p><img src="https://eventql.io/blog/dividing-infinity-distributed-partitioning-schemes/parallel_query.png" /></p>

<p>While this simplistic scheme works well for a static dataset, we&#39;ll have to be
more clever if we are to implement an entire database that supports adding
and modifying rows.</p>

<p>Why? Consider this: If we want to modify a row in our naively partitioned table,
we first have to figure out on which server we have put the row when splitting
the table into pieces. Now, to find the row, we have to search through all rows
until we hit the correct one. In the worst case we would have to examine <em>all</em>
rows on <em>all</em> servers to find any single row - the whole 100TB of data.</p>

<p>In more technical terms: Locating a row has <em>linear complexity</em> <a href="https://eventql.io/blog/dividing-infinity-distributed-partitioning-schemes/#foot3">[3]</a>: Finding
the row in a table containing one thousand rows would take one thousand times
longer than finding the row in a table containing just one row. It gets slower
and slower as we add more rows.</p>

<p>Clearly, our simplistic solution will not scale: We need a more efficient way to
figure out on which server a given row is stored.</p>

<p><hr /></p>

<p>To quickly tell the location of a specific row, we could store an index file
somewhere that records the location of each row. We could then do a
quick lookup into our index to find the correct server instead of searching
through all the data.</p>

<p><img src="https://eventql.io/blog/dividing-infinity-distributed-partitioning-schemes/location_index.png" /></p>

<p>Sadly, this doesn&#39;t solve the problem. The index file would still have
linear complexity. Allthough this time it wouldn&#39;t be linear in time, but in
space: If we were storing a bazillion rows in our table, our index file would
also have a bazillion rows.</p>

<p>Essentially, we would just be rephrasing the problem statement from
<em>&quot;How do we partition a large table?&quot;</em> to <em>&quot;How do we partition a large index
file?&quot;.</em></p>

<p><hr /></p>

<p>We&#39;ll have to come up with a partitioning scheme that allows us to find rows
with less than linear complexity. I.e. we have to find an algorithm that can
correctly compute the location of any single row, but doesn&#39;t get slower and
slower as we add more rows to the table.</p>

<h2>Modulo Hashing</h2>

<p>One such algorithm is called <em>modulo hashing</em>. The good thing about modulo
hashing is that it&#39;s not only very efficient but also extremely simple to
implement.</p>

<p>If we want to partition our input table using modulo hashing, we first have to
assign an identifier <code>ID</code> to every row in the table. This <code>ID</code> is usually
dervied from the row itself, for example by designating one of the table&#39;s columns
as the primary key. For illustration purposes, we will use numeric identifiers,
but the same works with strings. <a href="https://eventql.io/blog/dividing-infinity-distributed-partitioning-schemes/#foot4">[4]</a></p>

<p>The only piece of informationthat modulo hashing keeps is a single variable <code>N</code>.
This variable <code>N</code> contains the number of servers among  which the table should
be partitioned.</p>

<p>Now, to figure out on which server a given row belongs, we simply compute the
remainder of the division of the row&#39;s <code>ID</code> over <code>N</code>. This use of the modulo
operation is also where the algorithm derives its name from.</p>

<pre><code>find_row(row_id) {
  server_id := row_id % N;
  return server_id;
}
</code></pre>

<p>If, for example, we wanted to locate the row with <code>ID=123</code> in a table partitioned
among 8 servers (<code>N=8</code>), the row would be stored on server number 3
(<code>123 % 8 = 3</code>).</p>

<p>Of course, this is assuming we have also used the algorithm to decide on
which server to put each row while loading the input table in the first place.</p>

<p>Modulo hashing works out so that every possible <code>ID</code> is consistently mapped
to a single server. The distribution of the rows among servers will be
approximately uniform, i.e. every server will get roughly the same number of
rows. <a href="https://eventql.io/blog/dividing-infinity-distributed-partitioning-schemes/#foot5">[5]</a></p>

<p><hr /></p>

<p>The modulo hashing algorithm is a huge improvement over our naive approach as it
is constant in space and time: Locating a row takes the same amount of time
regardless of the total number of rows in the table. And the only pieces of
information we need to tell where a row goes are the row&#39;s <code>ID</code> and the total
number of servers <code>N</code>, which will only take a few bytes to store. Not bad.</p>

<p>So are we done? I&#39;m afraid we&#39;re not. </p>

<p>One thing modulo hashing can&#39;t handle well is a growing dataset. If we&#39;re
continually adding more rows to the table, we will eventually have to add more
capacity and increase the number of servers <code>N</code>. However, once we do that the
locations of almost all of the rows would change, since changing <code>N</code> also changes
the result of all modulo operations.</p>

<p>This means that in order to increase the number of servers <code>N</code> and still keep the
rows where they belong, we would have to copy every single row in the table to
it&#39;s new location every time we add or remove a server.</p>

<p>That&#39;s not exactly ideal: Even if we did not care about the massive overhead of
copying every single row, we would eventually reach a point where our table grows
faster than we can rebalance it.</p>

<h2>Consistent Hashing</h2>

<p>Consistent hashing is a more involved version of modulo hashing. The main
improvement of consistent hashing is that it allows to add and remove servers
without affecting the locations of all rows.</p>

<p>At the heart of the consistent hashing algorithm is a so called <em>circular
keyspace</em>. Before we discuss what exactly that means let&#39;s define the term
<em>keyspace</em>:</p>

<p>As with modular hashing we need to assign an identifier <code>ID</code> to each row. Now,
the keyspace is the range of all valid <code>ID</code> values. If the <code>ID</code> is numeric, the
keyspace is the range from negative infinity to positive infinity.</p>

<p><img src="https://eventql.io/blog/dividing-infinity-distributed-partitioning-schemes/keyspace_only.png" />
<div class="img_sub">A keyspace and the positions of three row identifiers within the keyspace.</div></p>

<p>Within the keyspace, identifiers are well-ordered <a href="https://eventql.io/blog/dividing-infinity-distributed-partitioning-schemes/#foot6">[6]</a>. That means each <code>ID</code> has a
<em>successor</em> and a <em>predecessor</em>. The successor of an <code>ID</code> is the <code>ID</code> that goes
immediately after it and the predecessor is the one that goes immediately before it.</p>

<p>In the illustration above, the successor of green (<code>ID=123</code>) is red (<code>ID=856</code>), the
successor of red (<code>ID=856</code>) is yellow (<code>ID=923</code>) and so on.</p>

<p>But what is the successor of yellow (<code>ID=923</code>)? Does it have one? The answer is
not entirely clear. However, we will later have to come up with a successor for each
possible position in the keyspace, so we will have to define what the successor
of the last position in our keyspace is.</p>

<p>Imagine, we glued one end of our keyspace to the other end:</p>

<p><img src="https://eventql.io/blog/dividing-infinity-distributed-partitioning-schemes/circular_keyspace.png" />
<div class="img_sub">A circular keyspace and the positions of three row identifiers within the keyspace.</div></p>

<p>Now we can clearly say that, in clockwise order, yellow&#39;s (<code>ID=923</code>) successor
is green (<code>ID+123</code>). Also, it finally looks like a circular keyspace!</p>

<p><hr /></p>

<p>Back to consistent hashing. Like we did with modulo hashing, we will choose
an initial number of servers <code>N</code>. For each of the <code>N</code> servers, we will put a marker
at a random position in the circular keyspace.</p>

<p>To decide on which server a given row <code>ID</code> belongs, we first locate the position
of the <code>ID</code> in the circular keyspace and then search clockwise for the next
server marker. In other words: each row goes onto the server whose marker
immediately succeeds the row-id&#39;s position in the keyspace.</p>

<p>The illustration below shows a circular keyspace with eight server markers. In
the illustration, the succeeding server marker for row <code>123</code> is <code>server 1</code>,
while the suceeding server marker for rows <code>856</code> and <code>923</code> is <code>server 3</code>.</p>

<p>So row <code>123</code> gets stored on <code>server 1</code> while rows <code>856</code> and <code>923</code> get stored on
<code>server 3</code>.</p>

<p><img src="https://eventql.io/blog/dividing-infinity-distributed-partitioning-schemes/circular_keyspace_servers.png" />
<div class="img_sub">A circular keyspace with three row identifiers and eight server markers.</div></p>

<p>Alike the modulo hashing scheme, we still end up with a uniform distribution
of rows among servers <a href="https://eventql.io/blog/dividing-infinity-distributed-partitioning-schemes/#foot7">[7]</a> and can quickly locate each row. The only
information we have to store are the positions of each server&#39;s marker in the keyspace.</p>

<p>Additionally, any server marker that we add or remove will only affect the rows
immediately between it and the previous marker: The location of all other rows
remain consistent, hence the name consistent hashing.</p>

<p>This means we can add or remove servers and only have to move a small subset of the
rows into their new locations. The ratio of rows that needs to be moved is <code>1/N</code>
where <code>N</code> is the number of servers. So as we add more servers, the percentage of
rows that need to be moved actually gets smaller.</p>

<p><hr /></p>

<p>Can we still do better? It depends on your usecase. Consistent hashing is
successfully employed in a number of popular key/value databases <a href="https://eventql.io/blog/dividing-infinity-distributed-partitioning-schemes/#foot8">[8]</a> such as DynamoDB <a href="https://eventql.io/blog/dividing-infinity-distributed-partitioning-schemes/#foot9">[9]</a>,
Cassandra <a href="https://eventql.io/blog/dividing-infinity-distributed-partitioning-schemes/#foot10">[10]</a> or memcache <a href="https://eventql.io/blog/dividing-infinity-distributed-partitioning-schemes/#foot11">[11]</a>. Nevertheless, here are two things that we could
improve about consistent hashing:</p>

<p>Firstly, consistent hashing only supports an exact lookup operation. That is, we
can only find a row quickly if we already know it&#39;s <code>ID</code>. If we want to find the
locations of all rows in a range of <code>IDs</code>, for example all rows with an <code>ID</code>
between <code>100</code> and <code>200</code>, we&#39;re back to scanning the full table.</p>

<p>Because of this, consistent hashing is particularly well suited for key/value
databases where range scans are not required but less so for OLAP [<a href="https://eventql.io/blog/dividing-infinity-distributed-partitioning-schemes/#foot12">12</a>] systems
like EventQL.</p>

<p>The other possible improvement is that we still have to copy roughly <code>1/N</code> of the
table&#39;s rows after changing the number of servers. If we had 100TB on 20 servers,
that would mean we&#39;re -  realistically - still copying at least 15TB for every
server addition <a href="https://eventql.io/blog/dividing-infinity-distributed-partitioning-schemes/#foot13">[13]</a>. Not too bad, but still a lot of overhead network traffic.</p>

<h2>The BigTable Algorithm</h2>

<p>The last algorithm we will look at today is best known for it&#39;s publication in
Google&#39;s BigTable paper <a href="https://eventql.io/blog/dividing-infinity-distributed-partitioning-schemes/#foot14">[14]</a>.</p>

<p>The bigtable algorithm takes a completely different approach to the problem,
but it also starts by defining a keyspace. Except this time it&#39;s not a 
circular keyspace, but a linear one.</p>

<p>The illustration below shows a bigtable keyspace and the position of three rows
with the identifiers <code>123</code>, <code>856</code> and <code>923</code> within the keyspace.</p>

<p><img src="https://eventql.io/blog/dividing-infinity-distributed-partitioning-schemes/keyspace_only.png" /></p>

<p>The next thing bigtable does is to split up the keyspace into a number of
<em>partitions</em> that are defined by their start and end positions, i.e. by the
lowest and highest row identifiers that will still be contained in the partition.</p>

<p>The illustration below shows a keyspace that is split into five parititions
A-E. In the illustration, the row with <code>ID=123</code> goes into partition <code>B</code> and
the rows with <code>ID=856</code> and <code>ID=923</code> both go into partition <code>D</code>.</p>

<p><img src="https://eventql.io/blog/dividing-infinity-distributed-partitioning-schemes/keyspace.png" />
<div class="img_sub">Three record identifiers are mapped onto five partitions</div></p>

<p>Now, the clever bit about the bigtable algorithm is how it comes up with the
partition boundaries. To see why it&#39;s clever we have to understand why we can&#39;t
simply divide the keyspace into equal parts without knowing the exact distribution
of the input data:</p>

<p>One reason for that is that if you split up the range from negative to positive
infinity into a list of discrete partitions, you end up with an infinite number
of partitions.</p>

<p>The other reason is that it&#39;s highly likely that the row identifiers will all
be piled up in a small area of the keyspace. After all, the identifiers might
be user-supplied so we can&#39;t nessecarily guarantee anything about their distribution.</p>

<p><img src="https://eventql.io/blog/dividing-infinity-distributed-partitioning-schemes/keyspace_distribution.png" />
<div class="img_sub">Realistic distribution of record identifiers in the keyspace</div></p>

<p>So it could be, that even though we have split up the keyspace into a large number
of partitions, all rows actually end up in the same partition. And we can&#39;t
solve the problem by making the partitions infinitesimally small either - that
would be like going to back to keeping track of every row&#39;s location individually,
just a bit worse.</p>

<p><hr /></p>

<p>Here&#39;s how bigtable solves the problem: Initially the table starts out with a
single partition that covers the whole keyspace - from negative infinity to
positive infiity.</p>

<p>As soon as this first partition has become too large, it will be split in two.
The split point will be chosen so that it roughly halves the data in the
partition into equal parts. This continues recursively as partitions become too
large. At a basic level, it&#39;s an application of the classic divide and conquer
principle <a href="https://eventql.io/blog/dividing-infinity-distributed-partitioning-schemes/#foot15">[15]</a>.</p>

<p><img src="https://eventql.io/blog/dividing-infinity-distributed-partitioning-schemes/partition_split.png" />
<div class="img_sub">Partition D is splitting into partitions D1 and D2</div></p>

<p>This way, you always end up with a number of partitions that are roughly equal
in size. Even though the distribution of row identifiers in the keyspace is
initially unknown.</p>

<p>And since each partition is defined in terms of it&#39;s lowest and highest contained
row identifier, we can easily implement efficient range scans: To find all rows
in a given range of identifiers, we only have to scan the partitions with
overlapping ranges.</p>

<p>Lastly, the bigtable scheme does require a second allocation layer to assign
partitions to servers that we didn&#39;t discuss here. Suffice to say that this
second allocation layer allows us to add new servers to a cluster without
physically moving a single row. Of course, we still need to move around some rows
every time we split a partition.</p>

<h2>Conclusion</h2>

<p>So is the bigtable algorithm really &quot;better&quot; than consistent hashing? Again, it
depends on the usecase.</p>

<p>The upsides of the bigtable scheme are that it supports range scans and that we
can add capacity to a cluster without copying rows. The major downside is that
implementing the algorithm in a masterless system requires a fair amount of
code and synchronization.</p>

<p>For EventQL, we still chose the bigtable algorithm as the clear winner. After
reading this post, go have a look at the debug interface of EventQL where you
can see the <em>partition map</em> for a given table. Hopefully it will make a lot more
sense now:</p>

<p><a href="https://eventql.io/blog/dividing-infinity-distributed-partitioning-schemes/partition_map.png" target="_blank"><img src="https://eventql.io/blog/dividing-infinity-distributed-partitioning-schemes/partition_map_small.png" /></a>
<div class="img_sub">Partition map for an EventQL table with a DATETIME primary key</div></p>

<p>That&#39;s all for today. In the next post we will discuss how to handle streaming
updates on columnar files. You can subscribe to email updates for upcoming posts
or the RSS feed in the sidebar.</p>

<div class="footnotes">
  <p>
    <a id="foot0"></a>
    [0] Andrew Lamb et al. (2012) The Vertica Analytic Database: C-Store 7 Years Later
    (The 38th International Conference on Very Large Data Bases) &mdash;
    <a target="_blank" href="http://vldb.org/pvldb/vol5/p1790_andrewlamb_vldb2012.pdf">http://vldb.org/pvldb/vol5/p1790_andrewlamb_vldb2012.pdf</a>
  </p>
  <p>
    <a id="foot1"></a>
    [1] Sergey Melnik et al. (2010) Dremel: Interactive Analysis of Web-Scale Datasets
    (The 36th International Conference on Very Large Data Bases) &mdash;
    <a target="_blank" href="http://static.googleusercontent.com/media/research.google.com/en//pubs/archive/36632.pdf">http://static.googleusercontent.com/media/research.google.com/en//pubs/archive/36632.pdf</a>
  </p>
  <p>
    <a id="foot2"></a>
    [2] EventQL (2016) An open-source SQL database for large-scale event analytics &mdash
    <a target="_blank" href="http://eventql.io">http://eventql.io</a>
  </p>
  <p>
    <a id="foot3"></a>
    [3] Complexity Theory on Wikipedia &mdash;
    <a target="_blank" href="https://eventql.io/blog/dividing-infinity-distributed-partitioning-schemes/https://en.wikipedia.org/wiki/Big_O_notation">https://en.wikipedia.org/wiki/Big_O_notation</a>
  </p>
  <p>
    <a id="foot4"></a>
    [4] The usual way to do this is to first run the string through a hash function
    and then use (a subset of) the output of the hash function as a numeric identifier.
  </p>
  <p>
    <a id="foot5"></a>
    [5] To ensure uniform distribution, pre-process the identifiers with a suitable
    hash function (i.e. a hash function that guarantees uniform distribution of
    its outputs such as <a target="_blank" href="https://eventql.io/blog/dividing-infinity-distributed-partitioning-schemes/https://en.wikipedia.org/wiki/MurmurHash">MurmurHash</a>)
  </p>
  <p>
    <a id="foot6"></a>
    [6] Well-order on Wikipedia &mdash;
    <a target="_blank" href="https://eventql.io/blog/dividing-infinity-distributed-partitioning-schemes/https://en.wikipedia.org/wiki/Well-order">https://en.wikipedia.org/wiki/Well-order</a>
  </p>
  <p>
    <a id="foot7"></a>
    [7] Practical systems will apply a hash function to the row ID and assign
    more than one marker per server. Marker values are chosen randomly in usually
    a 160 bit keyspace. This works out so that the data is almost guaranteed to
    be uniformly distributed among servers.
  </p>
  <p>
    <a id="foot8"></a>
    [8] Key-value database on Wikipedia &mdash;
    <a target="_blank" href="https://eventql.io/blog/dividing-infinity-distributed-partitioning-schemes/https://en.wikipedia.org/wiki/Key-value_database">https://en.wikipedia.org/wiki/Key-value_database</a>
  </p>
  <p>
    <a id="foot9"></a>
    [9] Giuseppe DeCandia et al. (2007) Dynamo: Amazon’s Highly Available Key-value Store
    (21st ACM Symposium on Operating Systems Principles) &mdash;
    <a target="_blank" href="http://www.allthingsdistributed.com/files/amazon-dynamo-sosp2007.pdf">http://www.allthingsdistributed.com/files/amazon-dynamo-sosp2007.pdf</a>
  </p>
  <p>
    <a id="foot10"></a>
    [10] Avinash Lakshman, Prashant Malik (2009) Cassandra - A Decentralized Structured Storage System
    (ACM SIGOPS Operating Systems Review archive Volume 44 Issue 2) &mdash;
    <a target="_blank" href="https://eventql.io/blog/dividing-infinity-distributed-partitioning-schemes/https://www.cs.cornell.edu/projects/ladis2009/papers/lakshman-ladis2009.pdf">https://www.cs.cornell.edu/projects/ladis2009/papers/lakshman-ladis2009.pdf</a>
  </p>
  <p>
    <a id="foot11"></a>
    [11] memcached - a distributed memory object caching system &mdash;
    <a target="_blank" href="https://eventql.io/blog/dividing-infinity-distributed-partitioning-schemes/https://memcached.org/">https://memcached.org/</a>
  </p>
  <p>
    <a id="foot12"></a>
    [12] OLAP on Wikipedia &mdash;
    <a target="_blank" href="https://eventql.io/blog/dividing-infinity-distributed-partitioning-schemes/https://en.wikipedia.org/wiki/Online_analytical_processing">https://en.wikipedia.org/wiki/Online_analytical_processing</a>
  </p>
  <p>
    <a id="foot13"></a>
    [13] Assuming a replication factor of 3. In practice, replication facotr and
    overhead combined may result in the data being copied 6-9 times.
  </p>
  <p>
    <a id="foot14"></a>
    [14] Fay Chang et al. (2006) Bigtable: A Distributed Storage System for Structured Data
    (7th USENIX Symposium on Operating Systems Design and Implementation) &mdash;
    <a target="_blank" href="http://static.googleusercontent.com/media/research.google.com/en//archive/bigtable-osdi06.pdf">http://static.googleusercontent.com/media/research.google.com/en//archive/bigtable-osdi06.pdf</a>
  </p>
  <p>
    <a id="foot15"></a>
    [15] Divide and conquer on Wikipedia &mdash;
    <a target="_blank" href="https://eventql.io/blog/dividing-infinity-distributed-partitioning-schemes/https://en.wikipedia.org/wiki/Divide_and_conquer">https://en.wikipedia.org/wiki/Divide_and_conquer</a>
  </p>
</div>

<h4>Next up in the series:</h4>

<ul>
<li><a href="https://eventql.io/blog/parallel-io-and-columnar-storage/">Part 1: Parallel I/O and Columnar Storage</a></li>
<li><a href="https://eventql.io/blog/dividing-infinity-distributed-partitioning-schemes/">Part 2: Dividing Infinity - Distributed Partitioning Schemes</a></li>
</ul>
]]></content:encoded>
          <author>paul@eventql.io (Paul Asmuth)</author>
          <pubDate>Mon, 15 Aug 2016 16:00:00 -0000</pubDate>
        </item>
      
        <item>
          <title>Parallel I/O and Columnar Storage</title>
          <link>https://eventql.io/blog/parallel-io-and-columnar-storage</link>
          <guid isPermaLink="false">https://eventql.io/blog/parallel-io-and-columnar-storage</guid>
          <enclosure url="https://eventql.io/blog/parallel-io-and-columnar-storage/columnar_storage.png" length="55300" type="image/png"/>
          <description><![CDATA[<![CDATA[ <p><i>
Welcome to our new blog! Today&#39;s article is the first in a series of posts
discussing the architecture and implementation of massively parallel databases
such as Vertica <a href="https://eventql.io/blog/parallel-io-and-columnar-storage/#foot0">[0]</a>, BigQuery <a href="https://eventql.io/blog/parallel-io-and-columnar-storage/#foot1">[1]</a> and EventQL <a href="https://eventql.io/blog/parallel-io-and-columnar-storage/#foot2">[2]</a>
</i></p>

<p><i>
We begin with a high level overview of the system while follow up posts will discuss
specific components in more detail. The target audience are software and systems engineers
with an interest in databases and distributed systems.
</i></p>

<h3>The Challenge</h3>

<p>Let&#39;s start off with a challenge: We&#39;re given a table with 100TB of web tracking
data collected over a period of a few days. Here is a small sample of rows from
the table:</p>

<p><img src="https://eventql.io/blog/parallel-io-and-columnar-storage/tracking_data.png" alt="Tracking Data" /></p>

<p>Our goal is to answer queries like <i>&quot;How many people visited the page &#39;/account/signup&#39;
in the last 10 days?&quot;</i>. The are only two rules: The query is not known beforehand
and we must answer it in less than a second. Here is the same example query in SQL:</p>

<pre><code>SELECT count(1)
FROM tracking_data
WHERE url = &#39;/account/signup&#39; AND time &gt; time_at(&quot;-10d&quot;);
</code></pre>

<p>If that doesn&#39;t sound like an interesting challenge yet, consider this quick
back-of-the-envelope calculation: Assuming the average hard disk can read roughly
200MB per second (sequentially), loading 100TB from disk will take 500k disk-seconds
or about 138 disk hours.</p>

<p>Oops. 138 hours is almost 6 days. A long way from &quot;less than a second&quot;. We&#39;re
seven orders of magnitude off and that&#39;s before we even started to process any
of the data &mdash; just to read it from disk.
<hr />
How can we still solve the challenge? We can&#39;t make a single disk go any faster
but we can break up the data set into smaller pieces, put each piece on its own
disk and read all pieces from all disks in parallel. If we distributed the data
over 500k individual disks we could read our whole data set in one second.</p>

<p><img src="https://eventql.io/blog/parallel-io-and-columnar-storage/parallel_query.png" alt="Parallel query" /></p>

<p>There is only one small problem with that scheme &mdash; half a million disks
cost <em>a lot</em> of money.</p>

<p>So even if we use a lot of machines, reading the full data set from disk
in less than a second is utterly out of reach.</p>

<p><hr /></p>

<p>Can we still solve the challenge? Yes, but I&#39;m afraid we&#39;ll have to cheat &mdash;
maybe we can answer the query without actually reading all the data from disk.</p>

<p>If we could come up with an algorithm which computes the query result after
reading only .01% of the data (or 10GB) from disk, then we could return an answer
in one second using just fifty disks. Fifty disks could be hosted in a dozen
servers. Finally, that sounds reasonable!</p>

<p>But how do we compute the answer for the full dataset while reading only .01% of
the data from disk? One approach would be to use sampling and probabalistic algorithms.
However, sampling would give us <i>approximately</i> correct results. Approximate
results are great for some use cases and not so great to unworkable for others.</p>

<p>There is another trick we can use to minimze the amount of data to be read from
disk while always giving correct results. The technique is called &quot;column-oriented&quot;
or &quot;columnar&quot; storage.</p>

<h3>Columnar Storage</h3>

<p>To really understand the benefit of columnar storage for data anlytics we first
have to look at how regular <em>row-oriented</em> databases store tables on disk:</p>

<p>It turns out they do it pretty much like you&#39;d expect them to. They basically
keep a file somwhere which contains all the rows in the table. One row after
another. Hence the name &quot;row-oriented&quot;.</p>

<p><img src="https://eventql.io/blog/parallel-io-and-columnar-storage/row_oriented.png" alt="Row-oriented file" />
<div class="img_sub">A row-oriented file conceptually looks something like this</div></p>

<p>What&#39;s problematic about row-oriented databases is that to execute a query, they
always have to read the full rows from disk. Even if just a small part of each
row is required to answer the query, the database still has to read every row
in full. <a href="https://eventql.io/blog/parallel-io-and-columnar-storage/#foot3">[3]</a></p>

<p>The not completely intuitive reason for this is that hard disks are only fast if
you read a file sequentially, i.e. only if you read one byte after another.
Jumping around within a file performs very poorly. So poorly in fact, that reading
whole rows is practically always <em>faster</em> than reading partial rows. <a href="https://eventql.io/blog/parallel-io-and-columnar-storage/#foot4">[4]</a></p>

<p>Consider the following example query which calculates the number of page views per
minute:</p>

<pre><code>SELECT time, count(1)
FROM tracking_data
GROUP BY date_trunc(&quot;1min&quot;, time);
</code></pre>

<p>To compute the answer, we only need to know the value of the <code>time</code> column of
each row. We&#39;re not interested in the <code>session_id</code>, <code>url</code> or any of the potentially
hundreds of other columns of the table.</p>

<p>Ideally, we would only load the <code>time</code> column of each row from disk when
executing the query. But we just saw that row-oriented storage can&#39;t do that
efficiently. We always have to read the full rows no matter what.</p>

<p>Depending on the table and query, we could be spending 99% of the time reading
data from disk which we&#39;re not going to need to answer the query.</p>

<p><hr /></p>

<p>This is the problem column-oriented storage tries to solve. The basic idea is that
instead of storing one row after another, we can break up the row into the individual
columns and then store one column after another.</p>

<p>If, for example, our table contained one thousand rows with each three columns
<code>time</code>, <code>session_id</code> and <code>url</code>, we would first store an array of a thousand <code>time</code>
values, then another array of a thousand <code>session_id</code> values and finally an array of
a thousand <code>url</code> values.</p>

<p>You might have come across the concept before under a different name: What we&#39;re
doing is basically vectorization <a href="https://eventql.io/blog/parallel-io-and-columnar-storage/#foot5">[5]</a>.</p>

<p><img src="https://eventql.io/blog/parallel-io-and-columnar-storage/columnar_storage.png" alt="Columnar storage" /></p>

<p>Storing each column seperately has two desirable properties. The first and most
obvious one is that it allows us to also read each column separately &mdash; we don&#39;t
have to load all the extraneous columns from disk anymore.</p>

<p>The second and less obvious upside of columnar storage is that we can compress
the data very efficiently. The compression further reduces the number of bytes
we actually have to fetch from disk to read a row.</p>

<p>To see why compression in columnar files can be very significant, imagine our
table had a fourth <code>is_customer</code> column. Sadly, we don&#39;t have any customers yet
so the field is always <code>false</code>.</p>

<p>In a row-oriented database, storing the <code>is_customer</code> field for one million rows
would take at least 1MB (one byte per boolean). In a columnar database we can store
all one million values in a single byte &mdash; a 1000000x improvement. <a href="https://eventql.io/blog/parallel-io-and-columnar-storage/#foot6">[6]</a></p>

<p><hr /></p>

<p>Lastly, it should be noted that columnar storage also has a big downside: It&#39;s
less efficient to perform updates on columnar files than on row-oriented files.
So you probably won&#39;t see the classical OLTP databases like MySQL
switching to columnar any time soon. Still, for analytical queries on large data
sets columnar is clearly the way to go.</p>

<h3>Are we there yet?</h3>

<p>Looks like we have finally put together a scheme which will allow us to excute a
SQL query on 100TB of data in less than a second, even though just reading the
data from disk would have taken many days.</p>

<p>Let&#39;s recapitulate our approach: We&#39;re going to split up the data into many small
pieces, then distribute the pieces among a dozen or so machines. On each machine,
we will cheat by storing the rows in columnar format and only actually reading a
small subset of the compressed data to answer the query.</p>

<p>Of course, we have only scratched the surface of the problem so far. In the
<a href="https://eventql.io/blog/dividing-infinity-distributed-partitioning-schemes/">next post</a> we
will discuss how exactly we&#39;re going to split up the data into pieces.</p>

<p>You subscribe to email updates for upcoming posts or the rss feed in the
sidebar.</p>

<div class="footnotes">
  <p>
    <a id="foot0"></a>
    [0] Andrew Lamb et al. (2012) The Vertica Analytic Database: C-Store 7 Years Later
    (The 38th International Conference on Very Large Data Bases) &mdash;
    <a target="_blank" href="http://vldb.org/pvldb/vol5/p1790_andrewlamb_vldb2012.pdf">http://vldb.org/pvldb/vol5/p1790_andrewlamb_vldb2012.pdf</a>
  </p>
  <p>
    <a id="foot1"></a>
    [1] Sergey Melnik et al. (2010) Dremel: Interactive Analysis of Web-Scale Datasets
    (The 36th International Conference on Very Large Data Bases) &mdash;
    <a target="_blank" href="http://static.googleusercontent.com/media/research.google.com/en//pubs/archive/36632.pdf">http://static.googleusercontent.com/media/research.google.com/en//pubs/archive/36632.pdf</a>
  </p>
  <p>
    <a id="foot2"></a>
    [2] EventQL (2016) An open-source SQL database for large-scale event analytics &mdash;
    <a target="_blank" href="http://eventql.io">http://eventql.io</a>
  </p>
  <p>
    <a id="foot3"></a>
    [3] Yes, you can put an index on the column. It turns out that indexes in
    traditional row-oriented database are fundamentally columnar representations
    of the data. Another way to look at it would be that a columnar table behaves
    like a traditional table with an automatic index on all columns.
  </p>
  <p>
    <a id="foot4"></a>
    [4] Of course this is an oversimplification. However, even with buffering and
    speculative read-ahead you wont realistically be faster than reading the rows
    in full.
  </p>
  <p>
    <a id="foot5"></a>
    [5] "Vectorization" on Wikipedia &mdash;
    <a target="_blank" href="https://eventql.io/blog/parallel-io-and-columnar-storage/https://en.wikipedia.org/wiki/Vectorization">https://en.wikipedia.org/wiki/Vectorization</a>
  </p>
  <p>
    <a id="foot6"></a>
    [6] We'll reveal how later in this series. If you can't wait until then check
    out this excellent paper on the topic [1].
  </p>
</div>

<h4>Next up in the series:</h4>

<ul>
<li><a href="https://eventql.io/blog/parallel-io-and-columnar-storage/">Part 1: Parallel I/O and Columnar Storage</a></li>
<li><a href="https://eventql.io/blog/dividing-infinity-distributed-partitioning-schemes/">Part 2: Dividing Infinity - Distributed Partitioning Schemes</a></li>
</ul>
]]></description>
          <content:encoded><![CDATA[<![CDATA[ <p><i>
Welcome to our new blog! Today&#39;s article is the first in a series of posts
discussing the architecture and implementation of massively parallel databases
such as Vertica <a href="https://eventql.io/blog/parallel-io-and-columnar-storage/#foot0">[0]</a>, BigQuery <a href="https://eventql.io/blog/parallel-io-and-columnar-storage/#foot1">[1]</a> and EventQL <a href="https://eventql.io/blog/parallel-io-and-columnar-storage/#foot2">[2]</a>
</i></p>

<p><i>
We begin with a high level overview of the system while follow up posts will discuss
specific components in more detail. The target audience are software and systems engineers
with an interest in databases and distributed systems.
</i></p>

<h3>The Challenge</h3>

<p>Let&#39;s start off with a challenge: We&#39;re given a table with 100TB of web tracking
data collected over a period of a few days. Here is a small sample of rows from
the table:</p>

<p><img src="https://eventql.io/blog/parallel-io-and-columnar-storage/tracking_data.png" alt="Tracking Data" /></p>

<p>Our goal is to answer queries like <i>&quot;How many people visited the page &#39;/account/signup&#39;
in the last 10 days?&quot;</i>. The are only two rules: The query is not known beforehand
and we must answer it in less than a second. Here is the same example query in SQL:</p>

<pre><code>SELECT count(1)
FROM tracking_data
WHERE url = &#39;/account/signup&#39; AND time &gt; time_at(&quot;-10d&quot;);
</code></pre>

<p>If that doesn&#39;t sound like an interesting challenge yet, consider this quick
back-of-the-envelope calculation: Assuming the average hard disk can read roughly
200MB per second (sequentially), loading 100TB from disk will take 500k disk-seconds
or about 138 disk hours.</p>

<p>Oops. 138 hours is almost 6 days. A long way from &quot;less than a second&quot;. We&#39;re
seven orders of magnitude off and that&#39;s before we even started to process any
of the data &mdash; just to read it from disk.
<hr />
How can we still solve the challenge? We can&#39;t make a single disk go any faster
but we can break up the data set into smaller pieces, put each piece on its own
disk and read all pieces from all disks in parallel. If we distributed the data
over 500k individual disks we could read our whole data set in one second.</p>

<p><img src="https://eventql.io/blog/parallel-io-and-columnar-storage/parallel_query.png" alt="Parallel query" /></p>

<p>There is only one small problem with that scheme &mdash; half a million disks
cost <em>a lot</em> of money.</p>

<p>So even if we use a lot of machines, reading the full data set from disk
in less than a second is utterly out of reach.</p>

<p><hr /></p>

<p>Can we still solve the challenge? Yes, but I&#39;m afraid we&#39;ll have to cheat &mdash;
maybe we can answer the query without actually reading all the data from disk.</p>

<p>If we could come up with an algorithm which computes the query result after
reading only .01% of the data (or 10GB) from disk, then we could return an answer
in one second using just fifty disks. Fifty disks could be hosted in a dozen
servers. Finally, that sounds reasonable!</p>

<p>But how do we compute the answer for the full dataset while reading only .01% of
the data from disk? One approach would be to use sampling and probabalistic algorithms.
However, sampling would give us <i>approximately</i> correct results. Approximate
results are great for some use cases and not so great to unworkable for others.</p>

<p>There is another trick we can use to minimze the amount of data to be read from
disk while always giving correct results. The technique is called &quot;column-oriented&quot;
or &quot;columnar&quot; storage.</p>

<h3>Columnar Storage</h3>

<p>To really understand the benefit of columnar storage for data anlytics we first
have to look at how regular <em>row-oriented</em> databases store tables on disk:</p>

<p>It turns out they do it pretty much like you&#39;d expect them to. They basically
keep a file somwhere which contains all the rows in the table. One row after
another. Hence the name &quot;row-oriented&quot;.</p>

<p><img src="https://eventql.io/blog/parallel-io-and-columnar-storage/row_oriented.png" alt="Row-oriented file" />
<div class="img_sub">A row-oriented file conceptually looks something like this</div></p>

<p>What&#39;s problematic about row-oriented databases is that to execute a query, they
always have to read the full rows from disk. Even if just a small part of each
row is required to answer the query, the database still has to read every row
in full. <a href="https://eventql.io/blog/parallel-io-and-columnar-storage/#foot3">[3]</a></p>

<p>The not completely intuitive reason for this is that hard disks are only fast if
you read a file sequentially, i.e. only if you read one byte after another.
Jumping around within a file performs very poorly. So poorly in fact, that reading
whole rows is practically always <em>faster</em> than reading partial rows. <a href="https://eventql.io/blog/parallel-io-and-columnar-storage/#foot4">[4]</a></p>

<p>Consider the following example query which calculates the number of page views per
minute:</p>

<pre><code>SELECT time, count(1)
FROM tracking_data
GROUP BY date_trunc(&quot;1min&quot;, time);
</code></pre>

<p>To compute the answer, we only need to know the value of the <code>time</code> column of
each row. We&#39;re not interested in the <code>session_id</code>, <code>url</code> or any of the potentially
hundreds of other columns of the table.</p>

<p>Ideally, we would only load the <code>time</code> column of each row from disk when
executing the query. But we just saw that row-oriented storage can&#39;t do that
efficiently. We always have to read the full rows no matter what.</p>

<p>Depending on the table and query, we could be spending 99% of the time reading
data from disk which we&#39;re not going to need to answer the query.</p>

<p><hr /></p>

<p>This is the problem column-oriented storage tries to solve. The basic idea is that
instead of storing one row after another, we can break up the row into the individual
columns and then store one column after another.</p>

<p>If, for example, our table contained one thousand rows with each three columns
<code>time</code>, <code>session_id</code> and <code>url</code>, we would first store an array of a thousand <code>time</code>
values, then another array of a thousand <code>session_id</code> values and finally an array of
a thousand <code>url</code> values.</p>

<p>You might have come across the concept before under a different name: What we&#39;re
doing is basically vectorization <a href="https://eventql.io/blog/parallel-io-and-columnar-storage/#foot5">[5]</a>.</p>

<p><img src="https://eventql.io/blog/parallel-io-and-columnar-storage/columnar_storage.png" alt="Columnar storage" /></p>

<p>Storing each column seperately has two desirable properties. The first and most
obvious one is that it allows us to also read each column separately &mdash; we don&#39;t
have to load all the extraneous columns from disk anymore.</p>

<p>The second and less obvious upside of columnar storage is that we can compress
the data very efficiently. The compression further reduces the number of bytes
we actually have to fetch from disk to read a row.</p>

<p>To see why compression in columnar files can be very significant, imagine our
table had a fourth <code>is_customer</code> column. Sadly, we don&#39;t have any customers yet
so the field is always <code>false</code>.</p>

<p>In a row-oriented database, storing the <code>is_customer</code> field for one million rows
would take at least 1MB (one byte per boolean). In a columnar database we can store
all one million values in a single byte &mdash; a 1000000x improvement. <a href="https://eventql.io/blog/parallel-io-and-columnar-storage/#foot6">[6]</a></p>

<p><hr /></p>

<p>Lastly, it should be noted that columnar storage also has a big downside: It&#39;s
less efficient to perform updates on columnar files than on row-oriented files.
So you probably won&#39;t see the classical OLTP databases like MySQL
switching to columnar any time soon. Still, for analytical queries on large data
sets columnar is clearly the way to go.</p>

<h3>Are we there yet?</h3>

<p>Looks like we have finally put together a scheme which will allow us to excute a
SQL query on 100TB of data in less than a second, even though just reading the
data from disk would have taken many days.</p>

<p>Let&#39;s recapitulate our approach: We&#39;re going to split up the data into many small
pieces, then distribute the pieces among a dozen or so machines. On each machine,
we will cheat by storing the rows in columnar format and only actually reading a
small subset of the compressed data to answer the query.</p>

<p>Of course, we have only scratched the surface of the problem so far. In the
<a href="https://eventql.io/blog/dividing-infinity-distributed-partitioning-schemes/">next post</a> we
will discuss how exactly we&#39;re going to split up the data into pieces.</p>

<p>You subscribe to email updates for upcoming posts or the rss feed in the
sidebar.</p>

<div class="footnotes">
  <p>
    <a id="foot0"></a>
    [0] Andrew Lamb et al. (2012) The Vertica Analytic Database: C-Store 7 Years Later
    (The 38th International Conference on Very Large Data Bases) &mdash;
    <a target="_blank" href="http://vldb.org/pvldb/vol5/p1790_andrewlamb_vldb2012.pdf">http://vldb.org/pvldb/vol5/p1790_andrewlamb_vldb2012.pdf</a>
  </p>
  <p>
    <a id="foot1"></a>
    [1] Sergey Melnik et al. (2010) Dremel: Interactive Analysis of Web-Scale Datasets
    (The 36th International Conference on Very Large Data Bases) &mdash;
    <a target="_blank" href="http://static.googleusercontent.com/media/research.google.com/en//pubs/archive/36632.pdf">http://static.googleusercontent.com/media/research.google.com/en//pubs/archive/36632.pdf</a>
  </p>
  <p>
    <a id="foot2"></a>
    [2] EventQL (2016) An open-source SQL database for large-scale event analytics &mdash;
    <a target="_blank" href="http://eventql.io">http://eventql.io</a>
  </p>
  <p>
    <a id="foot3"></a>
    [3] Yes, you can put an index on the column. It turns out that indexes in
    traditional row-oriented database are fundamentally columnar representations
    of the data. Another way to look at it would be that a columnar table behaves
    like a traditional table with an automatic index on all columns.
  </p>
  <p>
    <a id="foot4"></a>
    [4] Of course this is an oversimplification. However, even with buffering and
    speculative read-ahead you wont realistically be faster than reading the rows
    in full.
  </p>
  <p>
    <a id="foot5"></a>
    [5] "Vectorization" on Wikipedia &mdash;
    <a target="_blank" href="https://eventql.io/blog/parallel-io-and-columnar-storage/https://en.wikipedia.org/wiki/Vectorization">https://en.wikipedia.org/wiki/Vectorization</a>
  </p>
  <p>
    <a id="foot6"></a>
    [6] We'll reveal how later in this series. If you can't wait until then check
    out this excellent paper on the topic [1].
  </p>
</div>

<h4>Next up in the series:</h4>

<ul>
<li><a href="https://eventql.io/blog/parallel-io-and-columnar-storage/">Part 1: Parallel I/O and Columnar Storage</a></li>
<li><a href="https://eventql.io/blog/dividing-infinity-distributed-partitioning-schemes/">Part 2: Dividing Infinity - Distributed Partitioning Schemes</a></li>
</ul>
]]></content:encoded>
          <author>paul@eventql.io (Paul Asmuth)</author>
          <pubDate>Mon, 15 Aug 2016 15:00:00 -0000</pubDate>
        </item>
      
        </channel>
      </rss>
    