<?xml version="1.0"?>
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.1 plus MathML 2.0//EN" 
               "http://www.w3.org/TR/MathML2/dtd/xhtml-math11-f.dtd" [
  <!ENTITY mathml "http://www.w3.org/1998/Math/MathML">
]>
<html xmlns="http://www.w3.org/1999/xhtml" xml:lang="en">
<head>
<link rel="stylesheet" type="text/css" href="../../../comsci/css/s011.css"/>
<link rel='SHORTCUT ICON' href='../../../comsci/images/FrameHome.ico' />
<style type="text/css">
</style>
<title> Bucket Sort
</title>
</head>

<body>


<div class="message_right">
  <a href="http://validator.w3.org/check/referer">
    <img src="http://www.w3.org/Icons/valid-xhtml11" alt="Valid
	XHTML 1.1!" height="31" width="88" />
  </a>

  <a href="http://jigsaw.w3.org/css-validator/">
    <img style="width:88px;height:31px"
       src="http://jigsaw.w3.org/css-validator/images/vcss" 
       alt="Valid CSS!" />
  </a>
  <br/>
  Created 2005-09-16 &nbsp; Modified 
<!--UPDATE_DATE_MODIFIED-->
<!--UPDATE_DATE_BEGIN-->
2007-02-18
<br/>
<a href="mailto:chelton.evans@yahoo.com">Chelton Evans</a>
<!--UPDATE_DATE_END-->
</div>

<h1> 
<a href="geom.html"> <img alt="proj" src="../../../comsci/images/compgeom.png" /></a>
Bucket Sort
<a href="../../../../index.html">
<img alt="home" src="../../../comsci/images/Frame.gif" /> </a>
</h1>

<p>
<a href="#Intro">Intro</a><br/>
<a href="#Implementation">Implementation</a><br/>
<a href="#Bucket_Sort_with_Linked_Lists">
  Bucket Sort with Linked Lists</a><br/>
<a href="#Hybrid_Bucket_Sort">
  Hybrid Bucket Sort</a><br/>
<a href="#Hash_Tables">Hash Tables</a><br/>
<a href="#Integer_Sorting">
  Integer Sorting</a><br/>
<a href="../../../misc/proj/bucket/doc.html">C++ bucket</a>

</p>


<div class="float25">
<h2>Intro</h2>

<p>Sorting in expected O(n) time into buckets.  </p>


<p>Bucket sort or Bin sort is generalized to Radix sort.</p>

<p>However Bucket sort can also be generalized with linked
 list to not only sort into buckets but sort continuously
 and have O(n) time.  The radix sort is discrete.
</p>

<a id="Implementation"></a>
<h2> Implementation </h2>

<p>Assuming a random distribution and a known interval length,
 dividing the interval into n containers for n points we can
 take advantage of calculating which container a point lies in
 because this calculation is constant in time.
</p>

<p class="equ">
Let 
<math xmlns="&mathml;">
  <mi>v</mi>
  <msub>
    <mi></mi>
    <mrow><mi>i</mi></mrow>
  </msub>
</math>
 be a vector of n elements. <br/>

<math xmlns="&mathml;">
  <mi>x</mi>
  <msub>
    <mi></mi>
    <mrow><mi>0</mi></mrow>
  </msub>
</math>
 be the least and 
<math xmlns="&mathml;">
  <mi>x</mi>
  <msub>
    <mi></mi>
    <mrow><mi>1</mi></mrow>
  </msub>
</math>

 be the greatest element. <br/>

<math xmlns="&mathml;">
  <mi>f</mi>
  <mo>(</mo>
  <mi>x</mi>
  <mo>)</mo>
  <mo>=</mo>
  <mo>&lfloor;</mo>
  <mfrac>
    <mrow>
  <mi>x</mi>
  <mo>-</mo>
  <mi>x</mi>
  <msub>
    <mi></mi>
    <mrow><mi>0</mi></mrow>
  </msub>

    </mrow>
    <mrow>
  <mi>x</mi>
  <msub>
    <mi></mi>
    <mrow><mi>1</mi></mrow>
  </msub>
  <mo>-</mo>
  <mi>x</mi>
  <msub>
    <mi></mi>
    <mrow><mi>0</mi></mrow>
  </msub>
    </mrow>
  </mfrac>
  <mi>n</mi>
  <mo>&rfloor;</mo>
  
</math>

</p>


<p class="equ">
Let 

<math xmlns="&mathml;">
  <mi>c</mi>
  <msub>
    <mi></mi>
    <mrow><mi>i</mi></mrow>
  </msub>
</math>
 be a counter of 
<math xmlns="&mathml;">
  <mi>n</mi>
</math>
 intervals.
<br/>



<math xmlns="&mathml;">
  <mi>for</mi>
  <mi>i</mi>
  <mo>=</mo>
  <mi>0</mi>
  <mo>..</mo>
  <mi>n</mi>
  <mo>-</mo>
  <mi>1</mi>
</math>
<br/>

 &nbsp; &nbsp; 
 &nbsp; &nbsp; 
  
<math xmlns="&mathml;">
  <mo>++</mo>
  <mi>c</mi>
  <msub>
    <mi></mi>
    <mrow>
      <mi>f</mi>
      <mo>(</mo>

  <mi>v</mi>
  <msub>
    <mi></mi>
    <mrow><mi>i</mi></mrow>
  </msub>

      <mo>)</mo>
    </mrow>
  </msub>
</math>



</p>



<p class="equ">
Let 

<math xmlns="&mathml;">
  <mi>cp</mi>
  <msub>
    <mi></mi>
    <mrow><mi>i</mi></mrow>
  </msub>
</math>
 be the start position of the interval. <br/> 

<math xmlns="&mathml;">
  <mi>s</mi>
  <mo>=</mo>
  <mi>0</mi>
</math>

 &nbsp; &nbsp; 

<math xmlns="&mathml;">
  <mi>cp</mi>
  <msub>
    <mi></mi>
    <mrow><mi>0</mi></mrow>
  </msub>
  <mo>=</mo>
  <mi>0</mi>
</math>



<br/>

<math xmlns="&mathml;">
  <mi>for</mi>
  <mi>i</mi>
  <mo>=</mo>
  <mi>0</mi>
  <mo>..</mo>
  <mi>n</mi>
  <mo>-</mo>
  <mi>1</mi>
</math>
<br/>

 &nbsp; &nbsp; 
 &nbsp; &nbsp; 
  
<math xmlns="&mathml;">
  <mi>s</mi>
  <mo>+=</mo>
  <mi>c</mi>
  <msub>
    <mi></mi>
    <mrow><mi>i</mi><mo>-</mo><mi>1</mi></mrow>
  </msub>
</math>

<br/>

 &nbsp; &nbsp; 
 &nbsp; &nbsp; 
  
<math xmlns="&mathml;">
  <mi>cp</mi>
  <msub>
    <mi></mi>
    <mrow><mi>i</mi></mrow>
  </msub>
  <mo>=</mo>
  <mi>s</mi>
</math>

</p>

<p>Now iterate through v inserting which bucket the counter points to. As an
 element is inserted that counter is increased. So all the elements
 are inserted into the correct buckets.
</p>

<p> I implemented this in a crude fashion  - see 
  <a href="../../../misc/proj/nbody/doc.xml">C++ nbody</a> in
 the cell.h file. It has been designed to sit in cache
 so I avoided STL.
</p>

</div>

<div class="float25">
<a id="Bucket_Sort_with_Linked_Lists"></a>
<h2>Bucket Sort with Linked Lists</h2>

<p>
This is like a hash table.  In fact it is a hash table with
 the key as the bucket.  All you do is order on insertion the element
 into the bucket which is O(k). 
</p>

<p>But here is where it gets good. If O(k) is bounded in time then
 it is constant in time relative to O(n). In practice the maximum bucket
 length is small.  In this situation you can then expect O(n) insertion
 time into the buckets.
</p>

<p>For those of you who have not heard of hashed arrays they are the
 fasted data structure by far because they have constant access and
 retrieval time which is magic for algorithms.  
 Their weakness is that the data is unordered.
</p>

<p>To say just how good they are I had this school assignment with a
 dictionary that took around 25 minutes of execution time with
 balanced trees, the same execution took 30 seconds with a hashed data
 structure with the same code.  The moral of the story is this - it does
 not matter how badly you wrote your code if it executes in constant time.
</p> 

<p>Back to the topic, the bucket sort with linked lists is easy to implement
 and can sort in O(n) time on continuous data.
</p>

<p>For unfriendly data you could add more buckets and change the index
 function. </p>

<p>I implemented this and experimentally verified the O(n) relationship.
  See <a href="../../../misc/proj/bucket/doc.html">C++ bucket</a>.
</p>

</div>

<div class="spacer" />

<div class="float25">
<a id="Hybrid_Bucket_Sort"></a>
<h2>Hybrid Bucket Sort</h2>

<p>The problem with the Bucket sort is that if the data
 is unfriendly it becomes a O(n^2) instead of O(n).
 Now in practices increasing the bucket length is really
 good for most applications and you can change the hash 
 function yourself.  But to the theoretical mathematician and
 computer scientist there is a problem - an algorithm which 
 can become O(n^2) is unacceptable.
</p>

<p>There are situations where O(n^2) can be made acceptable by
 reordering the data in a random way, unfortunately this does nothing
 for the bucket sort as the sorter's complexity is independent of 
 the data it sorts.  
</p>

<p>However combining a bucket sort with another sorter which uses 
 a balanced tree for O(nlogn) is possible.  So when the data starts
 behaving in an unfriendly way use the second default sorter.
</p>

<p>The worst case was detected by having a maximum number of links
 in a bucket.  When this was exceeded the whole bucket was inserted
 into the default sorter. As this is itself a linear operation the
 sorting complexity is not changed, but is the sum of a linear insert
 and the complexity of inserting into the default sorter.
</p>

<p>This strategy gives the expected complexity of the sorter as
 O(n) and the worst case complexity as O(nlogn).
</p>

<p>Here is the maths in the worst case. 
Let k be the maximum number of links.
 Assume that inserting into a bucket is O(k) but since k is
 much smaller than n and independent of n then O(k)=O(1).<br/>
 Iterating though n data elements and inserting in either the bucket list
 or the default sorter expressed with complexities as an equation gives <br/>
 O(n)(O(k)+O(logn))<br/>
=O(n)(O(1)+O(logn))<br/>
=O(nlogn)
</p>

<p>
See <a href="../../../misc/proj/bucket/doc.html">C++ bucket</a>.
</p>


</div>

<div class="float25">
<a id="Hash_Tables"></a>
<h2>Hash Tables</h2>

<p>These generally use buckets too as this solves the collision
 problem and memory is cheap.
</p>

<p>
They should run in O(n) time. On my old P3 O(n) insertion and
 access of integers up to around 2,000,000 integers observed.
 Where I am making the hash table bucket size equal to the
 data size that the table will hold.
</p>

<p>They can also be used as maps.  For example have a string and
 a counter as a data type with the hash on the string.
 Inserting string and counter pairs into the hash table,
 updating a counter could retrieve the same data string with an
 empty counter and then alter the counter with the retrieved pointer.
</p>

<p>Hash tables depend on hash functions.  If you have some idea
 of your applications data distribution you could build a 
 hash function to suit it.  This is more critical for the 
 bucket sort which hashes and sorts too.  Generally people have
 really good hash functions with bit operators. There is heaps
 on the net.
</p>

</div>

<div class="float25">
<a id="Integer_Sorting"></a>
<h2>Integer Sorting</h2>

<p>I am concerned with sorting unique integer sequences.</p>

<p>Here is an algorithm for sorting integers in a range.</p>

<p>
Suppose that I wish to sort k integers with an array of n length.
 Further suppose that each integer i satisfies 0 &lt; i &lt; n.
 So the array is marked 0 if the integer is not present, else 1.
</p>

<p>Sorting then consists of iterating over k and assigning a 1 to 
 the corresponding array element.
</p>

<p>While sorting is linear O(n) for small k O(klog(k)) &lt; O(n).
 So a general O(klog(k)) sorter could be more effective in these
 situations.
</p>

<p>Alternatively use the bucket sort with linked lists and insert
 linearly in the correct order. Since the linked lists are small 
 insertion is constant in time.
</p>

<p>If the distribution is not random 
 then make it random. For example 
 I have a data structure with a lot of consecutive integers
 which represent points.  These are just some of the points
 in a global points container. By re labeling the points by means
 of a random shuffle of all the points in the global container,
 the data structure now has a random distribution of integers
 to the points and this is suitable for the bucket sort with
 linked lists.
</p>



</div>





</body>
</html>


