[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
[gnuastro-commits] master 509d899: Library (kdtree): corrected visualiza
From: |
Mohammad Akhlaghi |
Subject: |
[gnuastro-commits] master 509d899: Library (kdtree): corrected visualization in book, refactored library |
Date: |
Tue, 8 Dec 2020 19:59:01 -0500 (EST) |
branch: master
commit 509d89935a53dee1aa8392b75144eba3787947e3
Author: Sachin Kumar Singh <sachinkumarsingh092@gmail.com>
Commit: Mohammad Akhlaghi <mohammad@akhlaghi.org>
Library (kdtree): corrected visualization in book, refactored library
The visualized output of the description k-d tree part of the book didn't
correspond to the actual output k-d tree from 'gal_kdtree_create': while
the input coordinates were the same as those in Wikipedia, the root node in
our run would be different from Wikipedia. So when a user would run our
code, and get a different k-d tree they would be confused.
With this commit, the output of 'gal_kdtree_create' is now used in the
visualization at the start of the k-d tree library part of the book and a
footnote has been added to explain the small difference with the k-d tree
from Wikipedia. The examples with the two k-d tree functions have also
become much more realistic (with the k-d tree root written as a FITS
keyword and later read).
Also, the partition making subroutine within the k-d tree creation function
is now separated from 'kdtree_median_find' and made into a seperate
function ('kdtree_make_partition'). Comments are improves a bit too.
---
doc/gnuastro.texi | 117 ++++++++++++++++++++++++--------
lib/fits.c | 2 +-
lib/kdtree.c | 195 +++++++++++++++++++++++++++++++-----------------------
3 files changed, 202 insertions(+), 112 deletions(-)
diff --git a/doc/gnuastro.texi b/doc/gnuastro.texi
index e7560e1..c27a9ab 100644
--- a/doc/gnuastro.texi
+++ b/doc/gnuastro.texi
@@ -25488,18 +25488,21 @@ Below you can see the simple tree for 2D points from
Wikipedia.
The input point coordinates are represented as two input @code{gal_data_t}s
(@code{X} and @code{Y}, where @code{X->next=Y} and @code{Y->next=NULL}).
If you had three dimensional points, you could define an extra
@code{gal_data_t} such that @code{Y->next=Z} and @code{Z->next=NULL}.
The output is always a list of two @code{gal_data_t}s, where the first one
contains the index of the left sub-tree in the input, and the second one, the
index of the right subtree.
-The index of the root node (@code{2} in the case below) is also returned as a
single number.
+The index of the root node (@code{0} in the case below@footnote{This example
input table is the same as the example in Wikipedia (as of December 2020).
+However, on the Wikipedia output, the root node is (7,2), not (5,4).
+The difference is primarily because there are 6 rows and the median element of
an even number of elements can vary by integer calculation strategies.
+Here we use 0-based indexes for finding median and round to the smaller
integer.}) is also returned as a single number.
@example
-INDEX INPUT OUTPUT K-D Tree
-(as guide) X --> Y LEFT --> RIGHT (visualized)
----------- ------- -------------- ------------------
-0 5 4 1 4 (7,2)
-1 2 3 BLANK BLANK / \
-2 7 2 0 3 (5,4) \
-3 9 6 5 BLANK / \ (9,6)
-4 4 7 BLANK BLANK (2,3) (4,7) /
-5 8 1 BLANK BLANK (8,1)
+INDEX INPUT OUTPUT K-D Tree
+(as guide) X --> Y LEFT --> RIGHT (visualized)
+---------- ------- -------------- ------------------
+0 5 4 1 2 (5,4)
+1 2 3 BLANK 4 / \
+2 7 2 5 3 (2,3) \
+3 9 6 BLANK BLANK \ (7,2)
+4 4 7 BLANK BLANK (4,7) / \
+5 8 1 BLANK BLANK (8,1) (9,6)
@end example
This format is therefore scalable to any number of dimensions: the number of
dimensions are determined from the number of nodes in the input list of
@code{gal_data_t}s (for example, using @code{gal_list_data_number}).
@@ -25514,6 +25517,20 @@ The first dataset contains the indexes of left and
right nodes of the subtrees f
The index of the root node is written into the memory that @code{root} points
to.
@code{coords_raw} is the list of the input points (one @code{gal_data_t} per
dimension, see above).
+For example, assume you have the simple set of points below (from the
visualized example at the start of this section) in a plain-text file called
@file{coordinates.txt}:
+
+@example
+$ cat coordinates.txt
+5 4
+2 3
+7 2
+9 6
+4 7
+8 1
+@end example
+
+With the program below, you can calculate the kd-tree, and write it in a FITS
file (while keeping the root index as a FITS keyword inside of it).
+
@example
#include <stdio.h>
#include <gnuastro/table.h>
@@ -25522,10 +25539,16 @@ The index of the root node is written into the memory
that @code{root} points to
int
main (void)
@{
- size_t root;
gal_data_t *input, *kdtree;
- char kdtreefile[]="kd-tree.txt"; /* Inputs and outputs can */
- char inputfile[]="coordinates.txt"; /* also be FITS tables. */
+ char kdtreefile[]="kd-tree.fits";
+ char inputfile[]="coordinates.txt";
+
+ /* To write the root within the saved file. */
+ size_t root;
+ char *unit="index";
+ char *keyname="KDTROOT";
+ gal_fits_list_key_t *keylist=NULL;
+ char *comment="k-d tree root index (counting from 0).";
/* Read the input table. Note: this assumes the table only
* contains your input point coordinates (one column for each
@@ -25533,17 +25556,19 @@ main (void)
* for each point, you can specify which columns to read by
* name or number, see the documentation of 'gal_table_read'. */
input=gal_table_read(inputfile, "1", NULL, NULL,
- GAL_TABLE_SEARCH_NAME, 0, -1, 0, NULL);
+ GAL_TABLE_SEARCH_NAME, 0, -1, 0, NULL);
/* Construct a k-d tree. The index of root is stored in `root` */
kdtree=gal_kdtree_create(input, &root);
- /* Write output to a file. */
- gal_table_write(kdtree, NULL, NULL, GAL_TABLE_FORMAT_BFITS,
- kdtreefile, "kdtree", 0);
-
- /* Report k-d tree root. */
- printf("Root row of '%s': %zu\n", kdtreefile, root);
+ /* Write the k-d tree to a file and write root index and input
+ * name as FITS keywords ('gal_table_write' frees 'keylist').*/
+ gal_fits_key_list_title_add(&keylist, "k-d tree parameters", 0);
+ gal_fits_key_write_filename("KDTIN", inputfile, &keylist, 0);
+ gal_fits_key_list_add_end(&keylist, GAL_TYPE_SIZE_T, keyname, 0,
+ &root, 0, comment, 0, unit, 0);
+ gal_table_write(kdtree, &keylist, NULL, GAL_TABLE_FORMAT_BFITS,
+ kdtreefile, "kdtree", 0);
/* Clean up and return. */
gal_list_data_free(input);
@@ -25551,15 +25576,26 @@ main (void)
return EXIT_SUCCESS;
@}
@end example
+
+You can inspect the saved k-d tree FITS table with Gnuastro's @ref{Table}
(first command below), and you can see the keywords containing the root index
with @ref{Fits} (second command below):
+
+@example
+asttable kd-tree.fits
+astfits kd-tree.fits -h1
+@end example
+
@end deftypefun
@deftypefun size_t gal_kdtree_nearest_neighbour (gal_data_t
@code{*coords_raw}, gal_data_t @code{*kdtree}, size_t @code{root}, double
@code{*point}, double @code{*least_dist})
Returns the index of the nearest input point to the query point (@code{point},
assumed to be an array with same number of elements as @code{gal_data_t}s in
@code{coords_raw}).
The distance between the query point and its nearest neighbor is stored in the
space that @code{least_dist} points to.
-This search is efficient due to the constantly checking for the presence of
possible best points in other branches.
+This search is efficient due to the constant checking for the presence of
possible best points in other branches.
If it isn't possible for the other branch to have a better nearest neighbor,
that branch is not searched.
-For example the function below reads the input, and the k-d tree file created
in the example of @code{gal_kdtree_create} (so you won't have to re-create the
k-d tree every time) and finds the input point that is closest to a given query
point.
+As an example, let's use the k-d tree that was created in the example of
@code{gal_kdtree_create} (above) and find the nearest row to a given coordinate
(@code{point}).
+This will be a very common scenario, especially in large and multi-dimensional
datasets where the k-d tree creation can take long and you don't want to
re-create the k-d tree every time.
+In the @code{gal_kdtree_create} example output, we also wrote the k-d tree
root index as a FITS keyword (@code{KDTROOT}), so after loading the two table
data (input coordinates and k-d tree), we'll read the root from the FITS
keyword.
+This is a very simple example, but the scalability is clear: for example it is
trivial to parallelize (see @ref{Library demo - multi-threaded operation}).
@example
#include <stdio.h>
@@ -25569,31 +25605,54 @@ For example the function below reads the input, and
the k-d tree file created in
int
main (void)
@{
- double least_dist;
- size_t root=KDTREE_ROOT;
+ /* INPUT: desired point. */
+ double point[2]=@{8.9,5.9@};
+
+ /* Same as example in description of 'gal_kdtree_create'. */
gal_data_t *input, *kdtree;
- char kdtreefile[]="kd-tree.txt"; /* Inputs and outputs can */
- char inputfile[]="coordinates.txt"; /* also be FITS tables. */
+ char kdtreefile[]="kd-tree.fits";
+ char inputfile[]="coordinates.txt";
- double point[2]=@{9,8@}; /* assuming a 2-d tree. */
- size_t nearest_index;
+ /* Processing variables of this function. */
+ char kdtreehdu[]="1";
+ double *in_x, *in_y, least_dist;
+ size_t root, nkeys=1, nearest_index;
+ gal_data_t *rkey, *keysll=gal_data_array_calloc(nkeys);
/* Read the input coordinates, see comments in example of
* 'gal_kdtree_create' for more. */
input=gal_table_read(inputfile, "1", NULL, NULL,
GAL_TABLE_SEARCH_NAME, 0, -1, 0, NULL);
- /* Read the k-d tree (created before). */
+ /* Read the k-d tree contents (created before). */
kdtree=gal_table_read(kdtreefile, "1", NULL, NULL,
GAL_TABLE_SEARCH_NAME, 0, -1, 0, NULL);
+ /* Read the k-d tree root index from the header keyword.
+ * See example in description of 'gal_fits_key_read_from_ptr'.*/
+ keysll[0].name="KDTROOT";
+ keysll[0].type=GAL_TYPE_SIZE_T;
+ gal_fits_key_read(kdtreefile, kdtreehdu, keysll, 0, 0);
+ keysll[0].name=NULL; /* Since we didn't allocate it. */
+ rkey=gal_data_copy_to_new_type(&keysll[0], GAL_TYPE_SIZE_T);
+ root=((size_t *)(rkey->array))[0];
+
/* Find the nearest neighbour of the point. */
nearest_index=gal_kdtree_nearest_neighbour(input, kdtree, root,
point, &least_dist);
+ /* Print the results. */
+ in_x=input->array;
+ in_y=input->next->array;
+ printf("(%g, %g): nearest is (%g, %g), with a distance of %g\n",
+ point[0], point[1], in_x[nearest_index],
+ in_y[nearest_index], least_dist);
+
/* Clean up and return. */
+ gal_data_free(rkey);
gal_list_data_free(input);
gal_list_data_free(kdtree);
+ gal_data_array_free(keysll, nkeys, 1);
return EXIT_SUCCESS;
@}
@end example
diff --git a/lib/fits.c b/lib/fits.c
index 89daf98..023ecff 100644
--- a/lib/fits.c
+++ b/lib/fits.c
@@ -1376,7 +1376,7 @@ gal_fits_key_list_add_end(gal_fits_list_key_t **list,
uint8_t type,
}
else /* List is empty */
{
- newnode->next=*list;
+ newnode->next=NULL;
*list=newnode;
}
}
diff --git a/lib/kdtree.c b/lib/kdtree.c
index edeff89..9bdbe93 100644
--- a/lib/kdtree.c
+++ b/lib/kdtree.c
@@ -60,10 +60,12 @@ along with Gnuastro. If not, see
<http://www.gnu.org/licenses/>.
/* Main structure to keep kd-tree parameters. */
struct kdtree_params
{
- size_t ndim;
- size_t *input_row;
- gal_data_t **coords;
- uint32_t *left, *right;
+ size_t ndim; /* Number of dimentions in the nodes. */
+ size_t *input_row; /* The indexes of the input table. */
+ gal_data_t **coords; /* The input coordinates array. */
+ uint32_t *left, *right; /* The indexes of the left and right nodes. */
+
+ /* The values of the left and right columns. */
gal_data_t *left_col, *right_col;
};
@@ -83,7 +85,6 @@ kdtree_node_swap(struct kdtree_params *p, size_t node1,
size_t node2)
/* No need to swap same node. */
if(node1==node2) return;
- // printf("left = %u, right = %u\n", tmp_left, tmp_right);
p->left[node1]=p->left[node2];
p->right[node1]=p->right[node2];
p->input_row[node1]=p->input_row[node2];
@@ -97,11 +98,10 @@ kdtree_node_swap(struct kdtree_params *p, size_t node1,
size_t node2)
-/* Return the distance between 2 given nodes. This distance is equivalent
- to the radius of the hypersphere having `node` as the center.
+/* Return the distance between 2 given nodes. The distance is equivalent
+ to the radius of the hypersphere having node as its center.
- Return:
- Radial distace from given point to the node.
+ Return: Radial distace from given point to the node.
*/
static double
kdtree_distance_find(struct kdtree_params *p, size_t node,
@@ -145,7 +145,7 @@ kdtree_distance_find(struct kdtree_params *p, size_t node,
/****************************************************************
******** Preperations and Cleanup *******
****************************************************************/
-/* */
+/* Initialise the kdtree_params structure and do sanity checks. */
static void
kdtree_prepare(struct kdtree_params *p, gal_data_t *coords_raw)
{
@@ -167,7 +167,7 @@ kdtree_prepare(struct kdtree_params *p, gal_data_t
*coords_raw)
if(tmp->type == GAL_TYPE_FLOAT64)
p->coords[i]=tmp;
else
- p->coords[i]=gal_data_copy_to_new_type (tmp, GAL_TYPE_FLOAT64);
+ p->coords[i]=gal_data_copy_to_new_type(tmp, GAL_TYPE_FLOAT64);
/* Go to the next column list. */
tmp=tmp->next;
@@ -252,8 +252,8 @@ kdtree_cleanup(struct kdtree_params *p, gal_data_t
*coords_raw)
gal_data_t *tmp;
/* Clean up. */
- tmp=coords_raw;
- for(i=0; i<p->ndim; ++i)
+ tmp = coords_raw;
+ for(i = 0; i<p->ndim; ++i)
{
if(p->coords[i]!=tmp) gal_data_free(p->coords[i]);
tmp=tmp->next;
@@ -286,93 +286,111 @@ kdtree_cleanup(struct kdtree_params *p, gal_data_t
*coords_raw)
/****************************************************************
******** Create KD-Tree *******
****************************************************************/
-/* Find the median to seperate the hyperspace. Instead of randomly chossing
- the media-point, we use `quickselect alogorithm` to find median in
- average complexity of O(n). This also makes nodes partially sorted w.r.t
- each axis. See `https://en.wikipedia.org/wiki/Quickselect` for
- pseudocode and more information of the algorithm.
+/* Divide the array into two parts, values more than that of k'th node
+ and values less than k'th node.
+
+ Return: Index of the node whose value is greater than all
+ the nodes before it.
+*/
+static size_t
+kdtree_make_partition(struct kdtree_params *p, size_t node_left,
+ size_t node_right, size_t node_k,
+ double *coordinate)
+{
+ /* store_index is the index before which all values are smaller than
+ the value of k'th node. */
+ size_t i, store_index;
+ double k_node_value = coordinate[p->input_row[node_k]];
+
+ /* Move the k'th node to the right. */
+ kdtree_node_swap(p, node_k, node_right);
+
+ /* Move all nodes smaller than k'th node to its left and check
+ the number of elements smaller than the value present at the
+ k'th index. */
+ store_index = node_left;
+ for(i = node_left; i < node_right; ++i)
+ if(coordinate[p->input_row[i]] < k_node_value)
+ {
+ /* Move i'th node to the left side of the k'th index. */
+ kdtree_node_swap(p, store_index, i);
+
+ /* Prepare the place of next smaller node. */
+ store_index++;
+ }
+
+ /* Place k'th node after all the nodes that have lesser value
+ than it, as it was moved to the right initially. */
+ kdtree_node_swap(p, node_right, store_index);
+
+ /* Return the store_index. */
+ return store_index;
+}
+
+
+
- Return : median point(node) that splits the hyperspace. */
+
+/* Find the median node of the current axis. Instead of randomly
+ choosing the median node, we use `quickselect alogorithm` to
+ find median node in linear time between the left and right node.
+ This also makes the values in the current axis partially sorted.
+
+ See `https://en.wikipedia.org/wiki/Quickselect`
+ for pseudocode and more details of the algorithm.
+
+ Return: Median node between the given left and right nodes.
+*/
static size_t
kdtree_median_find(struct kdtree_params *p, size_t node_left,
size_t node_right, double *coordinate)
{
- double pivot_value;
- size_t i, store_i, node_pivot, node_k;
+ size_t node_pivot, node_median;
/* False state, this is a programming error. */
- if (node_right < node_left)
+ if(node_right < node_left)
error(EXIT_FAILURE, 0, "%s: a bug! Please contact us to fix "
"the problem! For some reason, the node_right (%zu) is "
"smaller than node_left (%zu)", __func__, node_right,
node_left);
/* If the two nodes are the same, just return the node. */
- if (node_right == node_left)
+ if(node_right == node_left)
error(EXIT_FAILURE, 0, "%s: a bug! Please contact us to fix "
"the problem! For some reason, the node_right (%zu) is "
"equal to node_left (%zu)", __func__, node_right, node_left);
- /* The middle value (here used as pivot) between node_left
- and node_right. */
- node_k=node_left+(node_right-node_left)/2;
+ /* The required median node between left and right node. */
+ node_median = node_left+(node_right-node_left)/2;
- /* Loop until the median (for the current axis) is returned (and in
- the mean-while, the underlying indexs are sorted). */
+ /* Loop until the median of the current axis is returned. */
while(1)
{
- /* In every run, 'node_left' and 'node_right' change, so the
- pivot index and pivot value change. */
- node_pivot = node_left+(node_right-node_left)/2;
- pivot_value = coordinate[p->input_row[node_pivot]];
-
- /* Move the pivot_value to the right/larger end. */
- kdtree_node_swap(p, node_pivot, node_right);
-
- /* Move all nodes smaller than pivot value to the left/smaller
- side of the nodes. */
- store_i=node_left;
- for (i = node_left; i < node_right; ++i)
- if (coordinate[p->input_row[i]] < pivot_value)
- {
- /* Move ith-node to the left/smaller side,
- and increment store_i */
- kdtree_node_swap(p, store_i, i);
-
- /* Prepare the place of next smaller node. */
- store_i++;
- }
-
- /* Move pivot, to be just after of all the nodes that are less
- than it (because pivot was moved to 'node_right'). */
- kdtree_node_swap(p, node_right, store_i);
-
- /* Set node_pivot to be the store_i. */
- node_pivot=store_i;
-
- /* If median is found, break the loop and return the node_k. */
- if (node_k == node_pivot) break;
-
- /* Change the left or right node based on the position of node_pivot
- so we can continue the search/sort in the loop. */
- if (node_k < node_pivot) node_right = node_pivot - 1;
- else node_left = node_pivot + 1;
+ /* Pivot node acts as a reference for the distance from the desired
+ (here median) node. */
+ node_pivot = kdtree_make_partition(p, node_left, node_right,
+ node_median, coordinate);
+ /* If median is found, break the loop and return median node. */
+ if(node_median == node_pivot) break;
+
+ /* Change the left or right node based on the position of
+ the pivot node with respect to the required median node. */
+ if(node_median < node_pivot) node_right = node_pivot - 1;
+ else node_left = node_pivot + 1;
}
-
- /* Return the pivot node. */
- return node_pivot;
+ /* Return the median node. */
+ return node_median;
}
-
/* Make a kd-tree from a given set of points. For tree construction, a
median point is selected for each axis and the left and right branches
- are made by comparing points based on that axis.
+ are recursively created by comparing points in that axis.
- Return : a balanced kd-tree.
+ Return : Indexes of the nodes in the kd-tree.
*/
static uint32_t
kdtree_fill_subtrees(struct kdtree_params *p, size_t node_left,
@@ -390,11 +408,10 @@ kdtree_fill_subtrees(struct kdtree_params *p, size_t
node_left,
if(node_left==node_right) return p->input_row[node_left];
/* Find the median node. */
- node_median=kdtree_median_find(p, node_left, node_right,
- p->coords[axis]->array);
+ node_median = kdtree_median_find(p, node_left, node_right,
+ p->coords[axis]->array);
/* node_median == 0 : We are in the lowest node (leaf) so no need
- to continue seachin recursively.
When we only have 2 nodes and the median is equal to the left,
its the end of the subtree.
*/
@@ -411,7 +428,8 @@ kdtree_fill_subtrees(struct kdtree_params *p, size_t
node_left,
But node right can never be equal to node median.
So we don't check for it.*/
p->right[node_median] = kdtree_fill_subtrees(p, node_median+1,
- node_right, depth+1);
+ node_right,
+ depth+1);
/* All subtrees have been parsed, return the node. */
return p->input_row[node_median];
@@ -443,8 +461,8 @@ gal_kdtree_create(gal_data_t *coords_raw, size_t *root)
/* Do a reverse permutation to sort the indexes
back in the input order. */
- gal_permutation_apply_inverse (p.left_col, p.input_row);
- gal_permutation_apply_inverse (p.right_col, p.input_row);
+ gal_permutation_apply_inverse(p.left_col, p.input_row);
+ gal_permutation_apply_inverse(p.right_col, p.input_row);
/* Free and clean up */
kdtree_cleanup(&p, coords_raw);
@@ -475,9 +493,13 @@ gal_kdtree_create(gal_data_t *coords_raw, size_t *root)
/****************************************************************
******** Nearest-Neighbour Search *******
****************************************************************/
-/* Find the nearest neighbour of the `point`. See
- `https://en.wikipedia.org/wiki/K-d_tree#Nearest_neighbour_search` for
- more information. */
+/* This is a helper function which finds the nearest neighbour of
+ the given point in a kdtree. It calculates the least distance
+ from the point, and the index of that nearest node (out_nn).
+
+ See `https://en.wikipedia.org/wiki/K-d_tree#Nearest_neighbour_search`
+ for more information.
+*/
static void
kdtree_nearest_neighbour(struct kdtree_params *p, uint32_t node_current,
double *point, double *least_dist,
@@ -505,7 +527,7 @@ kdtree_nearest_neighbour(struct kdtree_params *p, uint32_t
node_current,
*out_nn = node_current;
}
- /* If exact match found(least distance 0), return it. */
+ /* If exact match found (least distance 0), return it. */
if(*least_dist==0.0f) return;
/* Recursively search in subtrees. */
@@ -534,8 +556,12 @@ kdtree_nearest_neighbour(struct kdtree_params *p, uint32_t
node_current,
-/* High-level function used to find the nearest neighbour of from a given
- point. Returns the index of the nearest neighbour in the kd-tree. */
+/* High-level function used to find the nearest neighbour of a given
+ point in a kd-tree. It calculates the least distance of the point
+ from the nearest node and returns the index of that node.
+
+ Return: The index of the nearest neighbour node in the kd-tree.
+*/
size_t
gal_kdtree_nearest_neighbour(gal_data_t *coords_raw, gal_data_t *kdtree,
size_t root, double *point,
@@ -552,6 +578,11 @@ gal_kdtree_nearest_neighbour(gal_data_t *coords_raw,
gal_data_t *kdtree,
/* Use the low-level function to find th nearest neighbour. */
kdtree_nearest_neighbour(&p, root, point, least_dist, &out_nn, 0);
+ /* least_dist is the square of the distance between the nearest
+ neighbour and the point (used to improve processing).
+ Square root of that is the actual distance. */
+ *least_dist = sqrt(*least_dist);
+
/* For a check
printf("%s: root=%zu, out_nn=%zu, least_dis=%f\n",
__func__, root, out_nn, least_dist);
[Prev in Thread] |
Current Thread |
[Next in Thread] |
- [gnuastro-commits] master 509d899: Library (kdtree): corrected visualization in book, refactored library,
Mohammad Akhlaghi <=