Re: [Gluster-devel] RFC/Review: libgfapi object handle based extensions

Avati, Amar,

Amar, Anand S and myself had a discussion on this comment and here is an answer to your queries the way I see it. Let me know if I am missing something here.

(this is not a NFS Ganesha requirement, FYI. As Ganesha will only do a single lookup or preserve a single object handle per filesystem object in its cache)

Currently a glfs_object is an opaque pointer to an object (it is a _handle_ to the object). The object itself contains a ref'd inode, which is the actual pointer to the object.

1) The similarity and differences of object handles to fds

The intention of multiple object handles is in lines with multiple fd's per file, an application using the library is free to lookup (and/or create (and its equivalents)) and acquire as many object handles as it wants for a particular object, and can hence determine the lifetime of each such object in its view. So in essence one thread can have an object handle to perform, say attribute related operations, whereas another thread has the same object looked up to perform IO.

Where the object handles depart from the notion of fds is when an unlink is performed. As POSIX defines that open fds are still _open_ for activities on the file, the life of an fd and the actual object that it points to is till the fd is closed. In the case of object handles though, the moment any handle is used to unlink the object (which BTW is done using the parent object handle and the name of the child), all handles pointing to the object are still valid pointers, but operations on then will result in ENOENT, as the actual object has since been unlinked and removed by the underlying filesystem.

The departure from fds is considered valid in my perspective, as the handle points to an object, which has since been removed, and so there is no semantics here that needs it to be preserved for further operations as there is a reference to it held.

So in essence for each time an object handle is returned by the API, it has to be closed for its life to end. Additionally if the object that it points to is removed from the underlying system, the handle is pointing to an entry that does not exist any longer and returns ENOENT on operations using the same.

2) The issue/benefit of having the same object handle irrespective of looking it up multiple times

If we have an 1-1 relationship of object handles (i.e struct glfs_object) to inodes, then the caller gets the same pointer to the handle. Hence having multiple handles as per the caller, boils down to giving out ref counted glfs_object(s) for the same inode.

Other than the memory footprint, this will still not make the object live past it's unlink time. The pointer handed out will be still valid till the last ref count is removed (i.e the object handle closed), at which point the object handle can be destroyed.

So again, as many handles were handed out for the same inode, they have to be closed, etc.

3) Graph switches

In the case of graph switches, handles that are used in operations post the switch, get refreshed with an inode from the new graph, if we have an N:1 object to inode relationship.

In the case of 1:1 this is done once, but is there some multi thread safety that needs to be in place? I think this is already in place from the glfs_resolve_inode implementation as suggested earlier, but good to check.

4) Renames

In the case of renames, the inode remains the same, hence all handed out object handles still are valid and will operate on the right object per se.

5) unlinks and recreation of the same _named_ object in the background

Example being, application gets an handle for an object, say named "a.txt", and in the background (or via another application/client) this is deleted and recreated.

This will return ENOENT as the GFID would have changed for the previously held object to the new one, even though the names are the same. This seems like the right behaviour, and does not change in the case of a 1:1 of an N:1 object handle to inode mapping.

So bottom line, I see the object handles like an fd with the noted difference above. Having them in a 1:1 relationship or as a N:1 relationship does not seem to be an issue from what I understand, what am I missing here?

Shyam

From: "Anand Avati" <address@hidden>
To: "Shyamsundar Ranganathan" <address@hidden>
Cc: "Gluster Devel" <address@hidden>
Sent: Monday, September 30, 2013 10:35:05 AM
Subject: Re: RFC/Review: libgfapi object handle based extensions

I see a pretty core issue - lifecycle management of 'struct glfs_object'. What is the structure representing? When is it created? When is it destroyed? How does it relate to inode_t?

Looks like for every lookup() we are creating a new glfs_object, even if the looked up inode was already looked up before (in the cache) and had a glfs_object created for it in the recent past.

We need a stronger relationship between the two with a clearer relationship. It is probably necessary for a glfs_object to represent mulitple inode_t's at different points in time depending on graph switches, but for a given inode_t we need only one glfs_object. We definitely must NOT have a new glfs_object per lookup call.

Avati

On Thu, Sep 19, 2013 at 5:13 AM, Shyamsundar Ranganathan <address@hidden> wrote:

Avati,

Please find the updated patch set for review at gerrit.
http://review.gluster.org/#/c/5936/

Changes made to address the points (1) (2) and (3) below. By the usage of the suggested glfs_resolve_inode approach.

I have not yet changes glfs_h_unlink to use the glfs_resolve_at. (more on this a little later).

So currently, the review request is for all APIs other than,
glfs_h_unlink, glfs_h_extract_gfid, glfs_h_create_from_gfid

glfs_resolve_at: Using this function the terminal name will be a force look up anyway (as force_lookup will be passed as 1 based on !next_component). We need to avoid this _extra_ lookup in the unlink case, which is why all the inode_grep(s) etc. were added to the glfs_h_lookup in the first place.

Having said the above, we should still leverage glfs_resolve_at anyway, as there seem to be other corner cases where the resolved inode and subvol maybe from different graphs. So I think I want to modify glfs_resolve_at to make a conditional force_lookup, based on iatt being NULL or not. IOW, change the call to glfs_resolve_component with the conditional as, (reval || (!next_component && iatt)). So that callers that do not want the iatt filled, can skip the syncop_lookup.

Request comments on the glfs_resolve_at proposal.

Shyam.

----- Original Message -----

> From: "Anand Avati" <address@hidden>

> To: "Shyamsundar Ranganathan" <address@hidden>
> Cc: "Gluster Devel" <address@hidden>
> Sent: Wednesday, September 18, 2013 11:39:27 AM
> Subject: Re: RFC/Review: libgfapi object handle based extensions

> Minor comments are made in gerrit. Here is a larger (more important) comment
> for which email is probably more convenient.

> There is a problem in the general pattern of the fops, for example
> glfs_h_setattrs() (and others too)

> 1. glfs_validate_inode() has the assumption that object->inode deref is a
> guarded operation, but here we are doing an unguarded deref in the paramter
> glfs_resolve_base().

> 2. A more important issue, glfs_active_subvol() and glfs_validate_inode() are
> not atomic. glfs_active_subvol() can return an xlator from one graph, but by
> the time glfs_validate_inode() is called, a graph switch could have happened
> and inode can get resolved to a different graph. And in syncop_XXXXXX() we
> end up calling on graph1 with inode belonging to graph2.

> 3. ESTALE_RETRY is a fundamentally wrong thing to do with handle based
> operations. The ESTALE_RETRY macro exists for path based FOPs where the
> resolved handle could have turned stale by the time we perform the FOP
> (where resolution and FOP are non-atomic). Over here, the handle is
> predetermined, and it does not make sense to retry on ESTALE (notice that FD
> based fops in glfs-fops.c also do not have ESTALE_RETRY for this same
> reason)

> I think the pattern should be similar to FD based fops which specifically
> address both the above problems. Here's an outline:

> glfs_h_XXXX(struct glfs *fs, glfs_object *object, ...)
> {
> xlator_t *subvol = NULL;
> inode_t *inode = NULL;

> __glfs_entry_fs (fs);

> subvol = glfs_active_subvol (fs);
> if (!subvol) { errno = EIO; ... goto out; }

> inode = glfs_resolve_inode (fs, object, subvol);
> if (!inode) { errno = ESTALE; ... goto out; }

> loc.inode = inode;
> ret = syncop_XXXX(subvol, &loc, ...);

> }

> Notice the signature of glfs_resolve_inode(). What it does: given a
> glfs_object, and a subvol, it returns an inode_t which is resolved on that
> subvol. This way the syncop_XXX() is performed with matching subvol and
> inode. Also it returns the inode pointer so that no unsafe object->inode
> deref is done by the caller. Again, this is the same pattern followed by the
> fd based fops already.

> Also, as mentioned in one of the comments, please consider using
> glfs_resolve_at() and avoiding manual construction of loc_t.

> Thanks,
> Avati

From:	Shyamsundar Ranganathan
Subject:	Re: [Gluster-devel] RFC/Review: libgfapi object handle based extensions
Date:	Mon, 30 Sep 2013 06:40:13 -0400 (EDT)