Register | Sign In  
Wiki Lens/Structured Data    avg: 4.7 (3 ratings)

Motivation

A Wiki is a great way to collaboratively edit unstructured data. However, many things in the world are described by structured data. For example, a book may have a title, author, and publisher; a restaurant may have cuisine, cost, address, and phone number. Let's call these things fields. Fields could be kept on an unstructured Wiki page. Put them in a list, maybe separating the field name from the field value by a : or something.

However, there are several advantages to offering support for fields of all pages in a particular category:

  1. Consistency of fields across pages in a category. This might include:

    1. Upon page creation, the right fields are pre-inserted
    2. Upon page editing, fields are adjusted (missing ones added, superfluous ones ignored or marked)
    3. Upon page display, fields are adjusted (missing ones added, superfluous ones ignored or marked)
    4. Audit tools to find pages with missing or superfluous fields
  2. Easier data entry of fields. This might include:

    1. Upon page creation, the right fields are pre-inserted
    2. Appropriate user interface widgest for each field. A text field can have a text box, a single-value constrained field could have a drop-down, a multi-valued constrained field could have a series of checkboxes.
    3. Validation, not accepting fields of an innappropriate value.
  3. Better search.

    1. Searching only given fields. For example, searching only the "average entree price" field of a restaurant. Without this feature, searching all page text for "10" (an example average price) might return copious useless results.
    2. Displaying fields in search results. For example, a book search could show title and author in the search results, so the browser can decide whether to click on the page for more detail.
  4. Better user interface for displaying fields. This might include:

    1. Consistent display from page to page in the category.
    2. Colors or formatting applied to field names and values.

Examples

Some web sites that describe things with structure. See for example:

And some that are Wiki-like (with editing and recent changes) that have structured data:

Possible solutions

Fields stored in-page invisibly

Take the structured data and put it in the page text. Before editing, take it out. After saving, put it back in. (JohnRiedl: Are you sure you really want to take it out and put it back in? Might be nice just to have the fields there in the text ... with the caveat that arbitrary editing can lead to bad results. I suppose you'd have to have drop-downs, etc., for the fields that support that. Hmmm.)

This is nice because we may use the current versioning mechanism for page history, reverting hacked pages, emailing page-change notifications, etc. etc.

It is not nice because searching page fields requires loading the page text, which is inefficient.

Fields stored in-page as semantic relations

Structured data is entered with the popular relation:=link and attribute:=value syntax. PhpWiki and mediawiki know the syntax, and manage the enriched links automatically.

This is nice because
  • it will be easier accepted upstream, because it doesn't require Smarty,
  • is easily searched: FullTextSearch and the special PhpWiki:SemanticSearch - currently wikilens fields are even not fulltextsearch'ed,
  • offers solutions to more advanced problems, like SubCategory, subclassing, defining list relationship, and so on.
  • And it doesn't require a fixed structure as with current Categories.
It is not nice because
  • the editing problem is not solved.
  • And it enforces no strict structured data.

For the editing problem one can extend the edit action to provide widgets for the fields, or add a clever javascript (moacdropdown within the textarea) which understands the new syntax and autofills after := and :: --ReiniUrban2

Fields stored in specifically modeled database schema

For a book category with title and author fields, have a database table called "book", with columns "title" and "author".

This is nice because it is the direct and obvious mapping to a database.

It is not nice because the data is not obviously versionable or editable. Also, the database tables have to change as the fields change. This is awkward.

Fields stored in a generically modeled database schema

Have a database table called "field" with columns like "pageid", "pageversion", "fieldname", "fieldvalue".

For performance, one might have "int_field", "float_field", "text_field" tables, so the "fieldvalue" could be a specific type that is easily indexed.

This is nice because the database tables never have to change, and it can be versioned.

It is not nice because code everywhere in Phpwiki has to change to include this data in pages. For example, displaying, editing, diffing, notifying of changes, all have to be changed. This is expensive.

Other functional goals

  1. Allow end-user editing of which fields are in a category
  2. Allow light-weight page creation. That is, the CreatePage plugin can display the fields of a page, and allow you to fill them in and click 'create' without visiting the page editor.

Other Technical goals

  1. Design an API for page fields whose callers do not know whether the structured data is stored in-page or by some other means. The advantage is that the underlying storage could be changed or migrated while the calling code remains the same.
  2. Nice to have: allow for future plug-in of new field types for editing and display, possibly searching later.

Recommended solution

We choose Fields stored in-page.

Here is more detail on how this would work.

Finding page category

For category-specific page fields, a page must know which category it is in (if any).

First, have an API that gives the primary category of a page if it has one, or null.

The primary category C of a page A is the first link on A to a category page C.

A category page is one that has the CategoryPage plugin on it.

Alternatively, one can imagine a "category" checkbox while editing a page. This might be more useful, but where to store this information?

One can also imagine caching somewhere the list of category pages, generated by (perhaps) adding or removing a page P to/from the list at "Save" time for P.

A page can actually be in several categories (simply refer to multiple category pages). However, for this first pass, we will not take fields from multiple categories, only the primary one.

TonyLam: "primary"? First-listed? Largest?

One can also imagine that there could be an inheritance mechanism among categories. That is, Movie has title, director, and synopsis, while AmericanMovie inherits those fields from Movie, but adds Motion Picture Association of America (MPAA) rating (PG, R, etc.). However, for this first pass, we will forego any notion of inheritance.

Specifying category fields

We must have some way to describe what fields a given category has.

Here are the attributes of a field:

  • name. A name suitable for reference in code, so probably no spaces, no special characters.
  • displayName. A name suitable for display, so special characters allowed.
  • storagetype. number, shortText, longText. This is for determining how to represent the values at the storage level.
  • display. ? Some sort of function...
  • validation. A function that checks if a given value is okay for this field.
  • description. What is this field intended to be?
  • isRequired. True if this field must be filled in. This may be used for display (bold, red or * the thing), or validation (can't save the page without filling this in).
  • position. This is a number indicating where a field goes on the page.

Storing field specification

The field specification should be put somewhere that can be edited by Wiki users, to satisfy our technical goal. I propose storing them in a "fields" subpage of the category page. So for a category "Book", if there are fields, they are in a "Book/fields" page.

The storage format should be human-readable, easy to parse, and be able to represent fairly arbitrary fields (in other words, no problems with special characters). This might suggest XML, but it seems so un-Wikilike.

One possibility is a list of name-value pairs as follows.

! Field specification

* field
   * name: neighborhood
   * displayName: Neighborhood
   * type: singleValueConstrained
         * Uptown
         * Northeast
         * Downtown
   * description: The neighborhood
   * isRequired: no
   * position: 1

* field
   * name: cost
   * displayName: Cost
   * type: singleValueConstrained
         * $
         * $$
         * $$$
         * $$$$
   * description: Average entree cost
   * isRequired: yes
   * position: 2

* field
   * name: hours
   * displayName: Hours
   * type: shortText
   * description: Hours of operation
   * isRequired: no
   * position: 3

! End field specification

Sadly, this is not purely name-value pairs. It has some structure, hence is complicated to parse and understand.

Another possiblity, more verbose but more uniform, is a "nested lists" format as follows.

! Field specification

* field
   * name
      * neighborhood
   * displayName
      * Neighborhood
   * type
      * singleValueConstrained
         * Uptown
         * Northeast
         * Downtown
   * description
      * The neighborhood
   * isRequired
      * no
   * position
      * 1

* field
   * name
      * cost
   * displayName
      * Cost
   * type
      * singleValueConstrained
         * $
         * $$
         * $$$
         * $$$$
   * description
      * Average entree cost
   * isRequired
      * yes
   * position
      * 2

* field
   * name
      * hours
   * displayName
      * Hours
   * type
      * shortText
   * description
      * Hours of operation
   * isRequired
      * no
   * position
      * 3
! End field specification

The fields area is started with a heading "! Field specification" and ended by "! End field specification".

Nesting indication is rigid: 3 spaces + "* " for every level.

TonyLam: The constrained values should probably also be name-value pairs to allow for a level of indirection. Otherwise, it gets kind of annoying if someone wants to change the wording of one of the allowed values (e.g. from "Downtown" to "Downtown, Minneapolis"). Also, we might consider a "short form" and "long form" for the constrained values -- maybe "$" really wants to appear as "$: 5-10 dollars" when a user is adding a restaurant.

Storing field values in a page

Once the fields have been specified on the category/fields page, their values for a particular page must be stored, in that page. I propose a page preamble. That is, if a page has fields, they are at the top, in the first section, started with "! Fields" and ended with "! End fields".

I propose that whatever format used to store the field specification can also be used to store the fields.

EXAMPLE SHOULD GO HERE

Displaying page fields

  1. Find the "category" of a page.
  2. If it has one, read the field specification (possibly cache).
  3. Read the fields from the page, if they exist
  4. Show them at the top of the page, field by field, using a per-type display function. If the field doesn't exist in the page, display a clear blank or unselected state.
  5. Show the 'unstructured' part of the page (the way a page is currently displayed, with no structured data).

One can imagine wanting to control fairly precisely what the visual display looks like. One can imagine category-level look-and-feel, e.g., HTML with "field" macros that allows rich specification of what the page should look like. One could also imagine that perhaps fields could be referred to in the unstructured part of the page, for fun. For this first pass, we skip both of these things and stick to a fairly simple display.

Editing page fields

  1. Find the "category" of a page.
  2. If it has one, read the field specification (possibly cache).
  3. Read the fields from the page, if they exist
  4. Show them at the top of the page, field by field, using a per-type display-edit-widget function. If the field doesn't exist in the page, display a clear blank or unselected state.
  5. Show the 'unstructured' part of the page (the way a page is currently displayed, with no structured data).

This is VERY similar to display, but displaying a different thing (edit widgets instead of display widgets). There is probably a nice way to use the same "display fields" code, but give it a different set of widgets ("edit" widgets instead of "display" widgets).

TonyLam: Some field types might have several possibly-appropriate widgets -- dropdown vs radio buttons, select vs checkboxes, single-widget vs multiple-widget, anything vs Javascript/DHTML-driven-interface. Also, widget layout and parameters is an issue. One size does not fit all. I ran into a lot of these issues a long while back when building the survey portion of the ML2 experimental framework, which was basically a very stripped-down version of surveymonkey.

There are further actions, though, for editing.

  1. Support preview.
  2. Validate during the "Save" and return to the edit window if fields are invalid (including wrong data for the type, or a required field missing).

TonyLam: If the framework assumes fields are one-to-one with user inputs here, one possible issue is fields whose value is a function of several user inputs. One example is a date field where we would like users to have separate input widgets for month, day, year. Another is a field where the user has a given some pre-defined choices but can also enter his or her own value (which might be useful for the food type field in the restaurant recommender). Can be worked around by using silly JS tricks ala ML3 advanced search, but then fields might need to know about that too.

Editing field specifications

It would be nice if we could use the field editor to also edit field specifications. Not sure if this is possible. One can imagine a "category" category that has fields of its own like "name" "displayName", etc. hence being able to use the field editing module on field specs.

However, fields and field specs are similar, but not the same. In particular, fields are all name-value pairs (I guess for the multi-valued constrained fields a "value" has multiple parts), while the field specification has more structure. Also, one would also have to be able to add in the spec case, and not the field case.

If not, we can edit specs using the existing unstructured editor.

TonyLam: We, yes. Anyone else, no?

Decided issues

  1. What happens to the fields if a page category changes or is deleted? All fields get cleared. One can always visit old versions of the page to get the fields.
  2. What happens if you visit an old version of the page which refers to fields no longer on the category page? These fields are displayed as generic text fields.
  3. What happens if required fields are added to a category page? One is not allowed to edit the new page until the required field is filled in.
  4. What happens if category fields are changed so that all pages in that category have invalid fields? The field is now not displayed, and the next time a particular page is edited, the field is deleted from the page data. If the field was changed (as opposed to deleted), one must re-enter the value.

Undecided issues

  1. Is declaring a page a category by having a CategoryPage plugin on it somewhere too awkward? In particular, I imagine performance being pretty bad if the code has to scan through all linked pages looking for the CategoryPage plugin all the time.
  2. The field specification storage format is undecided.
  3. Field type is slightly confused. So is it "basic types only" like int? Is a more category-specific field like "zipcode" a type? What about "ASIN" (Amazon id) that could be used for external searching? Is it about display or validation, or both?
  4. Should a field specification have isIdentifier? Identifiers seem very special. They must be unique. For internal ones, they are assigned by the system. Often there is an external one as well, though. For example, imdbId for a movie.
  5. Can we use the field editor to edit field specs?