Format Field Refactoring

From GCD
Jump to navigation Jump to search

Background

In Nov, 2012 the GCD-Policy list agreed by consensus to split the current series level Format field into it's currently defined five fields.

Format currently is a semi-colon delimited format specified as "[color]; [dimensions]; [paper stock]; [binding]; [publishing format]" however since this is a free text field which has been manually maintained the refactoring from a single text field into five individual fields has a number of challenges, which include:

  1. Variable number of fields: For many series there are more or less that five fields, some have a single value, others have as many as eleven. In some cases other delimiters were used (eg: commas or spaces) and in other cases semi-colons were used to further sub-divide a field (eg: "newsstand: magazine (#1-10: 27.5 cm high; 20.7 cm wide; #11-25: 26.7 cm high; 20.4 cm wide" is a single dimension value but is four semi-colon delimited fields)
  2. Missing fields: not all series have all five fields defined
  3. Ordering of fields: the first fields is not always color etc
  4. Typos: many, many typos in fields
  5. Data Inconsistency: Many ways of entering the same data (eg: US, U.S., us or saddle-stitch, saddle-stitched, saddlestitched)
  6. Extraneous data: Data is contained in the Format field which doesn't belong in any of the five defined fields

The tech team is moving ahead with implementing the new separate data fields while in parallel the policy list is determine how these new fields should be defined.

There are a number of approaches to populating the new fields:

  1. Do nothing: make the old format field read-only and individuals will populate the new fields manually over time. The challenge is that due to the large number of series and limited number of indexers this becomes a special project and could take years to complete.
  2. Blindly auto-assign: a simple script could parse the data using the semi-colon delimiter and auto-populate the five new fields. The challenge is that due to the data inconsistency problems identified above there are going to be a lot of bad values
  3. Intelligently auto-assign: create a more complex script to parse the data and populate the new fields. This is what I'm recommending.

Design

My design goal is fairly straight forward, to take the existing data in the format field, make some simple corrections and then populate the new fields. Since the full definition of the new fields is still being defined I'm not attempting to fully define the new fields but to populate them with their existing values. Doing this now before the new fields are finalized serves to establish a bit of a baseline that should make it easier to move to the newer definition when defined but also it provides data on what occupies these existing fields today to inform the disussion. Eventually I'd expect that the ultimate definition of these new fields will be a narrowly defined set of pre-determined values with an escape value to allow some special cases. Another migration script could be written to eventually move to that new format.

The final version of this script should be similar to those in /apps/gcd/migration/ however the initial approach is a brute force perl script working off a current MySQL data dump to determine the necessary text transformations and categorizations needed. If we are willing to modify the mySQL data structures directly then the resulting data could be manually imported as a one time action, however I understand this isn't the recommended process.

The design is fairly straightforward and brute force:

  • traverse the series table
    • extract the format field data
    • separate the format field data into individual fields based on the semi-colon delimiter
      • TODO: additionally determine if the resulting fields need to be further sub-divided based on a secondary delimiter
    • for each resulting field
      • correct any typos in the field
      • normalize any special values as defined
      • categorize the fields as one of the new five fields
        • If there is extraneous data mark that to be appended to the notes field
        • I'm still evaluating if there is any data which should be discarded
      • if there are multiple values for a single field, then concatenate them into a single field with a comma delimiter
    • output the resulting new series table values:
      • five new field values, some of which may still be empty
      • updated notes field as necessary
      • currently this is just a text file dump
  • while parsing the data record some additional metrics:
    • total number of series processed
    • total number of unique fields found
    • how the data is categorized, eg: how many different color, paper, dimensions, binding, pubformat values are there
    • how many series have a value for each of the five fields

Considerations for new Field rules

Just a few thoughts on items that need to be considered when defining the new field descriptions

  • General
    • Language - GCD needs a general i18n policy as most text fields have multiple language data values which is a general site/data challenge. However this also applies to English language variations eg: color vs colour.
    • handling series variants or changes in format. lots of notes in the data about specific issues only
  • Color
    • delimiters for multiple colors
    • how to encode "black and white" and "black and white with one additional color" (eg: black, white, red vs b&w with red etc)
    • what is a color comic? - two color vs four color vs full color vs standard color etc
  • Paper
    • using the word "paper" in the field eg: "baxter" vs "baxter paper"
    • cardboard vs cardstock
  • Dimensions
    • use of units (inches vs cm vs std page sizes (A4))
    • separate fields for "length vs width" vs single text field and all the variations on "x" or "by" and how to describe the value (length, width, len, high, tall etc)
  • Binding
  • Pubformat

Rules

Typos

  • simple replacement rules to fix typos
    • there are some transformations here, eg: dust cover to dustjacket that I'll clean up

"bi-montly" --> "bi-monthly"
"book shelf" --> "bookshelf"
"(carboard|carbook|cardtock|cardtsock|carstock|carstock)" --> "cardboard"
"(clotbound|clothsbound|colthsbound)" --> "clothbound"
"collecton" --> "collection"
"coloes" --> "color "
"colot" --> "color"
"(diimensions|dmensions|dimentions)" --> "dimensions"
"delux " --> "deluxe "
"digestformaat" --> "digest format"
"(duckjacket|dust-jacket|dustjacket|dustjackets|dust cover)" --> "dust jacket"
"exterioir" --> "exterior"
"fomat" --> "format"
"(interiior |interio |interor )" --> "interior "
"(glosy|gossy)" --> "glossy"
"graphiv" --> "graphic"
"isue " --> "issue "
"jucket" --> "jacket"
"limtied " --> "limited "
"limuted" --> "limited"
"(magazineformaat|magazinformat)" --> "magazine format"
" matt " --> " matte"
"monochroom" --> "monochrome"
"monthy " --> "monthly "
"newpaper" --> "newspaper"
"newsprin " --> "newsprint "
"(newprint|news print|newspirnt|newspront)" --> "newsprint"
"newstand" --> "newsstand"
"(papoerback|paper back|papreback|paperpack|paprback)" --> "paperback"
"occassional " --> "occasional "
"(on-shot|one- shot|one--shot|one=shot|one shot.*)" --> "one-shot"
"pocketbok" --> "pocketbook"
"(quartery|quaterly)" --> "quarterly"
"atandard" --> "standard"
"standar " --> "standard "
"twleve" --> "twelve"
"vairant" --> "variant"
"varioius" --> "various"

Transformations

Most of these transformation rules are fairly basic and fix a few typos along the way

  • use the " (0x22) for all representations of inch

"\s*inches" --> "\""
"\s*inch" --> "\""
"\s*imches" --> "\""
" inces" --> "\""
"(\'\'|\x92+|\x94)" --> """

  • use "cm" for all centimeter values

"centimetres" --> "cm"
" cms." --> "cm"

  • standardize on "U.S."

"U\s*S\s*" --> "U.S."

  • standardize on "ongoing"

"on-going" --> "ongoing"
"^ngoing" --> "on-going"
"(on going|on-goin)" --> "on-going"

  • standardize on "color" - which goes against my Canadian education but color is more prevalent in the data
    • I'm going to remove this rule so that both colour and color are handled in my next update

"(colour|clour)" --> "color"

  • the rest are fairly self-explanatory. All of the variations on standard or saddle-stitched where amazing!

"miniseries" --> "mini-series"

"approx\.* " --> "approximately "
"aproximately " --> "approximately"

"baxter color comic" --> "baxter paper color comic"
"baxter paper comic\." --> "baxter paper color comic"
"baxter paperstock" --> "baxter paper"
"baxter stock paper" --> "baxter paper"

"(back and white|blac \& white|black \& whte|black and whiter|black and whtie|black\s*\&\s*white|black-and-white|black-white|blcak \& white|balck \& white)" --> "black and white"
" lack and white" --> " black and white"
"black and whit " --> "black and white "
"black \& wite with color covers" --> "black and white with color covers"
"black an white interior pages" --> "black and white interior pages"
"black\,red" --> "black\, red"
"blac\, white\, and red" --> "black\, white and red"

"greytones" --> "grey tones"

"oneshot" --> "one-shot"

"card\sstock" --> "card\-stock"

"(collect edition|collected edition\.|collected edtion)" --> "collected edition"

"colorcover" --> "color cover"

"(full-color |fulll color )" --> "full color "
"full colo glossyr" --> "full color glossy"

"cover \& interior" --> "cover and interior"

"squarebond" --> "squarebound"

"slipcase" --> "slip case"

"(harccover|harcover|hardover)" --> "hardcover"
"hardcover,b" --> "hardcover, b"
"hardcover,d" --> "hardcover, d"

"paperstock" --> "paper-stock"

"(min-series|miniseries|miniseries\.)" --> "mini-series"

"mangaformaat" --> "manga format"

"maxiseries" --> "maxi-series"

"(minicomics|minicomic)" --> "mini-comic"

"(prestigeformat|pestige format|prestiege format)" --> "prestige format"

"(pefectbound|perfect binding)" --> "perfect bound"

"(printed on one side of page only|printed on onse side of each page only.)" --> "printed on one side of each page only)"

"(published erratically|published infrequently|published irregularily)" --> "published irregularly"

"(sofftcover|soft-cover|soft covers|softover)" --> "soft cover"

"(sqaure bound|squarebond|squareboound|squarebounb|square bound|squarebound|squareboun\,)" --> "square-bound"
"squareboun " --> "square-bound"

"(sandard|stanard|standaard|standaards|standared|standart|standerd|standrad|stanrad|stasndard|stndard|syandard)" --> "standard"
" standar " --> " standard "
"standardblack" --> "standard black"
"standardsize" --> "standard size"
"staped" --> "stapled"

"(moden|modenr|moden|modersn|modrn|monder|mordern)" --> "modern"
"moder " --> "modern "
"modern ius" --> "modern U.S."

"(usa size|usa suze|ussize)" --> "U.S. size"

"(trade paprback|trade paprback|tradepaper back)" --> "tradepaperback"

"(daddle-stiched|saddel-stitched|sadde-stitched|saddle bound|saddle stapled|saddle stiched|saddle stitched|saddle -stitched|saddle\" --> "stitched|saddlebound|saddle-bound|saddle-sitched|saddles-stitched|saddle-sticthed|saddle-stiiched|saddle-stiotched|saddlestitch binding|saddlestitched|saddle--stitched|saddle-stitchedl|saddle-stitches|saddle-stitchted|saddle-stitiched|saddle-stotched|saddle-sttched|saddle-sttiched|saddlie-stitched|saddl-stitched|sadle stitched|sadle-stitched|sadlle-stitched|sddle-stitched|staddle-stitched)" --> "saddle-stitched"
"(saddle stitch |saddle-stitch )" --> "saddle-stitched "
"^saddle-stitch$" --> "saddle-stitched"
"^saddle stitch$" --> "saddle-stitched"
"^saddle-stiched$" --> "saddle-stitched"

"web\s*comic" --> "web-comic"

Categorization

These rules are used to determine which of the resulting fields should be populated. The regex values are very brute force and will be cleaned up and optimized down the road. Whether I created a new rule or an OR condition in an existing rule for a word match was very arbitrary and really was just dependent on when I found the case. Most of the rules are just looking for code word matches and occasionally check for location in the field.

Color

/^color|duo(\-|\s)(color|tone|chrome)/
/^black|^blanc|^blue|blanco|^bianco|black and white|color |full color|^red(,|\s)*|^yellow/
/^3-d|^b(\/|\&)(n|w)/
/(sort\/hvit|^zwart)/

Dimensions

/(size |tall |width |height|tabloid|^dimensio|^regular comic|^regular modern|graphic novel)/
/standard (silver|u\.*s\.*|golden|modern|bronze|comic)/
/(high$|wide$|tall$)/
/(^(a|b)(4|5)|^album)/
/din a/
/^cm\.*\s*\d+/
/(standard modern|taschenbuch|paperback|^magazine|regular size|silver age|regular size|dvd)/

  • A bunch of rules to try to check for numbers and units

/cm\.*$/
/(cm|mm|in.+|\")\s*(x|by|\xD7)/
/(wide|high|h)\s*(x|by|\xD7)/
/cm (long|wide)/
/\d+x\d+mm$/
/^\(*\d+[\.\,\']*\d*\s*(x|by|\xD7)/
/^\(*\d+\s*\d+\/*\d*\s*(x|by|\xD7)/
/\d+\s*\d*(\.|\/)*\d*(\"|\')*\s*\x/
/\d+(\s|\-)\d\/\d/


Paper

/(newsprint|bond |^bond|glossy |baxter |^printed |^slick |white (matte|paper)|card-*stock|cardboard|coated)/
/^\d+\s*page/

Binding

/(pages$|sider$|^hand(\-| )*stapled|^square-bound|various bindings)/
/saddle|softcover|hardcover|digest|^bound|perfect(-|\s)*bound|prestige|^rebound|bookshelf|deluxe format/

Pub Format

/(ongoing|one-shot|series|collected edition|collected volume|collection|standard color comic|irregular|promotional|^published|^quarterly|^weekly|web-comic collection)/
/(bi-monthly|bi-weekly|direct sales|direct-sale)/

Notes

/(with 3-d|with ads|wraparound cover|various formats|^various|^varies|^variable|^varied)/)

Discard

Data that is garbage and is discarded.

  • none yet

Other

  • No additional special rules yet but I expect there to be a few

04 DEC 12 Status

First report back to the tech list. I plan on doing this weekly and hopefully wrapping up this project before the end of the year.

My goal with this first update is to get some feedback to make sure that this process is going to be useful to the group before I dive back in for round 2.

Current Status

  • Using the 01-Dec-2012 MySQL datadump I created the first iteration of this brute force script using perl
  • The challenge really isn't the programming is the data evaluation and hence I used perl out of convenience. The current version of the script is very brute force and is simply a set of string transforms and checks. I'll clean this up down the road.
  • The set of transforms and categorization checks are listed above
  • I set a goal to get to < 2000 unknowns and stopped there for this week. In reviewing the results there are still a number of errors in the rules which I need to go back and fix
  • There are two major remaining issues to address:
    • handling the further subdividing of a field based on a secondary delimiter
    • tightening the categorization checks which are currently execution order specific and as a result some fields which match multiple rules are going in the wrong final bucket. I just need to tighten the rules
  • I've avoided many of the foreign language entries but will tackle them next
  • I also want to double check that I'd not inadvertently changing character sets for the foreign language encoding

Current Stats

Metric Value Notes
Total number of series 67244
# of series with N fields based only on semi-colon delimiter
1 6919
2 6964
3 6624
4 18145
5 18316
6 8096
7 1922
8 226
9 15
10 12
11 2
12 3
Total number of unique fields 11187
Field category count distribution
Unknown 1919
Color 1237
Dimensions 5436
Paper 739
Binding 892
PubFormat 756
Notes 208
Series Field Count Distribution
Unknown 20380
Color 40614
Dimensions 43852
Paper 12775
Binding 38280
PubFormat 21980
Notes 626

Next Steps

  • continue to plow through the data with another update posted next weekend 09 Dec 2012

Current Data

  • 121204 Format Refactoring.zip - all the data is in a single zip file for convenience. Hopefully there won't be any access issues as I hosted this on my google drive account.
    • six separate text files with all of the individual values for each field: fields-binding.txt, fields-color.txt, fields-dimensions.txt, fields-notes.txt, fields-paper.txt, fields-pubformat.txt, fields-unknown.txt
    • fields-log.txt - text file showing how all field values were transformed either to fix a typo or to normalize the data
    • fields-stats.txt - text file with the stats above
    • format.txt - tab separated text file containing
      • gcd_series.id - unique series id from gcd_series table
      • gcd_series.sort_name - use sort_name from gcd_series table
      • gcd_series.format - original format field data from gcd_series table
      • color - new data values
      • dimensions - new data values
      • paper - new data values
      • binding - new data values
      • pubformat - new data values
      • notes - new data value really this is just the data that will be appended to the current notes value. I prefaced each of the uncategorized fields with TODO:
    • gcd-format.pl - the really ugly perl script that does the work