CSCI 403 Database Management

Spring 2018

Project 6: Convert

(Back to Projects)

Purpose



Overview


For this project, you will be creating a music database along the lines of MusicBrainz (musicbrainz.org) or Discogs (www.discogs.com). These databases track an enormous amount of metadata on commercial and non-commercial music releases on various media. Your database will be considerably simpler, focusing just on albums (generically, ignoring media-specific variations), artists, and labels.

To see the data you will be working with, take a look at the table public.project6 in the course database. You will see there is a considerable amount of data on various artists and their albums, but of course, it is just a drop in the bucket compared to what the above websites manage. (The data, by the way, comes from the MusicBrainz database. I performed queries on their database to gather a limited dataset in the format I wanted, so any data errors should likely be ascribed to my querying and not to the MusicBrainz project.)

This project is similar to project 3 in that you are asked to create some tables and populate them with data from another table. However, the emphasis in project 3 was on basic SQL with a focus on SELECT queries. In this project the emphasis is on schema creation and data migration, and the fine details are left up to you. You are provided with an entity-relationship diagram (ERD) and asked to create a schema based on the ERD. Finally, you need to extract data for your schema from the raw data provided.

You will be asked to provide several SQL queries, not so much to test your skill at constructing SELECT queries (although practice is always good!), but to better illuminate for us your database schema.


Details


This project requires the creation of SQL DDL to produce the required schema, and SQL queries to populate the schema tables from the project data. All of the tasks below should be done strictly via queries executed on the database, e.g., using psql or SQuirreL. That is, although you could query the database and do some of the tasks within a client programming environment, you are required for this project to do all of the work on the database (server) side, using pure SQL. The deliverable for this project is a SQL script (a text file containing your queries - traditionally with a .sql extension) that constructs, populates, and tests your schema. However, unlike previous projects, you may use multiple queries to achieve the end goals - you may even create intermediate tables or views, if you wish.

Build your SQL script carefully and test as you go. You can create a single script which drops tables and recreates them (note you can use something like DROP TABLE IF EXISTS foo CASCADE; in your script. This will drop foo and any objects dependent on it, but won't give an error if foo doesn't currently exist).

Please comment your SQL script (anything between /* and */ will be ignored by the SQL processor) so that we can figure out what is happening if something goes wrong when we test!

Part I: Constructing your database schema

See the end of this document for the ERD and detailed information about the data items you will be working with. Your job is to turn this ERD into a database schema, complete with tables, primary keys, foreign keys, and anything else you feel is necessary. There is no one correct answer here; an ERD can give rise to different schemas, although some similar structure is expected. Please follow the algorithm presented in your book and in class to convert the ERD to a schema as closely as possible.

As presented, the ERD suggests you should use ID attributes as primary keys for albums, artists, and labels. This is generally good practice, since (a) natural key values may not exist due to the possibility of non-unique names (e.g., more than one artist or album of the same name) and (b) it can be cumbersome and costly in terms of space to use complex or text keys as foreign keys. Typically ID fields are just integers. In the PostgreSQL database, the easiest way to generate appropriate unique values in these fields is to declare them as type "serial". When you insert into the tables, insert into all columns except the ID column; the ID values will be generated sequentially by the database on your behalf.

You will need to decide what to name your tables and attributes in the tables, although naming them the same as the entities and attributes in the ERD is not a bad first choice. You will have to add additional attributes for foreign keys; generally these should be named like "foreignentity_id", where you are referencing the ID attribute of the foreign entity. You can name things any way you like, as long as they are semantically reasonable.

Part II: Populating your database

You next need to work out how to extract the data you need in your tables from the table public.project6. This table contains a very flattened (de-normalized - we'll talk about normalization in a future lecture) view of data about albums and artists, so there is a lot of redundancy. Working out how to eliminate the redundancy appropriately (hint: DISTINCT) and populate your tables correctly will be a challenge.

Here are some hints to get you started:

Part III: Querying your database

At the end of your SQL script, please include queries to answer each of the following questions. The purpose here isn't so much to test your SELECT skills as to let us see how your database is structured. These queries may also be a good test of your database (as well as of your skills!), so you may want to construct these as you do part II above and use them to test portions of your database as they come online.

Together with the questions below, I'm showing the answers I received when I went through this assignment. If your answer does not match mine, don't immediately assume it is an error (in your approach or mine), but spend a little time trying to work out what might cause the difference.

Questions and answers:

  1. Get all members of The Who and their begin/end years with the group ordered by their starting year and name. My solution looks like:
              name          | begin_year | end_year 
    ------------------------+------------+----------
     John Entwistle         |       1964 |     2002
     Keith Moon             |       1964 |     1978
     Pete Townshend         |       1964 |         
     Roger Daltrey          |       1964 |         
     Kenney Jones           |       1978 |     1985
     John "Rabbit" Bundrick |       1979 |     2011
    (6 rows)
    
  2. Get all groups that Chris Thile has been a part of:
          name      
    ----------------
     Punch Brothers
     Nickel Creek
    (2 rows)
    
  3. Get the album titles, years, and labels for all the albums put out by the band Talking Heads, ordered by year:
                    title                | year |    label     
    -------------------------------------+------+--------------
     Talking Heads: 77                   | 1977 | Philips
     More Songs About Buildings and Food | 1978 | Sire Records
     Remain in Light                     | 1980 | Sire Records
     Speaking in Tongues                 | 1983 | Sire Records
     Little Creatures                    | 1985 | Sire Records
     True Stories                        | 1986 | EMI
     Naked                               | 1988 | Fly
    (7 rows)
    
  4. Get all albums (album, year, artist, and label) that Chris Thile has performed on, ordered by year:
                    title                | year |     artist     |       label        
    -------------------------------------+------+----------------+--------------------
     Nickel Creek                        | 2000 | Nickel Creek   | Sugar Hill Records
     This Side                           | 2002 | Nickel Creek   | Sugar Hill Records
     Into the Cauldron                   | 2003 | Chris Thile    | Sugar Hill Records
     Why Should the Fire Die?            | 2005 | Nickel Creek   | Sugar Hill Records
     How to Grow a Woman From the Ground | 2006 | Chris Thile    | Sugar Hill Records
     Punch                               | 2008 | Punch Brothers | Nonesuch
     A Dotted Line                       | 2014 | Nickel Creek   | Nonesuch
    (7 rows)
    
  5. Get all albums (artist, album, year) in the 'electronica' genre ordered by year, artist:
          name      |         title          | year 
    ----------------+------------------------+------
     Massive Attack | Blue Lines             | 1991
     Underworld     | Dubnobasswithmyheadman | 1993
     Massive Attack | Protection             | 1994
     Fluke          | Risotto                | 1997
     Massive Attack | Mezzanine              | 1998
     Amon Tobin     | Supermodified          | 2000
     Massive Attack | 100th Window           | 2003
    (7 rows)
    
  6. Get all the tracks on Led Zeppelin's Houses of the Holy in order by track number:
                name             | number 
    -----------------------------+--------
     The Song Remains the Same   | A1
     The Rain Song               | A2
     Over the Hills and Far Away | A3
     The Crunge                  | A4
     Dancing Days                | B1
     D'yer Mak'er                | B2
     No Quarter                  | B3
     The Ocean                   | B4
    (8 rows)
    
  7. Get all genres that James Taylor has performed in:
      genre  
    ---------
     rock
     folk
     jazz
     country
    (4 rows)
    
  8. Get all albums published by a label headquartered in Hollywood:
        name    |             title               | year |    name     
    ------------+---------------------------------+------+-------------
     Supertramp | Breakfast in America            | 1979 | A&M Records
     Supertramp | Crisis? What Crisis?            | 1975 | A&M Records
     Supertramp | Crime of the Century            | 1974 | A&M Records
    (3 rows)
    

Grading:


Part I: 5 points for each element on ERD correctly translated to a schema element 115 points
Part II/III: 10 points for each query we can successfully run against your database 80 points
README 5 points
Total: 200 points

Documentation:


README:

  1. Include names of all people who helped/collaborated as per the syllabus
  2. Describe the challenges you encountered and how you surmounted them
  3. What did you like/dislike about the assignment?
  4. How long did you spend on this assignment?

Submit a zip file on Canvas containing:


ERD and Data Dictionary


ERD:

A printable version is available here. A detailed description of the entities, attributes, and relationships (i.e., a "data dictionary") is provided below. Because many of the attribute names are not unique in the ERD, I've prepended their entity names and a dot to make clear which attribute I'm describing.
ElementDescription
ArtistAn entity describing a person or group recording music albums.
Artist.IDUnique identifier for an artist.
Artist.NameThe name of the person or group, e.g., "Led Zeppelin".
Artist.TypeThis is an indicator of whether the artist is a person or a group. Groups may have associated member artists, whereas persons may not.
is member ofA relationship between artist groups and their members. A group can have 1 or more members who are also artists (persons). A person can be a member of 0 or more groups. Where known, the membership years of the person are also indicated by relationship attributes (described below).
is member of.Begin YearThe year that the group member in the relationship began in the group.
is member of.End YearThe year that the group member in the relationship stopped being in the group.
LabelAn entity describing a record label (a publisher of albums).
Label.IDUnique identifier for a record label.
Label.NameThe label name, e.g., "Decca Records".
Label.LocationThe headquarters location of the record label.
AlbumAn entity describing a single music album. This is an abstraction in that it tries to capture the notion of a single release by the artist of a collection of tracks under a specific album title. In real life, releases may be made at different times in different countries, even by different labels and with different tracks. The intent in this database is to capture information about the initial release as a proxy for all subsequent releases of the same title.
Album.IDUnique identifier for an album.
Album.TitleThe name assigned by the artist to this album.
Album.YearThe year the album was released.
Album.GenreA genre that the album might be considered to be a part of. An album might be associated with multiple genres. The genre values allowed in this database are 'folk','rock','country','jazz', and 'electronica'.
releasedThe relationship between artists and their albums. An album has exactly one principal artist (the group or person who released the album, not individual members of a group). An artist may release 0 or more albums.
publishedThe relationship between labels and albums. An album has exactly one label, and a label can publish 0 or more albums.
TrackAn entity representing a single track (song) on an album.
Track.NameThe name of the track or song. Note that this is marked as a partial key, suggesting that track names are unique per album. This assumption will work for the data provided, but in fact, does not work in a larger context - songs may be repeated on albums.
Track.NumberA number or string indicating the position of the track in relation to the rest of the album.
is song onThe relationship between albums and their tracks. An album has 1 or more tracks, whereas each track is only on 1 album.