Compare the Schools. Part 2. Loading data.

This is the second part of the “Compare the Schools” series, where I will show you how to create a simple data loader that will take an input in the CSV format and will populate MongoDB collections.

Data model:

CTS_DB_schema

We will have two collections – ‘school’ and ‘result’ with one-to-many relationships between them.

We will use different data sources to populate the fields.

Also, we will calculate the Rankins Score (ranking_score) and, the Weighted Moving Average  of the Ranking Score (ranking_score_wma), and, finally, the Rank (rank) for that school for a given year – we store all that in the Result collection too. It is redundant, however, since we will have, one new data load per year, it would make sense to calculate ranks at that time.

Let’s start with the first data source – VCAA Post Compulsory Completion and Achievement Information. The data is in the PDF format. I have tried few converters, however, the only one that did work was Tabula. The conversion process is rather simple and is well covered in the Tabula’s documentation.

Here is directory structure for the code.


├── README.md
├── dataload.go
├── dataload_test.go
├── load.sh
└── models
 └── models.go

models.go defines the School and the Result structs:


type (

//Result ...
Result struct {
Year int `json:"year" bson:"year"`
  ...
  PercentCompletionVCE int `json:"percent_completion_vce" bson:"percent_completion_vce" csv:"percent_completion_vce"`
  ...
  MedianVCEScore int `json:"median_vce_score" bson:"median_vce_score" csv:"median_vce_score"`
  PercentScore40AndOver float32 `json:"percent_score_40_and_over" bson:"percent_score_40_and_over" csv:"percent_score_40_and_over"`
  School bson.ObjectId `json:"school" bson:"school_id" csv:"school_id"`
  RankingScore float32 `json:"ranking_score" bson:"ranking_score"`
  RankingScoreWMA float32 `json:"ranking_score_wma" bson:"ranking_score_wma"`
  Rank int `json:"rank" bson:"rank"`
}
//School ...
School struct {
  Name string `json:"school_name" bson:"name" csv:"name"`
  Adult bool `json:"adult" csv:"adult"`
  Small bool `json:"small" csv:"small"`
  InterBcl bool `json:"inter_bcl" csv:"inter_bcl"`
  EduSector string `json:"edu_sector" csv:"edu_sector"`
  Type string `json:"type" csv:"type"`
  Address string `json:"address" csv:"address"`
  Locality string `json:"locality" csv:"locality"`
  Postcode string `json:"postcode" csv:"postcode"`
  State string `json:"state" csv:"state"`
  Id bson.ObjectId `json:"id" bson:"_id,omitempty"`
}
)

To calculate the ranking score I decided to follow ideas of Chris Tofallis (Add or Multiply? A Tutorial on Ranking and Choosing with Multiple Criteria. For now I am using Percent Completion VCE, Percent Score of 40 and over and Median VCE Score. To avoid getting zero score, which would be the case if any of these are equal to 0, I am shifting all values by 1. Therefore the formula is:

CodeCogsEqn

The weighted moving average is generally calculated by this formula:

CodeCogsEqn-3

where R is a Ranking Score, W is a weight for a period t.

The idea behind calculating WMA is to smooth year-to-year fluctuations of the results, while giving more weight to the most recent ones.

Practically we will calculate WMA for last 3 years. The weight for the current year will be 3, for the previous one – 2 and for the one before – 1. Therefore:

CodeCogsEqn-4

Here is the code.

//CalculateRankingScore is a method to calculate absolute ranking score
// for a given school at a given year
func (r Result) CalculateRankingScore() (score float32) {
 score = float32((r.MedianVCEScore+1)*(r.PercentCompletionVCE+1)) * (r.PercentScore40AndOver + float32(1))
 return
}

//CalculateRankingScoreWMA is a method to calculate a weighted moving average of the
//ranking score over number of years
func (r Result) CalculateRankingScoreWMA(ss []float32) (result float32) {
 var numerator float32
 var denominator int
 for i, v := range ss {
   numerator = numerator + float32(i+1)*v
   denominator = denominator + i + 1
 }
 result = numerator / float32(denominator)
 return
}

Note that we do not explicitly pass the number of years to the CalculateRankingScoreWMA method – instead, we feed it with a slice of ranking scores. This will allow us to calculate the WMA even if the number of scores is less than 3 years – for example, if the school is new, or we calculating this value for the first two years.

Now, that we have our models defined and data converted into CSV, we can load it into the MongoDB collections.

Here is the problem though. VCAA data format is not consistent – there are more columns added in the recent years. Also, we want our loader to be universal. Therefore our CSV rows will contain some, but not all model data. Also, they might contain rows we are not interested in – that are not in our models.

How to deal with that?

The answer is – first we will create a map of CSV columns to the models’ fields.

Suppose our CSV header looks like this:


name,small,locality,adult,inter_bcl,median_vce_score,percent_score_40_and_over,percent_completion_vce

For our School model we want to know that in that CSV file we have Name attribute at the column 0, Small – at column 1,  Locality – at column 2, Adult – at 3, InterBcl (which, by the way, is International Baccalaureate) – at 4.

For the Result model, we just have PercentScore40AndOver at column 5 and PercentCompletionVCE at column 6.

Okay, first, we define a HeaderSlice type as a slice of strings.

type headerSlice []string

Next, we write columnNumByHeader method of the headerSlice. It takes a string for the header name and returns a column number and ‘true’ if found, or ‘-1’ and false – if not.

func (hs headerSlice) columnNumByHeader(title string) (int, bool) {
	for i, v := range hs {
		if v == title {
			return i, true
		}
	}
	return -1, false
}

Finally, we write mapStructToCSVColumns function. It accepts a headerSlice and a reference to a struct via an interface and returns a map, i.e. {Name: 0, Small: 1, Locality: 2, Adult: 3, InterBcl: 4}

func mapStructToCSVColumns(hs headerSlice, strct interface{}) map[string]int {
	m := make(map[string]int)
	t := reflect.TypeOf(strct).Elem()
	for i := 0; i < t.NumField(); i++ {
		tag := t.Field(i).Tag.Get("csv")
		fieldName := t.Field(i).Name
		colnumber, found := hs.columnNumByHeader(tag)
		if found {
			m[fieldName] = colnumber
		}
	}
	return m
}

This function uses reflect package to get fields names by the csv tag. reflect.TypeOf(strct).Elem() returns a pointer to the struct ‘strct’. Now we can iterate through the fields, getting the value of the csv tag and the field name. Then cvs tag is passed to the columnNumByHeader method of the headerSlice. If the corresponding column name is found, the number of that column is added to the resulting map. The reason we use interface type here is to be able to use this function with both School and Result structs.   Now, that we have this map, we can populate an instance of a struct. For that we use populateStructFromBody function.


func populateStructFromBody(row []string, m map[string]int, strct interface{}, debug bool) (interface{}, error) {
	if debug {
		log.Println(row)
	}
	t := reflect.TypeOf(strct)
	if t.Kind() == reflect.Ptr {
		t = t.Elem()
	}
	if t.Kind() != reflect.Struct {
		return nil, errors.New("Input param is not a struct")
	}
	rr := reflect.New(t).Elem()

	for k, v := range m {
		if debug {
			log.Println(k)
			log.Println(row[v])
		}
		re := regexp.MustCompile(`\^|-|I\/D|N\/A|&amp;amp;amp;amp;amp;amp;lt;4|&amp;amp;amp;amp;amp;amp;lt; 4`)
		row[v] = re.ReplaceAllString(row[v], "")
		if debug {
			log.Println("after conversion")
			log.Println(row[v])
		}
		f := rr.FieldByName(k)
		switch f.Type().String() {
		case "int":
			switch row[v] {
			case "":
				f.SetInt(0)
			default:
				intval, err := strconv.ParseInt(row[v], 10, 0)
				if err != nil {
					return nil, errors.New("Cant convert to Int")
				}
				f.SetInt(intval)
			}
		case "float32":
			switch row[v] {
			case "":
				f.SetFloat(0)
			default:
				floatval, err := strconv.ParseFloat(row[v], 32)
				if err != nil {
					return nil, errors.New("Can't convert to Float")
				}
				f.SetFloat(floatval)
			}
		case "bool":
			switch row[v] {
			case "A":
				f.SetBool(true)
			case "*":
				f.SetBool(true)
			case "Y":
				f.SetBool(true)
			default:
				f.SetBool(false)
			}
		case "string":
			row[v] = strings.Title(strings.ToLower(row[v]))
			f.SetString(row[v])
		default:
			return nil, errors.New("Something is wrong...")
		}
	}
	return rr.Interface(), nil
}

We pass CVS row data (as a slice), a map of a struct fields to the CSV columns and a reference to a struct. With reflect.TypeOf(strct) we are getting pointer to a struct, which we can use to create an instance (reflect.New(t).Elem()). The return value rr is, itself is a pointer to a type.

Then we go through the content of a map, created with mapStructToCSVColumns. Knowing the field name, we are getting its type. This is needed to convert CSV strings as necessary. While doing that we also do some translating (i.e. ‘>4’ to an empty value, ‘A’ to the boolean ‘true’, etc). Having value with the correct type we can now populate that field.

The last non-main function is readCSV. It takes a filename and returns slice of slices.

func readCSV(f string) ([][]string, error) {
	var err error
	file, err := os.Open(f)
	check("Can't open file", err)
	defer file.Close()
	reader := csv.NewReader(file)
	dat, err := reader.ReadAll()
	return dat, err
}

Now we are ready to write the main function.

Let’s go through it bit y bit:

	filePtr := flag.String("in", "", "csv file to load")
	yearPtr := flag.Int("year", 0, "year of the results")
	mongoPtr := flag.String("mongo", "mongodb://localhost", "mongodb connection string")
	databasePtr := flag.String("database", "", "database name")
	debugPtr := flag.Bool("debug", false, "[false]|true")
	wmaPeriod := 3
	flag.Parse()
	if *filePtr == "" || *yearPtr == 0 {
		flag.PrintDefaults()
		os.Exit(1)
	}
	var err error
	dat, err := readCSV(*filePtr)
	check("Can not read csv file", err)
	header := headerSlice(dat[0])
	if *debugPtr {
		log.Println(header)
	}
	body := dat[1:]

Here we do the following:

  • use ‘flag’ package to define command line parameters – ‘in’, ‘year’, ‘mongo’, ‘database’ and ‘debug’;
  • set WMA period to 3 years;
  • load CSV file with the name (full path) passed by the ‘in’ parameter value.
  • use the first row to set header slice
  • assign the rest of the rows to the ‘body’ variable.

Next, we create structs to the CVS columns maps for both our models:

schoolStructFieldToColumnsMap := mapStructToCSVColumns(header, &amp;models.School{})
resultStructFieldToColumnsMap := mapStructToCSVColumns(header, &amp;models.Result{})

Now we are ready to connect to the database…


s, err := mgo.Dial(*mongoPtr)
	sc := s.DB(*databasePtr).C("schools")
	rc := s.DB(*databasePtr).C("results")

… and go through each of the CVS rows, populate structs and upsert them to the database.

for _, row := range body {
	bar.Increment()
	if *debugPtr {
		log.Println(row)
	}
	interfaceSchool, _ := populateStructFromBody(row, schoolStructFieldToColumnsMap, &amp;amp;amp;amp;amp;amp;models.School{}, *debugPtr)
	interfaceResult, _ := populateStructFromBody(row, resultStructFieldToColumnsMap, &amp;amp;amp;amp;amp;amp;models.Result{}, *debugPtr)
	school := interfaceSchool.(models.School)
	result := interfaceResult.(models.Result)
	query := bson.M{"name": school.Name, "locality": school.Locality}
	_, err = sc.Upsert(query, bson.M{"$set": school})
	check("Can not upsert school", err)

	var sch models.School
	err = sc.Find(query).One(&amp;amp;amp;sch)
	check("Can not find a school", err)

	result.School = sch.Id
	result.Year = *yearPtr
	result.RankingScore = result.CalculateRankingScore()
// calculating Ranking Score WMA
	var Scores []float32
	var rr []models.Result
	err = rc.Find(bson.M{"school_id": result.School}).Sort("year").Limit(wmaPeriod - 1).All(&amp;amp;amp;amp;amp;amp;rr)
	check("Can not perform the find query", err)
	for _, r := range rr {
		Scores = append(Scores, r.RankingScore)
	}
	Scores = append(Scores, result.RankingScore)
	result.RankingScoreWMA = result.CalculateRankingScoreWMA(Scores)

	_, err = rc.Upsert(bson.M{"school_id": result.School, "year": result.Year}, bson.M{"$set": result})
	check("Can not upsert the result", err)
}

And the last bit is to calculate the actual School ranks. Later I will move this functionality to the front-end, so that I can have ranks for filtered lists (i.e. only government schools).

</pre>
<pre>//Calculate school ranks
var rank int
rank = 0
var previousRankingScore float32
previousRankingScore = 0.00
var rrr []models.Result
err = rc.Find(bson.M{"year": *yearPtr}).Sort("-ranking_score_wma").All(&rrr)
check("Can not get results from the database.", err)

for i, r := range rrr {
	if r.RankingScoreWMA != previousRankingScore {
		rank = i + 1
	}
	log.Printf("School ID: %v, Ranking score: %v, Rank: %v", r.School, r.RankingScoreWMA, rank)
	err = rc.Update(bson.M{"school_id": r.School, "year": r.Year}, bson.M{"$set": bson.M{"rank": rank}})
	previousRankingScore = r.RankingScoreWMA
}</pre>
<pre>

Here is the full code for the dataload.go and some tests.

Advertisements
Compare the Schools. Part 2. Loading data.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s