Skip to the content.

Bulk updating YAML/CSV files with Bash

by @stackptr on August 18, 2022

At Freckle we maintain hundreds of thousands of questions to engage students and help them practice their skills. These questions are spread across many CSV and YAML assets. On occasion we need to make bulk edits to these assets, sometimes requiring multiple assets and associations between them.

We’ve used many tools to do this: Excel, Python, Ruby, Haskell, JavaScript, etc. In this post we explore editing YAML and CSV files in bulk, using Bash, so we can complete these tasks before the heat death of the universe.

Let’s explore an example scenario. We want to bulk replace standard identifiers with skill identifiers in question assets.

Bash will save the day

A typical asset uses YAML to define the content, an identifier for the question, and the standard identifier it is linked to:

- id: question-id
  standard_id: FR-SYS-1-A.1.c
  text: ...

mappings.csv is a lengthy list of all the standards with associated skills, that can replace them:

standard_id,skill_id
FR-SYS-1-A.1.a,2a5abd24-56f0-47c0-95bf-4384db773921
FR-SYS-1-A.1.b,6b67007a-e512-4b0b-9b02-df26366c38fc
FR-SYS-1-A.1.c,edf5dc7e-c5da-4fbf-9cf1-de8396537204
FR-SYS-1-A.2.a,b7fb5844-23fa-4271-9192-48d22d4b942b

The script should therefore update the asset by replacing the standard_id field with a skill_id field using the appropriate mapping:

- id: question-id
  skill_id: edf5dc7e-c5da-4fbf-9cf1-de8396537204
  text: ...

We can imagine taking the mappings.csv to obtain a list of sed commands that can update the identifiers. However, this solution would result in running hundreds of commands against each file, which is slow and unnecessary. Question assets are logically grouped such that only a few such mappings need to be employed to update an asset.

Therefore, an efficient script will do the following:

Reading CSVs the UNIX way

While there exists general-purpose CSV parsing tools to handle myriad possibilities in this space, this mappings.csv file is easily read using the UNIX toolchain.

First, cut is used to extract a field at some location by tokenizing a line with a delimiter. For example, given $LINE as the first line of CSV data in the mappings.csv:

echo "$LINE" | cut -d ',' -f1
FR-SYS-1-A.1.a

❯ echo "$LINE" | cut -d ',' -f2
2a5abd24-56f0-47c0-95bf-4384db773921

The -d flag defines a comma-delimited line with -f taking the field number to extract.

cat mappings.csv can be piped to cut to read in and extract the contents of the CSV, but this will include the headers on the first line. Therefore, tail -n +2 is used instead to discard the unnecessary line and pipe the rest to cut.

tail -n +2 mappings.csv | cut -d ',' -f1 will print the contents of the first column of each line. Using command substitution allows this output to be written to variable for future use:

csv_file="mappings.csv"
arr_standard_ids=( $(tail -n +2 "$csv_file" | cut -d ',' -f1) )
arr_skill_ids=( $(tail -n +2 "$csv_file" | cut -d ',' -f2) )

Lookup tables using associative arrays

Both arr_standard_ids and arr_skill_ids, after reading in mappings.csv, are one-dimensional array variables. Some operations on an $array in Bash include:

One dimensional arrays are not very ergonomic for lookups, so the next step is to build a map-like structure where a known standard identifier is mapped to a skill identifer.

To build this map, an associative array is declared and the elements of the previously read arrays are zipped into key/value pairs:

declare -A map_standard_skill
for i in "${!arr_standard_ids[@]}"
do
  map_standard_skill["${arr_standard_ids[$i]}"]="${arr_skill_ids[$i]}"
done

The value in an associative array is found in an identical manner as a one-dimensional array, except the indices can be any string value:

standard_id="FR-SYS-1-A.1.c"
echo "${map_standard_skill[$standard_id]}"

# => edf5dc7e-c5da-4fbf-9cf1-de8396537204

3 Rs: Reading, ‘riting, and regex

Now that the lookup table is built, the script will read some input asset file to build a list of replacements. For example, if an asset has defined standard_id: FR-SYS-1-A\.1\.c, the script should run the following replacement:

s/standard_id: FR-SYS-1-A\.1\.c/skill_id: edf5dc7e-c5da-4fbf-9cf1-de8396537204/

(Note the necessity to escape special characters such as . — ignore at your own peril).

A process_file function will take the filename as argument, reading the file to build the list of replacements, before writing to the file using sed :

process_file () {
  local file
  file="$1"

  declare -a substitutions

  while read -r line
  do
    # Do something to add to substitutions
  done < "$file"

  for sub in "${substitutions[@]}"
  do
    sed -i "$sub" "$file"
  done
}


for file in *.yaml; do
  process_file "$file"
done

Bash provides a =~ operator to compare a string to an extended regular expression. If the regular expression has capture groups, each group is available in the $BASH_REMATCH array. This allows to search for a string with a standard_id YAML key and capturing the value in order to look up the skill_id:

process_file () {
  local file
  file="$1"
  declare -a substitutions

  while read -r line
  do
    if [[ $line =~ ^standard_id:\ (.*)$ ]]
    then
      lookup="${BASH_REMATCH[1]}" # standard_id value
      skill_id="${map_standard_skill[$lookup]}"
      find="$line"
      replace="skill_id: $skill_id"
      substitutions+=("s/$find/$replace/")
    fi
  done < "$file"

  for sub in "${substitutions[@]}"
  do
    sed -i "$sub" "$file"
  done
}

To our peril, we’ve ignored the need to escape special characters, which is remedied through an escape function and command substitution:

escape () {
  echo "${1//\./\\.}"
}

process_file () {
  local file
  file="$1"
  declare -a substitutions

  while read -r line
  do
    if [[ $line =~ ^standard_id:\ (.*)$ ]]
    then
      lookup="${BASH_REMATCH[1]}" # standard_id value
      skill_id="${map_standard_skill[$lookup]}"
      find=$(escape "$line")
      replace=$(escape "skill_id: $skill_id")
      substitutions+=("s/$find/$replace/")
    fi
  done < "$file"

  for sub in "${substitutions[@]}"
  do
    sed -i "$sub" "$file"
  done
}

for file in *.yaml; do
  process_file "$file"
done

Putting it together

Now each step of the script has been worked through in isolation. We can glue the pieces together into the final product:

#!/usr/bin/env bash

# 1. Read in mappings CSV:
csv_file="mappings.csv"
arr_standard_ids=( $(tail -n +2 "$csv_file" | cut -d ',' -f1) )
arr_skill_ids=( $(tail -n +2 "$csv_file" | cut -d ',' -f2) )

# 2. Build associative array of standard id to skill id
declare -A map_standard_skill
for i in "${!arr_standard_ids[@]}"
do
  map_standard_skill["${arr_standard_ids[$i]}"]="${arr_skill_ids[$i]}"
done


# Convenience functions for processing
escape () {
  echo "${1//\./\\.}"
}

# 3. Process an asset $file
process_file () {
  local file
  file="$1"
  declare -a substitutions

  while read -r line
  do
    if [[ $line =~ ^standard_id:\ (.*)$ ]]
    then
      lookup="${BASH_REMATCH[1]}" # standard_id value
      skill_id="${map_standard_skill[$lookup]}"
      find=$(escape "$line")
      replace=$(escape "skill_id: $skill_id")
      substitutions+=("s/$find/$replace/")
    fi
  done < "$file"

  for sub in "${substitutions[@]}"
  do
    sed -i "$sub" "$file"
  done
}

for file in *.yaml; do
  process_file "$file"
done

YAML and CSV assets are flexible and portable, but there is not always one clear method to make large changes to them. Bash scripts are similarly flexible and portable, making them a useful option when the need arises for bulk updates. Relying on the UNIX toolchain allows these scripts to be simple, composable, and stable. And who doesn’t love hacking together a little Bash script now and again!?