Bulk updating YAML/CSV files with Bash
by @stackptr on August 18, 2022
At Freckle we maintain hundreds of thousands of questions to engage students and help them practice their skills. These questions are spread across many CSV and YAML assets. On occasion we need to make bulk edits to these assets, sometimes requiring multiple assets and associations between them.
We’ve used many tools to do this: Excel, Python, Ruby, Haskell, JavaScript, etc. In this post we explore editing YAML and CSV files in bulk, using Bash, so we can complete these tasks before the heat death of the universe.
Let’s explore an example scenario. We want to bulk replace standard identifiers with skill identifiers in question assets.
Bash will save the day
A typical asset uses YAML to define the content, an identifier for the question, and the standard identifier it is linked to:
- id: question-id
standard_id: FR-SYS-1-A.1.c
text: ...
mappings.csv
is a lengthy list of all the standards with associated skills, that can replace them:
standard_id,skill_id
FR-SYS-1-A.1.a,2a5abd24-56f0-47c0-95bf-4384db773921
FR-SYS-1-A.1.b,6b67007a-e512-4b0b-9b02-df26366c38fc
FR-SYS-1-A.1.c,edf5dc7e-c5da-4fbf-9cf1-de8396537204
FR-SYS-1-A.2.a,b7fb5844-23fa-4271-9192-48d22d4b942b
The script should therefore update the asset by replacing the standard_id
field with a skill_id
field using the appropriate mapping:
- id: question-id
skill_id: edf5dc7e-c5da-4fbf-9cf1-de8396537204
text: ...
We can imagine taking the mappings.csv
to obtain a list of sed
commands that can update the identifiers. However, this solution would result in running hundreds of commands against each file, which is slow and unnecessary. Question assets are logically grouped such that only a few such mappings need to be employed to update an asset.
Therefore, an efficient script will do the following:
- Read in
mappings.csv
and create a lookup table fromstandard_id
toskill_id
- Process each asset file to determine the specific
standard_id
values to replace - Look up
skill_id
for replacement and update the asset file
Reading CSVs the UNIX way
While there exists general-purpose CSV parsing tools to handle myriad possibilities in this space, this mappings.csv
file is easily read using the UNIX toolchain.
First, cut
is used to extract a field at some location by tokenizing a line with a delimiter. For example, given $LINE
as the first line of CSV data in the mappings.csv
:
❯ echo "$LINE" | cut -d ',' -f1
FR-SYS-1-A.1.a
❯ echo "$LINE" | cut -d ',' -f2
2a5abd24-56f0-47c0-95bf-4384db773921
The -d
flag defines a comma-delimited line with -f
taking the field number to extract.
cat mappings.csv
can be piped to cut
to read in and extract the contents of the CSV, but this will include the headers on the first line. Therefore, tail -n +2
is used instead to discard the unnecessary line and pipe the rest to cut
.
tail -n +2 mappings.csv | cut -d ',' -f1
will print the contents of the first column of each line. Using command substitution allows this output to be written to variable for future use:
csv_file="mappings.csv"
arr_standard_ids=( $(tail -n +2 "$csv_file" | cut -d ',' -f1) )
arr_skill_ids=( $(tail -n +2 "$csv_file" | cut -d ',' -f2) )
Lookup tables using associative arrays
Both arr_standard_ids
and arr_skill_ids
, after reading in mappings.csv
, are one-dimensional array variables. Some operations on an $array
in Bash include:
- Accessing a value at a given index
$i
:${array[$i]}
- Obtaining all keys (indices):
${!array[@]}
One dimensional arrays are not very ergonomic for lookups, so the next step is to build a map-like structure where a known standard identifier is mapped to a skill identifer.
To build this map, an associative array is declared and the elements of the previously read arrays are zipped into key/value pairs:
declare -A map_standard_skill
for i in "${!arr_standard_ids[@]}"
do
map_standard_skill["${arr_standard_ids[$i]}"]="${arr_skill_ids[$i]}"
done
The value in an associative array is found in an identical manner as a one-dimensional array, except the indices can be any string value:
standard_id="FR-SYS-1-A.1.c"
echo "${map_standard_skill[$standard_id]}"
# => edf5dc7e-c5da-4fbf-9cf1-de8396537204
3 Rs: Reading, ‘riting, and regex
Now that the lookup table is built, the script will read some input asset file to build a list of replacements. For example, if an asset has defined standard_id: FR-SYS-1-A\.1\.c
, the script should run the following replacement:
s/standard_id: FR-SYS-1-A\.1\.c/skill_id: edf5dc7e-c5da-4fbf-9cf1-de8396537204/
(Note the necessity to escape special characters such as .
— ignore at your own peril).
A process_file
function will take the filename as argument, reading the file to build the list of replacements, before writing to the file using sed
:
process_file () {
local file
file="$1"
declare -a substitutions
while read -r line
do
# Do something to add to substitutions
done < "$file"
for sub in "${substitutions[@]}"
do
sed -i "$sub" "$file"
done
}
for file in *.yaml; do
process_file "$file"
done
Bash provides a =~
operator to compare a string to an extended regular expression. If the regular expression has capture groups, each group is available in the $BASH_REMATCH
array. This allows to search for a string with a standard_id
YAML key and capturing the value in order to look up the skill_id
:
process_file () {
local file
file="$1"
declare -a substitutions
while read -r line
do
if [[ $line =~ ^standard_id:\ (.*)$ ]]
then
lookup="${BASH_REMATCH[1]}" # standard_id value
skill_id="${map_standard_skill[$lookup]}"
find="$line"
replace="skill_id: $skill_id"
substitutions+=("s/$find/$replace/")
fi
done < "$file"
for sub in "${substitutions[@]}"
do
sed -i "$sub" "$file"
done
}
To our peril, we’ve ignored the need to escape special characters, which is remedied through an escape
function and command substitution:
escape () {
echo "${1//\./\\.}"
}
process_file () {
local file
file="$1"
declare -a substitutions
while read -r line
do
if [[ $line =~ ^standard_id:\ (.*)$ ]]
then
lookup="${BASH_REMATCH[1]}" # standard_id value
skill_id="${map_standard_skill[$lookup]}"
find=$(escape "$line")
replace=$(escape "skill_id: $skill_id")
substitutions+=("s/$find/$replace/")
fi
done < "$file"
for sub in "${substitutions[@]}"
do
sed -i "$sub" "$file"
done
}
for file in *.yaml; do
process_file "$file"
done
Putting it together
Now each step of the script has been worked through in isolation. We can glue the pieces together into the final product:
#!/usr/bin/env bash
# 1. Read in mappings CSV:
csv_file="mappings.csv"
arr_standard_ids=( $(tail -n +2 "$csv_file" | cut -d ',' -f1) )
arr_skill_ids=( $(tail -n +2 "$csv_file" | cut -d ',' -f2) )
# 2. Build associative array of standard id to skill id
declare -A map_standard_skill
for i in "${!arr_standard_ids[@]}"
do
map_standard_skill["${arr_standard_ids[$i]}"]="${arr_skill_ids[$i]}"
done
# Convenience functions for processing
escape () {
echo "${1//\./\\.}"
}
# 3. Process an asset $file
process_file () {
local file
file="$1"
declare -a substitutions
while read -r line
do
if [[ $line =~ ^standard_id:\ (.*)$ ]]
then
lookup="${BASH_REMATCH[1]}" # standard_id value
skill_id="${map_standard_skill[$lookup]}"
find=$(escape "$line")
replace=$(escape "skill_id: $skill_id")
substitutions+=("s/$find/$replace/")
fi
done < "$file"
for sub in "${substitutions[@]}"
do
sed -i "$sub" "$file"
done
}
for file in *.yaml; do
process_file "$file"
done
YAML and CSV assets are flexible and portable, but there is not always one clear method to make large changes to them. Bash scripts are similarly flexible and portable, making them a useful option when the need arises for bulk updates. Relying on the UNIX toolchain allows these scripts to be simple, composable, and stable. And who doesn’t love hacking together a little Bash script now and again!?