Introduction
In today’s data-driven world, integrating job posting data into your infrastructure often requires converting it between different formats. This article provides a step-by-step guide for developers and data engineers on how to convert job postings from JSON to XML using only Bash commands. Whether you’re working with APIs, legacy systems, or data pipelines, this guide will help you streamline the process and ensure your data is ready to use.
We'll start by setting up the required tools (jq
and xmlstarlet
), downloading job data in JSON format, and writing a Bash script to perform the conversion. By the end, you'll have an automated process for transforming job postings into structured XML, ready for further processing or integration.
Understanding JSON and XML Formats
JSON (JavaScript Object Notation) and XML (Extensible Markup Language) are two widely used formats for data representation and exchange. JSON is a lightweight, more human-readable format often used in web APIs and modern applications due to its simplicity and ease of parsing. XML, on the other hand, is more structured and widely used in enterprise applications, document storage, and configurations. While both formats serve similar purposes, their structural differences create challenges when converting data between them.
For this tutorial consider the following JSON structure, which represents a shortened version of a job posting in Techmap's dataset. The job posting JSON is structured hierarchically. At the top level, it contains fields like dateCreated
, source
, and name
. Nested within these are objects like location
and salary
, which further break down into subfields. Arrays, such as orgTags.KEYWORDS
, allow multiple values to be stored under a single key.
{
"source": "careerjet_lu",
"name": "Manager Accounting German Speaker - Ettelbruck",
"url": "https://www.careerjet.lu/jobad/lucd2a076d8a39c18a101fa654cd21b470",
"dateCreated": "2025-02-01T00:07:41+0000",
"location": {...},
"company": {
"name": "Abiomis", ...
},
"position": {
"contractType": "Permanent",
"workType": "FullTime"
},
"salary": {
"text": "90000 (EUR per YEAR)",
"minValue": 90000.0, ...
},
"text": "Manager Accounting German Speaker ...",
"html": "<h2>Manager Accounting German Speaker</h2> ...",
"json": {
"schemaOrg": {
"title": "Manager Accounting German Speaker - Ettelbruck",
"datePosted": "2025-02-01T00:07:41Z", ...
}, ...
},
"orgTags": {
"KEYWORDS": [
"Accounting"
], ...
}, ...
}
When converting JSON to XML, key challenges arise due to structural differences. JSON uses arrays and nested objects, which do not have direct equivalents in XML. While JSON represents arrays with square brackets ([ ]), XML relies on repeating elements. Additionally, JSON allows flexible key-value pairs, whereas XML enforces a hierarchical structure with defined tags. This means that an array of objects in JSON must be converted into multiple repeated XML elements, requiring careful transformation to maintain data integrity.
Another challenge is handling data types and attributes. JSON does not differentiate between attributes and elements, but XML does. For example, in XML, some values can be stored as attributes within tags, while others must be enclosed as elements. And JSON does not natively support attributes like XML does, which means extra processing is required to determine whether a JSON key should be represented as an XML element or attribute. Ensuring correct data representation while maintaining readability in both formats is a critical aspect of JSON-to-XML conversion.
Finally, encoding and formatting differences must be considered. JSON supports a more compact structure with fewer constraints, while XML enforces stricter syntax, including required closing tags and predefined namespaces. Special characters like "<", ">", and "&" (esp. in the html
field) must be escaped in XML but not in JSON, requiring transformation logic to prevent parsing errors. Despite these challenges, proper schema mapping and transformation techniques can enable seamless interoperability between JSON-based web services and XML-based systems.
2. Setting Up The Environment
To convert job postings from JSON format to XML using Bash commands, we need essential command-line utilities like jq
for processing JSON data and xmlstarlet
for generating XML output. These tools allow efficient transformation of structured data directly in the terminal without requiring additional programming languages or libraries. This section will guide through installing these utilities on different operating systems.
On Debian-based Linux distributions such as Ubuntu, you can install jq and xmlstarlet using the APT package manager. Open a terminal and run the following command:
sudo apt update && sudo apt install -y jq xmlstarlet
For macOS users, Homebrew is the recommended package manager. With Homebrew installed, you can install jq
and xmlstarlet
using:
brew install jq xmlstarlet
On Red Hat-based distributions like CentOS, Fedora, or Rocky Linux, you can use dnf or yum:
sudo dnf install -y jq xmlstarlet # For Fedora
sudo yum install -y jq xmlstarlet # For CentOS/RHEL
After installation, you can verify that both tools are correctly set up by running:
jq --version
xmlstarlet --version
With these utilities installed, your system is ready to process job postings in JSON format and transform them into XML using simple Bash scripts.
3. Downloading the JSON Job Postings
To obtain job postings in JSON format, you need access to a reliable data source. Techmap provides a free Luxembourg job postings feed on AWS Data Exchange beside paid data feeds for other countries like the US. This section will guide you through subscribing to the feed and downloading the data using the AWS Command Line Interface (CLI).
-
Subscribing to the Techmap Data Feed: To access the job postings, visit the Techmap Luxembourg Data Feed and subscribe to the feed. Once subscribed, AWS will provide you with an S3 bucket alias, which you will use to download the dataset. The data is organized in compressed daily export files per country, where each file contains all job postings in JSON-Lines for a specific day.
-
Setting Up AWS CLI for Downloading Data: Before downloading job postings, ensure that you haveAWS CLI installed and configured with the necessary permissions.
-
Downloading the Job Postings: Once your AWS CLI is set up, export your S3 bucket alias and specify the month of data you want to download. To get some test data but not download all files we focus on one month.
export YOUR_BUCKET_ALIAS=<YOUR_BUCKET_ALIAS> export YEAR_MONTH=2025-02
To list available files for the selected month, use:
aws s3api list-objects-v2 \ --request-payer requester \ --bucket $YOUR_BUCKET_ALIAS \ --prefix "lu/techmap_jobs_lu_$YEAR_MONTH-" | grep Key
To download a single file, run:
aws s3 cp \ --request-payer requester \ s3://$YOUR_BUCKET_ALIAS/lu/techmap_jobs_lu_$YEAR_MONTH-01.jsonl.gz .
For downloading all job postings from the month:
aws s3 sync \ s3://$YOUR_BUCKET_ALIAS/lu/ . \ --request-payer requester \ --exclude "*" \ --include "techmap_jobs_lu_$YEAR_MONTH-*.jsonl.gz"
With these steps completed, you will have downloaded the compressed data files with the job postings in JSON format, ready for transformation into XML.
4. Converting JSON to XML Using Bash
Now, we’ll walk through the process of converting job postings from JSON to XML using Bash. The script we’ll use leverages jq
for JSON processing and xmlstarlet
for XML validation. It reads JSON Lines (.jsonl.gz) files, extracts relevant fields, and converts them into XML format. It also handles nested objects, arrays, and special characters, ensuring that the XML output adheres to the required structure. We'll extract various types of fields so that you can customize the output later to suit your needs.
Create a file called convert-json-to-xml.sh
to convert the JSON files to XML.
#!/bin/bash
# Ensure jq and xmlstarlet (or apache-arrow) is installed
if ! command -v jq &> /dev/null || ! command -v xmlstarlet &> /dev/null; then
echo "Error: Missing required tools. Install them using: sudo apt install jq xmlstarlet"
exit 1
fi
# Loop over all .jsonl.gz files in the current directory
for file in techmap_jobs_*.jsonl.gz; do
# Extract the country code (e.g., uk) and date (e.g., 2025-03-01) from the filename
countrycode=$(echo "$file" | sed -E 's/techmap_jobs_([a-zA-Z]{2})_.*/\1/')
date=$(echo "$file" | sed 's/.*_\([0-9]\{4\}-[0-9]\{2\}-[0-9]\{2\}\)\.jsonl\.gz/\1/')
# Define output file names
output_xml="techmap_jobs_${countrycode}_${date}.xml"
# Initialize XML file
echo '<?xml version="1.0" encoding="UTF-8"?>' > "$output_xml"
echo "<jobs>" >> "$output_xml"
# Process the .jsonl.gz file and write to the corresponding CSV file
gunzip -c "$file" | jq -c '. | del(.json)' | while read -r line; do
echo "$line" | jq -r '
# Function to escape XML special characters
def escape_xml:
gsub("&"; "&") | gsub("<"; "<") | gsub(">"; ">") | gsub("\""; """) | gsub("\u0027"; "'");
# Function to convert arrays to a string with a separator
def array_to_string($separator):
if type == "array" then map(tostring) | join($separator) else "" end;
# Function to safely wrap content in CDATA
def safe_cdata:
if . == null or . == "" then "<![CDATA[]]>" else "<![CDATA[" + . + "]]>" end;
# Convert JSON to XML
. as $job |
"<job>" +
"<source>\($job.source // "" | escape_xml)</source>" +
"<countryCode>\($job.sourceCC // "" | escape_xml)</countryCode>" +
"<dateCreated>\($job.dateCreated // "" | escape_xml)</dateCreated>" +
"<name>\($job.name // "" | escape_xml)</name>" +
"<url>\($job.url // "" | escape_xml)</url>" +
"<languageLocale>\($job.locale // "" | escape_xml)</languageLocale>" +
"<referenceID>\($job.referenceID // "" | escape_xml)</referenceID>" +
"<contact>\($job.contact | tostring // "" | escape_xml)</contact>" +
"<salary>\($job.salary | tostring // "" | escape_xml)</salary>" +
"<position>\($job.position | tostring // "" | escape_xml)</position>" +
"<location>\($job.location.orgAddress.street // "" | escape_xml), \($job.location.orgAddress.city // "" | escape_xml), \($job.location.orgAddress.state // "" | escape_xml), \($job.location.orgAddress.country // "" | escape_xml)</location>" +
"<companyName>\($job.company.nameOrg // "" | escape_xml)</companyName>" +
"<companyURL>\($job.company.url // "" | escape_xml)</companyURL>" +
"<industry>\($job.company.info.industry // "" | escape_xml)</industry>" +
"<categories>\($job.orgTags.CATEGORIES | array_to_string(";") | escape_xml)</categories>" +
"<industries>\($job.orgTags.INDUSTRIES | array_to_string(";") | escape_xml)</industries>" +
"<contractTypes>\($job.orgTags.CONTRACT_TYPES | array_to_string(";") | escape_xml)</contractTypes>" +
"<experienceRequirements>\($job.orgTags.EXPERIENCE_REQUIREMENTS | array_to_string(";") | escape_xml)</experienceRequirements>" +
"<careerLevels>\($job.orgTags.CAREER_LEVELS | array_to_string(";") | escape_xml)</careerLevels>" +
"<workTypes>\($job.orgTags.WORK_TYPES | array_to_string(";") | escape_xml)</workTypes>" +
"<text>\($job.text // "" | escape_xml)</text>" +
"<html>\($job.html // "" | safe_cdata)</html>" +
"<schemaOrg>\($job.json.schemaOrg | tostring // "" | escape_xml)</schemaOrg>" +
"</job>"
' >> "$output_xml"
done
# Close the root XML element
echo "</jobs>" >> "$output_xml"
# Format XML for readability
xmlstarlet fo "$output_xml" > "${output_xml}.formatted"
mv "${output_xml}.formatted" "$output_xml"
# Compress the XML file
gzip -f "$output_xml"
echo "Processed: $file -> ${output_xml}.gz"
done
The script provided is highly customizable, allowing you to adapt it to your specific needs. You can easily add or remove data fields by modifying the jq
commands within the script. For example, if you want to include additional fields like jobType
or salaryRange
, you can extend the XML generation logic. Similarly, if certain fields are not required, you can exclude them to simplify the output. This flexibility makes the script a versatile tool for integrating job posting data into various workflows, whether you’re building a data pipeline, feeding an API, or analyzing job market trends.
Once you’ve customized the script to suit your requirements, you can seamlessly integrate it into your existing workflows. For instance, you can schedule the script to run periodically using a cron job, ensuring that your XML files are always up-to-date with the latest job postings.
5. Automating the Conversion Process
Automating the conversion of job postings from JSON to XML can save time and ensure that your data is always up-to-date. By setting up a cron job, you can schedule the script to run at regular intervals to download the latest job posting files from AWS S3 and converting them to XML using the script we’ve developed.
Setting Up a Cron Job
A cron job is a time-based task scheduler in Unix-like operating systems. To set up a cron job for the conversion script, follow these steps:
-
Open the crontab editor by running:
crontab -e
-
Add a new line to schedule the script. For example, to run the script daily at 2 AM, add:
0 8 * * * /path/to/your/script.sh
Here,
0 8 * * *
means "at 8:00 AM every day." Replace/path/to/your/script.sh
with the full path to your script. -
Save and exit the crontab editor. The cron job is now scheduled and will run automatically at the specified time.
Automating Daily Downloads from AWS S3
To ensure that the script processes the latest job postings, you can automate the daily download of files from AWS S3. Modify the script to include the AWS CLI commands for downloading the files before processing them. Here’s an example of how to integrate the download step:
#!/bin/bash
# Set AWS S3 bucket alias and date
export YOUR_BUCKET_ALIAS=<YOUR_BUCKET_ALIAS>
export YEAR_MONTH=$(date +%Y-%m)
# Download the latest files from AWS S3
aws s3 sync \
s3://$YOUR_BUCKET_ALIAS/lu/ . \
--request-payer requester \
--exclude "*" \
--include "techmap_jobs_lu_$YEAR_MONTH-*.jsonl.gz"
# Run the conversion script
/path/to/your/script.sh
This script downloads the latest files for the current month and then runs the conversion script. You can schedule this combined script using a cron job as described earlier.
Conclusion
In this tutorial, we explored the process of converting JSON job postings into XML using Bash and command-line utilities like jq
and xmlstarlet
. We explored the key differences between JSON and XML, highlighting the challenges of converting job postings between these formats. Then, we set up the necessary tools across different operating systems, obtained job postings in JSON format from the Techmap data feed, and created a Bash script to transform JSON job postings into structured XML.
Looking ahead, you can refine the script to handle more complex data structures, optimize performance for large datasets, and integrate validation mechanisms. Whether you’re analyzing job market trends or feeding data into APIs, this approach provides a solid foundation for handling job posting data effectively.