Introduction
In today’s data-driven world, extracting actionable insights from job postings can provide a competitive edge. However, this data isn’t always available in the format you need.
This guide will walk you through the process of downloading job posting data from AWS Data Exchange, converting JSON files to CSV, and importing the data into Google Sheets for further analysis.
Understanding Job Postings
Job postings are typically targeted at potential new employees and create awareness that a company exists, is hiring for a specific role, and provides insights into what the company does, plans, wants, uses, builds, etc. Their content generally includes information such as:
- Company Information: Industry, company size, location, goals, products, achievements, vision, mission, values, and culture.
- Project Information: The project, product, or solution the employee will work on, or the department the employee will be part of.
- Job Information: Specific tasks, role, responsibilities, objectives, expected start date, and workplace.
- Requirement Information: Experience, industry background, skills, competencies, qualifications, certifications, working hours, work location, and travel requirements.
- Work Environment Information: Tools and technologies used in daily operations, as well as requirements related office culture (e.g. dress code).
- Compensation Information: Salary range, bonus structure, benefits, perks, and incentives.
- Application Information: Required application documents, application deadline, and contact information such as phone numbers, email addresses, or contact forms.
Job postings may also include details about departments or teams the employee will collaborate with, career paths, opportunities for advancement, personal budgets for hardware or conferences, or social activities within the team or company.
Origin of Job Postings
Companies distribute their job postings across various platforms, including their own websites, career pages of HR tools (e.g., Personio or SAP SuccessFactors), job boards (e.g., Monster or StepStone), job aggregators (e.g., Indeed or CareerJet), job meta-search engines (e.g., Trovit or JobRapido), and social media platforms (e.g., LinkedIn or Facebook).
Techmap's job posting data is sourced from many of these platforms, cleaned, normalized, and stored in JSON-formatted data files in AWS S3. This data is offered on the AWS Data Exchange (ADX) platform. ADX is a service that enables customers to securely find, subscribe to, and use third-party data in the cloud. It allows businesses to leverage a wide variety of datasets for analytics, machine learning, and other data-driven applications.
In the remainder of this guide, we will use the free Luxembourg data feed from Techmap on ADX, which is in JSON format, and convert it into CSV.
Techmap's JSON format for Job Postings
Our data is exported daily in JSON format and stored in gzip-compressed files. In the uncompressed files, each line contains a job posting in JSON format, including many direct fields such as name
, text
, html
, URL
, and dateCreated
, as well as more complex data structures for company
, location
, json
(from the job page), and others.
The json
field contains several subfields, such as schemaOrg
for data in the Schema.org Microdata json format, as well as source-specific JSON data found on the webpage in fields like pageData
, pageData2
, and others.
You can find the complete data structure in our Data Overview, which provides detailed information about the fields and formats, including nested fields and their global frequency.
Prerequisites for the Convertion
To get started with downloading and processing job posting data from AWS Data Exchange, you’ll need to set up a few tools and accounts. This guide assumes basic familiarity with cloud services and command-line operations. Before diving into the steps, make sure you have the following prerequisites ready:
- AWS Account: Sign up for an AWS account if you don't already have one.
- AWS CLI: Install and configure the AWS CLI on your machine.
- Bash commands: Install necessary commands to uncompress files and handle JSON in bash. For this project this can be done on Ubuntu by executing:
sudo apt update && sudo apt install gzip jq -y
- Google Account: A Google account to use Google Sheets. [Optional]
Step 1: Downloading Data from ADX
In this step, we'll subscribe to the Techmap data feed on AWS Data Exchange (ADX) and download the job posting data using the AWS CLI. Follow the instructions below to get started.
1.1. Subscribe to the Data Feed
To access the job posting data, you’ll first need to subscribe to the free Luxembourg data feed provided by Techmap. Here’s how:
- Navigate to the free Luxembourg data feed from Techmap on the AWS Data Exchange.
- Subscribe to the data feed to gain access to the job posting data. AWS will give you an alias of our AWS S3 bucket where you can access the data files.
1.2. Download Data Using AWS CLI
Once subscribed, you can set the AWS S3 bucket alias and filter the files to a specific month of interest as follows:
export YOUR_BUCKET_ALIAS=<YOUR_BUCKET_ALIAS>
export YEAR_MONTH=2025-02
Then, you can use the AWS CLI to download the data files.
List all files
aws s3api list-objects-v2 \
--request-payer requester \
--bucket $YOUR_BUCKET_ALIAS \
--prefix 'lu/techmap_jobs_lu_$YEAR_MONTH-' | grep Key
Download one individual file
aws s3 cp \
--request-payer requester \
s3://$YOUR_BUCKET_ALIAS/lu/techmap_jobs_lu_$YEAR_MONTH-01.jsonl.gz .
Download all files
aws s3 sync \
s3://$YOUR_BUCKET_ALIAS/lu/ . \
--request-payer requester \
--exclude "*" \
--include "techmap_jobs_lu_$YEAR_MONTH-*.jsonl.gz"
Step 2: Converting JSON to CSV
Now that we have the data files containing job postings in JSON format, we can use a bash
script to loop through all the files, uncompress them, and convert relevant parts of the JSON into cell data in a CSV file using jq
. We'll extract various types of fields so that you can customize the output later to suit your needs.
First, create a file called convert-json-to-csv.sh
to convert the JSON files to CSV.
#!/bin/bash
# Loop over all .jsonl.gz files in the current directory
for file in techmap_jobs_*.jsonl.gz; do
# Extract the country code (e.g., uk) and date (e.g., 2025-03-01) from the filename
countrycode=$(echo "$file" | sed -E 's/techmap_jobs_([a-zA-Z]{2})_.*/\1/')
date=$(echo "$file" | sed 's/.*_\([0-9]\{4\}-[0-9]\{2\}-[0-9]\{2\}\)\.jsonl\.gz/\1/')
# Create the output CSV file with the same name but .csv extension
output_file="techmap_jobs_${countrycode}_${date}.csv"
# Write the header to the CSV file (only if the file doesn't already exist)
if [ ! -f "$output_file" ]; then
echo "source,countryCode,dateCreated,name,url,languageLocale,contact,salary,position,location,companyName,companyURL,industry,text,html,schemaOrg" > "$output_file"
fi
# Process the .jsonl.gz file and write to the corresponding CSV file
gunzip -c "$file" | jq -r '
# Flatten the JSON and remove the "json" field if it exists
del(.json) |
[
# Extract simple data fields
.source,
.sourceCC,
.dateCreated,
.name,
.url,
.locale,
# Convert non-scalar fields (objects) to strings using tostring() - drill down if you like
(.contact | tostring // ""),
(.salary | tostring // ""),
(.position | tostring // ""),
# Concatenate location fields into a single string
([.location.orgAddress.street, .location.orgAddress.city, .location.orgAddress.state, .location.orgAddress.country] | map(select(. != null and . != "")) | join(", ")),
# Flatten company object
.company.nameOrg,
.company.url,
.company.info.industry,
# Extract large data fields
.text,
.html,
(.json.schemaOrg | tostring // "")
] | @csv' >> "$output_file"
gzip -f "$output_file"
echo "Processed: $file -> ${output_file}.gz"
done
echo "CSV files created."
Before executing the script, use the command chmod +x convert-json-to-csv.sh
to make it executable. This ensures you have the necessary permissions to run the script.
Once the script is executed, it will convert all JSON data files into CSV files, with each file corresponding to a single day's data. These CSV files will be ready for further analysis or integration into tools like Google Sheets or Excel.
Step 3: Importing CSV files into your Spreadsheet Program
Importing the CSV file is very simple after you've uncompressed the .gz
file (or you disable the gzip command in the script). The process is straightforward and ensures seamless compatibility with your existing workflow. These tools empower you to analyze, visualize, and share data efficiently, streamlining your data management and collaboration efforts.
In your Google Workspace or Google Drive, you can easily import the CSV file by navigating to the File menu and selecting the appropriate import option. This allows you to upload and integrate the data directly into Google Sheets, where you can further organize, analyze, and visualize it as needed.
You can also import the CSV into programs like Microsoft Excel or Tables on macOS. In Excel, go to File > Import, select the CSV, and load it into a worksheet. In Tables, choose File > Import and open the CSV.
Furthermore, you can create a python
script to automatically upload the csv files into your Google Drive using the Google Sheets API with the gspread
library.
Conclusion
In this guide, we've downloaded job posting data from AWS Data Exchange, converted the JSON files to CSV files, and imported the data into a Google Spreadsheet. This process provides a foundation for more advanced data analysis and visualization tasks.
For further exploration, consider automating this workflow using cloud functions or integrating machine learning models to extract deeper insights from the job postings.
Happy analyzing!