In today’s data-driven business landscape, job postings are a goldmine of insights for companies looking to gain a competitive edge. From understanding market trends to optimizing recruitment strategies, the data embedded in job postings can drive actionable decisions. However, this data often comes in JSON format, which isn’t always ideal for analysis.
This comprehensive guide will walk you through the process of downloading job posting data from AWS Data Exchange, converting JSON files to CSV, and transforming them into Parquet format for efficient storage and analysis. Whether you're a data engineer, analyst, or business leader, this guide will help you unlock the full potential of job posting data.
Why Job Posting Data Matters for Businesses
Job postings are more than just recruitment tools—they are a reflection of a company’s goals, strategies, and market positioning. By analyzing job postings, businesses can:
- Identify Market Trends: Understand which skills, roles, and industries are in demand.
- Benchmark Competitors: Gain insights into competitors’ hiring strategies and organizational priorities.
- Optimize Recruitment: Tailor job descriptions and requirements to attract top talent.
- Drive Strategic Decisions: Use data to inform business expansion, product development, and workforce planning.
However, to extract these insights, you need the data in a format that’s easy to work with. That’s where converting JSON to Parquet comes in.
Understanding Job Posting Data
Job postings typically contain a wealth of structured and unstructured data, including:
- Company Information: Industry, size, location, goals, products, and culture.
- Job Details: Role, responsibilities, required skills, qualifications, and compensation.
- Application Requirements: Deadlines, required documents, and contact information.
- Work Environment: Tools, technologies, and workplace policies.
This data is often distributed across platforms like company websites, job boards (e.g., Indeed, LinkedIn), and HR tools (e.g., SAP SuccessFactors). Techmap aggregates this data, cleans it, and makes it available on AWS Data Exchange (ADX) in JSON format.
Why Convert JSON to Parquet?
While JSON is a flexible format, it’s not always the most efficient for large-scale data analysis. Parquet, on the other hand, is a columnar storage format optimized for performance and storage efficiency. Here’s why Parquet is a better choice:
- Faster Queries: Parquet’s columnar structure allows for quicker data retrieval.
- Reduced Storage Costs: Parquet files are highly compressed, saving storage space.
- Compatibility: Parquet works seamlessly with big data tools like Apache Spark, AWS Athena, and Pandas.
By converting JSON to Parquet, you can streamline your data workflows and improve analysis efficiency.
Step 1: Downloading Job Posting Data from AWS Data Exchange
1.1. Subscribe to the Techmap Data Feed
To access the job posting data, you’ll need to subscribe to the free Luxembourg data feed from Techmap on AWS Data Exchange. Here’s how:
- Visit the Techmap Luxembourg Data Feed on AWS Data Exchange.
- Subscribe to the feed to gain access to the data. AWS will provide an alias for the S3 bucket where the data is stored.
1.2. Download Data Using AWS CLI
Once subscribed, use the AWS CLI to download the data. Here’s how:
-
Set your S3 bucket alias and specify the month of interest:
export YOUR_BUCKET_ALIAS=<YOUR_BUCKET_ALIAS> export YEAR_MONTH=2025-02
-
List all files for the specified month:
aws s3api list-objects-v2 \ --request-payer requester \ --bucket $YOUR_BUCKET_ALIAS \ --prefix 'lu/techmap_jobs_lu_$YEAR_MONTH-' | grep Key
-
Download a single file:
aws s3 cp \ --request-payer requester \ s3://$YOUR_BUCKET_ALIAS/lu/techmap_jobs_lu_$YEAR_MONTH-01.jsonl.gz .
-
Download all files for the month:
aws s3 sync \ s3://$YOUR_BUCKET_ALIAS/lu/ . \ --request-payer requester \ --exclude "*" \ --include "techmap_jobs_lu_$YEAR_MONTH-*.jsonl.gz"
Step 2: Converting JSON to Parquet via CSV
Once you’ve downloaded the JSON files, the next step is to convert them first into CSV format for easier manipulation and then into Parquet format. Here’s how:
- Create a
bash
script (convert-json-to-parquet.sh
) to automate the conversion:
#!/bin/bash
# Ensure jq and csv2parquet (or apache-arrow) is installed
if ! command -v jq &> /dev/null || ! command -v csv2parquet &> /dev/null; then
echo "Error: Missing required tools. Install them using: sudo apt install jq csv2parquet"
exit 1
fi
# Loop over all .jsonl.gz files in the current directory
for file in techmap_jobs_*.jsonl.gz; do
# Extract the country code (e.g., uk) and date (e.g., 2025-03-01) from the filename
countrycode=$(echo "$file" | sed -E 's/techmap_jobs_([a-zA-Z]{2})_.*/\1/')
date=$(echo "$file" | sed 's/.*_\([0-9]\{4\}-[0-9]\{2\}-[0-9]\{2\}\)\.jsonl\.gz/\1/')
# Define output file names
output_csv="techmap_jobs_${countrycode}_${date}.csv"
output_parquet="techmap_jobs_${countrycode}_${date}.parquet"
# Write the header to the CSV file (only if the file doesn't already exist)
if [ ! -f "$output_csv" ]; then
echo "source,countryCode,dateCreated,name,url,languageLocale,referenceID,contact,salary,position,location,companyName,companyURL,industry,orgTags.CATEGORIES,orgTags.INDUSTRIES,orgTags.CONTRACT_TYPES,orgTags.EXPERIENCE_REQUIREMENTS,orgTags.CAREER_LEVELS,orgTags.WORK_TYPES,text,html,schemaOrg" > "$output_csv"
fi
# Process the .jsonl.gz file and write to the corresponding CSV file
gunzip -c "$file" | jq -r '
# Flatten the JSON and remove the "json" field if it exists
del(.json) |
[
.source,
.sourceCC,
.dateCreated,
.name,
.url,
.locale,
.referenceID,
# Convert non-scalar fields (objects) to strings using tostring() - drill down if you like
(.contact | tostring // ""),
(.salary | tostring // ""),
(.position | tostring // ""),
# Concatenate location fields into a single string
([.location.orgAddress.street, .location.orgAddress.city, .location.orgAddress.state, .location.orgAddress.country] | map(select(. != null and . != "")) | join(", ")),
# Flatten company object
.company.nameOrg,
.company.url,
.company.info.industry,
# Handle null arrays and join them with a semicolon if not null
(.orgTags.CATEGORIES // [] | join(";")),
(.orgTags.INDUSTRIES // [] | join(";")),
(.orgTags.CONTRACT_TYPES // [] | join(";")),
(.orgTags.EXPERIENCE_REQUIREMENTS // [] | join(";")),
(.orgTags.CAREER_LEVELS // [] | join(";")),
(.orgTags.WORK_TYPES // [] | join(";")),
# Extract large data fields
.text,
.html,
(.json.schemaOrg | tostring // "")
] | @csv' >> "$output_csv"
csv2parquet $output_csv -o $output_parquet
rm $output_csv # optional: remove csv files
gzip -f "$output_csv"
gzip -f "$output_parquet"
echo "Processed: $file -> ${output_csv}.gz -> ${output_parquet}.gz"
done
-
Make the script executable:
chmod +x convert-json-to-csv.sh
-
Run the script to generate CSV files:
./convert-json-to-parquet.sh
Step 3: Importing Data into Your Analytics Workflow
Once your job posting data is in Parquet format, it becomes highly versatile and can be integrated into various analytics workflows. Parquet’s columnar storage and efficient compression make it ideal for large-scale data processing and analysis. Below are detailed examples of how you can use Parquet files in different analytics environments:
- Analyzing Data with Python and Pandas: Python’s Pandas library is a popular tool for data analysis. Parquet files can be easily read into a Pandas DataFrame for further manipulation and analysis.
- Querying Data with Apache Spark: Apache Spark is a powerful distributed computing framework that works seamlessly with Parquet files. It’s ideal for processing large datasets.
- Visualizing Data with Tableau or Power BI: Parquet files can be connected to business intelligence (BI) tools like Tableau or Power BI for interactive visualizations.
- Analyzing Data with AWS Athena: AWS Athena is a serverless query service that allows you to analyze data directly from Parquet files stored in S3.
- Building Machine Learning Models: Parquet files can be used as input for machine learning pipelines to predict trends, classify job postings, or recommend roles.
- Integrating with Data Warehouses: Parquet files can be loaded into data warehouses like Snowflake, BigQuery, or Redshift for centralized analysis.
Conclusion
Converting job posting data from JSON to Parquet is a powerful way to unlock actionable insights for your business. Whether you’re using Python for quick insights, Spark for large-scale processing, or BI tools for visualization, Parquet ensures your data is efficient, scalable, and ready for action.
For more advanced use cases, consider automating your workflows with cloud functions or exploring machine learning models to extract deeper insights from your data.
Happy data wrangling!