Position Overview
Monolith AI is seeking an experienced QA Engineer to lead load testing efforts for a critical system
release focused on improving concurrency and high request load handling. This fast-paced, short-
term engagement requires someone who can quickly understand complex distributed systems,
design comprehensive load tests, and work collaboratively with a rapidly growing engineering team
to ensure our new environment meets performance requirements.
Primary Responsibilities
• Design and Implement Automated Load Testing Framework
◦ Develop comprehensive load tests for FastAPI endpoints, Temporal workflows/
activities, and AWS service interactions
◦ Create realistic test scenarios simulating concurrent workflow execution patterns,
including graph-based workflow orchestration
◦ Build automated test suites that measure system behavior under varying concurrency
levels and request loads
• Performance Analysis and Bottleneck Identification
◦ Monitor and analyze system performance across the entire stack (API layer,
Temporal workers, AWS services)
◦ Identify concurrency limitations in Temporal workflow execution, AWS service
limits (Athena, ECS), and inter-component communication
◦ Document performance characteristics including response times, throughput limits,
and failure modes under load
• Collaborate on Non-Functional Requirements (NFR) Definition
◦ Work with Customer Success and Product teams to understand business
requirements and translate them into measurable performance criteria
◦ Iterate on acceptable concurrency thresholds, latency targets, and throughput
requirements◦ Validate that proposed NFRs are realistic and achievable given architectural
constraints
• System Documentation and Knowledge Extraction
◦ Understanding of the existing system through code review, discussions with the
development team, and exploratory testing
◦ Create clear documentation of test methodologies, results, and recommendations for
future testing
• Recommendation and Optimization Guidance
◦ Provide actionable recommendations for removing identified bottlenecks
◦ Suggest configuration optimizations for Temporal (worker pools, task queues) and
AWS services (Athena concurrency, ECS capacity)
• Rapid Communication and Status Reporting
◦ Maintain daily/frequent communication with the Tech Lead regarding project
progress, blockers, and findings
◦ Quickly escalate issues that could impact the aggressive timeline
◦ Present findings and recommendations to technical and non-technical stakeholders
• Cross-Component Integration Testing
◦ Test complex scenarios involving graph execution triggering node workflows across
multiple system boundaries
◦ Validate S3 read/write operations under concurrent load
◦ Ensure inter-component communication (API → Temporal, Temporal Activity →
API triggers) performs reliably at scale
Key Performance Indicators
• Test Coverage and Execution
◦ Complete automated load test suite covering all critical components within first 3
weeks
◦ Execute baseline and progressive load tests identifying maximum sustainable
concurrency levels
• Bottleneck Identification and Impact
◦ Identify and document top 5-7 performance bottlenecks with clear impact analysis
◦ Provide actionable remediation recommendations with estimated effort and impact
for each bottleneck
3. NFR Definition and Validation
◦ Collaborate with stakeholders to define measurable NFRs within first 2 weeks
◦ Validate system meets or document gaps against agreed NFR criteria by project end
• Documentation and Knowledge Transfer
◦ Deliver comprehensive test documentation, results analysis, and system performance
characteristics
◦ Conduct knowledge transfer sessions ensuring team can maintain and extend testing
framework
• Project Velocity and Communication
◦ Meet weekly milestone targets in this fast-paced 2-month engagement
◦ Maintain proactive communication rhythm (daily standups, weekly detailed reports
to Tech Lead)
Required Qualifications
Experience:
• 4+ years of experience in QA/performance testing roles
• 2+ years of hands-on experience with load testing distributed systems and microservices
architectures
• Proven experience with load testing tools (e.g., k6, JMeter, Locust, Gatling, Artillery)
• Experience testing workflow orchestration systems (Temporal, Airflow, Prefect, or similar)
• Demonstrated ability to test systems integrating with AWS services (particularly Athena,
ECS, S3)
Technical Skills:
• Strong proficiency in Python (required for test automation and working with FastAPI/
Temporal)
• Experience with REST API testing and performance validation
• Understanding of distributed systems concepts: concurrency, queueing, backpressure, rate
limiting
• Familiarity with AWS infrastructure and service limits• Experience with monitoring and observability tools (Prometheus, Grafana, Datadog, or
similar)
• Proficiency with Git and CI/CD pipelines
• Ability to read and understand code in order to design effective tests
Immediate Availability:
• Ability to start in early January 2025 and commit to focused 3-month engagement
• Availability for full-time contract work during project duration
Preferred Qualifications
• Direct experience with http://Temporal.io (workflows, activities, workers)
• Experience with containerized workloads and Docker/ECS
• Prior work in fast-paced startup or scale-up environments
• Experience with infrastructure-as-code (Terraform, CloudFormation)
• Background in Site Reliability Engineering (SRE) or DevOps practices
• Familiarity with data processing pipelines and analytics systems
• Previous contract/consulting experience with rapid knowledge acquisition
• Experience with graph-based workflow systems or DAG execution engines
• Knowledge of AWS service limits and optimization strategies
Essential Soft Skills
Self-Direction and Initiative:
• Ability to operate independently in an ambiguous, fast-moving environment with minimal
documentation
• Proactive problem-solving mindset; doesn't wait for perfect information before taking action
• Comfortable making pragmatic decisions quickly in a time-constrained project
Communication and Collaboration:
• Exceptional communication skills for extracting knowledge through conversations with
existing team members
• Ability to translate technical findings into clear, actionable recommendations for diverse
audiences• Comfortable asking clarifying questions and challenging assumptions respectfully
• Strong written communication for documentation and status updates
Adaptability and Learning Agility:
• Quick learner who can rapidly understand complex, poorly documented systems
• Flexible and comfortable with changing priorities in a 15-person team that's doubling in size
• Thrives in fast-paced environments with aggressive timelines
Pragmatism and Results Orientation:
• Focused on delivering practical, actionable outcomes within tight timeframes
• Understands the balance between thoroughness and speed in a 2-month engagement
• Comfortable with "good enough" when perfect isn't achievable within constraints
Stakeholder Management:
• Skilled at managing expectations with technical leadership about realistic timelines and
trade-offs
• Diplomatic when delivering difficult news about performance limitations or bottlenecks
• Collaborative approach when working with CS and Product on NFR definition
Key Challenges in This Role
• Rapid Knowledge Acquisition with Limited Documentation
◦ The existing system lacks comprehensive documentation, requiring you to quickly
build understanding through code review, system exploration, and frequent
discussions with the development team
◦ Success requires comfort with ambiguity and strong investigative skills
• Aggressive Timeline with High Impact
◦ A 3-month timeline to design tests, execute comprehensive load testing, identify
bottlenecks, and deliver actionable recommendations is extremely tight
◦ Must balance thoroughness with pragmatism; prioritize ruthlessly to ensure critical
areas are covered
• Complex Distributed System with Multiple Integration Points
◦ The system involves multiple layers (FastAPI, Temporal, AWS services) with
complex inter-component communication patterns (graph → node workflows)◦ Must understand the entire stack sufficiently to design realistic, comprehensive load
tests that expose real-world bottlenecks