Posted Apr 16, 2026

AI Consultant for Basic Testing of AI Tools

------------------------PROJECT UP DATE--PLEASE READ----------------------- Thank you for your interest. We have to say that we do not have a lot of experience, which is exactly why we need help, but we do know what we need. Due to limited time resources, we are updating the information with Addendum 1 and Addendum 2, as well as this introductory note. Introductory note With those freelancers/providers who can offer us the requested information, once we get in touch we would like to see which products such a freelancer has already built, so that we can understand which of those existing products we could potentially reuse or adapt. We are interested in solutions ranging from very simple ones up to somewhat more complex ones, including agents. We are also interested in extracts/integrations for certain web pages and in one mobile application for parking. For now, we are extending our project with the two addendums below. • ** ADDENDUM 1 – Clarification of the future project (business side) I am looking for paid assistance from a person or company who can: - refer me to an expert or company (or let me know if you have done this yourself) that has already built a similar project (multi‑LLM comparison, RAG, legal/medical domain), whether it is a public SaaS product or a private in‑house solution, or - point me to an existing software product with comparable capabilities that can be demonstrated. The task is to: - connect me with such a person/company, or - point me to such software (in production or as a custom solution for another client), so that this software can be presented (demo, walkthrough) and its capabilities clearly shown. If such software is available for sale or licensing, I am also interested in exploring purchase/licensing options. • ** ADDENDUM 2 – Precise technical minimal scope (MVP) Minimal scope (MVP) of the system I want to build/use: 1. Orchestration - Implementation of a central orchestrator, preferably using Vellum Workflows, but I am also open to another commercial or custom orchestrator (no open‑source frameworks like LangChain, LlamaIndex, etc. in the core). [skywork](https://skywork.ai/blog/vellum-ai-review-prompt-management-evaluations-orchestration/) - Clearly separated modes of operation: - Normal mode: queries go only to the primary model (OpenAI). - Compare mode: manual trigger to compare OpenAI vs Anthropic on the same prompt. - Web‑check mode: manual trigger that sends a Perplexity/web‑research call (never running in parallel by default, only when explicitly requested). 2. LLM providers - Integration with: - OpenAI (primary model for generation and reasoning). - Anthropic (secondary model for answer comparison/sanity‑check). - Perplexity (used exclusively for additional web‑check/research). - Configurable parameters per model: temperature, max tokens, timeout, number of retries. 3. RAG layer - Ingestion pipeline for documents (PDF, DOCX) with basic cleaning (encoding, removal of headers/footers where feasible). - Document chunking + metadata (e.g. source, date, author, document type, jurisdiction/medical domain). - Vector database: - primarily pgvector on PostgreSQL, or - alternatively Pinecone as a managed solution – with a reasoned justification for the choice. [datacamp](https://www.datacamp.com/tutorial/pgvector-tutorial) - RAG queries must return citations in the answer (link to document + ID + page/paragraph range). 4. Database model (SQL) Minimum entities: - user (at least 2 users) - case/matter (legal or medical question) - user profile memory (preferences, answer style, language, etc.) - case memory (history of queries and key conclusions per case) - session summaries (session‑level summaries for long‑term context retention). 5. Security and audit - Authentication and authorization for 2 users, with roles: admin, user. - Audit log for every call: timestamp, model, provider, user_id, case_id, used document_ids, mode type (normal/compare/web‑check). - Encryption in transit (HTTPS/TLS). - Backup strategy for databases (SQL + vector store) with an approximate RPO/RTO. 6. Evaluations - Prepare and implement a minimal test set (at least 15 legal/medical questions) with reference answers or at least expected key citations. - Evaluation of: - citation accuracy (the model must cite real documents and relevant sections), - basic guardrails against hallucinations (e.g. answer “I do not know / not present in the documents” when there is no relevant context). 7. Requirements for candidates - No open‑source LLM frameworks (LangChain, LlamaIndex, etc.) in the core orchestration – I prefer custom code or a commercial platform. - Vellum experience is a plus, but not mandatory; I am open to strong alternative suggestions. [skywork](https://skywork.ai/blog/vellum-ai-review/) - Please apply only if you have already built a similar multi‑LLM RAG system (legal/medical domain is a strong plus) and can show the architecture or anonymized examples. ------------

Apply Now

AI Consultant for Basic Testing of AI Tools

More Remote Jobs