How to Hire an AI Developer in 2026 (Founder's Guide)
By Ergini, Software & AI Developer in Pristina, Kosovo
TL;DR
Most AI developer interviews test the wrong things. Here is the 4-stage interview I would run, the take-home that filters real engineers, and the red flags I see weekly on Upwork - written from the side of the AI engineer being hired.
The hire that works vs the one that wastes 3 months
Most AI hires fail the same way. A founder reads ten LinkedIn profiles that all sound similar, picks the one with the most fluent intro call, signs a six-week engagement, and three months later has a half-built prototype, a $9,000 OpenAI bill from a tight loop nobody caught, no eval suite, and no idea whether the system actually works. The developer was not lying. They just were not the right kind of AI developer for the job.
I write this from the side being hired. I am a senior AI engineer based in Pristina, I have shipped LLM products end-to-end for clients in the US and Western Europe, and I sit on the candidate side of roughly two scoping calls a week. The patterns that fail are consistent enough that you can filter for them in fifteen minutes if you know what to look for.
The failed hire usually looks like one of four archetypes. The prompt engineer pretending to be an engineer - fluent with ChatGPT, no code repository older than three months, cannot read a stack trace. The API plumber - can wire OpenAI to a Next.js route in an hour, has no opinion on retrieval, no concept of evals, no idea what the system costs per call. The demo wizard - builds a Loom-worthy prototype in a week, ships a system in month three that hallucinates 40% of the time and has no observability to detect it. The resume-stuffed classical ML engineer - ten years of experience that all predates transformers, treats LLMs as a curiosity, will quietly try to solve your problem with a fine-tuned BERT.
The hire that works looks different. They have shipped a product you can click on. They talk about cost per call and latency in the first twenty minutes without being asked. They have an opinion on when not to use RAG. They have scar tissue from a specific production failure and can tell the story with the metric attached. They estimate work in days, not in weeks of vague exploration. The pre-call email already includes a question that shows they read your spec carefully.
Most of this post is about how to surface the second profile and filter out the first. The work is not particularly hard. It just requires not skipping the steps that LinkedIn and a charismatic intro call are designed to make you skip.
What an "AI developer" actually is in 2026
The label "AI developer" covers at least five distinct jobs in 2026, with overlapping skills and very different price tags. The single biggest mistake founders make is hiring one archetype for a problem that needs a different one - paying senior AI engineer rates for a job a prompt engineer could finish in a week, or paying prompt engineer rates for a job that demands a software engineer who happens to know LLMs.
Below is the taxonomy I use when a founder asks "what kind of AI person do I need." Read it once and the rest of the post stops being abstract.
The 5 archetypes of "AI developer" - pick the right one for your problem
| Archetype | Strongest at | Typical rate | Hire when |
|---|---|---|---|
| Prompt engineer | Prompt design, chained workflows, no-code agents | $60 – $120 / hr | One workflow inside an existing product |
| AI integrator | Wiring LLMs into an existing app, vendor APIs | $80 – $150 / hr | Bolting AI onto a shipped SaaS or internal tool |
| AI engineer | RAG, agents, evals, observability, full systems | $120 – $250 / hr | Building a real AI product end-to-end |
| ML engineer | Fine-tuning, training, custom model deployment | $180 – $350 / hr | Domain model, custom training, on-device inference |
| Applied research engineer | Novel architectures, frontier research | $300 – $600+ / hr | Almost never - FAANG-scale problems only |
Prompt engineer - when this is enough
A pure prompt engineer is the right hire when the problem is one well-defined workflow that lives inside a product someone else already built. Think: improving the prompt behind a customer-support autoresponder, designing the agent loop for a Zapier-style internal automation, or producing the system prompt and few-shot examples for a customer-facing chatbot built on an off-the-shelf platform. Engagement length is usually one to three weeks, output is mostly markdown and config, and the cost is the cheapest of any AI hire.
Warning sign: a candidate who only fits this profile but is being considered for a job that requires shipping a system. Prompt engineers can write fluent prompts and still not be able to ship production software. That is fine, and it is exactly why the hiring bar should match the actual scope.
AI integrator - wires LLMs into existing product
The integrator is the right hire when you have a working product and you want to add AI features without rebuilding the stack. They are comfortable with the AI SDK, OpenAI and Anthropic APIs, streaming responses, server actions or edge functions, and the basic plumbing of getting structured output into a UI. They will not architect a large RAG system from scratch, but they will ship a feature against an existing codebase in a week or two.
This is the right hire for most SaaS founders adding their first AI feature. The job is mostly integration engineering with light AI opinions - exactly what a strong full-stack developer who has shipped two or three LLM features can deliver.
AI engineer - builds RAG, agents, evals, with software-eng foundations
This is the most commonly needed and most commonly misjudged hire. An AI engineer in 2026 is a software engineer first who has built and shipped LLM-backed systems in production. They have opinions on chunking strategy, on when to add reranking, on prompt caching, on observability, on agent design, and on the failure modes that actually bite. They can talk through architecture, but they can also write the code and ship it. They have a strong cost and latency mental model and budget accordingly. The closest single title in older taxonomies is "senior full-stack engineer with LLM specialty."
This is the archetype I belong to and the one I am most often evaluating other candidates against. It is also where the largest skill variance hides - the gap between a real senior AI engineer and a confident integrator is often invisible on a resume.
ML engineer - fine-tuning, training, deployment of custom models
The ML engineer is the right hire when the project genuinely needs a model trained or fine-tuned on your data - domain-specific classifiers, custom embedding models, on-device inference, vision models, anything where an API call to a frontier model will not cut it on cost, latency, or accuracy. The hiring pool is smaller, the rate is higher, and the timelines are longer.
Most teams do not need this hire. Founders who think they do usually discover, after a scoping conversation, that a strong prompt and a good retrieval pipeline against an off-the-shelf model gets them 95% of the way there at one tenth the cost.
Applied research engineer - only at FAANG-scale problems
If you are reading a founder's guide to hiring AI developers, you do not need this person. They work at OpenAI, Anthropic, DeepMind, and a small number of well-funded labs, and they do not solve product problems. Skip.
Where to source - ranked by signal
Sourcing channel matters as much as filter. The highest signal channels are also the slowest, and the highest volume channels are also the noisiest. Pick a mix that matches the urgency.
Personal portfolio with shipped AI products (highest signal). Someone who built and shipped products that you can click on, log into, and pay for has clearer signal than any interview. Look for a portfolio with working URLs, not just "case studies" in Notion. Bonus signal if they have published the architecture in a blog post - public writing is one of the cleanest forcing functions for clear thinking. If the products are also their own, you know they ship without supervision.
GitHub with real LLM repos (medium-high). A GitHub account with two or three real AI repos - eval suites, RAG implementations, agent frameworks, model wrappers - tells you more than a polished resume. Pay attention to commit history, not stars. Stars correlate with marketing, commits correlate with depth.
Toptal and Upwork (medium, expensive). Toptal screens hard at the top of the funnel and is reasonably reliable for senior contractors, but you pay a 2x markup over the rate the engineer receives. Upwork has a wide range - actual seniors exist on it, but you need to filter aggressively and the average profile is junior. Both platforms work better when you already know what good looks like.
LinkedIn outbound (low signal, high volume). Searchable, fast, and almost completely uncalibrated. The "AI engineer" title got bolted onto a million profiles in 2023 and 2024 and never came off. Use LinkedIn to source, never to evaluate.
Discord, Twitter, and build communities (selective). The LLM developer scene has visible nodes - the LangChain Discord, the Anthropic and OpenAI dev communities, the AI Engineer Summit crowd, smaller groups around specific frameworks. People who show up in those spaces and answer technical questions in public are usually the real deal. Slower to recruit from, but the signal is high.
Agencies (low signal-to-noise per dollar). Most AI agencies in 2026 are integrators dressed up as engineering firms. You pay 2x to 3x the rate of an equivalent freelance hire and get a project manager you did not ask for. The right move is an agency only when the contract structure and accountability actually justify the markup - which is rarely for an early-stage team.
Resume red flags
These are the patterns that consistently correlate with weak hires. Treat them as filters, not deal-breakers - but the more that stack up, the lower the bar drops.
- "ChatGPT power user" in the skills section. Anyone with real LLM engineering chops would not list a consumer product as a skill.
- AutoGPT, BabyAGI, or AgentGPT projects from 2023. Demoware from the agent-hype cycle, almost never shipped to real users, almost never had evals.
- "Built a chatbot in a weekend." Everyone built a chatbot in a weekend. It is not evidence of shipping production AI.
- 10+ years of AI experience. Often means classical ML pre-2018 - useful in a different job, frequently a poor fit for shipping LLM products.
- Specializes in 30 frameworks. LangChain, LlamaIndex, Haystack, AutoGen, CrewAI, DSPy, Semantic Kernel, Pinecone, Weaviate, Qdrant, Chroma, Milvus, and on and on. Real engineers pick two or three tools and have strong opinions about why.
- No GitHub, no portfolio, no public writing. The best AI engineers I know all have at least one of the three. None of them have zero.
- Claims to have built "an AGI" or "reasoning system from scratch." Marketing language, almost never engineering language.
- Every project is a demo, none is a product. Demos are easy. Products with paying users are hard. The skills do not overlap as much as people assume.
Resume green flags
The mirror image - what to weight heavily when you see it.
- Shipped product URLs you can click on. Working, billable, used by real customers, ideally with the engineer named as the builder.
- Eval setups in public repos. A folder named /evals with real test cases is one of the strongest senior signals in the LLM world.
- Public writing on tradeoffs. Blog posts that name a problem, the options considered, and what was chosen. Tradeoff thinking is the single hardest skill to fake.
- Named clients with verifiable engagements. Not "worked with Fortune 500 companies," but specific named companies, ideally with the product they shipped together.
- Open-source contributions to real AI tooling. PRs to LangChain, LlamaIndex, the AI SDK, vector DB clients, eval frameworks - not stars on tutorial repos.
- Cost and latency mentioned unprompted. Any candidate who brings up token cost, prompt caching, or p95 latency in the first conversation has shipped real systems.
- Talks about failure cases, not capability. Capability-talk is marketing. Failure-mode talk is engineering.
The 4-stage interview I would run
Most AI developer interviews waste both sides' time. The process below filters senior signal in roughly 8 hours of total commitment across 2 to 3 weeks. If you cannot run a full process, run the first two stages and accept the higher noise.
Stage 1 - Screen (15 minutes)
Five questions on a video call, asked in order. The goal is to filter out the bottom 70% of candidates fast. There are no trick questions, and the answers do not need to be perfect - you are looking for the texture of the response.
- Walk me through the most recent AI product you shipped to production, with the metric you optimized for. Senior answer: specific product, specific metric, specific failure mode they fought. Junior answer: a demo, no metric, no failure.
- When would you NOT use RAG? Senior answer: when the corpus is small enough to fit in context, when the answer is computable rather than retrievable, when latency budget cannot tolerate the extra hop. Junior answer: blank stare or "you should always use RAG."
- How do you evaluate retrieval recall? Senior answer: labeled query-document pairs, recall@k, regression on the eval set as part of CI. Junior answer: "we look at the outputs."
- Walk me through a production AI failure you debugged. Senior answer: a specific story with a metric, a hypothesis, a fix, and the cost of the bug. Junior answer: hypothetical or generic.
- What does your last AI feature cost per call, and how do you know? Senior answer: a number, a breakdown by input and output tokens, and the tooling they used to track it. Junior answer: "we did not really track that."
Three or more senior answers and you advance to stage two. Two or fewer and you stop. The signal-to-noise on these five questions is higher than any take-home you can design.
Stage 2 - Take-home (4 to 8 hours, paid)
Pay a flat $400 to $800 for the take-home. Unpaid take-homes longer than two hours filter out exactly the senior candidates you want. Give a real spec: build a small RAG over these 50 markdown documents, ship with one eval and one observability hook. Provide the documents and a test query set. Give them five days to deliver.
What you are looking for in the submission, in priority order:
- The eval. Did they ship one? Is it real? Does it run? What metric did they pick and why? Eval discipline is the single strongest senior signal in the LLM world.
- Failure-mode handling. What happens when the retrieval misses? What happens when the model refuses? Is there a fallback?
- Cost and latency awareness. Did they pick a model deliberately, or default to the most expensive one? Did they use prompt caching where it would have helped?
- The README. Did they explain the tradeoffs they made and the things they explicitly cut? Senior engineers always document what they did not build.
- Code quality. Is the code readable? Are the files reasonably structured? Are there tests on the non-probabilistic parts?
What you are not grading: pixel-perfect UI, framework choice (within reason), or whether they used your favorite vector DB. This is an engineering and judgment test, not a stack-purity test.
Stage 3 - Architecture interview (60 minutes)
A live whiteboard or shared doc session designing a real system. Pick one of: a customer-support agent with retrieval over your knowledge base, a meeting-summary pipeline with action-item extraction, an autonomous email triage agent with a human review queue. Walk through the system together - data flow, retrieval, prompt structure, eval, observability, cost model, failure modes, deployment.
The signal is in what they choose to talk about without being prompted. A senior AI engineer will, unprompted, bring up: chunking strategy, embedding model selection, reranking, prompt caching, structured output, tool calling, evaluation methodology, the human-in-the-loop surface, latency budget, cost per request, and the failure modes specific to the design. They will also push back on parts of the spec - "I would not do this with an agent, I would do it with two LLM calls" - which is what you want.
Stage 4 - Reference + paid trial (1 to 2 weeks)
Always check at least one reference. The question that matters most: "would you hire them again, and for what kind of work?" The shape of the answer tells you more than any scorecard.
Before committing to a long engagement, run a small paid trial. Two weeks, a scoped piece of real work, real money. You learn more about working with someone in two weeks of real work than in any interview process. They learn whether they want to work with you. The optionality is worth the cost.
What to test that nobody tests
Most AI developer interviews skip the things that distinguish good from great. The list below is what I would push on if I had limited time.
- Eval thinking. Can they design an eval for a problem they have never seen before? Can they explain why the eval matters more than the prompt?
- Cost awareness. Can they estimate the per-call cost of a feature given a token count, a model, and a cache hit rate? Do they reach for prompt caching, Batch APIs, and smaller models when appropriate? My OpenAI API cost post covers the math they should already know.
- Latency budgeting. Do they know what the user experience tolerates? Can they design a streaming UI that masks a 4-second backend call? Do they reach for smaller models on the hot path and larger ones for offline tasks?
- Prompt-injection awareness. Do they think about adversarial inputs? Do they sanitize tool inputs? Do they treat retrieved documents as untrusted?
- The "I would not use AI here" instinct. The strongest signal of all. A senior AI engineer will, with some regularity, tell you that the right answer to a problem is not an LLM but a regex, a SQL query, or a hard-coded rule. Anyone whose answer to every problem is "build an agent" should not run your AI roadmap.
Comp ranges (2026)
Real numbers from real engagements across geographies and seniority levels. The ranges below assume genuine senior skill - five or more years of software engineering with at least two years of shipped LLM work. Mid-level rates are typically 40 to 60% of these numbers.
| Region / structure | Rate (senior) | Notes |
|---|---|---|
| US senior AI engineer (in-house) | $200K – $400K loaded | Salary + equity + benefits, SF / NYC weighted |
| US AI engineer via agency | $200 – $400 / hr | Markup over engineer rate is 1.8x to 2.5x |
| US freelance senior | $150 – $300 / hr | Direct, no agency layer |
| Western Europe senior (in-house) | €110K – €180K | London, Amsterdam, Berlin, Zurich vary by 30% |
| Western Europe freelance senior | €90 – €180 / hr | Lower in southern Europe |
| Eastern Europe / Balkans freelance senior | $80 – $140 / hr | Kosovo, Serbia, Romania, Poland - CET timezone |
| Hybrid retainer (10 – 20 hrs / week) | $8K – $15K / month | Most underrated structure for pre-seed teams |
| Fixed-scope AI MVP | $25K – $75K | 6 – 12 weeks, one senior, end-to-end |
For full context on how AI features compare to the rest of an early-stage build, the MVP cost guide breaks out the line items. The single most useful framing: the right structure is usually fractional senior, not full-time mid. A senior AI engineer at 15 hours a week typically out-ships a full-time mid-level engineer at 40 hours a week on AI-specific work, especially in the first 90 days of a project.
In-house vs freelance vs agency vs fractional
The decision matrix below maps a stage to a structure. Almost every founder I talk to defaults to one of these too early or too late, and the mismatch costs months.
| Structure | Best for | Watch out for |
|---|---|---|
| Freelance project | One bounded project, 6 – 16 weeks, defined scope | Scope creep, single point of failure if they leave |
| Fractional retainer | Ongoing AI work, no full-time need yet | Schedule conflicts with their other clients |
| Agency | Funded team that needs process and accountability | 2x – 3x markup, generic stack opinions |
| In-house full-time | Post-PMF roadmap with 12+ months of focused work | Hiring lead time, ramp time, retention risk |
| Two-week paid trial | Any of the above, as a de-risking step | Only blocker is not budgeting for it |
Practical heuristic: if you have less than $50K of AI budget in your runway, hire freelance or fractional. Between $50K and $250K, run a fractional retainer with the goal of converting to full-time once the roadmap is clear. Above $250K of dedicated AI engineering budget, hire in-house and use freelancers to fill gaps. Agencies make sense in a narrow band - Series A and beyond, with a real reason process needs to be bought rather than built.
Contract structures that protect both sides
AI contracts have the same shape as good software contracts, with two specific additions for the probabilistic nature of the work. The clauses below are what I include in my own master service agreement and what I look for when signing one from a client.
IP assignment in the master agreement. All work product transfers to the client on payment. Pre-existing tools, libraries, and templates remain with the engineer with a perpetual license back to the client. Standard, non-negotiable, keep it boring.
Eval-based acceptance criteria. This is the AI-specific clause that prevents 80% of disputes. The work is accepted when the system passes a written eval suite at agreed thresholds - for example, "recall@5 of 0.85 on the 50-question test set, p95 latency under 4 seconds, no factual regressions on the regression suite." Without an eval-based acceptance gate, "done" becomes a matter of taste, which is a fight nobody wins.
Milestone payments. 30% on signing, 30% on a defined midpoint deliverable, 40% on acceptance against the eval. Never accept "all on completion" - it misaligns incentives on both sides. Never accept "all upfront" - same reason in reverse.
Model spend pass-through. Model and infrastructure spend is invoiced at cost, not marked up. If the engineer is using their own OpenAI key, they bill it line-item with the receipt attached. Mixing engineering rate and infra spend hides cost in ways that always come back to bite.
Exit clauses. Either side can terminate with 14 days' notice. On termination, the client pays for completed work, receives all code and prompts, and the engineer provides a 4-hour handoff session. No long termination tails, no liquidated damages, no claw-backs on completed milestones.
Sample acceptance paragraph you can adapt: "The Phase 2 deliverable is accepted when the system achieves the metrics defined in Appendix A on the held-out eval set, runs end-to-end in the client environment without intervention, and includes the observability instrumentation listed in Appendix B. Acceptance review will occur within 7 business days of delivery. If acceptance is denied, the Engineer will receive written notice with specific deficiencies and 14 days to remediate before the milestone is reassessed."
The first 30 days - what good looks like
The shape of a competent AI engineer's first month is recognizable. If you have one and you are not seeing this pattern, the hire is in trouble and you should address it before day 45.
Week 1. They read your existing codebase, your data, and your current AI usage. They produce a written document - usually a Loom and a one-pager - covering what they found, what they propose to change in the first month, and what they recommend cutting. They have a working local environment. They have run the existing system end-to-end and have specific observations.
Week 2. A first eval suite exists, even a small one. It is checked into the repo. The CI runs it. A first observability hook - log of prompts, completions, latencies, and costs - is live. They have shipped a small visible improvement, often a prompt fix or a retrieval tweak, with the eval delta attached.
Week 3. The first meaningful feature or rework is in progress. They are estimating in days, not weeks. They are pushing back on parts of the original spec with specific reasoning. They are asking for access to things they actually need (production data, an evaluation budget, a vendor API key) and not asking for things that would be a distraction.
Week 4. The feature ships behind a feature flag. Evals confirm it is not a regression. Cost and latency instrumentation confirms the unit economics. There is a written post-launch note covering what worked, what did not, and what they would do next. The relationship feels easier than it did on day one.
If the first 30 days look more like "exploring options," "evaluating frameworks," or "setting up infrastructure" without a single shipped change, the hire is misshapen for the work. Have the hard conversation early, not at day 90.
Frequently asked questions
What does an AI developer actually do?
An AI developer in 2026 ships products that use LLMs, retrieval, and agents as first-class components. Day-to-day that means prompt design, eval harnesses, retrieval pipelines, tool calling, observability, cost and latency control, and the regular software engineering around all of it - APIs, queues, auth, databases, frontends. The job is mostly software engineering with strong opinions about probabilistic systems.
How much should I pay an AI developer in 2026?
US senior AI engineers cost $200K to $400K loaded in-house, or $200 to $400 per hour through an agency. Western European seniors run €110K to €180K. Eastern European and Balkan senior freelancers - including Kosovo - run $80 to $140 per hour. A part-time fractional engagement at 10 to 20 hours per week typically lands at $8K to $15K per month.
Should I hire a freelance AI developer, an agency, or in-house?
Freelance for a defined project under six months with a focused scope. Agency for funded teams that need process accountability and a contract that survives staff churn. In-house once you have paying customers and a 12+ month roadmap that genuinely needs 30+ focused engineering hours per week. Fractional is the right answer more often than founders expect - a senior AI engineer at 15 hours per week beats a full-time mid for most pre-seed teams.
What are the biggest red flags in an AI developer interview?
Heavy ChatGPT-user energy with no code repos, AutoGPT or agent-framework demos with no eval setup, weekend-chatbot projects framed as production experience, claims of 10+ years of AI experience that turn out to mean pre-2018 classical ML, and resumes that list 30 frameworks but no shipped products. Real AI engineers talk about tradeoffs, failure modes, and cost - not about how impressive the model is.
Should I give a take-home assignment when hiring an AI developer?
Yes, paid, and small - 4 to 8 hours of work for a flat fee of $400 to $800. Give a real spec: build a small RAG over a provided document set, ship with one eval and one observability hook. The signal is in what they choose to evaluate, how they handle the obvious failure cases, and what they explicitly cut for scope. Unpaid take-homes longer than two hours filter out the senior people you want.
How do I test for real LLM experience versus prompt fluency?
Ask them to walk you through a production AI failure they personally debugged, with the metric, the hypothesis, and the fix. Ask when they would not use RAG. Ask how they evaluate retrieval recall. Ask them to estimate the per-call cost of a feature they have shipped. Anyone who has actually shipped LLM products has scar tissue and specific numbers. Prompt-fluency candidates have neither.
How long should the hiring process take?
Two to three weeks from first message to signed contract for a freelance or fractional hire - a 15-minute screen, a paid take-home returned in a week, a 60-minute architecture interview, and one reference call. Full-time hires take six to twelve weeks with sourcing, multiple loops, and offer negotiation. Anything faster than two weeks for full-time skips signal you will regret; anything slower than four weeks for a freelancer loses candidates to other offers.
Can I hire a senior AI developer remotely from Eastern Europe?
Yes, and it is the highest leverage hire most founders are not making. Kosovo, Serbia, Romania, and Poland all have strong senior AI engineers at 40 to 60% of US rates, working in CET which overlaps comfortably with both the US East Coast and Western Europe. The contract structure matters more than the geography - fixed-scope milestones, IP assignment in the master agreement, and a paid two-week trial before committing to anything longer.
If you want to skip the search, my services pages cover the specific shapes I work in: AI integration services, AI agent development, and the direct hiring routes - hire an AI developer in Kosovo, AI engineer in Pristina, or freelance AI engineer in Europe. Recent shipped products include Caldra AI, OmniAPI, and others on the homepage - each one is the kind of project the playbook above was designed to filter for.