Use Cases/Research Agents

Your research agent cannot tell the difference between a dataset and a liability.

HHS's SACHRP found that AI's ability to combine datasets makes it "very difficult to abide by the implicit promise to research subjects that data will not be associated with them as individuals." Northeastern University now requires a dedicated AI Systems form for every human subjects study. IRBs are catching up. Your research agents need authorization policies now.

Research AI agent guardrails defined

Research AI agent guardrails are runtime authorization policies that intercept data access, web scraping, and information extraction tool calls made by autonomous research agents. They enforce source allowlists, extraction limits, IRB protocol boundaries, PII redaction, and citation requirements. Every access is logged for reproducibility and compliance verification.

The risks of autonomous research agents

Research agents operate with broad autonomy: they crawl websites, query APIs, combine datasets, and synthesize information from dozens of sources. The Common Rule (45 CFR 46) explicitly calls for reexamination of identifiability because AI's ability to combine datasets creates re-identification risks that did not exist when the regulations were written. A breach of confidentiality in human subjects research must be reported to the IRB as an unanticipated problem.

Source reliability

Agents pull data from unverified sources, paywalled content, or sites with inaccurate information. LLMs hallucinate citations and fabricate statistics. Without source validation, bad data propagates through your research pipeline.

Re-identification risk

AI agents combining multiple "anonymized" datasets can re-identify individuals. The 45 CFR 46.104(d)(4) exemption for de-identified data does not account for AI's combinatorial capability. What looks anonymized to a human is linkable to a model.

IRB protocol violations

Agents accessing data outside approved IRB protocols create compliance violations. Any AI-based software interacting with human subjects data may require both a security assessment and IRB review before use. Institutions are adding mandatory AI disclosure to IRB submissions.

Data extraction liability

Scraping protected data, exceeding API rate limits, or extracting PII without consent creates legal liability under CFAA, GDPR, and institutional data use agreements. Agents do not inherently know when to stop.

Source validation and data extraction policies

Define policies that validate sources, enforce extraction limits, and ensure IRB compliance before your agent accesses any data.

veto-policy.yamlyaml
name: research-agent-guardrails
description: IRB compliance, source validation, and data extraction controls

rules:
  # Source allowlist enforcement
  - name: approved-sources-only
    tools: ["web_scrape", "fetch_url", "download_pdf"]
    condition: >
      not matches(args.url, context.approved_source_patterns)
    action: require_approval
    constraints:
      approver_role: "pi"
    response:
      message: "URL not in approved source list. PI approval required."
    audit:
      log_arguments: true

  # Block paywalled and authenticated content
  - name: block-paywalled-content
    tools: ["web_scrape", "fetch_url"]
    condition: "args.requires_auth or args.paywall_detected"
    action: deny
    response:
      error: "Cannot scrape authenticated or paywalled content"

  # Data extraction limits
  - name: extraction-record-limit
    tools: ["extract_data", "query_dataset"]
    condition: "args.record_count > 500"
    action: require_approval
    constraints:
      approver_role: "pi"
    response:
      message: "Extraction of 500+ records requires PI approval"

  # PII field blocking
  - name: block-pii-extraction
    tools: ["extract_data", "query_dataset"]
    condition: >
      any(args.fields, ['ssn', 'email', 'phone', 'address',
      'date_of_birth', 'ip_address'])
    action: deny
    response:
      error: "PII fields cannot be extracted without de-identification"

  # IRB protocol boundary
  - name: irb-protocol-check
    tools: ["query_dataset", "access_human_subjects_data"]
    condition: "not context.irb_protocol_approved"
    action: deny
    response:
      error: "Human subjects data access requires approved IRB protocol"

  # De-identification enforcement for research datasets
  - name: deidentification-required
    tools: ["query_dataset", "export_dataset"]
    condition: >
      args.dataset_type == 'human_subjects' and
      not args.de_identified
    action: deny
    response:
      error: "Human subjects data must be de-identified per IRB protocol"

  # Rate limiting per source
  - name: source-rate-limit
    tools: ["web_scrape", "fetch_url", "api_call"]
    condition: "context.requests_to_domain_last_hour > 60"
    action: deny
    response:
      error: "Rate limit reached for this domain. Try again later."

  # Citation requirement
  - name: require-citation-metadata
    tools: ["extract_data", "web_scrape"]
    action: allow
    constraints:
      require_metadata: ["source_url", "access_timestamp", "page_title"]
    audit:
      log_arguments: true
      log_response: true

  # Cross-dataset combination controls
  - name: dataset-combination-review
    tools: ["merge_datasets", "join_data"]
    condition: "args.dataset_count > 1"
    action: require_approval
    constraints:
      approver_role: "pi"
    response:
      message: "Combining datasets may create re-identification risk. PI review required."

IRB compliance and the Common Rule

The Common Rule (45 CFR 46) governs human subjects research. AI agents create new compliance challenges that existing regulations are being updated to address.

RequirementAI Agent ChallengeVeto Policy
Identifiability (46.102(e)(7))AI can re-identify 'anonymized' data by combining datasetsDataset combination review, de-identification enforcement, PI approval for merges
Informed Consent (46.116)Participants may not know AI will access their dataIRB protocol verification, consent status check before data access
Risk Minimization (46.111(a)(1))Autonomous agents can access more data than necessaryField-level restrictions, extraction limits, PII blocking
Data Confidentiality (46.111(a)(7))AI tools may transmit data to third-party serversData flow controls, block external API calls with subject data
IRB Review (46.109)AI use requires security assessment + IRB approvalProtocol check on every human subjects data access

Features for research agents

Source validation

Validate URLs against allowlists and blocklists. Detect paywalls and authentication requirements. Require PI approval for unapproved sources. Block scraping of protected content entirely.

Extraction controls

Cap records extracted per source. Rate-limit API calls per domain. Require approval for bulk operations. Block PII fields from extraction unless explicitly approved by the IRB protocol.

Citation tracking

Require citation metadata (source URL, access date, page title) on every extraction. Log complete provenance for every data point. Export citation data in APA, MLA, Chicago, or custom formats.

Reproducibility audit trails

Every source accessed, data extracted, and decision made is logged. Complete logs enable other researchers to verify your agent's data pipeline. Export for supplementary materials or compliance reporting.

Build vs buy for research AI

CapabilityDIYVeto
Source allowlist enforcement
PII field blocking
IRB protocol verification
Dataset combination review
Rate limiting per domain
Citation metadata tracking
Reproducibility audit trails
Time to complianceWeeksHours

Related use cases

Frequently asked questions

How do source validation policies work?
Source validation policies check URLs against configurable allowlists before your agent makes a request. You define patterns for approved academic databases (arXiv, PubMed, JSTOR), government sources, and licensed content. Requests to unapproved sources are blocked or routed to the PI for review. All access attempts are logged for audit.
Can Veto enforce rate limits across multiple agents?
Yes. Rate limits are enforced at the project level, shared across all agents using the same API key. This prevents multiple research agents from overwhelming a single API or violating terms of service through distributed requests. Per-domain and global rate limits are both configurable.
How does Veto help with IRB compliance?
Veto policies verify that an approved IRB protocol exists before allowing access to human subjects data. De-identification requirements are enforced automatically. Dataset combination operations require PI review to assess re-identification risk. All access is logged immutably, providing the audit trail IRBs require for continuing review.
What about re-identification risks when combining datasets?
Veto requires PI approval for any dataset merge or join operation. The policy evaluates the number and types of datasets being combined and flags operations that could create re-identification risk. This addresses SACHRP's concern about AI's ability to link 'anonymized' data across sources, implementing the reexamination of identifiability called for in 45 CFR 46.102(e)(7).
Can I use Veto with browser automation tools?
Yes. Veto integrates with Playwright, Puppeteer, and Selenium-based agents. URL navigation, form submissions, and data extraction can all be intercepted and validated against your policies. Source allowlists, rate limits, and PII blocking apply to browser-driven research agents identically.

Research agents need research-grade controls.