A vendor evaluation team at a discrete manufacturer spent four months on a proof of concept for a new inventory and production system. The vendor configured a sandbox environment, the team loaded a representative dataset, and a series of demonstrations walked through the standard workflows with curated examples. At the end of the four months, the team had a glossy document confirming that the system could do what the vendor said it could do. What they did not have was any evidence that the system worked under the conditions of their actual operation: their actual users, their actual data quality, their actual workflows that diverged from the vendor's standard, their actual integration with the production line. The proof of concept proved nothing about the operation that mattered. The procurement decision proceeded on faith.

This pattern is common, and it is the wrong way to evaluate manufacturing software. A better approach is a tightly scoped pilot that runs against real operational conditions for a defined period, produces verifiable evidence that the system performs under those conditions, and creates a fact base that supports the full rollout decision. A thirty day software pilot built around two SKUs at one location can deliver more useful information than a four-month proof of concept that lives in a sandbox, and the manufacturing software pilot plan that follows is designed to make that thirty day window productive.

Why Two SKUs Are Enough

The instinct on a pilot is to include enough scope to feel representative. The instinct is wrong. A pilot with twenty SKUs across three locations involves so many variables that no observed result can be cleanly attributed to a cause. If something goes wrong, the team spends the pilot debugging the pilot rather than evaluating the system. The right pilot scope is narrow enough that every variable is known and every result is interpretable, and that scope is two SKUs at one location.

The two SKUs should be chosen with intent. The first should be a high-volume, simple SKU that exercises the system's core flows: receiving, stocking, consumption, and dispatch. This SKU validates that the system can handle the operational tempo and that the basic workflows are clean. The second should be a lower-volume but more complex SKU, ideally a manufactured product with a multi-level BOM that exercises production planning, ATP calculation, and material reservation. This SKU validates that the system can handle the complexity that the high-volume SKU does not expose.

Together, these two SKUs cover most of the operational surface area without bringing in the long tail of edge cases that distract a pilot. The edge cases will be there in the full rollout, but they do not need to be tested in the pilot. The pilot is testing whether the system can do the work, not every possible permutation of the work.

The same logic applies to the location choice. One location, ideally one that is operationally typical rather than the largest or the most complex. Multi-location behavior can be tested in the next phase, after the single-location pilot has confirmed the foundation is solid.

Setting Up the Pilot in Week One

The first week of the pilot is for setup. Speed matters here because every day spent on setup is a day not spent collecting operational evidence. A modern manufacturing system should be able to support a two-SKU, one-location pilot setup in three to five days, including the initial data load. Vendors who require multiple weeks of professional services for this scope are signaling that their system is not designed for rapid deployment, which is itself useful evidence.

The setup steps are straightforward. Create the organization, configure the location, define the two items with their thresholds and supplier links, build the BOM for the manufactured SKU, configure the user roles for the pilot team, and load the current stock balances. The user role configuration matters more than it sounds. The pilot should run with the same role-based access control that production would use, including location scoping for users who should not see other locations. Running the pilot with everyone as administrator simulates a different system than the one you are actually evaluating.

The data load should reflect real conditions. Real supplier lead times, real BOM quantities including waste factors, real stock balances as of the cutover date. Synthetic data hides problems that real data would expose, and the point of the pilot is to expose problems while they are still cheap to fix. The data load should also include enough historical movement data to make the consumption analytics meaningful, ideally three to six months of prior movements imported as a backfill.

By the end of week one, the system should be configured, loaded, and ready to start parallel operation. The pilot users have access, the BOM is activated, and the first ATP calculations are running. Any setup issues encountered are the first data points from the pilot.

Parallel Operation in Weeks Two and Three

Weeks two and three are for parallel operation. The pilot SKUs continue to be managed in the existing system, and they are also managed in the new system, with operators recording every movement in both places. This is more work than running only the new system, but the parallel operation is what produces the verifiable evidence that justifies the full rollout decision. Without it, every observation in the new system is uncalibrated against operational reality.

The operations in scope for the parallel run should cover the full lifecycle of the two SKUs. Receiving inbound stock, recording transfers between sub-locations within the pilot site, consuming materials in production runs, recording produced quantities back into stock, dispatching outbound shipments, and processing any returns or adjustments. Each operation should be performed in both systems, with the timestamp and the operator recorded in both. At the end of each day, the two systems should be reconciled. Any discrepancy is investigated and explained.

The reconciliation process is where most pilot value is created. A discrepancy that resolves to a user error tells you something about user interface clarity. A discrepancy that resolves to a calculation difference tells you how the new system handles operations like waste-adjusted consumption or partial receipts. A discrepancy that resolves to a missing feature tells you whether the new system can support your operation at all. We have written about how dispatch discrepancies and reconciliation work in our piece on the topic.

The pilot users should be operating the new system the way they would operate it in production, not the way the vendor demonstrated it. They should be using their normal workflows, asking the questions they would normally ask, and pushing the system in the ways their daily work pushes any system. The vendor should not be in the room during operations. The vendor's job during the pilot is to be available for questions and to fix bugs, not to drive the operation.

Pilot Evaluation Criteria

The fourth week of the pilot is for evaluation. The pilot evaluation criteria should be defined at the start, not at the end, and they should be specific enough to support a clean decision. The categories that matter are operational accuracy, user experience, alerting effectiveness, and integration readiness. Each one should have measurable indicators that the pilot data can be assessed against.

Operational accuracy is measured by the reconciliation results. Across the two SKUs over the three weeks of parallel operation, what percentage of movements reconciled cleanly between the two systems on the first attempt. What was the source of discrepancies that required investigation. What was the average time to resolve a discrepancy. The right system should produce reconciliation rates above ninety-five percent by the end of the pilot, with the residual discrepancies explained by known causes that have known mitigations.

User experience is measured by how the pilot users describe their daily work in the new system. Did the workflows match their mental model of the operation. Did the system surface the information they needed when they needed it. Were the alerts useful or were they noise. The signal here is qualitative but it is essential, because a system that the operators dislike will be worked around regardless of how well it performs technically.

Alerting effectiveness is measured by what the system caught and what it missed. The pilot should produce a number of stockout proximities, threshold crossings, ATP shortfalls, and other operational signals. Did the alerts fire when they should have. Did they fire when they should not have. Was the alert volume manageable, or did it produce the kind of fatigue that leads operators to ignore everything. We have explored this in our piece on alert fatigue in operations, and the pilot is the right time to test the system's alert behavior under real conditions.

Integration readiness is measured by what the API can do. During the pilot, attempt at least one integration with an existing system: the accounting platform, the production line controller, the customer ordering system. The integration does not need to be production-grade, but it should validate that the API supports the operations the integration requires and that the data flows in the direction needed. An integration that proves easy in the pilot will be easy in production. An integration that proves difficult in the pilot will be a major project later.

Decision Criteria and Next Steps

By the end of the thirty days, the pilot should have produced enough evidence to support a clear decision. The reconciliation data shows whether the system can keep accurate stock. The user feedback shows whether the team can work with it. The alert behavior shows whether the system surfaces the right operational signals. The integration test shows whether the system will fit into the broader architecture. These four data sources, taken together, support a defensible decision in either direction.

If the evidence is positive, the next phase is a planned expansion. Add more SKUs at the same location to test the volume and complexity scaling. Then add a second location to test multi-site behavior. Then enable the production planning and MRP features in full. Each phase builds on the verified foundation of the previous one, which is the only way to scale a system rollout without accumulating risk.

If the evidence is mixed or negative, the pilot has saved the organization from a much larger commitment. The cost of running a focused thirty day pilot is small compared to the cost of a failed full rollout, and the information value is high regardless of which direction the decision goes. The pilot that proves the system does not fit your operation is just as valuable as the pilot that proves it does, because both outcomes prevent worse outcomes downstream.

The shift this enables is the shift from manufacturing software trials that are theatrical to manufacturing software trials that are evidentiary. A thirty day pilot run against real operational conditions produces a fact base that supports a real decision. The four-month proof of concept that lives in a sandbox produces a glossy document that supports nothing.


FalOrb is built for rapid setup and ledger-first onboarding, with scoped roles and parallel operation that makes a thirty day pilot practical at any manufacturing site. Book a 30-minute walkthrough or email us at [email protected] to see how it applies to your operation.