# Trading System Development

This post acts as the anchor for an ongoing discussion of trading system development.

Comments related to the design, testing, validation, analysis, and management of trading systems are welcome.

## This Post Has 17 Comments

### Leave a Reply

You must be logged in to post a comment.

Howard —

I’m trying to size tests correctly. Let’s say I have a test period of 784 days. This would be 784 daily bars.

My variable space produces 648 possible rulesets (3 variables * 3 variables * 4 * 6 *3). This is 83% of the test period size.

Am I evaluating the scenario generally correctly by calculating this way? And if so, 83% of the window would seem to be too much, but what is an acceptable consumption?

Thank you.

The higher the ratio of alternatives being evaluated to number of data points, the higher the likelihood of overfitting the model to the data.

To illustrate, imagine fitting a polynomial through 10 data points. As the complexity of the polynomial increases, the fit to the 10 points increases. But the ability of the polynomial to correctly estimate the value of a new data point decreases.

The technique for determining the value of the model is independent validation — one-time testing of the model on data that has not been used in choosing the values of the variables. Walk forward testing is the gold standard for validation of trading systems.

Hi, Howard –

I have all but the Amibroker of your books. They are superb, readable, and dense with expertise.

I have two questions:

1. In a walk forward test, is each out in sample period being optimized and then those optimized parameters tested in the subsequent out of sample test? So, if there were five segments (IN-OUT-1, IN-OUT-2, IN-OUT-3, IN-OUT-4, IN-OUT-5), would there be five separate optimizations?

Or is there a single optimization at the start, and one is testing the “final” parameters sequentially?

I’m getting confused on the specific process.

2. CAR25 — is there a calculation for it that I am missing?

Thank you very much.

The walk forward process is a sequence of steps. Each step uses a period of time and the data from that period. The period and associated data are separated into two parts.

The earlier part, called the in-sample period and in-sample data, is examined and tested with the intent of creating the best fit of the model to the data. The fit is accomplished through selecting different combinations of indicators, parameters, rules, and logic from those possible with the model. In machine learning terminology, this is the “learning” phase.

The later part, called the out-of-sample period and out-of-sample data, is only tested. This is the “validation” phase. The model that was fit during the learning phase is used to evaluate the out-of-sample data.

For a system built by fitting a model to data using algorithmic techniques, the “best” model from all the alternative models considered and tested in-sample is selected for the one-time test of the out-of-sample data. Those OOS results are the best estimate of future performance of the system.

The stepwise walk forward process advances through the time and data, moving forward in time in steps, each as long as the length of the out-of-sample period. At each step, a fresh “best” model is fitted to the then-current in-sample data.

Note that “best” is with respect to an objective function. Each set of trades is evaluated and scored using the objective function. The objective function can be as simple as “net profit,” which is the default in some trading system development platforms. Or it can be considerably more complex, weighting trading frequency, win-loss ratio, and so forth.

For the example in your question, there would be five separate optimizations. Each optimization results in a new model. And each produces one set of OOS trades.

Design of the objective function is important. Generating alternative models to test is easy. The intelligence is in the objective function. There are many considerations, including these features:

A. The alternatives tested can be sorted into the order given by the objective function. That order should be the same order as the person developing the system prefers them to be. That is, the developer should be confident that the highest ranked alternative is the one that the developer would have chosen given an opportunity to examine them all. This is important because the developer will not have an opportunity to examine the specific model chosen as best that will be used for the out-of-sample test. Or for live trading when this process is extended from development to trading.

B. The objective function should be based only on the distribution of trade results, not on the sequence in which those trades occur. We expect that the future will resemble the past — that OOS results will be similar to IS results — but only as far as the distribution. We cannot expect the sequence to be repeated. Consequently, the objective function should not include terms such as drawdown of the equity curve. That said, the developer must work within the constraints of the platform, and use the best objective function that can be formulated within its programming language.

———-

CAR25 is the estimate of profit for a set of risk-normalized trades.

Risk normalization is the process of determining the risk of a set of trades at a standard position size, then adjusting position size (to a value called safe-f) where the risk of drawdown over a user-selected time horizon is within the tolerance of the developer. After adjusting the trades to be taken with position size safe-f, CAR25 estimates the distribution of the future profit and the compound annual rate of return that might be realized.

CAR25 can be computed for any set of trades — real, paper, out-of-sample, in-sample, or hypothetical. Risk normalization means that any set of trades of those evaluated have the same risk of loss of funds. Given that the risk is the same, a rational trader would trade the one that provides the highest rate of return. CAR25 is a universal objective function. In the terminology of modeling and simulation, the alternative trading system with the highest CAR25 dominates all other alternatives. It is the one that should be allocated the trading funds.

In order for CAR25 to be a reasonable estimate of future performance, the set of trades that are used to compute CAR25 should be the best estimate of future performance. From the trading system development process, that is the set of out-of-sample trades produced by an mechanical process of fitting the model to the data and walking forward, with a single OOS test for each walk forward step.

Best regards, Howard

Hi Howard,

Firstly, thank you for all of your work and publications to date, I have enjoyed all of your books and I am looking forward to your next publication.

I have been just finished your book Quantitative Technical Analysis (QTA) and it seems that shortly after publication Amibroker 6.0 became available, in which MonteCarlo Analysis can now be calculated within the TSDP.

Whilst CAR25 at Safe-f cannot be calculated directly as described in QTA it is possible to calculate CAR25 and MDD95 for a given position size ( or fixedFraction ), either from the generated trade list or from the daily changes in equity.

I am trying to generate / develop a suitable Objective Function using these newly available tools. Currently my ObFn = CAR25 * Sqrt(N) / MDD95; where N is equal to the number of trades per year and the CAR25 and MDD95 are generated from the list of trades.

This seems to be working relatively well, however I am wondering if you have any suggestions to improve either this ObFn or the one you have provided in QTA given this new functionality is Amibroker?

Also due to the fact that I am trying to develop a system which trades a portfolio of stocks rather than a single issue, I am struggling with the problem of serial correlation between the stocks when a market crashes. ie 2008 GFC or 2011 Credit Crisis and the more recent equity market turmoil of late 2014 and late 2015.

I have tried optimizing and using Walk Forward optimization of the system on individual issues however there is never enough trades on each issue for the system to be accurately risk assessed and trad-able.

Any suggestions you might have would be greatly appreciated!

Many Thanks

Matt

A forum discussion I participate in asked

“Why do past data extrapolations over estimate profit and underestimate risk? Obviously real time participation of the system developer will alter the price action of the issue traded and availability of units at “expected” prices will alter (sometimes considerably) from the design phase. Is there anything less obvious? What are the real reasons?”

In-sample results are always good. Some models are intended to memorize. Some are intended to learn. Trading systems require models that have learned features in the data that precede profitable trades. The Only way to tell whether learning has happened is to supply the model with data that has not been used in fitting process and check the accuracy of the trades that are identified.

Traditional trading system development platforms (AmiBroker, Ninja, TradeStation) produce models that are decision trees. At the beginning of the development process, a decision tree model is not well fit to the data. In-sample testing and adjusting of rules and parameters improves the fit of the model to the data and the profitability of the resulting trades. The more attention given to fitting the data to the model, adjusting rules and parameters to give good in-sample results, the closer the model comes to memorizing. We can think of the trades as signals hidden among the price and volume data. Strong signals produce profitable trades — most of the strong signals have already been discovered and are incorporated in trading applications of large firms — hedge funds, banks, etc. What remains are weak signals among a noisy background.

As we search for models that fit the data more closely, we improve in-sample performance then test out-of-sample. Any adjustment to the model based on out-of-sample results causes the previously out-of-sample data to become in-sample data in an expanded fitting process. That iteration contaminates the out-of-sample data. At some point, we decide we have found a good system and want to trade it. The results we achieved during development are in-sample plus contaminated out-of-sample. Both sets of results include trades that result from recognition of the genuine signal as well as trades that result from fit to the random noise in the data.

Signals appear to be unique conditions — particularly the impulse signals we see as Buy and Sell. They form a distribution. Think mean, variance, kurtosis, skew. Use the entire distribution to the extend you can. At very least, think beyond mean.

We rely on the distribution of signals in the future data being similar to the distribution of signals in the development (in-sample and out-of-sample) data. That is the point when we discuss the need for stationarity. The future must resemble the past if we want the trades in the futures to resemble the trades in the past.

As I, and essentially all professional data scientists, regularly state — in-sample results have no value in estimating future performance. They Always underestimate risk and overestimate profit. We do not stop developing until the fit is good and the trades are profitable. If validation is done using a single pass over previously unexamined data, the trades in the validation period are reasonable estimates of future performance to the extent that the future resembles the past. If there have been several uses of the validation data, it suffers from overfitting just as does the formal in-sample data.

Trading systems are designed to recognize inefficiencies in the price series and give signals that allow traders to make a profit by buying and selling at the inefficient prices. In so doing, each profitable trade removes some of the inefficiency the system was developed to recognize and profit from. Eventually enough traders will recognize that inefficiency and remove so much of it that small traders cannot make profits greater than costs of operation — data subscriptions, salaries, slippage, commission, taxes, etc. Eventually the profitability — particularly the risk-normalized profitability — of every system deteriorates and results are poorer than even the best out-of-sample results.

Hi Howard.

From my very basic knowledge and lack of great terminology, my understanding of working with price data, and trying to optimize a trading system, you need a in sample series of data and out of sample series of data. The goal is to develop a profitable system in the in sample series that will duplicate itself in the out of sample series.

From what I know, the idea is to take the optimized parameters that best fit your objective function and to use this in the system and forward test it on the out of sample data.

When I get confused is the idea of walk forward testing. In this case your using some sort of variable that can be optimized and your looking for a parameter to best fit your objective function, and producing a series of In Sample and Out Of Sample results to observe if the parameter being optimized has discovered a pattern in the price series that proves to be a profitable model.

Now to me in my head this is over fitting as I understand it but as I said, I’m very novice at this quantitative trading approach.

Can you explain a little more about how one uses these changing optimizations or validates the results.

Thanks,

Richard

Hi Richard —

The purpose of walk forward testing is to give confidence that the system that was developed using historical data will be profitable when traded. That is, to give confidence that what worked in the past will continue to work in the future.

“Worked” means is profitable with acceptable risk.

“In the past” means in the in-sample period used to develop the system and select the parameters.

“In the future” means in the out-of-sample period that follows the in-sample period. And eventually in the out-of-sample period that is real trading.

The walk forward process performs several sets of “learn in-sample” then “test out-of-sample,” stepping forward in time with each new learn / test.

As you describe, the parameters are chosen using a process of generating many alternative models — each of the alternatives is different than the others and unique in some way — different issues, or rules, or parameters. Each is scored using the objective function of your choice — the higher the score, the better that alternative meets your preferences of trading accuracy, holding period, profitability, and risk.

The walk forward process helps the developer understand how the system changes over time.

We need the system to be stationary over the length of the in-sample period plus out-of-sample period. The in-sample results will almost always be good. When the out-of-sample results are also good, the system is stationary. When the out-of-sample results are not good, that is an indication that the model was over-fit to the data in-sample and has not learned a persistent profitable pattern. That system must be returned to development — it is not tradable.

There is no way to tell how long the system will remain stationary, except by making some test runs.

As the walk forward process continues, the values of the parameters usually change. Well behaved systems change by small amounts. It is possible that the parameters are nearly the same over several learn / test sequences. If so, this is an indication that the system is stationary over long periods — it is a good thing. However the system changes, that is an indication of a regime change. Watching the change from one learn / test period to the next suggests that the system will need periodic retuning to keep the model synchronized with the data.

At the end of the walk forward runs, there is a set of trades from each of the out-of-sample periods. That entire set is the best estimate of future performance of the system. Use only the out-of-sample trades. The in-sample trades have no value in estimating future performance.

Questions often arise —

Is walk forward optimizing? yes, definitely.

Is it data mining? Yes, definitely.

Is it over fitting? It is learning through fitting a model to some data. It is possible that the model is overfit. The resulting model could be underfit, well fit, or overfit. The test is whether the out-of-sample results are good. If they are good, the model is well fit. If they are poor, the model is poorly fit — perhaps overfit — perhaps underfit.

Why do the parameters change as the walk forward process moves through the data? Because the data changes. The model must change to stay synchronized with the data.

How long should the in-sample period be? There is no way to tell without doing some test runs. No rule of thumb will help. No single answer fits all systems.

How long should the out-of-sample period be? At a minimum, it must be long enough to generate enough trades to give data for analysis and to provide confidence. There is no maximum length. As long as the trades continue to meet your criteria, continue to use the system. When the trade performance deteriorates, reoptimize — just as if you were making one of the walk forward tests.

The parameters change throughout the test runs. What values should be used? Each out-of-sample period uses the values of the parameters chosen using the previous in-sample period.

Should I average the values a parameter takes on over all of the out-of-sample periods? No. You cannot expect the average to be best for any of the periods, or for the future. As the data changes, the model changes. Use the most recent model and its parameters.

What will happen when I trade this system? Nothing can be known for certain. The best estimate of future performance is the out-of-sample results from the walk forward run.

Best,

Howard

Hi Howard – Thank you.

If only there was a forth dimension… That would make it easier 🙂 Though I like the idea of the broad approach, and then narrowing it down using the two parameter approach to check for the “gradually descending sides”.

The curse of dimensionality – I haven’t heard this before. Does this relate to over-fitting… and different dimensions… Does it mean that one parameter that is optimized in isolation may not work against another parameter that is optimized separately?

If so, is the only way around this to optimize them together?

Hi David —

The Wikipedia entry for Curse of Dimensionality gives some background.

Those of us who are developing trading systems are relatively fortunate. There is a dimension associated with each component of a trading system that can be modified during the development process. These include issue to trade, time frame, choice of indicator, values of indicator parameters, rules, values of trigger levels associated with rules, etc. We fix many of these early in the design — for example, the system is designed to trade SPY using end-of-day data, entering when recent price is over-extended looking for a return to normal over a few days. The variables left to determine are which indicator, how long a lookback period, what trigger levels. We seldom have more than a dozen or so dimensions in the parameter space we are testing. Compare with machine learning / data mining problems where there are often thousands of dimensions.

Fitting a model to a set of data involves solution to a set of simultaneous equations. When there are more variables (m columns) than data points (n rows), there are many equally good solutions — all likely to overfit. One year of daily data has 252 rows — each the price, indicator, and pattern data for one day. Increasing the number of rows requires either data of finer resolution (say, hourly bars) or longer test periods (several years). Both are potentially helpful, but with drawbacks. Intra-day data may be difficult or expensive to obtain, and the patterns less profitable. Measuring intra-day might imply trading intra-day, which might not be feasible. Extending the test period might result in overlapping of several regimes and loss of stationarity — trying to fit a single model when no one model is profitable throughout.

Limiting the number of columns might be the better approach. Determine which variables can be fixed for the test period and use fewer variable indicators. Begin the search / optimization using broad ranges, then narrower. Reserve some data to use for out-of-sample testing.

Best,

Howard

Thank you, sir. I did see that article – but couldn’t quite get my head around it.

Your explanation is much better.

I really appreciate having a forum to help me understand these things!

Hey Howard,

Hope you are well! I’d like to kick off the conversation, if I may. I have only a very rudimentary understanding of code and validating Trading Systems, so apologies if this question is basic. But I have lots of questions, and hope using a public forum like this means that others, like me, can benefit too.

I was just going back over your book QTS with a fresh set of eyes, saw your 2D bar graph of optimized inputs, and a lot of things finally made sense. You mentioned it is wise to use the one at the top of a “mound” of near-by good results, so to speak, instead of the one outlier good result. This is one step to help find a robust trading system. Excellent!

So then I started using the 3D Graph in Amibroker after optimizing two variables. I can still see the “mound”, although it’s now in 3D. Even better!

My question is: How do you analyse three or more optimized variables using this method? It seems to make sense – choosing the variables that are surrounded by good results – but on 3+ variables I can’t figure it out.

Would you simply use the 2 variable method separately for each of the third variables? That could take a very long time…

Best regards – David

Hi David —

The values being optimized are parameters of the system. Each combination of parameters results in a unique system. Associated with each is a value for the objective function that you have chosen to rank alternatives.

If there are two parameters, we can create a plot showing one parameter on the x axis, the second parameter on the y axis, and the objective function as a contour — similar to lines of equal elevation on a typographical map. The “best” system is the one with the highest value for the objective function. The best typography is a mound with a broad top and gradually descending sides. The best system has parameter values corresponding to the center of the mound.

The performance of a system changes as the data changes. Different characteristics of the data might cause the best values of the parameters to change. When the top of the objective function mound is broad and smooth, system results change smoothly. When the mound has sharp drop-offs, such as cliffs, a change in the data may cause the best parameters to change so that they fall off the cliff. System performance using the previous best values of the parameters is poor.

Your question is how to best choose the best parameter values and how to visualize them.

One possibility is exactly as you suggest — pair-wise. Do some broad testing to identify the likely best values for all of the parameters, set all except two of them at values you expect to be reasonable, and optimize the pair. Then fix those two values and proceed to another pair, and so forth. This process works well when the parameters are independent. If there are interdependencies, the process of cycling through the pairs might not converge.

Another possibility is to begin with some broad analysis to determine whether the system seems to be stable around the set of best parameters, then optimize all of them in one pass, trusting your objective function. Follow that by sensitivity analysis to see what happens when the parameter values are changed.

Having a large number of parameters gives the model the best opportunity to be fit to the data, but also increases the likelihood that the model will be overfit to the data — that it will fit very well in-sample, but poorly out-of-sample. And having a large number of parameters increases the computational complexity and the “curse of dimensionality.” Additionally, data points are further apart as the dimensionality of the model increases, making it less likely that the n-dimensional mound will be smooth.

In summary, do your best to reduce the number of parameters in your model, use an objective function that you trust to rank alternatives as you would rank them subjectively, and pay close attention to the stationarity of the system. I have posted a video of a presentation I made The Importance of Being Stationary

Best regards,

Howard

Howard, I can’t thank you enough. Time and time again I’ve seen you take the time to reply to many people in clear detail. Thank you so much for clarifying this for me and for generously offering your time to do so.

Hi Dr. ,

“You mentioned it is wise to use the one at the top of a “mound” of near-by good results, so to speak, instead of the one outlier good result. This is one step to help find a robust trading system. Excellent!”

About this David’s comment, I also undertanding that, but my doubt is:

When we should do that? because in walk foward validantion the platform automatically choose the best objective function, so, we should do Walk Foward manually? or it’s an additional validation to system?

Thanks a Lot Dr. Bandy for you time and your Wisdom.

You are correct.

The walk forward process in a sequence of steps. In each step:

1. The in-sample data is used to fit the model to the conditions for that period. Each of the many alternative models is evaluated using the objective function.

2. The single potential model with the highest objective function score for the in-sample period is used to compute trading signals for the following out-of-sample period.

The trader — the person developing the system — must have confidence that the objective function being used:

1. Assigns scores that rank the alternatives tested in the same order that the trader would rank them if done subjectively.

2. Chooses an alternative that is best — or at least satisfactory.

Ideally, the objective function prefers models that are located in stable areas of the parameter space being searched. When there are only one or two parameters, we can see the objective value as a function of the parameters on a plot. For three or more parameters, we can see some of the performance by using pairs of parameters, but that is both tedious and limiting.

The platform used for development provides an assortment of built-in objective functions, and may provide tools for construction of custom objective functions. Given a choice, choose or design an objective function that rewards good performance in alignment with your preferences. And be aware that some objective functions use terms that tend toward to unstable results — terms that are dependent on the order in which the trades are added to the sequence. Begin by avoiding all position sizing in the model. Rather, use trades of a fixed size to isolate the percentage change per trade for stocks or dollar-per-contract change per trade for commodities. Given fixed trades, final equity will be independent of sequence. But drawdown will be sequence-dependent. So (try to) avoid objective functions that include terms that are sequence-dependent, such as drawdown.

Two techniques might be useful to test robustness of the system:

1. Test the robustness of the system to changes in the data by adding small amounts of random noise to the data.

2. Test the robustness to changes in the parameters by adding small amounts of random noise to the parameter values.

Note that the size of the search space grows quickly as additional parameters are added and as the granularity of the search tests more values in each parameter.

In high dimension search spaces, individual points are increasingly distant from each other and isolated. At high dimension, there are fewer plains and more cliffs. Try to keep the number of parameters as low as practical for the model.

Best regards, Howard