When you create those models, are they georeferenced to some degree of precision, or just created from for example some photogrammetry process? If the coordinates of the photos are relying on normal GPS accuracy only, I suspect context capture tries to make a best guess to where the model is referenced. However, accuracy of photo GPS is +/- 3m, so that’s your initial error margin, I suspect.
Using PPK (post-processing of coordinate data) you can process the photo GPS accuracy to a higher degree, depending on the accuracy of the timestamp to the saving of the image (a normal camera’s time precission is a bit dodgy). Some boxes added to your camera (for drones, planes, cars, etc.) have their own high-grade timer (micro-second precission) and GPS accuracy devices (like IMU’s, etc.) that you can use to post-process those images to a much higher accuracy (triangulation and adjustment based on accurate time you took the picture, and the GPS satellite data, coupled with CORS station data closest to that location), going from +/- 3m down to roughly 1-3cm lat/long, and 2-5cm height. Once you use those accurate coordinates instead of the internal photo ones when you create your mesh model, it will also be better georeferenced and placed in a more correct place.
Next is to look at the terrain models. I don’t know the accuracy of the Cesium World Terrain model, to be honest, so take this with a grain of salt. Some terrain models are more accurate than others, some also are quantizised, meaning they’re based off a grid that is calculated, to some degree of complexity and precission. If a terrain model is accurate to 30m grid, it can still be accurate in heights (say +/- 1m, normative), but it means that anything between the grid points are calculated, which is a problem if your terrain is especially bumpy (high in contrast) where your model is situated. Some quantisation methods are good at high resolution in complex areas and low resolution in flats, and anything inbetween, it’s a minefield in itself. World terrain models are usually good for generic stuff, but if you want high precission you often have to provide your own (and if your model is georeferenced properly, the same process can create both DTMs, models, point-clouds, etc.)
Anyway, take a screenshot of what you’re seeing, and we can try to help you a bit further? But I’d definitly start by looking at getting the georeferencing right (especially if this is a dynamic application that updates the data from time to time), or (much cheaper) look into height adjustments for 3D tilesets, and adjust them statically until you’re happy (and apply that adjustment every time you load your scene). More on that here;
https://sandcastle.cesium.com/gallery/3D%20Tiles%20Adjust%20Height.html
Cheers,
Alex