closes https://linear.app/tryghost/issue/MOM-83
- added additional labs flag to allow internal testing prior to private beta release
- bumped Koenig packages containing support for @-link feature
We want to use a randomly generated 64 byte secret for the hmac, and
utf8 encoding isn't nice to work with for this, so we're going to use a
base64 string and decode it into a buffer for the secret.
ref
https://linear.app/tryghost/issue/KTLO-45/deploy-members-caching-solution-to-a-single-site-to-validate-and-test
Currently we only cache publicly available content. Any content that is
accessed by a logged in member is only cached for that specific member
based on their cookie. As a result, almost all requests from logged in
members bypass our caching layer and reach Ghost, which adds unnecessary
load to Ghost and its database.
This change adds experimental headers that allow our CDN to understand
which tier to cache the content against, and securely tell the CDN which
tier a logged in member has access to. With these changes, we can cache
the member content against the tier, rather than the individual member,
which should result in a higher cache HIT ratio and reduce the load on
Ghost.
For requests to the frontend of the site, Ghost will set a custom
`X-Member-Cache-Tier` header to the ID of the tier of the member who is
accessing the content. This tells the CDN which tier to cache the
content against.
For requests to either `/members/?token=...` endpoint (the magic link
endpoint) or `/members/api/member`, Ghost will set a `ghost-access` and
`ghost-access-hmac` cookie with the ID of the tier of the logged in
member. With these two pieces of information, our CDN can serve cached
content to logged in members.
These headers are experimental, and can only be enabled via Ghost's
config. To enable these headers, set `cacheMembersContent:enabled` to
`true` and provide an HMAC key in `cacheMembersContent:hmacSecret`.
ref https://linear.app/tryghost/issue/CFR-29
- Removed the mobiledoc and lexical columns from the posts input
serializer, meaning they will no longer be queried for.
Get helpers are essentially a gateway to the Content API. We already
strip out the mobiledoc and lexical fields in the output
serializer/returned response, but this means we're passing the mobiledoc
and lexical fields back from the db. This is pointless and these fields
are substantial in size - by far the largest fields in the whole ghost
db - leading to slowed performance.
I've updated the posts input serializer to strip out the lexical and mobiledoc
columns so we stop doing a `select *` with every query.
ref ENG-824
- the bug is causing resize prefixes being added to images served from
outside of Ghost.
- this now would only append the prefex to images served by Ghost and
other images urls' would get served as is.
- we can determine that by checking whether imageName doesn't exist,
meaning the source is a third party.
- this mostly affect edge case users, eg where a feature image url was
passed in via the API and doesn't get served by Ghost.
refs CFR-21
Reorganised middleware execution so that member data is not redundantly loaded for static assets or the sitemap.
---------
Co-authored-by: Michael Barrett <mike@ghost.org>
fix https://linear.app/tryghost/issue/SLO-104/cannot-read-properties-of-undefined-reading-0-an-unexpected-error
- if the request body didn't contain the correct keys, it'd just HTTP
500 out of there
- this adds some optional chaining so we end up with undefined if
anything isn't as expected, and the following if-statement does the
rest of the check for us
- this also adds a breaking test (the first E2E test for authentication, yay!)
fix https://linear.app/tryghost/issue/SLO-101/http-500-with-invalid-multipart-data
- previously, busboy would error out if we supplied a body that was
invalid (such as an empty FormData)
- we would then return a HTTP 500 to the user, which causes all manner
of problems
- now we catch errors from busboy and return a nice BadRequestError
fix https://linear.app/tryghost/issue/SLO-85/fix-http-500-on-contentposts
- in the event we give the incorrect format in a filter, MySQL will
throw an error and we'll throw a HTTP 500 error
- we can capture this error and return a more useful error to the user
- ideally we'd do this in a validation step before attempting the query,
but parsing this out of NQL and detecting which columns are DATETIME
could be quite tricky
fix https://linear.app/tryghost/issue/SLO-87/cannot-read-properties-of-undefined-reading-createimpl-an-unexpected
refs https://github.com/jsdom/jsdom/issues/3709
- in the event we are given some HTML to parse, and that fails, we
currently return a HTTP 500 because it's unhandled
- the instance we saw was due to `<constructor>` crashing jsdom, we've
opened an issue for that
- in terms of handling the error gracefully, we can surround the code
in a try-catch and return a more suitable error. I've gone for a
ValidationError for now - you could debate whether a different one is
more appropriate
- also added Sentry error capturing so we're not blind to these,
ultimately we should make sure the parser can handle all
user-submitted data
fix https://linear.app/tryghost/issue/SLO-95/unexpected-end-of-multipart-data-for-broken-image-upload-request
- in the event the client sends an invalid body to the image or media
upload endpoints, Dicer will throw an error if the boundary data is
malformed
- previously, we've just been bubbling that up as an InternalServerError
and that results in an HTTP 500
- we can capture errors produced by dicer and return a handled
BadRequestError, as it's the client's fault
- also includes breaking tests
fix https://linear.app/tryghost/issue/SLO-94/unexpected-field-when-given-broken-image-upload-request
- in the event the body of an image or media upload request is malformed
(broken metadata / blob or something), we get a MulterError and this
bubbles up as an InternalServerError and spits out a HTTP 500
- we can capture this and return a BadRequestError, as it's the client's
fault for not providing the correct body
- this implements that and adds breaking tests
fix https://linear.app/tryghost/issue/SLO-93/undefined-path-error-with-bad-image-upload
- in the event we receive a request to upload an image, that doesn't
contain an image, we still try and unlink the files
- this is a dangling promise, so it doesn't cause an explicit HTTP
error, but it does show up as a console error
- fixed it by checking for the path, and early returning if it doesn't
exist
- also added a test that would fail without this
refs
[ENG-827](https://linear.app/tryghost/issue/ENG-827/🐛-crash-on-resizing-animated-gif)
Added a timeout to the image resizing middleware to prevent crashes when
an image is taking too long to resize. When the timeout is reached and
the image has not been resized, the middleware will return the original
image
ref
https://linear.app/tryghost/issue/ENG-851/implement-a-minimal-but-complete-version-of-redirect-caching-to
ref https://app.incident.io/ghost/incidents/55
Often immediately after sending an email, sites receive a large volume
of requests to LinkRedirect endpoints from members clicking on the links in
the email.
We currently don't cache any of these requests in our CDN, because we
also record click events, update the member's `last_seen_at` timestamp,
and send webhooks in response to these clicks, so Ghost needs to handle
each of these requests itself. This means that each of these LinkRedirect requests
hits Ghost, and currently all these requests hit the database to lookup
where to redirect the member to.
Each one of these requests can make up to 11 database queries, which can
quickly exhaust Ghost's database connection pool. Even though the
LinkRedirect lookup query is fairly cheap and quick, these queries aren't
prioritized over the "record" queries Ghost needs to handle, so they can
get stuck behind other queries in the queue and eventually timeout.
The result is that members are unable to actually reach the destination
of the link they clicked on, instead receiving a 500 error in Ghost, or
it can take a long time (60s+) for the redirect to happen.
This PR uses our existing `adapterManager` to cache the redirect lookups
either in-memory or in Redis (if configured — by default there is no caching). This only removes 1 out of
11 queries per redirect request, so it won't reduce the load on the DB
drastically, but it at least decouples the serving of the LinkRedirect from
the DB so the member can be redirected even if the DB is under heavy
load.
Local load testing results have shown a decrease in response times from
60 seconds to ~50ms for the redirect requests when handling 500 requests
per second, and reduced the 500 error rate to 0.
ref https://linear.app/tryghost/issue/ENG-826
- Changed staff deletion logic to do a bulk insert when adding a tag to
the users' associated posts
Staff deletion logic has really poor performance at scale because we do
individual updates for every post. If a user has dozens+ posts
(especially in a large db with thousands of posts), this can take >60s
and look like a timeout. Ultimately this should probably be a jobbed off
process, but for the time being we can improve this by doing a bulk
insert.
Note that this update uses the pattern for the bulk tagging of posts
from the right click (bulk) actions in the posts lists in Admin. With
bulk actions, **we do not trigger web hooks or the post.edited events**.
We will document this and follow up on this separately.
ref
https://linear.app/tryghost/issue/ENG-845/error-attempted-to-set-lexical-on-the-deleted-record
ref
[https://linear.app/tryghost/issue/ENG-854/🐛-deleting-imported-posts-makes-ghost-unresponsive](https://linear.app/tryghost/issue/ENG-854/%F0%9F%90%9B-deleting-imported-posts-makes-ghost-unresponsive)
- When deleting a post in the editor's Post Settings Menu, if the post
has unsaved changes (indicated by the hasDirtyAttributes property in the
editor), Admin will crash because it tries to save a post revision
before leaving the editor, but the post has already been deleted so
saving fails.
- This can occur when editing a post and quickly deleting it from the
Post Settings Menu before saving is completed.
- It can also occur when attempting to delete an imported post, as the
editor will parse the lexical from the server and may make some minor,
invisible-to-the-user changes to the lexical string locally (e.g. JSON
formatting, or updating the JSON to use extended version of base lexical
nodes), which triggers the same error.
- This fix bypasses the attempt to save a post revision when leaving the
editor if the post is already deleted, which allows the transition back
to the Posts route to succeed.
refs f39d1d3aa3
- similar to the commit above, the JSON parser changed between Node 18
and Node 20, so the error message changed too
- we actually just want to check the error is forwarded to the user, so
we can do that by getting the error message from JSON.parse and check
against that
ref 78311591d0
- updated tests to not click a button on the setup/done screen that is no longer shown
- fixed setup flow showing an alert bar due to not handling the `TransitionAborted` error that is thrown by the setup/done->dashboard redirect
ref https://linear.app/tryghost/issue/KTLO-1/members-spam-signups
- Some customers are seeing many spammy signups ("hundreds a day") — our
hypothesis is that bots and/or email link checkers are able to signup by
simply following the link in the email without even loading the page in
a browser.
- Currently new members signup by clicking a magic link in an email,
which is a simple GET request. When the user (or a bot) clicks that link, Ghost
creates the member and signs them in for the first time.
- This change, behind an alpha flag, requires a new member to click the
link in the email, which takes them to a new frontend route `/confirm_signup/`, then submit a form on the page which sends a POST request to the
server. If JavaScript is enabled, the form will be submitted
automatically so the only change to the user is an extra flash/redirect
before being signed in and redirected to the homepage.
- This change is behind the alpha flag `membersSpamPrevention` so we can
test it out on a few customer's sites and see if it helps reduce the
spam signups. With the flag off, the signup flow remains the same as
before.
ref https://linear.app/tryghost/issue/ENG-790/remove-use-of-sub-queries-in-email-analytics
- the `delivered_at` column is typically entirely/nearly entirely filled with values meaning the `IS NOT NULL` query matches a huge number of rows that MySQL has to fetch from the index to count
- using `IS NULL` switches that behaviour around as it will now match very few rows which has been shown in testing to be considerably quicker
- after switching to `IS NULL` the query returns an "undelivered" count rather than a "delivered" count, in order to keep the rest of the system behaviour the same we can calculate the delivered count by subtracting the query result from the total number of emails sent which we can fetch using a very fast primary key lookup query on the `emails` table
ref https://linear.app/tryghost/issue/CFR-13
- enabled saving traces on browser test failure; this makes troubleshooting a lot easier
- updated handling in offers tests to ensure the tier has fully loaded in the UI (not just `networkidle`)
- updated publishing test to examine the publish button reaction to the save action response instead of a 300ms pause
In general, our tests use a lot of watching for 'networkidle' - and sometimes just raw timeouts - which do not scale well into running tests on CI. In particular, 'networkidle' does not work if we're expecting to see React components' state updates propagate and re-render. We should always instead look to the content which encapsulates the response and the UI updates. This is something we should tackle on a larger scale.
ref ENG-774
ref https://linear.app/tryghost/issue/ENG-774
Staff Tokens will have both a `user` and an `apiKey` present on the
`loadedPermissions`.
The check here for `apiKey` was written when we could assume that an
`apiKey` was an Admin Integration - so it completely overwrote the
previous `allowed` list. When we added the concept of Staff Tokens -
this resulted in a privilege escalation.
This is a good lesson in not using proxies or indicators for data, as
changes elsewhere can invalidate them - if we had been specific and
checked the role of the current actor we wouldn't've had this bug!
ref https://linear.app/tryghost/issue/CFR-4/
- added request queueing middleware (express-queue) to handle high
request volume
- added new config option `optimization.requestQueue`
- added new config option `optimization.requestConcurrency`
- added logging of request queue depth - `req.queueDepth`
We've done a fair amount of investigation around improving Ghost's
resiliency to high request volume. While we believe this to be partly
due to database connection contention, it also seems Ghost gets
overwhelmed by the requests themselves. Implementing a simple queueing
system allows us a simple lever to change the volume of requests Ghost
is actually ingesting at any given time and gives us options besides
simply increasing database connection pool size.
---------
Co-authored-by: Michael Barrett <mike@ghost.org>
ref ENG-728
ref https://linear.app/tryghost/issue/ENG-728
This is not used anywhere, and makes the code more complicated, it's a good
step toward simplifying permissions and pulling them out of the database.
ref ENG-728
ref https://linear.app/tryghost/issue/ENG-728
This is NOT a functionality change. The Post#permissible method unit
tests have been updated to pass `true` as `hasUserPermission` and we can
see that the permission functionality remains the same.
The permissible method of the post model is responsible for removing
permission based on the data that is being modified, but the permissions
module is setup to allow the permissible method to grant permission -
this means that we call permissible, even if the current actor doesn't
have permission, this results in code that is hard to understand and
manage.
We are going to be instead returning early if an actor does not have
permission, this will allow permissible method signatures to be greatly
simplified (removing the need for hasUserPermission, hasApiKeyPermission
& hasMemberPermission arguments).
fixes https://linear.app/tryghost/issue/ENG-746/http-500-responses-when-handle-image-sizes-middleware-hits-missing
- in the event a request comes in for a resized image, but the source
image does not exist, we return a rendered 404 page
- we do this because we pass the NotFoundError to `next`, which skips
over the static asset code where we return a plaintext 404
- also included a breaking test that ensure we go to the next middleware
without an error
ref ENG-761
ref https://linear.app/tryghost/issue/ENG-761
Creating these pipelines is expensive, and we don't want to do it
repeatedly for the same controller. Adding caching should reduce the
amount of time spent setting up pipelines for each usage of the `get`
helper.
ref [ENG-747](https://linear.app/tryghost/issue/ENG-747/)
ref https://linear.app/tryghost/issue/ENG-747
H'okay - so what we're trying to do here is make get helper queries more
cacheable. The way we're doing that is by modifying the filter used when
we're trying to remove a single post from the query.
The idea is that we can remove that restriction on the filter, increase
the number of posts fetched by 1 and then filter the fetched posts back
down, this means that the same query, but filtering different posts,
will be updated to make _exactly_ the same query, and so share a cache!
We've been purposefully restrictive in the types of filters we
manipulate, so that we only deal with the simplest cases and the code is
easier to understand.
closes ENG-632
- This listens to a new property in the `milestones` config to set a minimum value of Milestones we wanna use the Slack notification service for