Skip to main content

The red flag was when I saw the server disk space is showing a site is taking up 57992.5 MB, where locally the site size is showing 957MB.  There is something serious happening here and I need to establish the problem quickly.  When dealing with a significant discrepancy in site size between a local environment and a server, it's important to identify the root cause of the larger disk usage on the server. 

 

Approaches for discovering the issue(s)

 

1. Analyse Large Directories

Use a tool like du (disk usage) in a Unix/Linux environment to find out which directories are using the most space. This command will help you spot any unusually large directories.

du -ah /path/to/site | sort -rh | head -20

 

2 Check Logs and Caches

Logs: Servers often generate detailed logs which can grow very large over time. Check directories such as /var/log or specific application logs within the site directory.

Cache Files: Web applications, especially content management systems like Drupal, WordPress, etc., store cache files that can increase in size.  Check your application's cache directory.

 

3. Database Size

The size of the database can also vary greatly between development and production environments. Check the size of the database on the server. For MySQL/MariaDB, you can use:

SELECT table_schema AS "Database", ROUND(SUM(data_length + index_length) / 1024 / 1024, 2) AS "Size (MB)" FROM information_schema.TABLES GROUP BY table_schema;

 

4. Review File Uploads

Users may upload more and larger files in a live environment than in a local testing environment. Check directories that contain user uploads, such as files or uploads.

 

5. Version Control Anomalies

Sometimes, version control systems like Git can cause issues. Check if there are large files tracked by Git that shouldn’t be, or if the .git folder itself has become bloated.

 

6. Backup Files

Check for backup files that might be stored within the site's directory structure. Automated backup processes can sometimes leave behind large files.

 

7. Temporary Files

Temporary files and directories can accumulate, especially if cleanup processes fail. Check for these and clear them if necessary.

 

8. Media and Static Content

Compare the size and count of media files like images, videos, and PDFs between the local and server environments. Tools like rsync can help you compare directories effectively.

 

9. Binary Executables and Dependencies

Sometimes, servers might have additional dependencies or compiled binaries that are larger or more numerous than those on a local machine.

 

Steps I took and responses

Step 1 - Analyse Large Directories

I started with step 1 - Analyse Large Directories and running the following command

du -ah ./ | sort -rh | head -20

Response

du -ah ./ | sort -rh | head -20
42G	./
38G	./logs
36G	./logs/access_ssl_log.processed
1.7G	./tmp
806M	./git
788M	./repo/{site}
788M	./repo
656M	./tmp/themes
638M	./tmp/themes/custom
606M	./logs/access_ssl_log.webstat
525M	./tmp/httpdocs
483M	./logs/access_ssl_log
447M	./git/{site}.git
442M	./git/{site}.git/objects
428M	./git/{site}.git/objects/pack
415M	./httpdocs
405M	./dev
404M	./tmp/themes/custom/claroness
398M	./tmp/themes/custom/claroness/node_modules
270M	./git/{site}-v1-old.git

Okay, wow in the logs directory, the file ./logs/access_ssl_log.processed is consuming 36G of space.

What is this file?

The file /logs/access_ssl_log.processed likely serves as a log file that records details about HTTPS (SSL/TLS) access to a server. This filename suggests that it contains processed data, potentially indicating that it has been modified or filtered from its original form. Here's a general breakdown of its purpose and contents:

Purpose of the File
Access Logging

It records incoming requests that the server has processed over HTTPS. This may include data like IP addresses of clients, user agents (browser types), URLs accessed, timestamps of each request, status codes returned by the server, and potentially more detailed transaction data.

Security and Monitoring

Such logs are vital for security audits, monitoring server performance, tracking user activities, and diagnosing issues with the web services offered over SSL/TLS.

Compliance and Analysis

For businesses that must comply with various regulations, maintaining access logs can be crucial for compliance with laws requiring data on user interactions. These logs are also used for traffic analysis and understanding user behavior.

 

Processed Data

The .processed extension hints that this file might have been:

Filtered

Only specific kinds of data or data meeting certain criteria may have been retained, removing extraneous or non-essential information.

Aggregated or Summarised

Data could be processed to provide summaries, such as daily or hourly statistics.

Anonymised

Sensitive information might have been removed or obscured to protect user privacy, especially in contexts where logs are used for analysis and need to comply with privacy regulations.

 

Managing Such a File
File Size Management

As seen in your earlier query, these files can grow significantly. Regular maintenance such as truncation, compression, and archiving is necessary to manage disk space effectively.

Security

Given the potentially sensitive nature of the data (e.g., IP addresses, URLs accessed), access to this file should be controlled and monitored.

If you need specific details about how this file is generated or processed, you would likely need to review the configuration of the web server or the application that writes to this log. This could involve examining web server settings (like Apache or Nginx) or scripts/processes that handle log files.

 

How do you reduce its size?

To handle a log file that has grown to a substantial size like 36 GB, the recommended approach to empty the file depends on whether you want to keep the log data for future reference or not. Here are two main strategies:

1. Truncate the File

If you don't need to keep the contents of the file, the simplest and most efficient way to empty it without deleting the file (so that it can continue to be used by running processes without disruption) is to truncate it. The following command sets the size of the file to zero bytes, effectively emptying it

truncate -s 0 /logs/access_ssl_log.processed

 

2. Archive and then truncate

If you need to preserve the data for compliance or analysis purposes, you should first archive the file and then truncate it:

Archive the Log File: You can use a tool like tar or gzip to compress the file and save it elsewhere. For example:

tar -czf archive_name.tar.gz /logs/access_ssl_log.processed

or

gzip -c /logs/access_ssl_log.processed > /path/to/archive/access_ssl_log.processed.gz

After successfully archiving the file, you can then safely truncate it by following the step above in part 'Truncate the file'

Additional Considerations

Disk Space: Make sure you have enough disk space to create an archive if you choose to compress and store the log file.
Log Management Policies: Follow your organization’s log management policies and procedures, especially regarding data retention and privacy.
Automating Log Rotation: To prevent this issue in the future, consider setting up log rotation using a tool like logrotate. This utility can automatically compress, archive, and delete old log files according to rules specified in its configuration.

 

Step 2 - Check Logs and Caches

I did inspect and clear both the database and directory caches.

 

Related articles

Andrew Fletcher31 May 2024
Connecting AWS S3 with Docker for Drupal 10
Recently, I encountered an issue where my local Docker environment refused to connect to AWS S3, although everything worked seamlessly in AWS-managed environments. This challenge was not just a technical hurdle; it was a crucial bottleneck that needed resolution to ensure smooth Drupal deployments...
Andrew Fletcher20 May 2024
Create a copy of files that go to the tmp directory
To review the content of files being generated in the /tmp directory on an Ubuntu server before Microsoft Defender removes them, you can use several approaches.  Following is the approach we took. Real-Time MonitoringYou can set up a script to monitor the /tmp directory and log the...