MJCS assigns a code to each of these different case types. district vs circuit court, criminal vs civil, etc.), and whether it is in one of the new MDEC-compatible formats. Each new item added to the scraper S3 bucket triggers a new parser Lambda invocation, which allows for significant scaling.Ĭase details in the MJCS are formatted differently depending on the county and type of case (e.g. The parser component is a Lambda function that parses the fields of information in the HTML case details for each case, and stores that data in the PostgreSQL database. The scraper is a continuously running ECS service that processes case numbers from the SQS scraper queue. Version information is kept for each case, including a timestamp of when each version was downloaded, so changes to a case can be recorded and referenced. The full HTML for each case ( example) is added to an S3 bucket. The scraper component downloads and stores the case details for every case number discovered by the spider. Periodically the spider will save its state using a combination of DynamoDB and S3, which allows resuming failed or canceled spider runs. These tasks run Case Harvester from a Docker image pulled from an Elastic Container Registry (ECR). The Spider is launched using Elastic Container Service (ECS) Fargate tasks run at regularly scheduled intervals. Each discovered case number is submitted to a PostgreSQL database, and then added to a queue for scraping: Each of these queries is then split again if more than 500 results are returned, and so forth, until the MJCS is exhaustively searched for case numbers. Because the MJCS only returns a maximum of 500 results, the search algorithm splits queries that return 500 results into a set of more narrowed queries which are then submitted. It does this by submitting search queries to the MJCS and iterating through the results. The spider component is responsible for discovering new case numbers. The following diagram shows at a high level how each of these components interact: Each component is a part of a pipeline that finds, downloads, and parses case data from the MJCS. ArchitectureĬase Harvester is split into three main components: spider, scraper, and parser. Instead, use the options described above for viewing the data, or if you have an AWS account you are also welcome to clone our database directly. NOTE: Unless you are modifying Case Harvester for specific purposes, please do not run your own instance so that MJCS is spared unneccesary load. If you would like to download tables from our database exported monthly, you can find that at. REST and GraphQL APIs are also available. Our database of cases (with criminal defendant names redacted) is available to the public and can be found at, which is built using our Case Explorer software. It is designed to leverage Amazon Web Services (AWS) for scalability and performance. D.C.Case Harvester is a project designed to mine the Maryland Judiciary Case Search (MJCS) and build a near-complete database of Maryland court cases that can be queried and analyzed without the limitations of the MJCS interface.
0 Comments
Leave a Reply. |
AuthorWrite something about yourself. No need to be fancy, just an overview. ArchivesCategories |