Creating a crawler¶
Crawlers are ordered by customers. So if a new crawler should be created, we start with an Issue in the
create-new-scrapy-crawler Project. The Subject should contain the name of the crawler. This name should match the regular expression /[a-z0-9-]+/
. The name should be used as name of the repository, name of the json file and name of the scrapy crawler.
Subject: create crawler <spidername> http://example.com
Developers are asked if they can do the task. Developers should assign themselve to the issue and start working. They should create a repository in the JobCrawler group. The new repository should be named <spidername>.
Example¶
root@scrapy-runner:~# scrapy startproject spidername
New Scrapy project 'spidername', using template directory '/usr/local/lib/python3.5/dist-packages/scrapy/templates/project', created in:
/root/spidername
You can start your first spider with:
cd spidername
scrapy genspider example example.com
root@scrapy-runner:~/spidername# scrapy genspider spidername cross-solution.de
Cannot create a spider with the same name as your project
Why spider cannot have the same name as the project? Please complete the example.
Please follow the PEP 8 Style Guide for Python Code
Please add a .gitlab-ci.yml
to your code. Example
before_script:
- export PATH=$PATH:/usr/local/bin
- run-tests.sh ${CI_PROJECT_NAME}
This will test the code against the PEP 8 styleguide and execute the crawler on the test and deployment mashine by pushing changes to the master.
in addition create a .gitignore
file:
*~
*.pyc
# ignore .idea and build directory
.idea
build
crawlers have to be version controlled by git. The location in our gitlab is: https://gitlab.cross-solution.de/scrapy/JobCrawler