Creating a crawler¶

Crawlers are ordered by customers. So if a new crawler should be created, we start with an Issue in the create-new-scrapy-crawler Project. The Subject should contain the name of the crawler. This name should match the regular expression /[a-z0-9-]+/. The name should be used as name of the repository, name of the json file and name of the scrapy crawler.

Subject: create crawler <spidername> http://example.com

Developers are asked if they can do the task. Developers should assign themselve to the issue and start working. They should create a repository in the JobCrawler group. The new repository should be named <spidername>.

Example¶

root@scrapy-runner:~# scrapy startproject spidername
 New Scrapy project 'spidername', using template directory '/usr/local/lib/python3.5/dist-packages/scrapy/templates/project', created in:
 /root/spidername

 You can start your first spider with:
     cd spidername
     scrapy genspider example example.com

 root@scrapy-runner:~/spidername# scrapy genspider spidername cross-solution.de
 Cannot create a spider with the same name as your project

Why spider cannot have the same name as the project? Please complete the example.

Please follow the PEP 8 Style Guide for Python Code

Please add a .gitlab-ci.yml to your code. Example

before_script:
   - export PATH=$PATH:/usr/local/bin
   - run-tests.sh ${CI_PROJECT_NAME}

This will test the code against the PEP 8 styleguide and execute the crawler on the test and deployment mashine by pushing changes to the master.

in addition create a .gitignore file:

*~
*.pyc

# ignore .idea and build directory
.idea
build

crawlers have to be version controlled by git. The location in our gitlab is: https://gitlab.cross-solution.de/scrapy/JobCrawler